experience expertise

A key skill set for a Data Scientist

Marcin Kosiński, Warsaw

From his office in Warsaw, emagine’s in-house Data Scientist, Marcin Kosiński, relays some expert advice to beginner and experienced Data Scientists alike, taking us through the most important aspects of his work.

There has been a growing demand for Data Scientists, i.e., people who analyze data in order to develop machine learning models.

The development of necessary skills, i.e. creative problem solving, is enabled by the challenges Data Scientists face every day. The work is far from boring, yet it requires specific skills and experience.

Which competences are the most important? Which technologies should one learn? Do you need to have a scientific background and a good head for figures? Or is it a better idea to focus on soft skills? What are the present-day requirements for a Data Scientist, and how will they change in a few or a dozen years?

You need to remember that most of the currently used technologies may undergo changes or become obsolete in just a few years.

Key competencies of a Data Scientist of the future

In this article, I would like to point to some of the universal elements of the work of a Data Scientist. Following the current trends is crucial from the perspective of the development and utility of the most widely used technologies.

However, you need to remember that most of the currently used technologies may undergo changes or become obsolete in just a few years.

Therefore, in this post, I would like to put particular emphasis on the everlasting credo of a Data Scientist, stated below, in line with the sequence of the data analysis process and the process of creating products based on data and machine learning models.

1. Asking the right questions

There is no research without a hypothesis and there is no project without a goal. Sometimes, formulating your hypothesis or your goal precisely requires many questions.

Customers have their needs in terms of work optimization and automation, and they believe that with Data Science, they can strengthen their market position and yet, they aren’t capable of specifying their needs in a machine learning language. Therefore, the ability to ask the right questions, in order to translate business needs into data-based solutions and to match the existing solutions, as much as possible, to individual project needs, is a key aspect of work for a Data Scientist.

Effective question-asking is also useful at the data quality and usability assessment stage; you will learn more about the infinite need to ask questions in the section on iterative problem-solving!

2. Data quality and usability assessment

This skill can’t be replaced with any machine. Despite immense data volumes and petabytes of saved information, in some cases, most of these sources have no real potential for use in Data Science solutions. Oftentimes, the quality of the data is substandard due to unsuccessful migration operations, human errors, logical errors within structures or due to the fact that certain types of information are, ultimately, of little use in machine learning models.

Thus, the ability to evaluate data usability and validate the quality of the data one is working with should be an indispensable element in a Data Scientist’s toolkit.

Most likely, the majority of the work performed by a Data Scientist will become automated in the future. For this reason, one should focus on these elements where the human factor is invaluable.

3. Iterative problem solving

This one is the main and pivotal skill of a Data Scientist. It is particularly important due to the nature of the data analysis process as well as the process consisting in creating data-based products.

The entire work process of a Data Scientist is iterative. This means that it resembles a loop, within which one goes in circles. Each time, an increasingly refined product is created by moving through the same successive steps but using the experience gained during the previous iteration (loop) of the process. Recreating the process time and time again, where each cycle is improved with the knowledge gained at the previous stage, helps prepare highly effective and customized solutions.

It is an important lesson for each Data Scientist – it is a good idea to start from the simplest model which, later on, will be subject to numerous improvements. The advantage of starting from the simplest model possible is that by doing so, we create an initial model, to which future, improved solutions can be compared. With this initial model, it becomes possible that in a relatively short time, we will be able to develop a satisfactory solution without the need to deploy heavy guns of machine learning.

4. Autoverification

It is often the case that the work of a Data Scientist results in a system of decision-making rules which enable the undertaking of numerous automated and smart actions. Such a system is referred to as a machine learning model. It is also important to be able to assess whether the given model is effective and precise enough.

When developing machine learning models, one must always remember to have the initial (the simplest in terms of its design) model, to which subsequent, improved solutions can be compared. The developed model should also be contrasted with competitive solutions described in the literature.

5. Explaining models' output

When a data-based product is produced, questions regarding its logic of operation  often arise. A Data Scientist has to explain complex machine learning models in such a way as to make it clear for people with no technical background.

Questions are asked about the elements that the operation of the model comprises, the most important data considered, its effect, the scale of interrelations as well as the methods of verification of the correctness of the model. A Data Scientist should be capable of explaining the model, the methods of its validation and which of the data used within the model played a crucial role in the given undertaking.

6. The simpler, the better

Sometimes, our interlocutors are surprised to hear this. Even though we have all the intellectual achievements of humankind at our disposal, it is often preferable to use data analysis solutions that are based on simple and explicable rules, which are fast and use as little computing power as possible.

There has been a growing urge to use the most complex and computing power-hungry solutions and yet, a Data Scientist should always strive to minimize the computing time within the product, reducing the memory use, simplifying the models, and reducing the amount of data required. It is easier to manage a simple model and it is also easier to understand its operation.

Needless to say, there are certain solutions where effectiveness is all that counts and where the heaviest guns from the machine learning arsenal are deployed but there are many solutions that are appreciated for their explicability and simplicity of operation.

Experience will help you ask the right questions, which will lead you toward the most optimal solution.

7. Searching for synergy

This phenomenon may be a novelty for less experienced Data Scientists. Most often, it occurs at higher levels of one’s career, e.g. at the managerial or executive level. It refers to the ability to search for connections among machine learning solutions. Attempts at deploying a tool created by one team for the purpose of other projects that another team is working on.

Quite frequently, the goal is to find solutions and applications which allow killing two birds with one stone. Sometimes, a Data Scientist focuses exclusively on improving a single tool he has been working on. However, in some cases, it is necessary to look at data-based products from a wider perspective, where connections between projects and developed tools are sought so that the potential of already developed solutions can be used to their fullest.

8. Reproducibility

An important and yet often neglected skill. Developed machine learning models should work today but should also remain operational in the future. Sometimes, due to a heavy workload, we move on straight to a new project directly after developing a model.

However, it is recommended that once a project has been completed, we’ll need some time to supply the solution with a reproducible environment which can be easily transferred and launched on many machines. This way, we can make sure that our solution will remain operational regardless of the versions of libraries and programming languages used around the world.

If we re-examine the list, we may conclude that the work of a Data Scientist consists of continuously asking questions.

This is precisely the case!

Generating applications, questioning solutions, constant improvement, and verification. Curiosity and conscientiousness are surely desirable qualities but in reality, it is the experience that helps ask the right questions which lead you toward the most optimal solution.

In your pursuit of a Data Scientist career, try to take notice of the points made in this article – this will allow you to achieve your dream goals faster.

Related insights

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.