As a developer, when you think of data science, the first thing that comes to your mind is probably math. The next thought is usually of all those foreign sounding algorithms. For most developers, data science is nothing but fitting an algorithm to a dataset. But I’m sorry to burst the bubble, it’s not as simple as that. Data Science is an umbrella term, and there are many things happening in a data science project. In this post, I’ll try to explain these different things as simply as I can, so that you can make informed decisions.
There are different practices under data science, such as data engineering, artificial intelligence, machine learning, deep learning, neural networks, data visualizations, etc. And not all these practices require you to know the math and the statistics that you read about in most formal data science courses. In fact, for a developer, data engineering is the best place to start the journey. That is what I did. Fortunately, I got the opportunity to work on a project as a data engineer. Let’s see what a Data Engineer does.
Each company has its own definition of a data engineer. Actually, each data engineer has their own definition of what they do. In my opinion, a data engineer is a person who has some or all of the following responsibilities:
- Extracts or collects raw data from various sources
- Cleans the collected data
- Tries to make sense of the data after removing the noise from it
- Tries to see what part of the data is useful and what part of it could be discarded
- Transforms the data
- Loads the data into one or more data stores for further analysis
These responsibilities sound more like ETL jobs, and that’s not entirely wrong. But a data engineer does more than just ETL. Along with ETL, a data engineer is also responsible for making sense of the data. In a data science project, this is particularly important because the scale of data will be very large. And when you’re working with large datasets, it becomes very important to keep only the data that makes sense to the business. You don’t want to be hoarding terabytes of unwanted data. This is just going to take up storage space and add extra cost to the business.
If this kind of work interests you, it’s pretty easy to get started as a data engineer. You need to learn tools such as Apache Spark, Apache HBase, Apache Drill, Apache Flink, Apache Storm, Apache Kafka, etc. Even Amazon provides a lot of data engineering and big data databases in the AWS stack, such as DynamoDB, Athena, S3 (as a data store), Kinesis, etc. Even Microsoft and Google offer similar tools and services in their Azure and GCP offerings. Using all these tools, you can start building pipelines for extracting or collecting data, analyzing the collected data, running transformation scripts on samples of data for testing, running transformations on the entire data or streams of data in real time, and then storing the transformed data in a data store. There are many resources on all these tools and services, and in most cases, just the official documentation should be sufficient to get started.
Once a data engineer stores the transformed data in a data store, an ML or AI engineer takes over for further processing.
Machine Learning (ML) or Artificial Intelligence (AI) Engineer
The job of an ML or AI engineer comes after a data engineer prepares the data. An ML or AI engineer is responsible for figuring out which features (columns or fields in the data) are relevant for the business problem. In most cases, not all features in the dataset are relevant for the current problem, but they definitely might be needed in some other solution. Anyway, the engineer will figure out what features need to be selected from the dataset, and check if any feature that was required is missing.
In case a feature is missing from the dataset, the data will go back one step to the data engineer to figure out how that feature could be collected or included in the dataset. If the data is as expected, the ML or AI engineer will now try to see how different datasets from different sources could be merged to all required features in one single dataset. I should probably make it clear here, usually, not all data required for a solution will be collected from a single source. So in this phase, it becomes necessary to merge all the data. And by merging, I don’t mean download this data to your laptop and copy-paste the data into a spreadsheet. Usually, the data will not even fit on your laptop’s hard disc. The data will usually remain on different machines. But a virtual table or view will be created where the data appears to be merged.
Depending on the problem statement, the ML or AI engineer will decide which algorithm or model to apply on the data. And depending on the algorithm selected, the data will be divided into training and testing data if it is a supervised learning model, or the model will be fit on the whole data if it is an unsupervised learning model.
This is the stage where all that math and statistics knowledge is required. Depending on the problem statement, you need to figure out what algorithm to use, which model you should train, etc. Also, because you have the basic knowledge of this (math, statistics, etc.), you know how to tune the models to increase the accuracy of your predictions or classifications. Till now, I’ve worked only with prediction models, and I can tell you, there is always tuning required. You have to play with the features (include or exclude them) and change a lot of other model parameters to reach the accuracy that you promised your client.
Most online courses do cover these tuning or optimization techniques to some extent. But unless you work on a real world problem with a real client, you’ll not learn all the optimization techniques. So yeah, I would highly recommend you try to get yourself into such a project. Also, you should look for a good mentor early in your data science career. The knowledge that you’d gain from a mentor is very valuable.
If you think this math and stats is what you want to do more than just writing code, then this is for you. There are many ways in which you can get started. But if you’re not familiar with the math and stats for this, let me tell you without any sugar coating, you’ll need to put in a few years to get yourself up to speed. I find it boring, but for those who find it interesting or are curious, it’s a rewarding experience. Anyway, for this, you can either take courses online or start reading books. There are many free courses available online from top universities such as IIT, Harvard, and Stanford. So you don’t really have to shell out a lot of money for this study. And once you think you have enough knowledge to start practicing, you should start looking at the various libraries and tools available for applying these algorithms programmatically on the data. Some examples of these tools and libraries are SciKit Learn, Apache Spark MLlib, and the AWS stack.
Data Visualization Engineer
The last type of data science job (according to me) is data visualization. Let me put it this way, all the work that the data engineers and the ML or AI engineers do is of no use if your client can’t see the data on their dashboards. And most business people understand charts and graphs. This is where data visualization engineers come in. I can’t stress enough how important data visualization is. Looking at numbers in a remote terminal window or a spreadsheet is not as convenient as looking at a couple of charts. The saying “A picture is worth a thousand words” is very true in this case.
These engineers are sometimes called Business Intelligence (BI) engineers. But this is not always the case. BI engineers do more than just data visualization. If this interests you, please look at tools such as Microsoft’s PowerBI, Google’s Data Studio, Amazon QuickSight, and D3 charts.
But the scene is changing
What I told till now is the traditional approach of getting into data science. But the scene is changing rapidly. With services such as TensorFlow by Google and the whole AWS stack with multiple services, all you have to do is call a function to train, test, and even deploy your models. With SageMaker, especially, it has become very easy to get started with AI and ML for a developer. Understanding the tool itself is very easy, because the documentation is easy and there are many examples to get started. The development part has become very easy. But you still need to figure out which model to use for solving your problem. For this, you’ll still need to have the math knowledge to some extent.
I hope I was able to help you figure out how to get started with data science, if you’re a developer that is. Most of this is based on my personal observations. And if you look around on the internet, you’ll find a thousand other ways in which you can start this journey. This is because everyone comes from a different technical background and will be focused on a different part of data science.