The wrong path to data science

Let me give you some context first. A few years ago, I was on the road to data science. I wanted to learn everything about this field, the sole idea of building something intelligent that can help someone predict something amazed me.

Inspired by this idea, I decided I wanted to become a data scientist, and like many others, I jumped from engineering to this new landscape. Not knowing where to start, I began searching through the Internet to see how data science looks.

Soon enough, I ended up in the vast sea of blogs, poured with hype and expectations; I was reading titles such as:

  • Data Science for Beginners: FULL COURSE!
  • Must-read books for Machine Learning and Data Science
  • How to speed up pandas with one line of code!
  • 10 BEST machine learning courses

I was ready to read them all. I thought to myself:

If I can learn the best algorithm, if I build the best model to predict X thing, if I apply bleeding-edge techniques, I will surely stay ahead of the competition, I will be a good data scientist.

I was about to learn that skill only gets you so far.

Hard truths and disappointments

After months of hard study, I was quickly looking for a job in this new and exciting field. I was pretty good at Pandas, and I was able to get my head around scikit-learn. Tensorflow? No problem.

I got a job as a data analyst. I was hyped, I felt fantastic, and I wanted to show what I could do. I tried to help the company grow and show them how to apply data analysis and machine learning to their operations.

But in my small mind, I had no idea how utterly wrong I was, and let me tell you why.

1. Data doesn’t appear magically

person holding wand on top of bowl
Photo by @art_maltsev on Unsplash, accessed 02/11/2020

Guess what? Every blog out there will give you the assumption that there’s a dataset clean and ready to be analyzed. I fell into this assumption as well.

As a data analyst, I was tasked to analyze our sales, monthly revenue, cancellations, and everything that is, without a doubt, essential for a SaaS company, and to get a dataset, I had to connect to production servers, APIs, buckets, etc.

And you could say: Well, of course, that’s expected! The only problem is that programs do not generate datasets for human consumption.

Most of the time, you will have a SQL table in a production database ridden with many columns that you don’t even understand what they mean. Or JSON files that don’t even have a proper structure. Or incomplete datasets that I needed to join from multiple sources to have a working dataset.

Now, imagine doing that over and over and over. The first lesson I learned was that data doesn’t magically appear: Something has to generate it, and someone has to put it together.

2. Scalability will be an issue

white staircase with pink background
Photo by @maxon on Unsplash, accessed 02/11/2020

I struggled to get data, but I finally knew my way around it. I was already building and putting together datasets, and it was about time to fire up Jupyter and start tinkering with it.

My objective was straight, I wanted to know the reasons people cancel their subscription with us. I started my EDA right away. Found some hard truths, cleared up assumptions, and built a small model to predict someone’s probability of churning.

It wasn’t the best model, but it was good enough, and I was proud of it. I presented my findings to the stakeholders, and they were delighted. They now expect a report of the accounts that will most likely cancel every month in their email. The experiment was a success!

Dear reader, did you just realized what I just said? If you haven’t managed a data science team before, you will think there’s no issue here. However, you will soon realize that this strategy is not scalable for someone who has to deal with such a team’s coordination, capacity, and planning.

You’re having a data analyst (or a data scientist) extracting data by himself, running a report locally, on a Jupyter notebook, which only works on his computer, manually delivering excel reports to the stakeholders. If you don’t think this is a recipe for disaster, I invite you to reconsider your strategy to build scalable teams.

Moreover, with this approach, you’re going to burn out, stakeholders will depend on your ability to send them the report on time, and you’re teaching the company that they don’t need to learn about data. They have you.

I soon learned my second lesson the hard way: Data Science is nothing without the architecture to support it.

3. Models with no business case are useless

person writing on white paper
Photo by @kellysikkema on Unsplash, accessed 02/11/2020

Even though our processes were not scalable, we kept going, and we were developing model after model, even the same models with different algorithms.

We just discovered AI, and we wanted to make it ours! After all those models I built, I realized that people asked me for things that seemed to tackle no business problem. Predict revenue? Ok. Cancellations? Here you go. Forecast new accounts? No problem.

Now, let me ask some difficult questions to my past self:

  1. Why someone wanted me to predict the revenue? Was there a plan to execute if our prediction was that we were about to hit a bad month?
  2. We have a model to predict churn. Do we have an action plan for users at risk?
  3. Why do you want to forecast new accounts? Is there any reason to do it at all?

If you layout a myriad of models out there without any business objective, with no execution plan, to only please stakeholders’ wonder and amazement, then let me politely tell you that you’re providing nothing of value.

Having a business objective and an execution plan is paramount to building successful AI products that will change the way you do business. Let’s go even further, your responsibility as a data scientist is to educate your stakeholders!

They trust your expertise and field knowledge to guide them through this AI revolution. Have them prepare a business case, ask them difficult questions and execution plans, ready to mitigate failure, explicit assumptions around the business risks involved.

You’ll provide a more clear agenda, better models, and you’ll bring more value to the company. Yet again, I learned another lesson: To build without a plan is to build nothing at all.

4. Data Science is not exclusive for a team

graphs of performance analytics on a laptop screen
Photo by @lukechesser on Unsplash, accessed 02/11/2020

We are at the Fourth Industrial Revolution, and data is at the front-line. That is something that I bet most people don’t understand. Data has such a breakthrough that it transformed businesses in its entirety!

Remember how important it was to know how to use a computer? I still remember that having a Microsoft Office learning certificate guarantees you a job somewhere. People were replaced one-by-one by the newer generations who were more adept at computers.

Years later, that was not a qualification requirement. It is an expectation. Companies now expect you to know how to use a computer, they expect you to understand how to use Excel, they expect you to browse the web without any issue.

If you think that data will be different, then I beg you to reconsider your priorities. We see companies investing in data democratization like there’s no tomorrow. Teaching their employees how to handle data, interpret it, and use it to enhance their operations.

They have seen the value that data brings to the business. They know how important it is to make informed decisions and develop strategies around data becoming the norm.

If you think Data Science will be reserved for teams who know how to handle data, then I’m afraid you’re wrong. If you want to succeed in your business, whether you are a data analyst or a Chief Data Officer of an organization, you have to push for data democratization.

And with this, I learned a valuable lesson: If you don’t invest in data democratization, your strategies will be superseded for a company that does it.

Conclusion and recommendations

If you see yourself in one of these points, then let me give you some recommendations:

  1. Hire a data engineer first: This person will lay the foundations for analysts and data scientists to scale their operations with ease. It’s one of the best single decisions you can make.
  2. Layout your data strategy: Think about your data strategy, try to find loopholes in it. Think about how you are going to move data (Apache Airflow?), how are you going to analyze data (Deepnote?), how are you going to deploy products (MLFlow?)
  3. Build with a business case: Make a quick checklist with objectives, risks involved, mitigation plans, what-ifs, and so on.
  4. Prepare to scale data: Invest in people education. Teach them Python or R; SQL is a must these days. These are people who are going to be at the front of your business. Maybe they are in marketing, sales, or other operations, but they need to read and manipulate data.

I hope you liked the entry. Follow me on Twitter if you want to read more entries like this 👉🏻

Also, share this article if you found it interesting. See you soon.