Data Scientists… We've Been Sold Lies

00 The Setup

Years in Industry

Large Financial MNC

Hard Learnings

I have been building machine learning models for a large financial MNC for about 4 years. Over time, I have gained knowledge and experience in building and improving machine-learning solutions to real-world data science problems. Throughout this tenure, I have observed that my understanding of data science and machine learning — which I gained through online learning on dummy datasets and Kaggle competitions — when I stepped foot in this field is grossly misrepresented in the real world.

And this is my attempt to share them with you.

01 You Will Not Build Cutting-Edge Machine-Learning Models

As soon as you get your first data science job, you are excited to build your first model. So was I. I was too eager to run the best machine-learning algorithm on real-world datasets and reach accuracy levels of ~95% — remember the iris/adult/titanic datasets?

Businesses are in the business of doing business, not building cutting-edge models.

— a hard truth, year one

Businesses want solutions to problems that improve revenue, profitability, productivity, or save cost. This means sometimes, good enough solutions that do not rely on machine learning can be used to solve business problems. Some of these solutions are 1st generation and act as a launchpad, especially for new marketing solutions where data is limited.

So, if you are not building top-notch best-in-class models within your job, it is for a good reason — timelines, limited data, or a limited understanding of new products within the market. Over time, the problems may saturate and may require innovation. It is then where business understanding combined with best-in-industry models will help improve model performance and subsequently business outcomes.

Note to self: The job isn't to build the most sophisticated model. The job is to solve the business problem — and sometimes the simplest solution is the right one.

02 Modeling Is More Than Running Just an Algorithm

At the beginning of your data science journey, you are exposed to toy datasets like iris/titanic/adult. Once you gain confidence, you participate in Kaggle competitions. These toy problems and competitions create an illusion that modeling is running an algorithm on datasets and improving a pre-defined score — at least, I had that expectation.

Much of what goes inside the model building process is hidden, and you won't know it unless you are given a real-world problem that requires generating or saving real dollars. To compare and contrast, let's look at a typical Kaggle competition versus the modeling development cycle I followed.

Kaggle vs. The Real World

🏆 Typical Kaggle Competition

Given well-prepared datasets and a pre-defined performance metric
Compete via feature engineering, hyperparameter tuning, trying multiple algorithms
Submit to a leaderboard; scored on private datasets you improve over time
Defining the dependent variable, data prep, and scoring metric are already done for you

🏢 Modeling in Real World

Business requests a solution, not a model — the problem definition is yours to figure out
Feasibility assessment, alignment on observations, and gathering business context
Single-model solutions preferred — deployability and maintainability matter
Governance, compliance, and sign-off can make or break everything

While Kaggle is the best simulation available for budding data scientists, much of the effort that goes into defining the dependent variable, preparing the datasets, and identifying the best scoring metric is already out of the picture.

03 The Real Modeling Development Cycle

Here is what I found building ML models for a large financial MNC with a dedicated team of data scientists providing solutions to multiple internal partners.

Requirement Planning

Business partners reach out to request a solution for a problem — e.g. identify customers to target for a new product due to a limited marketing budget.
The modeling team assesses requirements and checks the feasibility of building an ML model.
Both business and modeling teams align on relevant observations and gather business context. This becomes the foundation of defining the dependent variable, modeling approach, and the performance metric.

Data Prep & Modeling

The modeling team prepares datasets, explores data across relevant business segments, and verifies observations with business teams.
Builds a first-cut solution and measures performance at an overall level and against important business segments.
Performance is compared with a benchmark or production solution, if any.
The model is improved with feature engineering, hyperparameter tuning, and, if time permits, exploring alternate algorithms or approaches.
Unlike Kaggle, modeling teams try to build single-model solutions wherever possible — solutions must be simple enough to deploy and maintain.

Partner & Governance Approvals

The final model is presented to the business team, highlighting lifts generated from the production solution, if any.
An impact estimation is conducted to size the business benefit.
Once partners approve, model documentation is prepared according to enterprise standards and submitted for governance approvals.
This step is critical in the financial space — highly regulated. This step can make or break everything above.

Deployment & Tracking

Once the model passes internal governance checks, it is deployed in production and generates scores at a frequency relevant to the business teams.
The job doesn't end at deployment. The model is now live and needs to be tracked for performance deterioration over time according to governance rules.
Deterioration is addressed by adjustment or a full model rebuild.
(Just like getting into a top college — you can't chill for the rest of your life. Another lie we've been sold. More on that some other time.)

It doesn't take a data scientist to figure out that the model development cycle has much more to it than running your favourite algorithm.

04 Final Thoughts

Don't take this as a rant on competitions or on the iris dataset. One has to start somewhere. The idea is that there is an expectation mismatch between academic data science learning and data science for real-world problems — at least I felt it — and that you should not fret if you find yourself on the fence about data science.

Some of the best parts lie in the initial phases of the model development cycle — brainstorming with business teams. You learn much about the business, which gives you higher disposable knowledge.

Hence, working with a problem-solving mindset is the best way forward. There will be times when the problems will mirror a streamlined view of data science. So, take that opportunity to do all what you wanted to do when you started as a data scientist.

I hope you found this useful.