00 The Setup
I have been building machine learning models for a large financial MNC for about 4 years. Over time, I have gained knowledge and experience in building and improving machine-learning solutions to real-world data science problems. Throughout this tenure, I have observed that my understanding of data science and machine learning — which I gained through online learning on dummy datasets and Kaggle competitions — when I stepped foot in this field is grossly misrepresented in the real world.
And this is my attempt to share them with you.
01 You Will Not Build Cutting-Edge Machine-Learning Models
As soon as you get your first data science job, you are excited to build your first model. So was I. I was too eager to run the best machine-learning algorithm on real-world datasets and reach accuracy levels of ~95% — remember the iris/adult/titanic datasets?
Businesses are in the business of doing business, not building cutting-edge models.
— a hard truth, year oneBusinesses want solutions to problems that improve revenue, profitability, productivity, or save cost. This means sometimes, good enough solutions that do not rely on machine learning can be used to solve business problems. Some of these solutions are 1st generation and act as a launchpad, especially for new marketing solutions where data is limited.
So, if you are not building top-notch best-in-class models within your job, it is for a good reason — timelines, limited data, or a limited understanding of new products within the market. Over time, the problems may saturate and may require innovation. It is then where business understanding combined with best-in-industry models will help improve model performance and subsequently business outcomes.
02 Modeling Is More Than Running Just an Algorithm
At the beginning of your data science journey, you are exposed to toy datasets like iris/titanic/adult. Once you gain confidence, you participate in Kaggle competitions. These toy problems and competitions create an illusion that modeling is running an algorithm on datasets and improving a pre-defined score — at least, I had that expectation.
Much of what goes inside the model building process is hidden, and you won't know it unless you are given a real-world problem that requires generating or saving real dollars. To compare and contrast, let's look at a typical Kaggle competition versus the modeling development cycle I followed.
Kaggle vs. The Real World
- Given well-prepared datasets and a pre-defined performance metric
- Compete via feature engineering, hyperparameter tuning, trying multiple algorithms
- Submit to a leaderboard; scored on private datasets you improve over time
- Defining the dependent variable, data prep, and scoring metric are already done for you
- Business requests a solution, not a model — the problem definition is yours to figure out
- Feasibility assessment, alignment on observations, and gathering business context
- Single-model solutions preferred — deployability and maintainability matter
- Governance, compliance, and sign-off can make or break everything
While Kaggle is the best simulation available for budding data scientists, much of the effort that goes into defining the dependent variable, preparing the datasets, and identifying the best scoring metric is already out of the picture.
03 The Real Modeling Development Cycle
Here is what I found building ML models for a large financial MNC with a dedicated team of data scientists providing solutions to multiple internal partners.
- Business partners reach out to request a solution for a problem — e.g. identify customers to target for a new product due to a limited marketing budget.
- The modeling team assesses requirements and checks the feasibility of building an ML model.
- Both business and modeling teams align on relevant observations and gather business context. This becomes the foundation of defining the dependent variable, modeling approach, and the performance metric.
- The modeling team prepares datasets, explores data across relevant business segments, and verifies observations with business teams.
- Builds a first-cut solution and measures performance at an overall level and against important business segments.
- Performance is compared with a benchmark or production solution, if any.
- The model is improved with feature engineering, hyperparameter tuning, and, if time permits, exploring alternate algorithms or approaches.
- Unlike Kaggle, modeling teams try to build single-model solutions wherever possible — solutions must be simple enough to deploy and maintain.
- The final model is presented to the business team, highlighting lifts generated from the production solution, if any.
- An impact estimation is conducted to size the business benefit.
- Once partners approve, model documentation is prepared according to enterprise standards and submitted for governance approvals.
- This step is critical in the financial space — highly regulated. This step can make or break everything above.
- Once the model passes internal governance checks, it is deployed in production and generates scores at a frequency relevant to the business teams.
- The job doesn't end at deployment. The model is now live and needs to be tracked for performance deterioration over time according to governance rules.
- Deterioration is addressed by adjustment or a full model rebuild.
- (Just like getting into a top college — you can't chill for the rest of your life. Another lie we've been sold. More on that some other time.)
04 Final Thoughts
Don't take this as a rant on competitions or on the iris dataset. One has to start somewhere. The idea is that there is an expectation mismatch between academic data science learning and data science for real-world problems — at least I felt it — and that you should not fret if you find yourself on the fence about data science.
Some of the best parts lie in the initial phases of the model development cycle — brainstorming with business teams. You learn much about the business, which gives you higher disposable knowledge.
Hence, working with a problem-solving mindset is the best way forward. There will be times when the problems will mirror a streamlined view of data science. So, take that opportunity to do all what you wanted to do when you started as a data scientist.
I hope you found this useful.