Data Science is Lego. This article was published by Oscar… | by Twinkl Data Team | Twinkl Educational Publishers | Aug, 2022


This article was published by Oscar South, Data Scientist at Twinkl.

Here at Twinkl, one of our core values ​​is to “Go Above and Beyond”. This manifests itself throughout the organization in many ways. There’s also an intuition thrown around a lot in the data field that says “85% of data science projects fail” — I’d personally ballpark that the data team at Twinkl goes above and beyond this to where the 85% mark is closer to our success rate. We do this by taking a highly problem-focused approach with a deep understanding of business strategy and objectives, then making marginal gains by accepting that success comes from an open culture of failing fast and learning. This approach isn’t without challenges, but with the right touch certainly lends itself to a high success rate — it is one of the core drivers of the growth of the Twinkl data team and the success of the team within the larger scope of Twinkl’s growth .

In this article I’ll discuss what some of the challenges and opportunities this methodology presents look like in practice. In the follow up, I’ll apply the principles discussed to evaluate one of my own ongoing projects during my tenure here.

So — why is data science Lego?

An individual lego block is largely very uninteresting — it has inputs and outputs that let it connect to other lego blocks, and not much more than that. If you put enough of them together in the right way though, you can build fantastic structures that are a lot more interesting in totality than as the sum of their parts.

In the data field we all love to talk about machine learning models — a lot of the hype and enthusiasm in the industry exists around the gravity that is created by research and discourse around this specific part of the overarching field. moreso, “when you’re holding a hammer, everything looks like a nail” — ie. most data scientists are more comfortable with one kind of tool/model or another, and when you are more comfortable with one particular tool then it intuitively feels like that that is the right one to solve any problem you might encounter, regardless of whether there are other better options within reach.

This can lead to a view of a model as a monolithic problem solving entity, and often results in a very ‘model-first’ approach to data science projects, which looks a lot like this:

In my opinion, this is the cause of the anomalous 70% uplift in success rate in Twinkl. In our mapping between data science and lego, the machine learning model is the individual block and the finished construction is the solution to a business problem — we put the blocks together in composable and meaningful combinations to achieve a larger goal. This is also how we think about data projects at Twinkl — small, composable projects that solve clear, well defined problems in their own right and can be leveraged, combined and built on top of to solve larger problems with broader scope.

So, confusingly for our analogy, a lego model is not equivalent to a machine learning model. The mapping would look more like this:

Lego block == ML model

Lego model == business problem

The flipping of logic from a data science project being a monolithic, highly optimized function into an intuitive combination of the simplest versions of appropriately purposed ML models (or simply algorithms in a more general sense) moves the focus away from the technical details, towards thinking about the overall assumptions of the business problem. It’s kind of like abstract programming with blocks of business logic! It becomes very easy to think about each model/algorithm block in isolation and keep a highly complex problem very conceptually simple. These blocks can be also worked on independently by colleagues as long as a common interface is agreed on.

Since projects organized like this generally start delivering business value early, the functional ‘blocks’ which drive the value delivered by the project can be identified, zoomed in on and optimized independently. At this point you’re back to the monolithic ML model, but now the model is already delivering a baseline of value — optimizing that model (at least to a point) becomes a linear relationship between time/thought investment and business value returned. The risk of failure is largely gone! Optimizing a fundamental enough ‘block’ in a valuable enough project might become a project itself, or even a job description (or an entire team). The data team at Twinkl has organically grown in exactly this manner — as an example my own role here (developing engagement driven search recommendations) started life as an SQL query that a colleague wrote one morning a few years back.

The brilliant part is that once a project (the analogy of the lego ‘model’) exists, it becomes a composable ‘block’ with inputs and outputs in its own right. Now you can take these ‘lego model’ projects, plug their interfaces into each other and make new even-more-valuable composite ‘lego models’ that answer totally different business questions.

The implicit challenge here is that very short initial sprint lengths of just a couple of days may impose limitations, but those limitations also both motivate and facilitate creativity that ultimately help to drive business value. Motivating by favoring simple, creative solutions that cut straight to the point. Facilitating by freeing people to experiment with interesting solutions to problems — it doesn’t matter if your initial idea fails on monday, because there are 4 more days to try out alternative approaches.

In the follow up to this article, I’ll discuss the case study of one particular combination of my own successes and failures which led to facilitating business-value deliverables (and allowed me to engage with some really fulfilling data science methodologies) that would have been impossible or impractical and would have inevitably failed had that outcome initially been scoped out as a monolithic goal.

Leave a Comment