How is it like Leading Data Scientists and Engineers together in the Technology Space?

Dr M Maruf Hossain, PhD, GAICD
Feb 21
7 min read

Updated: Feb 26

Most corporate structure consists of various departments that contribute to the company’s overall mission and goals. Common departments include Marketing, Finance, Operations, often collectively referred to as Business, Human Resources, and Information Technology (IT). These five, or three, divisions represent the major departments within a publicly traded company, though there are often smaller departments within autonomous firms.

With recent interest in Artificial Intelligence and its potential to support organisations through data-driven decision-making systems, organisations are placing data leaders and data professionals in either the business or technology divisions. In some rare cases, it also places some in the human resources department.

Originally published at LinkedIn Pulse on 8 May 2020.

Photo by Startup Stock Photos from Pexels

After working on both the business and technology sides as a data professional for over a decade, I have always advocated for a dedicated data space within an organisation. I always consider the business to be the owner of all organisational data. But to process that data and make it consumable for businesses to generate actionable insight requires great technological skill. For this very reason, most organisations place data capabilities, specifically the ones that are more advanced in nature and computationally expensive, such as machine learning, natural language processing, and artificial intelligence, in the technology space.

This brings forth an interesting dilemma for the leaders of the technology space. Traditionally, technology requires its people to have a more process-oriented mindset. The architectural design and serviceability both require following stricter, more rigid processes. Whereas some data professionals, such as data analysts, data scientists, predictive analysts, machine learning scientists, or engineers, require a flexible, curious, and data-driven mindset.

Scrum is not the only type of agile methodology

IT loves Scrum as its preferred agile methodology because it is easy for them to time-box activities and tasks, respond quickly to changes, and deliver value in a timely fashion. This becomes easy when everyone in the team has a process-oriented mindset. After all, agile is the best approach for building in the ever-changing business world. And Scrum is the most adopted agile practice. In fact, it is so widespread that, at many organisations, Scrum and agile are interchangeable, and that’s where data professionals face their day-to-day challenges.

The major principle behind Scrum is time boxing

Scrum divides the project into sprints, or time boxes, by defining the set of tasks the team must complete within a given period. No additional task can be added within the sprint. Sprints are designed to minimise external interruptions from stakeholders. Between sprints, the team reviews the previous sprint and plans for the next one. Therefore, once the sprint goal is set, no changes occur without good reason.

This structure is due to the nature of software development: it requires concentration. Letting people move the sprint goals around in a project is a major source of frustration for everyone.

Data science is nothing like software development

When we build software, we develop a set of specifications that may include architectural designs that clearly define how each component will be designed and how they interact with one another. This spec can then be broken down to features, stories, tasks, etc. In other words, it is relatively straightforward to write a spec to ship the first software demo. Once the first version goes out and user feedback comes back, the development team can then iterate on the design. A good specification is more likely to lead towards a product.

But in data science, however, a good spec, even if we can come up with one, does not guarantee success, let alone a complete product.

Most of software development’s ambiguity comes from known unknowns—we know what kind of software we want to build, but we may not know what kind of code to write to get there.

In data science, we deal with unknown unknowns—we have no idea at all what the data will show us or whether we can achieve what we set out to do.

Imagine an engineer with perfect coding ability. This engineer will have no problem completing any software project. But the same person still has to address data issues in a data science project and may need to change the goal because the data is not available. Even if the data is there, it may not be the right data to achieve what he set out to achieve.

Dissatisfaction with applying Scrum in data science

Scrum works when we can turn a project’s roadmap into a spec. The Scrum process then turns the spec into a product. But we can’t do that in data science. Every piece of analysis by the data scientists generates new insights, which in turn can change priorities and render the sprint goal useless.

Often in data science projects, what the business wanted to achieve through machine learning turns out to be impossible, and there is nothing we can do but pivot to a new goal or just embed business logic to achieve the same. Data can often challenge (even wrongfully) the most popular beliefs, making it impossible to put such a data-driven model into production, either because it is wrong or because it doesn’t align with the business’s goals.

Say, we are in the middle of a two-week sprint, and our data scientists found that our sprint goal is impossible. What can we do? Common sense says we can ditch the sprint goal and meet with my team to set a new one within the sprint, so we don’t waste another week. But this means we expect interruptions during the sprint. If the goal of sprints is to minimise interruptions and boost productivity, and we know interruptions are inevitable, why put a timebox in the first place?

Every data science project has two stages: An exploration stage and a product stage

The product stage is similar to a software development project. Here, Scrum works. If we know what we need to build and can put the steps into a spec, then we can go ahead and run the sprints.

The exploration stage, however, is where data scientists dance with ambiguities. This is where running Scrum without context turns a project into a disaster.

The solution is Kanban

Kanban is a task queue. A basic Kanban board has 3 parts:

To-Do
Doing (Work in Progress, or WIP)
Done

The key constraint in Kanban is a limit on the number of tasks in WIP at any given time. If the limit is 4, the team can work on 4 tasks at once, no more. In the Kanban world, tasks are called cards.

In Scrum, the constraint is time: sprints are time-boxes. Frameworks use constraints to limit the project complexity and scope, and that’s why they work. In the To-Do queue, tasks are ranked by priority, much like a Scrum backlog. When a task in WIP is finished, the team picks it up from the top of the To-Do.

A common concern is, what happens when a task gets stuck in WIP indefinitely? This is a real problem. So, in Kanban, the project manager tracks cycle time, i.e., the time between starting and delivering the task, or in other words, how long the card sits in WIP. The team can set an agreement to review a card if any card has been stuck in WIP for, say, one week, and then take action to resolve the issue.

How does Kanban help in exploration?

Say, after initial analysis, the newfound insight changes the priority and makes the sprint goal useless. In Kanban, managing task priority is the same as managing the order of cards in the To-Do list. Therefore, when a new priority arises, instead of having to call a meeting to make a new plan of who works on what, the project manager can simply change the order of cards and notify the team of the reasons. This means fewer interruptions, and things get done.

This solution raises another question: if it is impossible to come up with a spec in the exploration stage. Then what does it mean to get things done in the exploration stage?

The answer lies in understanding the basic unit of work for data scientists when exploring. The basic unit of work in exploration is an experiment, and productivity is measured by the number of experiments. In data science, there is no spec in exploration. The basic unit of work is experiments derived from hypotheses about the data.

The experiment that a data scientist carries out is not really what engineers demand as “proof of concept” or what business users demand as “proof of value”.

Experiments answer queries

There’s always a question around data:

What is the source of the data, where the data is coming from, or how is the data collected?
What is the quality of the data? Can we trust the data without first manipulating it? Will it be consistent with what we see during experimentation, or will it change during production?
Did any of the above assumption changed at some point in my data?

For example, if we have XML data from running OCR on several thousand bank statements, we might ask: Will the yellow text in some statement be correctly OCRed every time we run OCR on similar statements, or will it be missed due to colour saturation?

Then we can dive into the data and see whether our assumption holds for our question. Depending on what we see, it may lead to another question: Is the resolution always going to be the same? Can we trust the sampling? etc.

The goal of the project may be to build a model, but we can’t build a robust model without understanding the business and verifying our hypotheses around the datasets. We need to run experiments, revise assumptions around data, and then run another one.

From the project management side, we need to add the experiment to To-Do, run it in WIP, document the findings, then push it to Done. This is a great workflow to help the team find the data traps and pitfalls in the process of building a robust model.

I suggest…

Not to use out-of-the-box Scrum. Throw away parts of data science that make no sense and save ourselves the frustration.
To use Kanban during the exploration phase.
To choose the agile practices that work well in our context.

Finally, if there are two workflows in data science, how can we balance our team’s time between them?

Start the project with only data scientists to perform exploratory tasks. Once they are at a comfortable stage to provide direction for building the product, bring in architects and engineers to build the actual product.