Data Science Operating Model: Data Science as a Service

Dr M Maruf Hossain, PhD, GAICD
Feb 26
5 min read

To drive value from data, analytics need to be operationalised. Ad-hoc data exploration can be unclear, depends on each individual functional team’s preferences to generate insight, acted on and then forgotten under the pile of documentation. The preference may vary from the way the experiments are conducted, tools being used, and the finding has been disseminated. Every project starts from scratch. And if someone with a vast knowledge leaves the organisation, the prior knowledge is completely lost.

Originally published at LinkedIn Pulse on 19 November 2019.

Organisations can extend their analytics process a step further, with a centre of excellence model by centralising their scare analytics talent by co-locating people in a cross-functional “hub”. However, that does not ensure that the team is not involved in unnecessary data pre-processing or clean-up tasks or caught into repetitive BI-type work. Even though the IT team gave them a powerful Azure environment to run their analyses, but sometimes it lacks the right packages or tools, rendering it useless.

In a data-driven organisation, each step of the analytical process is documented on the intranet as a set of best practices that each new member of the team must learn. There should also be a formal control process where a second team-member “double-does” the work to ensure consistent results. Once done, they save their source data, code, and the report to a designated folder tagged with their names and the date.

The analytics team function as a “shared service” and has recurring meetings with marketing, risk, finance, and operations. Once results are finalised, they work closely with the software engineers to translate their models from R into the production systems. Shelf lives are predetermined, and models get reviewed depending on compliance requirements. If the data science team were to disappear, it would have a major impact on many parts of the business.

A true data-driven organisation will take their analytics to even higher plane by ingraining their ways of working into the infrastructure of the team. Before anyone starts a project, their system will search the metadata of all previous projects and flag the most relevant prior work. There will be a single repository of all analytical work done throughout the organisation. And results cannot be published back to the library unless necessary metadata is populated. When data scientists want to explore recommendation engine for the checkout process, they should not only see the prior modellers’ efforts, but a full record of their exploratory process is available, including discarded model variants.

It is important that data scientists are working to help improve the entire analytical lifecycle from data discovery to business action. Educated business users identify more opportunities for data science, bring valuable context to the analytical process, and act as advocates for the data science team. True data democratisation requires deep, ongoing education for the organisation as a whole.

The following diagram depicts the data science operating model that an organisation can adopt in order to become a truly data-driven organisation.

Identify. This initial phase focuses on understanding the project objectives. Business outcomes need to be validated through Design Thinking workshops.

It is often misunderstood as only the data scientist needs to understand the business issues, while the business knowing exactly what they want. Often the business intends to ‘make smarter decisions by using data’, but they lack the understanding of what analytical models are, how they can or should be used and what realistic expectations are around model effectiveness. As such, the business itself needs to transform to work with analytical models.

The intent and expected outcomes from the data science sprints should be documented.

Another issue with this phase is that project objectives and project requirements are usually originating from different parts of the organisation. The objectives typically come from a higher management level than the requirements and ignoring this fact, not seldom leads to the situation where after the model has been developed, the end-users of the analytical model are required to post-rationalise the model, which leads to a lot of dissatisfaction.

Acquire. In this phase, business and data analysts need to validate if datasets needed for the sprint are available in the data platform. If there are missing datasets, then they need to identify the right sources for enrichment. Organisations acquire external data sources to enhance the data they already possess so they can make more informed decisions.

All data, no matter the source, begins in its raw form. When this collected data flows into a central data store, it often is ingested into the system in discrete datasets. What they often have when this happens is data being dumped into a data lake, or a data swamp, full of raw information that often is not useful outside of narrow contexts.

Ingest. Once external data have been acquired to enrich the existing data, data engineer, data modeller or data developers need to leverage the data microservices to convert data formats, persist and catalogue raw data.

They also need to work on data profiling, cross-file analysis and causality analysis while external data is loaded into the data platform.

Curate. After external datasets are ingested into the data platform, data engineer, data modeller or data developers target data load, match, merge and manage keys. Data needs to be actively managed through its life cycle of interest so that it retains its usefulness.

Reference data are also looked up, transformed and enriched at this phase.

Model. Data scientists formulate a hypothesis, based on the business questions and available data.

They then select data mining techniques that are relevant to the problem (the more techniques, the better), and calibrate the model parameters for optimal values. Typically, there are several techniques for the same problem type. Some techniques have specific requirements on the format of the data. Therefore, going back to the previous phases are sometimes necessary.

Experimentation may require refining the machine learning model incrementally to improve performance, changing to an entirely different format of the data, or even a different interpretation or adjustment of the business question.

Analyse. Data scientists and data miners continue to review and test the model performance to reduce the effect of the outliers. They select the best performing machine learning model by applying a benchmark against outcomes.

Model and analyse phase can go hand in hand and can have multiple iterations before an acceptable solution is found.

Present. Once an acceptable solution is found, data storyteller or UX consultants use visualisation tools to develop the storyboard for the hypothesis.

They use an informative and appealing visual representation of the results and the ranking of the outcome based on various datasets.

Act. Operational staffs take necessary business actions based on the analytic findings.

Organisations that harness more ad-hoc data exploration, usually skip the ingest and curate phase.

A data-driven organisation, on the other hand, will have two additional phases.

Maintain. Data scientists and machine learning engineers will induce the models in the production environment, and re-train the models on a regular interval or as necessary, to ensure that the model captures the changes of patterns presented by the new data, also known as the drift of concept.

This will ensure the predictions continue to be accurate.

Predict. Productionalise model continues to predict the outcome whenever new data comes in or a new user-action is taken.

If the machine learning model is part of a bigger solution or platform, then the platform uses the predicted outcome to decide on features such as default value to put in the input box or the default button to select.

A truly data-driven organisation will have all ten phases in its data science operating model.