Data Scientist Mindset

Dr M Maruf Hossain, PhD, GAICD
Feb 17
10 min read

Updated: Feb 26

There are many definitions of data science and is enough to confuse anyone, the question remains what is data science? So, before discussing this topic I often lay out the definition I go by.

Data science is an interdisciplinary area about scientific methods, processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, like Data Mining or Knowledge Discovery in Databases (KDD).

My definition of data science is rather simple: data analytics + data mining.

Though this definition opens an avenue for discussing and defining two terms. One not so new, another relatively new!

Originally published at LinkedIn Pulse on 17 November 2019.

Data Mining

Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of computational intelligence (popularly known as artificial intelligence), machine learning, statistics, operations research, and database systems. It is an interdisciplinary subfield of computer science. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualisation, and online updating. Data mining is the analysis step of the KDD.

The term is, indeed, a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of cognitive decision support system, including artificial intelligence, machine learning, and business intelligence.

Machine learning is sometimes mistakenly used as a synonym for data mining (also known as Knowledge Discovery in Databases), where the latter subfield focuses more on getting valuable insights from a large volume of data by using techniques from statistics, machine learning, computational intelligence, operations research and database systems.

Machine Learning and other associated areas

Machine learning is a subfield of computer science that enable computers with the ability to learn without being explicitly programmed. Evolved from the study of pattern recognition and artificial intelligence, machine learning mainly is a research and academic area that explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs.

Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making using computers. However, the difference is the problems each field aims to tackle. While computational statistics address traditional data-driven problems, early machine learning algorithms were focused on problems defined under very large database (VLDB), which computational statistics were unable to solve due to computational complexity and enormous computation power required. With the advent of modern technology, most earlier problems addressed by machine learning are now also solvable by computational statistics techniques.

Machine learning also has strong ties to mathematical optimisation, which delivers methods, theory and application domains to the field.

Structured, Unstructured and Big Data

The focus of machine learning and other relevant areas were primarily on structured data. Structured data are those data that have pre-defined models and usually represented in tabular form, and can be stored in a range of format, from plain text files to sophisticated data warehouses and data marts.

With the growing computational power and wealth of available unstructured data, a 21st century challenge is to efficiently harness valuable information from those data. Unstructured data are those data that neither have any pre-defined model nor it is organised in a pre-defined manner. Merrill Lynch, wealth management division of Bank of America, has cited a rule of thumb that somewhere around 80 to 90 per cent of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on any quantitative research but nonetheless is accepted by many. Initially, a massive amount of text (e.g., web, tweet etc.) was considered as unstructured data, but soon the phenomenon has been shifted from text only to other forms of data such as images, graph data or streaming data (e.g., audio, video etc.).

Traditional VLDB has been proven inefficient in handling such unstructured data, and hence a new set of architectures has been proposed. These new architectures are commonly referred to as Big Data. The most used big data architecture is Apache Hadoop and other similar frameworks that can handle petabytes of data in a relatively shorter time. Apache Hadoop employs Google’s MapReduce programming model to break large data and task into smaller component and run them in parallel in a distributed environment. Though the MapReduce model was introduced by Google, in 2014 they have moved out to Apache Mahout for their MapReduce tasks. Not all big data jobs can be solved with the MapReduce model and Apache Hadoop seems to be very slow when a large job involves iterative tasks. Apache Spark, an in-memory cluster computing framework for big data, can handle both MapReduce and iterative tasks quicker than Hadoop.

Cognitive systems: Systems that learn

Machine learning undertakes a single problem at a time and solves it. For example, the likelihood of customers applying for home loans who are not first home buyers, also willing to take offset account. Targeting such customer segment can be beneficial. Applying machine learning to solve a problem like that will require a team of analysts and data scientists to run a project to develop and maintain a machine learning model to identify potential customers who will be willing to take additional products. However, there are several things to consider having the long-term benefit of such projects:

When the bank would like to target customers?
How often the machine learning model needs to run or re-train?
Who will have the ownership of the project?
If the model needs to be run on a regular basis, how effective it will be to have a dedicated team to execute a monotonic process over and over?

The delay in identifying the potential customer will surely cost the bank significant revenue. Whereas, if identified in real-time the bank could not only provide the right set of products to its customers but also gain their revenue from the first day.

This is just one use-case. How will the process look like if the bank must take several use-cases where machine learning solution can be deployed? To avoid complexities of deploying multiple machine learning solutions, as well as leveraging the benefit from real-time implementation, an integrated real-time system is required. Such an integrated real-time system that comprises multiple machine learning algorithms is refereed as Cognitive Computing System. Gartner Research defined Cognitive computing as a “disruptive platform with a shift more impactful than many other technologies in the last 20 years”.

Cognitive systems augment humans’ skills, scaling up expertise and productivity for modern businesses.

Data science processes

Data science has 3 well-defined processes: Discover, Access, Distil, D-A-D for short.

Discover. Find or identify the sources of good data, and the metrics. Sometimes request the data to be created (work with data engineers and business analysts).
Access. Access the data. Sometimes via an API, a web crawler, an Internet download, database access (through SQL) or sometimes in-memory within a data source (e.g., Hadoop HDFS).
Distil. Extract essence from data, the stuff that leads to decisions, increased ROI, and actions (such as determining optimum bid prices in an automated bidding system).

The distilling process involves:

Exploring the data (creating a data dictionary and exploratory analysis),
Cleaning (removing impurities),
Refining (data summarisation, sometimes multiple layers of summarisation or hierarchical summarisation),
Analysing: statistical analyses (sometimes including stuff like experimental design that can take place even before the Access stage), both automated and manual. Might or might not require statistical modelling.
Presenting results or integrating results in some automated process.

What makes a person data scientist?

There are so many areas of study involved in data mining that it becomes obvious that a data scientist is not just a person with a set or even a subset of those skills alone. Because it is very common saying out there that one needs to have business acumen. Multiple Venn diagrams are floating around the Internet with circles between 3 and 10, with numerous combinations of intersections. Making one thing clear that Venn diagram is not the most appropriate visualisation technique to represent this information.

Harvard Business Review 2012 article said it is the sexiest job of the 21st century. In 2018, the same author Tom Davenport wrote, “a more accurate subtitle might have been “Sexiest Job of the 2010-2019 Decade,” because I am not sure how much longer data scientists will be in great demand.”

Though data scientists’ skillsets are heavy on tech, data scientists serve the business. It is not a research and development role. So obviously, one must think business. Chances are one will probably learn business on the job. Talking to business analysts is the key to understand business. Without understanding business, it is not possible to give value back to the business. And without giving back to business one cannot grow. Bootcamps, tutorials, training etc. can only teach about technologies and algorithms. So, it is important to be a team player.

Characteristics of a data scientist

Apart from being a team player, one must be curious. He needs to ask questions. Data-centric questions. Questions that can be answered with data. However, not all questions can immediately be answered with the data. Gaps need to be found in the data that is making it impossible to answer the questions. Is there a data provider? Can more data be requested from the provider? Or can external sources be identified for the missing data?

Being a scientist is about having a mindset, not a qualification or a technical aptitude. It is the mental attitude and aptitude to take on and solve problems through a process of hypothesis and experimentation. The weapon of choice may be a Bunsen burner or a Spectrometer or R Statistical Package – it does not matter. What matters is the desire for knowledge and understanding, even better if one is of the mindset that he wants to share it with everyone afterwards.

The process of scientific experimentation is what has driven invention for hundreds of years. It does not even matter that one does not get the results he expects – it is the journey that counts. William Perkin was trying to make synthetic quinine and ended up revolutionising dye-making. Even some items as seemingly complex as the human pacemaker was invented by accident – Wilson Greatbatch taking that one down. Other examples of accidental discoveries include the microwave oven, Teflon and Coca Cola. So, it is important to be open to end-results.

My work at Taskforce Eligo was one of the key moments in my career. The task force was initially planned to be public, operational and productive at the end of three years of data processing, planning and analysis. But it was made public just after 9 months, because the task force confiscated $580 million in cash, drug and asset, along with a hundred arrests throughout Australia., based on leads generated by my algorithm. The success in Taskforce Eligo placed me to a more critical project, national security operations in 2013.

My findings there were opposite. I could not identify a single lead! Being true to the data, I made a bold statement based on my analysis that the Australian Muslim community has no direct link with the Islamic State. My report was forwarded to the PM&C. Eventually, it saved billions of our taxpayers’ dollars to be spent on fighting an unreal threat. So, one must have confidence in his analysis! To be confident on analysis, it is important to answer key business questions from the analysis, as well as avoiding any pitfalls.

A data scientist must possess a passion for solving problems. A data scientist needs to go beyond identifying and analysing a problem – he needs to solve it. An abundance of data does not necessarily mean an abundance of good data. If someone simply run data through a block of code, then he will not have a successful solution. The successful data scientists do not just have the biggest data or implement the most advanced algorithm, they solve a problem. It is the people who have an innate drive to find solutions for the right problem that will be the most successful as data scientists.

Re-using the code base is also important. This is a characteristic of programmers. They always build code in a manner that they do not have to re-write the same code if needed to perform a similar task in future. This is a quality all data scientists need to adapt and make sure that they also write their code in a way that can be used without copying the codebase for different projects that involve a similar process.

Acting Like a Data Scientist

If someone is new to data science, they could consider the following six easy ways so that they can start acting like a data scientist:

Clean the data. Data scientists know the value of having clean data. Cleaning data, or wrangling data, is the process of finding errors in the dataset and fixing them in preparation of analysis. This includes:

typos,
mislabelled items,
incomplete information,
inaccuracies, and
irrelevancies.

It is important to be vigilant with the data. By putting the time and effort into ensuring the data quality is reliable, a data scientist will garner the best results. The output is only as good as the data that is going into it.

Think critically. Weird correlations exist, and good data scientists will be able to recognise the difference between correlation and causation. Are most of the customers really from the postcode 3000? A good data scientist would be able to think critically and ignore these types of faulty data.

When analysing data, the following questions may come handy:

Are there other variables that account for this outcome?
What was the size of the sample used in this dataset?
Is this a correlation or causation?

Avoid analysis paralysis. It is easy to get overwhelmed with data and then do nothing with it. One needs to start somewhere. To prevent inaction, start with the “low-hanging fruit”. Decide on at least three insights that can be captured from the data and use any analytics tool to create a story.

A good mindset to have is to be curious. Data scientists love to follow hunches, discover insights, and answer questions. Once there is a story to visualise, it’ll be easier to take the appropriate action. A good starting point is to do some descriptive analytics.

Blend datasets. By blending datasets from different sources, one can establish a broader understanding of the business and find valuable insights. He should start by importing all the different data sources across the business into an analytics tool, then create a series of charts with the KPIs for each business function. By seeing the organisation, he may discover insights that may have gone unnoticed before.

Design a dashboard. Data science is more than just analysis, it is about presenting the findings in clear and compelling stories. It is important to choose the best charts and graphs that will tell a compelling story. Visual communication is also a great way to convey complex information to an audience.

Communicate. Ideas are like a virus. They spread when people share. For one to share his insights and help them spread, he needs to be able to communicate effectively in print, vocally and on stage.

Concluding remark

Any execution would fall flat without the proper foundation. In the beginning, one must learn the basics. There is no escaping the magnitude of impact the fundamentals have on his career as a data scientist.