What is Data Science?

2. What is Data Science?#

In short, data science is about extracting and communicating relevant insights from complex data with the help of digital techniques.

Unlike established fields like mathematics, physics, or history, data science relatively new. If you ask ten data scientists to define their field, you will likely get ten different answers. Some might view it as a distinct discipline, others as a technical approach or mindset, and still others might consider it synonymous with statistics. Many authors have contributed definitions or descriptions of what data science is [Cao, 2017, Donoho, 2017, Blei and Smyth, 2017, Carmichael and Marron, 2018, Grus, 2019], but for now let’s start with the rather general and accessible description of data science as

the art of gaining and communicating insights from complex data through digital techniques.

Many quantitative scientists would argue that they do similar work, as they strive to learn from data and use digital tools extensively. This overlap does not diminish the importance of data science; it simply indicates that many scientists must also be data scientists to stay current in their fields. Rapid advancements in digital techniques, including machine learning, are transforming many research areas.

Opinions on what data science exactly is can vary, often depending on the application area. In consulting and business, data science might mean something different than in academia. However, most agree on a Venn diagram that is frequently used to illustrate data science [Carmichael and Marron, 2018, Conway, 2010]: the intersection of Digital Techniques, Statistics, and Domain Expertise.

Figure 1. Venn diagram to indicate the intersection of fields for data science.

2.1. Data is Nothing New. So Why Data Science Now?#

Data has been a cornerstone of human understanding for millennia - from ancient civilizations keeping records of harvests and astronomy, to modern businesses tracking sales and performance. It’s clear that data in itself is not a new concept. However, the emergence and ascendancy of data science as a discipline is a relatively recent phenomenon. So, why now?

The prominence of data science in today’s world can be attributed to several concurrent developments:

(1) The exponential increase in the volume of data generated. Thanks to digitalization and the rise of the Internet, mobile devices, and IoT (Internet of Things), we are producing data at a previously unimaginable scale. This big data presents both a challenge and an opportunity - the challenge being how to handle and process this vast amount of information, and the opportunity being the valuable insights that can be gleaned from it.

This is accompanied by an increased recognition of the importance of data-driven decision-making across diverse sectors [Provost and Fawcett, 2013]. Various industries, governments, and institutions have realized that leveraging the power of data can lead to increased efficiency, better decision-making, and a competitive advantage.

This existence (and appreciation) of larger and larger amounts of data can be seen as a substrate for the rise of data science, but it really needed a combination of several other developments to be able to properly work with such data (Fig. 2.1).

(2) The evolution and expansion of statistical methodologies have been a key driver. Statistics provide the foundational principles and techniques for analyzing data, making inferences, and predicting future trends. In the era of big data, classical and modern statistical techniques form the backbone of most analyses in data science. The relationship is actually so close that in the 1990s, statisticians like Jeff Wu even suggested renaming statistics to “Data Science”, a debate which is still ongoing [Carmichael and Marron, 2018]. Despite all overlap, both terms still exist and usually mean related but different things (see also [Hassani et al., 2021]).

(3) The strides we’ve made in data handling capabilities have greatly facilitated the rise of data science. This obviously includes the drastic advancements in computational power and storage capabilities that made it possible to collect, store, and analyze these massive datasets. But this also includes many developments from computer science, such as databases. Just a few decades ago, collecting, storing, and analyzing the vast amounts of data we deal with today would have been unimaginable, let alone impractical.

(4) There has been significant progress in the field of algorithms, which also includes machine learning. It is algorithms, which are at the heart of nearly every tool that we use as data scientists for understanding and interpreting data. This can range from optimization methods dating back more than 200 years (e.g., least square method) all the way to current deep learning approaches. These advancements have opened up new possibilities for predictive analytics, automation, and artificial intelligence.

(5) Lastly, the often-underestimated field of data visualization has seen revolutionary advancements. Effective data visualization makes complex data more comprehensible, accessible, and actionable. The development of powerful visualization tools enables us to present data in a visually compelling manner that fosters understanding and drives informed decisions [Munzner, 2014, Healy, 2018, Wilke, 2019].

So, while data is not new, the volume of data, our ability to process it, and the recognition of its value, are. These changes have given rise to the burgeoning field of data science, marking a new era in our relationship with data.

../_images/figure_foundations_of_data_science_history.png — Fig. 2.1 The concurrent developments leading to Data Science [1].#

What is Data Science?

Contents

2. What is Data Science?#

2.1. Data is Nothing New. So Why Data Science Now?#

2.2. A brief spotlight: the many facets of Data Science#