What are the next steps?

28. What are the next steps?#

Once you familiarized yourself with the basics of data science, you can take many different exiting routes.

28.1. Domain specific knowledge and techniques#

While many of the basic techniques can be applied to a wide range of domains and use cases, every domain comes with its own data types, questions, and challenges.

In nearly all cases I have seen, data scientists will - over time - specialize in certain techniques, data types, or domains. This also makes sense, because the more you understand about the data you work with, but also the underlying questions and related difficulties, the better you can work on extracting new knowledge from such data.

Let’s take a few examples.

28.1.1. Geo-spatial data#

Working with geographic data, maps, etc. is very relevant to a lot of fields. For instance urban development, agriculture, logistics, but also geo-politics and more. In all those cases you would have to learn the basics of how to interpret geographic or geo-spatial data. How do you compute distances or areas (unless you consider the earth to be flat… well, but then you are probably no data scientist…)?

In Python, there are many libraries that can help you work with such data. For an introduction to the topic have a look here: https://geographicdata.science/book/intro.html

28.1.2. Biological or medical data#

If you ever looked at genetic datasets you will quickly understand that simply running a copy+paste data science script on this might not be the way to go. Biological or medical data often involve high-dimensional datasets, complex structures, and a requirement for stringent data privacy and security measures. This data can range from electronic health records to genomic sequences, each with its unique set of challenges for data processing, pattern recognition, and interpretation.

In the medical domain, you must also account for the ethical considerations and regulatory compliance requirements when handling data. This might include understanding the intricacies of GDPR in Europe, HIPAA in the United States, or other relevant legislation. Furthermore, the stakes are high; the insights derived from data can directly affect patient outcomes. Therefore, a deep understanding of statistical methods that are robust against false discoveries is paramount, and it’s often necessary to work closely with domain experts to ensure the results are clinically relevant and interpretable.

28.2. Software Development and Data Engineering#

Hopefully, this book helped to convince you that you can do very complex and detailed data analysis with a rather limited amount of code (because we can rely on many great Python libraries!). Data Science, however, is more than only creating and running Jupyter Notebooks. No doubt, those notebooks are great for fast, explorative work and they are very often used to run data science workflows. But they come with great limitations, too.

As soon as you want to build something longer-lasting (maybe something that should work robustly for months or even years), or something larger in scope (maybe a project so complex that a full team will work on it), Jupyter notebooks can no longer fulfill all requirements. This is the point when you will have to acquire new data science super-powers that go beyond the basics covered in this book. Most of all, better (Python-) software development skills! This includes the ability to properly package and version Python code, to add automated testing pipelines, to implement tools to help with code quality, security, and robustness.

Another common scenario in data science is that the data handling side goes far beyond the simple importing of a modest-sized csv-file. Think of things like really big data, for instance, data that is too big to fit into memory, or data that has to be processed in real-time. That is where data engineering and high performance computing start to come into play. This includes topics such as efficient data storage, retrieval, and processing. You might want to learn how to use specific tools for big data scenarios such as Spark, as well as skills to work with cloud platforms. But, a first step often even starts within your own code base namely algorithm and code profiling and optimization. Python is a great language for data science, but it needs specific tricks to make code run fast and efficient. At some point you will probably come to the point that you have to worry about parallelization and compiling, and the use of libraries such as numba or dask. But that is all material for another book…