I wanted to write a series of articles that explore data-science from a high-level perspective. These articles aim to answer the following questions:
- WHAT data science?: What is data science and how it came to be? When exactly did it become a field of knowledge and what developments lead to its formation?
- WHEN data science?: When should you use data science, and when is it a good fit for the problems you are facing? What type of problems can data science solve? When is data science a bad fit for creating a solution?
- WHY data science?: Why is it an important field of study, and why should you spend time familiarizing yourself with the topic? Why do companies spend money on data science and why is there so much momentum behind it?
- HOW data science?: How to apply data science to solve problems, and what's the general workflow of a data science project?
These are a lot of questions, but hopefully, after finishing this series you'll have answers for all of them.
The goal is not to provide you with practical data science knowledge (maybe future series will handle that), it is to give you a general idea of this exciting field. After reading this you will be in a much better position to decide if you want to dive deeper in the topic and become a data scientist, or if it might not be a good fit for your career aspirations.
Let's start answering our first question: What is data science?
Cut to the chase: what's data science?
As the name implies, data science is a discipline centered around the analysis of... well, data.
You will find several different definitions of data science on the internet. What most of them have in common is that data science is a discipline centered on the recollection, manipulation, analysis, and interpretation of data. This encompasses techniques and knowledge from different disciplines that contribute to the goal of extracting valuable insights from data. Among these disciplines you can find:
- Probability and statistics. It uses techniques from this field for the analysis of data.
- Data mining. It relies on the knowledge of this field for finding patterns in structured and unstructured data.
- Machine learning. It uses the techniques and technology of the field to extract valuable information from data and create models.
- Big data. It uses big data techniques and technology to enable the processing and analysis of petabytes of data.
- Business analytics (or other forms of analytics), as it is also responsible for the manipulation of data from the beginning of the process (recollection) to the last stages (analysis and communication of the insights extracted).
- Data visualization. It uses data visualization to aid data scientists in the analysis of data and to assist them in the task of communicating the results with other professionals.
Based on the original comic made by SansSerifComics
The age of the internet brought an unprecedented explosion in the amount of information generated every day. The analysis of this volume of data without the aid of technology became impossible. This led to the development of the field we know today as data science.
The goal of data science is the improvement of decision making by backing choices in real-world data. It enables humans to extract valuable insights from data that would go unnoticed otherwise.
These insights need to be both non-obvious and useful. Non-obvious means that a human being would not be able to find the pattern by themselves, and useful meaning that it's possible to take action as a result of the knowledge gained. These actions could be things like:
- Identify tendencies in consumers and offer them products they are more likely to enjoy/find useful, and therefore buy.
- Finding out which subscribers to your service are likely to leave you for a competitor and take action in time to regain their goodwill.
- Identify abnormal events to prevent frauds or terrorist attacks.
- Identify specific types of cancer and provide a more effective treatment for every particular case.
Data science differences itself from the traditional data analysis approach on its reliance on automated tools for the analysis of data. Hand analysis of the current volume of data available is impossible, that's why machines have become every data analyst's best friend.
The definition I think best describes the field is the following:
Data science is a multidisciplinary approach to the automatic gathering, transformation, analysis, and presentation of data to extract valuable and actionable insights from it.
While it might seem as if the field is a new hot thing, most of the foundation data science builds upon have been around for a while. We will now talk about important developments that lead to the creation of the field we nowadays know as data science.
One notch on a stick, two notches on a stick
The creation of this field didn't happen overnight. Humanity has been developing its tools and foundations millennia, way longer than you would imagine. Let's recap some of the most important developments that led to the field we know today as data science:
Before 41000 BC: Humans start writing marks on bones and sticks to keep track of events or perform simple calculations. We start creating rudimentary mathematical concepts and keeping track of natural phenomena.
3200BC-2000BC: Mesopotamians invent bookkeeping and start keeping track of commercial transactions and other types of information. Egyptians realize that having an idea of the resources available is good for budgeting food, collecting taxes and raising armies, so they start running periodic censuses across the empire. Past this point, we just keep getting better and better at writing records and raising taxes.
800AD-1900AD: We invent statistics. As the name implies its goal is to deal with information about the state: demographics, economics, and other things. We soon find out we can apply these techniques to all sorts of data, and statistics starts to play a very important role in science and engineering. We also invent probability and probability distributions, enabling us to perform statistical learning. Gauss invents the least-squares method and William Playfair creates the field of data visualization. Oh, and we create devices for analyzing big amounts of data: computers.
1900AD-1950AD: Turing invents the foundations of modern computers, and the invention of the transistor catalyzes the creation of more powerful electronic components. Research in intelligent systems starts and we create theoretical models for what will eventually become modern machine learning concepts, like neural networks. Humanity starts to employ computers for analyzing big amounts of data. The foundations of information theory are created, and the use of multivariate statistical techniques becomes widespread.
1950AD-2000AD: A lot of interest in AI leads to the development of modern techniques like SVMs, k-means clustering, decision trees and backpropagation in neural networks. Edgar Codd publishes a paper on the relational data model, leading to the development of technologies like relational databases and SQL. Data warehouses are born as a result of companies needing to centralize their information processing and integrating huge amounts of data. Challenges to analyze big amounts of data create the field of data mining (originally known as knowledge discovery in database). The term data science starts to be adopted in the 90s as a result of the discussion about using computers to analyze big amounts of data. The age of the internet begins.
2000AD-NOW: Mobile technologies and massive internet availability increase the amount of daily data produced to unprecedented levels. Companies invest in the creation of software tools and libraries (like Hadoop and Spark) for the analysis of massive amounts of data. Powerful GPUs and other hardware are available at consumer prices. Universities start offering data-science as an option for students. Data science gathers lots of interest and becomes a mainstream field with lots of useful applications.
Cool, but why now?
Well, if I had to guess I'd say it's the combination of 3 factors:
Availability powerful hardware and libraries: The computational complexity of many of the most powerful data science techniques were beyond the reach of most organizations and individuals, until recently. Nowadays, you can use your own personal computer and open source libraries to create useful data applications, and there is plenty of literature available to teach you how to do this.
The amount of data available: With modern computers and the internet, we became pretty good at gathering data and making it available. The availability of massive datasets dramatically improved the accuracy of the models we can create.
Belief in the opportunities of data-driven decision making: Companies and organizations realized the power hidden behind the huge datasets produced every day. Realizing that powerful and important insights can be gained from data that would go otherwise unnoticed led organizations to invest in the advancement of the field.
Great, but what can I use it for?
We just finished explaining what data science is.
In the next article of the series, we will see some examples where data science is a good fit. This is an incredibly exciting field with lots of applications and opportunities! So let's see which type of problems we can solve with it.
What to do next
- Share this article with friends and colleagues. Thank you for helping me reach people who might find this information useful.
- The comic for ML is based on an original comic made by SandSerif
- This series on the MIT Essential Knowledge series books on data science and machine learning. These and other very helpful books can be found in the recommended reading list.
- Send me an email with questions, comments or suggestions (it's in the About Me page)