Data Science


Data Science: An Overview


In simple words, Data Science is the study of data. It is a multi-disciplinary field where people uses specifically designed techniques and systems to extract insights from any kind of data, be it structured or unstructured data. Also, the data could be general data in local databases or it could be Big Data from multiple clusters globally. And it is a multi-disciplinary field because it needs the understandings of mathematics, skills of computer science, and the experience of business strategy and acumen. That's what Data Science is in a nutshell.

And in that line of thought, Data Scientists are breed of people who have the required technical as well as analytical skills to solve problems by studying data and extracting valuable insights from the data. Data Analysts are subset to Data Scientists, having only analytical skills and so are more inclined towards preparing and analysing the data. They may not have the detailed technical knowledge on writing programs or technical and architectural stuffs, but have enough analytical skills to perform analytical dissections of data using available tools and techniques. So people having knowledge in statistical mathematics, computer science and business domain are best suited to work as Data Scientists or for that matter as Data Analysts as applicable. Data Scientist would have more technical insights than Data Analyst.

As computer systems have become faster, advanced, cheaper, and its reach has increased to more and more users now-a-days, it has given rise to one obvious thing - generation of huge amount of data. Various users are using various systems like Google, Amazon, Facebook, Twitter, YouTube, WhatsApp, Instagram, online maps, scientific research systems, and various websites and portals, and generating huge amount of data every day. The data storage has also become so advanced that computer systems can now store all these data easily without the fear of losing. We now have data in Terabytes, Petabytes, Exabyte, or Zettabyte, etc.. And so, as we have abundance of data, we need a scientific system to dig out meaningful information out of the data. And that's where Data Science comes into picture. All sciences are pretty much is Data Science, where we:
  1.  Observe data
  2.  Make Predictions
  3.  Test the Predictions
  4.  Update our ideas or get new insights
For example, if we are given a dataset of meteorites landing on earth for past 10 years, we can come up with questions that the data might solve like:
  1. What area is most likely to get hit by meteorites?
  2. How does the atmospheric pressure effect meteorite trajectory?
And then we can write a little code that trains a machine learning model on that data and predicts the answer. We can use an existing model, and there are a lot of them, or build our own. This looks complex and traditionally this would need a PhD, but the truth is with digital data doubling every two years (figure), and machine learning algorithms getting more powerful, anyone can become a Data Scientist, provided they have two things – Time and Motivation. Machine Learning democratise a Scientific Discovery.

Analytics vis-à-vis Data Science

As per Niall Sclater, Analytics is analysis of data, typically large set of data, by the use of mathematics, statics, and computer software.

As per Dimitris Bertsimas, Analytics is the science of using data to build models that lead to better decisions that in turn add value to individuals, companies, and institutions.

Analytics are basically of three types:
  1. Descriptive: Describes what has already happened based on studying existing dataset
  2. Predictive: Tells us what will probably happen in the future based on trend seen from past dataset
  3. Prescriptive: Helps us prescribe the right course of action based on Data Science results
So, Analytics and Data Science are like two sides of the same coin. While Data Science provide us the picture from the large heap of data, Analytics provides us the meaning of that picture. The person at the Data Science side is basically called a Data Scientist and the person at the Analytics side Data Analyst.

Terms used in Data Science
  • Artificial Intelligence: AI is something where a computer would behave like a human. This is done through writing programs called algorithms to develop systems that can make decisions like humans based on available information thru repetative learning.
  • Machine Learning: It is something where a computer or machine would learn by itself. This needs thru training the computer with more inputs. Training the computer means providing the system with various sets of data to define something. For example, providing the system with 5-years of data on shoe brands purchased by male and female customers with varying foot sizes in retail shops in a region. With this input data, the system can progressively learn to identify if for a particular foot size, a male or female customer would probably buy which brand of shoe. That's possible because the system will gradually learn from the pattern of brand selection by customer with what foot size.
  • Deep Learning: Machine Learning is a part of AI and Deep Learning is a part of Machine Learning. It is learning of different aspects from one input, based on supervised, semi-supervised or unsupervised training.

Python for Data Science

Python is considered as the best and widely used programming language for scientific calculations and writing Machine Learning algorithms. That's because Python has the following qualities:
  1. Designed for Readability: Easy to learn and understand having English language syntaxes
  2. General Purpose Language: Designed to be used for writing software in the widest variety of application domains (general-purpose), and not domain-specific
And so scientific community has adopted Python as their main language and created many scientific algorithms, which we can import and use for our purpose, which is generally missing in other languages. In order to work on specific topic, we need to install the dependencies or code packages that the code depends on to work on that topic. For example, for Data Science, we need the sciket-learn package. We follow these steps:
  1. Assuming that Python is already installed in the system and Environment is set, install PIP, the Python Package Manager that helps us installing dependencies. In command prompt type this (for Windows): python -m pip install -U pip
  2. Then install dependency for Data Science: pip install -U scikit-learn
  3. Use some IDE like JetBrains' PyCharm or Jupyter notebook.

Using popular tools for working in Data Science

As mentioned above you can use some Integrated Development Environment (IDE) like JetBrains PyCharm or Jupyter notebook to write your Python codes to analyse your data. Basically these two tools are enough to work on most of the Data Science stuffs. But if you need more user friendly and serviced IDE, you can use the Anaconda tool, which is not mentioned here.
  1. Using JetBrains' PyCharm: PyCharm is a cross-platform IDE that provides consistent experience on the Windows, macOS, and Linux operating systems. Use this link to find more details on PyCharm installation.
  2. Using classical Jupyter Notebook: While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook. As an existing Python user, you may wish to install Jupyter using Python’s package manager, pip, instead of Anaconda.
    First, ensure that you have the latest pip; older versions may have trouble with some dependencies: pip install --upgrade pip
    Then install the Jupyter Notebook using: pip install jupyter
    To launch Jupyter Notebook enter in command prompt: jupyter notebook
  3. Using Jupyter Lab: To install Jupyter Lab, enter in command prompt: pip install jupyterlab. To launch Jupyter Lab, enter in command prompt: jupyter-lab.
And so scientific community has adopted Python as their main language and created many scientific algorithms, which we can import and use for our purpose, which is generally missing in other languages. In order to work on specific topic, we need to install the dependencies or code packages that the code depends on to work on that topic. For example, for Data Science, we need the sciket-learn package. Follow these steps:
  1. Assuming that Python is already installed in the system and Environment is set, install PIP, the Python Package Manager that helps us installing dependencies. In command prompt type this (for Windows): python -m pip install -U pip
  2. Then install dependency for Data Science: pip install -U scikit-learn
  3. Use some IDE like JetBrains PyCharm or Jupyter notebook.

Using Scikit-learn for Machine Learning

Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

Practical Examples on Machine Learning

See these examples for a quick understanding on Machine Learning: