Data Science: An Overview
In simple words, Data Science is the study of data. It is a multi-disciplinary field where people uses specifically designed techniques and systems to extract insights from any kind of data, be it structured or unstructured data. Also, the data could be general data in local databases or it could be Big Data from multiple clusters globally. And it is a multi-disciplinary field because it needs the understandings of mathematics, skills of computer science, and the experience of business strategy and acumen. That's what Data Science is in a nutshell.
And in that line of thought, Data Scientists are breed of people who have the required technical as well as analytical skills to solve problems by studying data and extracting valuable insights from the data. Data Analysts are subset to Data Scientists, having only analytical skills and so are more inclined towards preparing and analysing the data. They may not have the detailed technical knowledge on writing programs or technical and architectural stuffs, but have enough analytical skills to perform analytical dissections of data using available tools and techniques. So people having knowledge in statistical mathematics, computer science and business domain are best suited to work as Data Scientists or for that matter as Data Analysts as applicable. Data Scientist would have more technical insights than Data Analyst.
As computer systems have become faster, advanced, cheaper, and its reach has increased to more and more users now-a-days, it has given rise to one obvious thing - generation of huge amount of data. Various users are using various systems like Google, Amazon, Facebook, Twitter, YouTube, WhatsApp, Instagram, online maps, scientific research systems, and various websites and portals, and generating huge amount of data every day. The data storage has also become so advanced that computer systems can now store all these data easily without the fear of losing. We now have data in Terabytes, Petabytes, Exabyte, or Zettabyte, etc.. And so, as we have abundance of data, we need a scientific system to dig out meaningful information out of the data. And that's where Data Science comes into picture. All sciences are pretty much is Data Science, where we:
- Observe data
- Make Predictions
- Test the Predictions
- Update our ideas or get new insights
- What area is most likely to get hit by meteorites?
- How does the atmospheric pressure effect meteorite trajectory?
Analytics vis-à-vis Data Science
As per Niall Sclater, Analytics is analysis of data, typically large set of data, by the use of mathematics, statics, and computer software.
As per Dimitris Bertsimas, Analytics is the science of using data to build models that lead to better decisions that in turn add value to individuals, companies, and institutions.
Analytics are basically of three types:
- Descriptive: Describes what has already happened based on studying existing dataset
- Predictive: Tells us what will probably happen in the future based on trend seen from past dataset
- Prescriptive: Helps us prescribe the right course of action based on Data Science results
Terms used in Data Science
- Artificial Intelligence: AI is something where a computer would behave like a human. This is done through writing programs called algorithms to develop systems that can make decisions like humans based on available information thru repetative learning.
- Machine Learning: It is something where a computer or machine would learn by itself. This needs thru training the computer with more inputs. Training the computer means providing the system with various sets of data to define something. For example, providing the system with 5-years of data on shoe brands purchased by male and female customers with varying foot sizes in retail shops in a region. With this input data, the system can progressively learn to identify if for a particular foot size, a male or female customer would probably buy which brand of shoe. That's possible because the system will gradually learn from the pattern of brand selection by customer with what foot size.
- Deep Learning: Machine Learning is a part of AI and Deep Learning is a part of Machine Learning. It is learning of different aspects from one input, based on supervised, semi-supervised or unsupervised training.
Python for Data Science
Python is considered as the best and widely used programming language for scientific calculations and writing Machine Learning algorithms. That's because Python has the following qualities:
- Designed for Readability: Easy to learn and understand having English language syntaxes
- General Purpose Language: Designed to be used for writing software in the widest variety of application domains (general-purpose), and not domain-specific
- Assuming that Python is already installed in the system and Environment is set, install PIP, the Python Package Manager that helps us installing dependencies. In command prompt type this (for Windows): python -m pip install -U pip
- Then install dependency for Data Science: pip install -U scikit-learn
- Use some IDE like JetBrains' PyCharm or Jupyter notebook.
Using popular tools for working in Data Science
As mentioned above you can use some Integrated Development Environment (IDE) like JetBrains PyCharm or Jupyter notebook to write your Python codes to analyse your data. Basically these two tools are enough to work on most of the Data Science stuffs. But if you need more user friendly and serviced IDE, you can use the Anaconda tool, which is not mentioned here.
- Using JetBrains' PyCharm: PyCharm is a cross-platform IDE that provides consistent experience on the Windows, macOS, and Linux operating systems. Use this link to find more details on PyCharm installation.
- Using classical Jupyter Notebook: While Jupyter runs code in many programming languages, Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook. As an existing Python user, you may wish to install Jupyter using Python’s package manager, pip, instead of Anaconda.
First, ensure that you have the latest pip; older versions may have trouble with some dependencies: pip install --upgrade pip
Then install the Jupyter Notebook using: pip install jupyter
To launch Jupyter Notebook enter in command prompt: jupyter notebook - Using Jupyter Lab: To install Jupyter Lab, enter in command prompt: pip install jupyterlab. To launch Jupyter Lab, enter in command prompt: jupyter-lab.
- Assuming that Python is already installed in the system and Environment is set, install PIP, the Python Package Manager that helps us installing dependencies. In command prompt type this (for Windows): python -m pip install -U pip
- Then install dependency for Data Science: pip install -U scikit-learn
- Use some IDE like JetBrains PyCharm or Jupyter notebook.
Using Scikit-learn for Machine Learning
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Practical Examples on Machine Learning
See these examples for a quick understanding on Machine Learning:
- Example 1: Simple Gender Classifier
- Example 2: Gender Classifier with CSV File
- Example 3: Gender Classifiers Comparison
- Example 4: Comparing Multiple Classifiers
- Example 5: Simple Clustering and Visualisations
- Example 6: Reading a File in Python
- Example 7: Creating Word Cloud for Word Count
- Example 8: Sentiment Analysis of Tweets from Twitter API
- Example 9: Exploring Pandas Library
- Example 10: Movies Data Analysis
- Example 11: Titanic Survival Predictions
- Example 15: Exploring Folium Map Plot
- Example 20: TensorFlow - Rock or Mine Prediction
- Example 21: WordCloud in Python