- Data science is the study of large quantities of data, which can reveal insights that help organizations make strategic choices.
- The data scientists need to be curious, judgemental and argumentative.
- Many algorithms are used to bring out insights from data.
A Methodology is a system of methods and a guideline to decision making during the scientific process.
Data science methodology guides the datascientis in solving complex problems with data.
Foundational methodology, a cyclical, iterative data science methodology developed by John Rollins, consists of 10 stages.
- Business understanding
- What is the problem you are trying to solve?
- Understand the business problem and determine the data needed to answer the core business question.
- Analytic Approach
- How can you use data to answer the question?
- If the question is to determine the probabilities of an action, then use a predictive model
- If the question is to show the relationships, then use a descriptive model
- If the question requires a yes or no answer, then use a classification model
- How can you use data to answer the question?
- Data Requirements
- Identify the correct and necessary data content, formats, and sources needed for the specific analytical approach.
- Data Collection
- Idenify and gather available data sources (These can be in the form of structured, unstructured, and even semi-structured data relevant to the problem domain.)
- Data Understanding
- Focused on exploring and analyzing the collected data to ensure that the data is representative of the problem to be solved.
- Data Preparation
- Where data is cleaned, transformed, and formatted for further analysis, including feature engineering and text analysis.
- Modeling
- Evaluation
- Deployment
- Feedback
- CRISP - DM stands for cross industry standard process for data mining
- CRISP-DM, an open source data methodology, combines several data-related methodology stages into one stage and omits the Feedback stage resulting in a six-stage data methodology.
- Business Understanding
- Data Understanding [Combination of Data Requirements, Collection and Understanding]
- Data Preparation
- Modeling
- Evaluation
- Deployment
Based On Queuestions
- Descriptive Questions: What is current status?
- Diagonistic Questions: Why did it happen?
- Predictive Questions: What likely to happen?
- Prescriptive Questions: What should we do?
- Classification Questions: What category does this belong to?
Descriptive statistics are appropriately named, as they provide insights into the main features of our data.
- Mean - Average
- Median - Middle Value
- Mode - Most Frequently Occuring Value
The most common way to gauge the variability is via Standard Deviation
Measures how much the values in the dataset vary around the mean.
Super low standard deviation indicates a dataset with values clustered around the mean, while a higher standard deviation represents a wider spread around the mean