Data Science project Life cycle ?

vamsi krishna
4 min readOct 4, 2021

The Data Science project begins from the problem statement and ends with the solution of the problem. Depending on the problem statement, approaching the problem also differs. We must work on some fields like data extraction, data cleaning, EDA… like many processes are there. In this post I am trying to explain the life cycle of the Data science project step by step as of my understanding.

  • Problem Statement(Start)
  • Data collection
  • Exploratory Data analysis(EDA)
  • Data cleaning
  • Feature Engineering
  • Model selection(End)

Problem Statement:

Understanding the problem is the first step in a Data Science project. Examples of problem statements are E-Commerce Demand forecasting, Fraud transactions prediction, Sales forecasting, Back order prediction, new disease prediction, etc…

More details of the problem statement link https://www.indeed.com/career-advice/career-development/what-is-a-problem-statement

Data collection:

Once we understand the problem statement, the second step is collecting the relevant data for approaching to solve the problem.

Here we are collecting data from Databases like SQL, MangoDB or from files like csv, xl, word, jpg, text, mp4, mp3, web scraping etc…

Depending upon the problem statement and the availability of relevant data we collect data from one resource or multiple resources too.

If we see example what are the things we needs to remember while collecting data are,

If we approach a doctor for health related issues, the doctor seeing your reports, in those reports might be the information related to gender, height, weight, blood group, blood tests, scanning reports and some other medical related tests data for understanding the problem of patient. All information in data science point of view observations and each individual information is a feature of the data and person have specific disease or not is target variable.

Based on the above example we need to collect relevant data by depending on our problem statement.

Exploratory Data Analysis:

Exploratory data analysis in short EDA is a very important step for understanding the data properly, if we are not interacting or understanding the data the rest of the project impacts highly.

Things we do in EDA part are

  • Visualize the data
  • Detecting missing data
  • Detecting Outliers
  • Detecting imbalance data
  • Understanding the distribution of the data

Typically these are things we do in EDA

For visualization in python we have libraries available Matplotlib, seaborn, visualization is a powerful step in EDA, we have different dimensions and hidden things about the data we get by visualizing.

Detecting missing data, outliers, imbalance data also gives us hypotheses on how to handle the data related issues.

By EDA we are able to interact with data very strongly and we can make our assumptions for further steps.

Data cleaning:

After EDA once we understand the data properly, next step was data cleaning, Data cleaning is impacts our model performance, things we do in data cleaning is

  • missing data
  • Outliers
  • Imbalanced Data
  • Distribution of the data

In Data cleaning, the first step is to handle the missing data. Handling missing data depends on the number of records that have missing values, importance of those features, correlation of those features, other features and target variables. Depending on these things we remove the missing records or features, imputing missing data with statistical methods like mean, median, mode of the feature, using machine learning algorithm to fill missing values by using correlation other features.

Outliers means data points in a feature or target variable far away from the mean.

Regression algorithms are sensitive to these outliers, generally for calculating outliers we use IQR(inter quartile range), Standard deviation, Box plot for uni variant outliers detection for multivariate outliers detection we use scatterplot.. Handling outliers is also one of the important steps in data cleaning, but some algorithms like Decision tree, random forest, XG boost type Tree based ensemble algorithms are not sensitive to outliers. Handling outliers means, either we drop those records or we adjust to the nearest quartile.

In a categorical features or a target variable few categories are available more and some categories are too less is called Imbalanced data. In a target variable binary classification problem Yes’s are 98% and No’s are 2%, with this data we train our ML model, it gives 98% accuracy even if the model trains only Yes because of class bias. For handling imbalance data we use oversampling, under sampling, or both. Oversampling means which category has less available. We take those classes more till they match the quantity of other categories, in our example we take more No’s till equal to Yes’s and other features we fill with nearest values with respect to actual No’s. Under sampling means we remove some Yes’s data till reaches equal to No’s, but this time we lose some information. Another method is doing oversampling and under sampling simultaneously.

Distribution of the data means numerical features or target variables distributed with respect to its own feature density. If data were randomly distributed our prediction efficiency might be affected.

Feature Engineering:

After data cleaning is over we need to select the features that we want. Mostly in data cleaning time we work on row operations more than column operations. Maybe we drop some columns like id columns, and more outliers and missing data columns and some columns are not relevant to our problem statement.

Coming to the feature engineering we select our final features after implementing feature engineering techniques we use those selected features only for our final model input features.

Selecting features by two ways ,

  • Using domain knowledge and selecting relevant features.
  • Using correlation matrices

Correlation matrices like Pearson correlation, gini impurity, and all machine learning algorithms.

We use these techniques depending on the problem statement and availability of the data we select the best features.

Model Selection:

Model selection is the final step in a machine learning project, here i am not talking about deployment and front end application and design.

Till now what process we done is coming under data preprocessing, now in model selection we use to train with different machine learning algorithms, like if data is linearly correlation we may use regression kinds of algorithns, if data was not correlated linearly we may use some enabled algorithms like Random forest, XGBoost, Ada Boost etc… and after select the model we tune the hyper parameters then we save the final model into pickle, csv, other required formats.

--

--

vamsi krishna

I'm a Data Science enthusiast who loves to play with machine learning and deep learning.