In previous parts, we have discussed data preprocessing, EDA, Feature selection, and cleaning. Now we are going to discuss the model selection before that we need to know some more information and metrics.


Mean Squared Error represents the average of the squared difference between the original and predicted values in the data set. It measures the variance of the residuals.


Identifying outliers is important for every data scientist. It helps detect abnormal data points or data that do not fit in the right pattern.
From a Machine Learning perspective, tools for Outlier Detection and Outlier Treatment hold a great significance, as they can have very influence on the predictive model.

Outlier is an observation point that is distant from other observations. An Outlier may be due to variability in the measurement or it may indicate experimental error.

Outliers, being the most extreme observations, may include the sample maximum or sample minimum, or both, depending on whether they are extremely high…


In cases, where there is a huge number of variables, it is quite difficult to visualize, draw inferences for that dataset. So these techniques try to take out a subset from that dataset, which can capture a normal amount of information laid out by the original set of variables. So if we are having a dataset of X dimensions, we can convert it into a subset of Y dimensions. This is called dimensionality reduction.

Dimensionality reduction may be both linear or non-linear, depending upon the method used. The prime linear method, called Principal Component Analysis

Principal Component Analysis

It works on a condition…


When building a machine learning model, it’s almost rare that all the variables in the dataset are useful to build a model. Feature Selection is a process of selection a subset of Relevant Features from all features, which is used to make Model Building.

If we have few features then it is easy to interpret the model, less likely to overfit but it will give low prediction accuracy.
if we have more features then it is difficult to Interpret model, more likely to overfit and it will give high prediction accuracy. …


This is the intermediate step after EDA and before data mining. In order to obtain value from the dataset through data mining, we need to first prepare or preprocess the data. It involves data cleaning, transformation, and reduction.


Normalization scales the feature between 0.0 & 1.0, retaining their proportional range to each other. Each sample with at least one non-zero component is rescaled independently of other samples so that its norm equals one.


Hypothesis testing is an essential procedure in statistics. A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. When we say that a finding is statistically significant, it’s thanks to a hypothesis test.

Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.

Ronald Coase said “Torture the data, and it will confess to Anything”. …


EDA is the process of visualizing and analyzing data to extract insights from it. In other words, EDA is the process of summarizing important characteristics of data in order to gain better understanding of the dataset.

In this article, we are going to introduce you to the process of EDA through the analysis of the House dataset available here. We will talk about some common methods used for EDA and will let you know how to apply them for extracting meaningful insights from raw data.

We can host a website for free on Github pages. This can be used to host our personal website, blog, portfolio, project, etc. without spending much effort and a single rupee. Github Pages are incredibly fast when compared to WordPress and other CMSs.There is no database or back-end execution in Github pages. So the chances of getting hacked are nil.

First, we require a github account and terminal.

Log in to github account. Create a new repository. This where we keep our project data.

Jekyll is a static site generator written in Ruby. Jekyll is best when it comes to personal blogs, portfolios, and static websites, etc. We can also take templates and edit them. We can find free templates in some sources on the internet. Beauty in Jekyll is that you can provide the content in a markup language (as plain text) and Jekyll will automatically generate static HTML pages.

After selecting a template we can clone it to the local system and update the changes as per our requirements, build it locally and then serve. we can change the website name, description…

Sri Harsha Tanamala

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store