Data pre-processing is a data mining technique which is used to transform raw data into a useful format.
Steps Involved in Data Pre-processing:
1. Data Cleaning
“The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin)
One of the most common problems I have faced in Exploratory Analysis is handling the missing values. I feel like that there is NO good way to deal with missing data. There are loads of different solutions for data imputation depending on the kind of problem — Time series Analysis, ML, Regression etc. and it is much more difficult to choose between them. So, let’s explore the most commonly used methods and try to find some solutions that fit our needs.
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.
Missing DataThis situation arises when some data is missing in our dataset. Before jumping to the various methods of handling missing data, we have to understand the reason why data goes missing.
Data Goes Missing Randomly :Missing at random means that the case in which a data point is missing, the reason for missing data is not related to the observed dataset.
Data Goes Missing not Randomly :Two possible cases for missing data can be – it depends on the hypothetical value or, it is dependent on some other variable’s values. People with high salaries generally do not want to reveal their incomes in surveys, this can be an example for first case and, we can think that the missing value was actually quite large and can fill it with some hypothetical value. And, an instance for latter case can be, females generally don’t want to reveal their ages! Here, the missing value in age column is impacted by gender column.
If data goes missing randomly, it is safe to remove the tuples with occurrences of missing values, while in the other case removing observations with missing values can produce a bias in the model. So we have to be quite bold while removing some tuples.
P.S. – Data imputation does not guarantee better results.
Dropping ObservationsTuple deletion removes all data for an observation that has one or more missing values. If the missing data is limited to a small number of observations, you may just opt to eliminate those cases from the dataset. However, in most cases, it can produce bias in the analysis because we can never be totally sure that the data has gone missing randomly.
Dropping VariablesThe better choice always is keep data than discarding it. Sometimes you can drop variables (columns) if the data for that particular column is missing for more than 60% rows but only if that column is insignificant. But, still, dropping tuples is always preferred choice over dropping columns.
mydata.drop(‘column_name’, axis=1, inplace=True)
Fill the missing valuesThere are various ways to do this task. You can choose to fill the missing values manually, by using mean, mode or median.
Utilising the overall mean, median or mode is a very straight-forward imputation method. It is quite fast to perform, but has clear disadvantages, one of them being that mean imputation reduces variance in the dataset.
from sklearn.preprocessing import Imputer
values = mydata.values
imputer = Imputer(missing_values=’NaN’, strategy=’mean’)
transformed_values = imputer.fit_transform(values)
# strategy can be changed to “median” and “most_frequent”
Regression:Data can be made smooth by fitting it into a regression function. The regression used can be linear (having one independent variable) or multiple (having multiple independent variables).
To start, most significant variables are identified using a correlation matrix. They are used as independent variables in a regression equation. The dependent variable is the one which has got missing values. Tuples having complete data are used to generate the regression equation; the equation is then used to predict missing values for dependent variable.
It provides good estimates for missing values. However, there are several disadvantages of this model which tend to overshadow the advantages. First, since the inserted values were predicted from other variables they fit together very easily and so standard error becomes biased. Another one, we also assume that there is a linear relationship between the variables used in the regression equation when there may not be one.
KNN (K Nearest Neighbours)In this method, k neighbours are chosen based on some distance measure and their average is used as an hypothetical value which can be used to fill up the missing data. KNN can predict both discrete values (most frequent value among the k nearest neighbours) and continuous values (mean among the k nearest neighbours)
Different formulas / concepts are used for calculating distance according to the type of data:
Continuous Data: Most commonly used distance formulas are – Euclidean, Manhattan and Cosine
Categorical Data: Hamming distance is generally used for categorical imputation. It iterates through all the categorical attributes and for each, counts one if the value is not the same between two tuples for that variable. The number of attributes for which the value was different is considered as the Hamming distance.One of the drawbacks of the KNN algorithm is that it becomes time-consuming when we try to analyse large datasets because it searches for similar instances through the entire dataset. Moreover, if we are dealing with high-dimensional data, KNN’s accuracy can severely have a downfall because there seems to be little difference between the nearest and farthest neighbour in multiple dimensions.
from fancyimpute import KNN
# Use 5 nearest rows which have a feature to fill in each row’s missing features
knnOutput = KNN(k=5).complete(mydata)
Read Full Article Here – https://brain-mentors.com/concepts-of-data-pre-processing/