Subscribe for Updates
The 5 steps to Preprocessing Data
Data preprocessing is a technique that transforms raw data into an understandable format. Data can be of various types- text, audio, video, images, etc., and since a machine cannot understand anything other than binary digits, that’s where data pre-processing comes into play.
Need for data pre-processing
Machine Learning (ML) and Deep Learning(DL) algorithms rely heavily on data. Therefore efficient preprocessing is required to improve model accuracy. Data collected from the real world is typically noisy, with missing values and erroneous values.
The prevalent defects of data collected could be:
- The dataset can be incomplete- can lack preferred attributes, contain NULL values which usually cannot be processed by the algorithm.
- The data might be inconsistent- change of format, change of marking scheme, numbers not adding up, etc.
- It can be erroneous and contain outliers. This noisy data can lead to cumulative errors in the output of the algorithm.
According to a Gartner Research report, poor quality of data costs a company on an average $15 million every year, which for most companies is too high a cost to bear.
In any AI algorithm training, we require Big data. Every minute, everyone is online, either transacting something or accessing the internet, generating a massive amount of data, and computer science has evolved to do computations on these data sets. This data is the key to the analysis of algorithms.
There are few points to remember while collecting data.
- Check for data imbalances. For instance, if we’ve collected fingerprint data for a computer security algorithm from person A and Person B. However, your algorithm ends up receiving more data from person A. In such a situation, the model we train will be heavily biased toward person A. However, in the real world, the model might see both the objects in equal amounts, which would mean that it would be guessing person B wrong half the time.
- We also need to split the data into two parts: The first, used for training the model, will be the majority of the dataset; The second for evaluating our trained model’s performance. The splitting can use the 80-20 or 70-30 ratio depending on the size of the data. In ML more data> better model> higher accuracy.
Steps for data pre-processing
Sometimes the data we collect needs other forms of adjusting and manipulation to free the data from duplication, normalization, error correction, etc. The five steps for data preprocessing include :
1. Data Cleaning:
This is a process of correcting inconsistent data by filling in missing values or removing duplicate records. The missing values are substituted by eliminating the entry or filling them with interpolation methods such as mean, median, or mode.
2. Data Integration:
Data integration involves acquiring the data and combining it according to features resulting in a reduction of memory and processing time. There are two types of feature detection: Feature Selection and Feature Extraction.
Feature selection entails prioritizing some desirable characteristics over others. For example, computer privacy is paramount, and an algorithm checking for spam email gives attributes such as IP address, email id more weightage than the person’s name, age, or gender. Whereas Feature Extraction is converting M attributes into N attributes, where M>N (or) M<N (or) M=N. Extraction algorithm changes with the format of data, whether it is audio or video, or text.
3. Data Transformation:
The data can be from various sources and thus may have different formats. To apply computations to them, they need to be consolidated and transformed into the same type. Changing the formatting, structure, or values of data is called Transformation. The dirty data is processed using regression, aggregation, normalization, or clustering methods.
4. Data Reduction:
As the dimensionality of data increases- the difficulty in applying the algorithm on it increases. Reduction is the process of reducing the data obtained without losing the meaning of data. For example, data from self-monitoring devices of individual people could be categorized age-wise to study trends in their weight, activity time, heart rate, etc.
5. Data Discretization:
Data can be of 2 types: categorical or numerical. The mapping of numerical data to concept labels or intervals is called Discretization. Thus reducing the size of the dataset and helps in preparing data for computation.
Data pre-processing is akin to spending time sharpening the ax before cutting the tree, ensuring the sustainability of the data. Many researchers are striving to build new methodologies and tactics to make the process easier, some of which are tools such as R, Weka, RapidMiner, Python, OpenRefine, etc. There are more detailed steps and nuances, but this serves as a good foundation framework to help us think through the problem, giving us a common language to think about each step and go deeper in the future.