Uploaded on
27 Dec 2022
Skill-Lync
Often while working with datasets, we encounter scenarios where the data present might be very scarce. Due to this scarcity, dividing the data into tests and training leads to a loss of information.
Ensuring that we use the full data and still have the test and train data separate led to the concept of cross-validation.
Most machine learning enthusiasts use the following types of cross-validation:
Here we have a dataset with n sample points. In these datasets, p points are set aside. The model is trained on n-p datasets points and tested on p datasets. This is exhaustively tried with all data points where p are randomly chosen till the entire dataset is exhausted. In the figure below, each brown box represents one data point.
p=1 means Leave 1 out of cross-validation
p=2 means Leave 2 out cross-validation
These three techniques suffer from a major flaw where in the bias is low and computational time is low.
For example, if we have n=10 and p=1, we will have to run the model 10 times. In each run 1 of the data point is set aside for testing. With p =2, we have 10C2 runs, that is 45 runs etc. In the above pictures, the examples are shown for p=1 and p=2. Note for p=2, not all the 45 options are shown.
This technique divides the data into 70:30 or 60:40. The training is done on the larger chunk and testing on the smaller chunk. One of the major disadvantages of this is that it does not work for imbalanced datasets. Second, a major chunk of the data is not used for learning.
In this technique, the data is divided into K groups. Each time K-1 goes into training, and one of them goes into testing. There are totally K number of runs needed to exhaust the entire dataset. This validation technique suffers from low bias and fails miserably when there is an imbalanced dataset. Each pink box below contains a lot of data points.
This technique randomly splits the data set into training and testing data. Due to the intrinsic randomness, this is also referred to as Montecarlo cross-validation sets. Here neither the split nor the iteration is fixed. In the end, the accuracy is given by the average of all the runs. Due to randomness, sometimes some points may not be used for training the model, which is one of the major disadvantages of this model. With imbalanced datasets, again, this validation technique fails.
In the stratified K fold cross-validation, K folds are made by splitting the data into K groups. However, while making the K groups, it is ensured that all the classes are represented as per their proportion in the original population in the validation dataset. This makes sure that this technique works well with the imbalanced dataset also.
Data that changes with time needs special treatment. We cannot use future data to predict the past. To ensure this, we always use the data till t-1 to train and test with t data. Similarly, we use up to t to train and test with t+1 the next time.
In the nested cross-validation, model hyperparameters and the 'K' of the stratified K fold and the K fold cross-validation are both varied. The model that gives the best accuracy is picked, and the corresponding K value and hyperparameter is used for further prediction.
Author
Navin Baskar
Author
Skill-Lync
Continue Reading
Related Blogs
Since 1991 when the Python language was developed, it has been used for various applications. Due to its simplicity and versatile nature, Python codes can help developers to complete the process of software development without much hassle.
16 May 2023
Python is an open-source programming language which means it is available on the official website, and anyone can make use of this technology free of cost. Since it is open-source, this means that the source code is also available to the public.
15 May 2023
Telecommunications networks support our digital society. They are, therefore, a top target for cyberattacks.
15 Apr 2023
Are you interested in becoming a web developer? If so, you've come to the right place! This comprehensive guide to full-stack web development will give you all the information you need to start.
13 Apr 2023
Are you looking for the latest and greatest tools for software development? Then you're in the right place! This blog post will explore the newest frameworks for software development, from the most popular to the most cutting-edge.
06 Apr 2023
Author
Skill-Lync
Continue Reading
Related Blogs
Since 1991 when the Python language was developed, it has been used for various applications. Due to its simplicity and versatile nature, Python codes can help developers to complete the process of software development without much hassle.
16 May 2023
Python is an open-source programming language which means it is available on the official website, and anyone can make use of this technology free of cost. Since it is open-source, this means that the source code is also available to the public.
15 May 2023
Telecommunications networks support our digital society. They are, therefore, a top target for cyberattacks.
15 Apr 2023
Are you interested in becoming a web developer? If so, you've come to the right place! This comprehensive guide to full-stack web development will give you all the information you need to start.
13 Apr 2023
Are you looking for the latest and greatest tools for software development? Then you're in the right place! This blog post will explore the newest frameworks for software development, from the most popular to the most cutting-edge.
06 Apr 2023
Related Courses