Menu

Executive Programs

Workshops

Projects

Blogs

Careers

Student Reviews


For Business / Universities

Corporate Training

Hire from US

Academic Up-skilling



All Courses

Choose a category

Loading...

All Courses

All Courses

logo

CSE

Uploaded on

27 Dec 2022

Cross-Validation Techniques For Data

logo

Skill-Lync

Often while working with datasets, we encounter scenarios where the data present might be very scarce. Due to this scarcity, dividing the data into tests and training leads to a loss of information.

Ensuring that we use the full data and still have the test and train data separate led to the concept of cross-validation.

Eight Major Types of Cross-Validation 

Most machine learning enthusiasts use the following types of cross-validation:

Leave p out cross-validation:

Here we have a dataset with n sample points. In these datasets, p points are set aside. The model is trained on n-p datasets points and tested on p datasets. This is exhaustively tried with all data points where p are randomly chosen till the entire dataset is exhausted. In the figure below, each brown box represents one data point. 

p=1 means Leave 1 out of cross-validation

p=2 means Leave 2 out cross-validation

These three techniques suffer from a major flaw where in the bias is low and computational time is low.

For example, if we have n=10 and p=1, we will have to run the model 10 times. In each run 1 of the data point is set aside for testing. With p =2, we have 10C2 runs, that is 45 runs etc. In the above pictures, the examples are shown for p=1 and p=2. Note for p=2, not all the 45 options are shown. 

Holdout cross-validation:

This technique divides the data into 70:30 or 60:40. The training is done on the larger chunk and testing on the smaller chunk. One of the major disadvantages of this is that it does not work for imbalanced datasets. Second, a major chunk of the data is not used for learning.

K Fold cross-validation:

In this technique, the data is divided into K groups. Each time K-1 goes into training, and one of them goes into testing. There are totally K number of runs needed to exhaust the entire dataset. This validation technique suffers from low bias and fails miserably when there is an imbalanced dataset. Each pink box below contains a lot of data points.

Repeated Random Sampling Validation:

This technique randomly splits the data set into training and testing data. Due to the intrinsic randomness, this is also referred to as Montecarlo cross-validation sets. Here neither the split nor the iteration is fixed. In the end, the accuracy is given by the average of all the runs. Due to randomness, sometimes some points may not be used for training the model, which is one of the major disadvantages of this model. With imbalanced datasets, again, this validation technique fails.

Stratified K fold Cross Validation:

In the stratified K fold cross-validation, K folds are made by splitting the data into K groups. However, while making the K groups, it is ensured that all the classes are represented as per their proportion in the original population in the validation dataset. This makes sure that this technique works well with the imbalanced dataset also.

Time Series Cross Validation:

Data that changes with time needs special treatment. We cannot use future data to predict the past. To ensure this, we always use the data till t-1 to train and test with t data. Similarly, we use up to t to train and test with t+1 the next time.

Nested Cross Validation:

In the nested cross-validation, model hyperparameters and the 'K' of the stratified K fold and the K fold cross-validation are both varied. The model that gives the best accuracy is picked, and the corresponding K value and hyperparameter is used for further prediction.


Author

author

Navin Baskar


Author

blogdetails

Skill-Lync

img

Continue Reading

Related Blogs

Real-Time Applications of Python You Need to Know

Since 1991 when the Python language was developed, it has been used for various applications. Due to its simplicity and versatile nature, Python codes can help developers to complete the process of software development without much hassle.

CSE

16 May 2023


A Brief Introduction to Python: Its Features and Different IDEs

Python is an open-source programming language which means it is available on the official website, and anyone can make use of this technology free of cost. Since it is open-source, this means that the source code is also available to the public.

CSE

15 May 2023


Cybersecurity in Telecom: Protecting Networks & Data from Cyber Threats

Telecommunications networks support our digital society. They are, therefore, a top target for cyberattacks.

CSE

15 Apr 2023


Everything you Need to Know About Full-Stack Web Development

Are you interested in becoming a web developer? If so, you've come to the right place! This comprehensive guide to full-stack web development will give you all the information you need to start.

CSE

13 Apr 2023


Exploring the Latest Frameworks for Software Development

Are you looking for the latest and greatest tools for software development? Then you're in the right place! This blog post will explore the newest frameworks for software development, from the most popular to the most cutting-edge.

CSE

06 Apr 2023



Author

blogdetails

Skill-Lync

img

Continue Reading

Related Blogs

Real-Time Applications of Python You Need to Know

Since 1991 when the Python language was developed, it has been used for various applications. Due to its simplicity and versatile nature, Python codes can help developers to complete the process of software development without much hassle.

CSE

16 May 2023


A Brief Introduction to Python: Its Features and Different IDEs

Python is an open-source programming language which means it is available on the official website, and anyone can make use of this technology free of cost. Since it is open-source, this means that the source code is also available to the public.

CSE

15 May 2023


Cybersecurity in Telecom: Protecting Networks & Data from Cyber Threats

Telecommunications networks support our digital society. They are, therefore, a top target for cyberattacks.

CSE

15 Apr 2023


Everything you Need to Know About Full-Stack Web Development

Are you interested in becoming a web developer? If so, you've come to the right place! This comprehensive guide to full-stack web development will give you all the information you need to start.

CSE

13 Apr 2023


Exploring the Latest Frameworks for Software Development

Are you looking for the latest and greatest tools for software development? Then you're in the right place! This blog post will explore the newest frameworks for software development, from the most popular to the most cutting-edge.

CSE

06 Apr 2023


Book a Free Demo, now!

Related Courses

https://d28ljev2bhqcfz.cloudfront.net/maincourse/thumb/introduction-hev-matlab-simulink_1612262875.jpg
Introduction to Hybrid Electric Vehicle using MATLAB and Simulink
4.8
23 Hours of content
Electrical Domain
Know more
https://d28ljev2bhqcfz.cloudfront.net/maincourse/thumb/vehicle-dynamics-matlab_1636606203.png
4.8
37 Hours of content
Cae Domain
Showing 1 of 4 courses
Try our top engineering courses, projects & workshops today!Book a Live Demo