Skill-Lync Launch Pad – Your Gateway to a Core Engineering Job by 2025! Only 1̶0̶0̶ +50 Seats Available.

02D 15H 13M 59S

Executive Programs

Workshops

Projects

Blogs

Careers

Placements

Student Reviews

For Business

Academic Training

Informative Articles

Find Jobs

We are Hiring!

All Courses

Choose a category

Mechanical

Electrical

Civil

Computer Science

Electronics

Offline Program

All Courses

CHOOSE A CATEGORY

Mechanical

Electrical

Civil

Computer Science

Electronics

Offline Program

Top Job Leading Courses

Automotive

CFD

FEA

Design

MBD

Med Tech

Courses by Software

Design

Solver

Automation

Vehicle Dynamics

CFD Solver

Preprocessor

Courses by Semester

First Year

Second Year

Third Year

Fourth Year

Courses by Domain

Automotive

CFD

Design

FEA

Tool-focused Courses

Design

Solver

Automation

Preprocessor

CFD Solver

Vehicle Dynamics

Machine learning

Machine Learning and AI

POPULAR COURSES

Post Graduate Program in Hybrid Electric Vehicle Design and Analysis

Post Graduate Program in Computational Fluid Dynamics

Post Graduate Program in CAD

Post Graduate Program in CAE

Post Graduate Program in Manufacturing Design

Post Graduate Program in Computational Design and Pre-processing

Post Graduate Program in Complete Passenger Car Design & Product Development

Executive Programs

Workshops

For Business

Success Stories

Placements

Student Reviews

Projects

Blogs

Academic Training

Find Jobs

Informative Articles

We're Hiring!

+91 9342691281 Log in

Curve fitting in MATLAB:

Aim: Perform the linear and cubic fit for a given set of data and then gain insights on the several parameters that can be used to identify a good fit. Theory: Curve fitting is the process of constructing a mathematical curve to fit the data points according to some criteria to get a mathematical formula…

MATLAB

ARTH SOJITRA
updated on 19 Jun 2020

Aim:

Perform the linear and cubic fit for a given set of data and then gain insights on the several parameters that can be used to identify a good fit.

Theory:

Curve fitting is the process of constructing a mathematical curve to fit the data points according to some criteria to get a mathematical formula that can be used to interpolate or extrapolate the values accordingly. Curve fitting is a very powerful method and can have several benefits as to the prediction of data in the future time or the data not having any specified discrete value. This branch of mathematics has its root in the abysmal depths of finance and actuary mathematics. The stock markets, financial risks all use this method to predict the stocks or the prices in the future and invest accordingly in them to get the maximum benefit out of it.

The curve fitting works on the principle as to form a closed and bounded mathematical curve which can give very good and accurate results well within the tolerance range for the specified data. Polynomial functions are used to fit the data which are of the form

$f(x) = A_0 + A_1x+ A_2x^2 + A_3x^3 + .......+A_nx^n$

where appropriate order of the polynomial can be selected to get the desired curve fit.

Many statistical packages such as R and numerical software such as the Maple, MATLAB, Mathematica, GNU Octave, and SciPy include commands for doing curve fitting in a variety of scenarios. For our report, we shall limit our focus to the MATLAB software as it is easy to program and visualize the data.

To measure the goodness of a fit there can be several parameters.

1.) Sum of Squares of the errors:

Consider the diagram is shown below in which we have to perform a curve fit with the help of an $n^(th)$ degree polynomial.

Consider the curve fit as :

$f(x) = A_0 + A_1x+ A_2x^2 + A_3x^3 + .......+A_nx^n$

So then this curve will also give some discrete values at the data points.

In the Sum of Squares of the error, we measure the difference between the actual data set and the curve fitted data set and sum the errors for all the discrete data points. Our goal is to find the coefficients such that this error sum is minimum.

The sum of the square of the errors will have the coefficients $A_0,A_1,A_2,.......,A_n$ and hence to minimize the equation we need to partially differentiate the equation with respect to the unknown coefficients and set it equal to zero to get n equations in n unknowns which can then be solved to obtain the coefficients.

Let the data points be : $(x_1,y_1) \ \,\ (x_2,y_2)\ \,\ (x_3,y_3)......\ \,\ (x_N,y_N)$

So the sum of squares of the error is :

SSE ( Sum of squares of the error ) = $sum_(i=1)^(N) [\ \ y(i) - f(x(i))\ \]^2$

Then to find the coefficients $A_0,A_1,A_2,.......,A_n$ :

$((frac(partial(SSE) )(partialA_0)),(frac(partial(SSE) )(partialA_1)),(frac(partial(SSE) )(partialA_2)),(.),(.),(.),(frac(partial(SSE) )(partialA_n))) = 0$

Solving these n equations in n unknowns we will get the coefficients $A_0,A_1,A_2,.......,A_n$ and hence our curve fit.

This sum of the least square method can also be used to check the goodness of a given curve fit. The lower the sum of the square of the error the better the fit.

2.) R-Squared

This is also one more quantity that measures how well the curve fits the given data. It is the correlation coefficient between the approximated curve fitted data and the actual data. We would tend to think that if the curve fit is a good one then we would have a rather stronger correlation between the curve fit data and the actual data as both of them would go hand in hand.

Consider the following diagram showing the data set and parameters used to calculate the R Squared.

If $y = f(x)$ is the curve fit, then the quantity SSR can be defined as:

$SSR = sum_(i=1)^(N) [\ \ y(i) - Mean\ \]^2$ where Mean is the mean of the original data set

and SSE can be defined as follows:

SSE ( Sum of squares of the error ) = $sum_(i=1)^(N) [\ \ y(i) - f(x(i))\ \]^2$

Now we can define the term SST as :

$SST = SSR + SSE$

and the R-squared can be defined as:

$R^2 = (SSR)/(SST)$

As it is clear the value of the $R^2$ will be between 0 and 1. It can be interpreted like this:

If $R^2$ is very close to zero the curve fit is very bad i.e they have a very weak correlation.
If $R^2$ is very close to one then the curve fit is very good i.e they have a very strong correlation.

3.) RSME ( Root Square Mean Error )

This is also a statistical way of measuring how well the curve fit is. As the name suggests it has something to do with the Squaring of the errors, then summing them and then taking the average and finally take the root.

It is a very similar measure as to the SSE ( Sum of the Square of the Error ) which was discussed earlier in the report. Mathematically,

$RSME = sqrt(frac(sum_(i=1)^(N) [\ \ y(i) - f(x(i))\ \]^2)(N)$

therefore we can see that:

$RSME = sqrtfrac(SSE)(N)$ . As it is evident that the smaller the RSME the better the curve fit.

Setup:

To compare the goodness of the various fits we will use the data of the variation of $C_p$ with Temperature T. We shall compare the several parameters as discussed above for the linear and the cubic fit.

To start within MATLAB we first need to import the data into the MATLAB Workspace which is done with the help of the following command.

clear all;
close all;
clc;

% Loading the data into MATLAB with the help of an inbuilt command
load('data')

%
% Now this data will contain around 3200 rows and 2 columns
%
% The column 1 is the data set for temperature
%
% The column 2 is the data set for the Cp Values
%

Explanation: This code will, first of all, clear the MATLAB Workspace so to remove the unnecessary variables, it will close all the figures and it will clear the command window. It will then import the data from the data file. It is to be made sure that the data file is present in the same directory as the current folder so that MATLAB can read it.

After importing the data the variable data will have 3200 rows and 2 columns as is shown below:

In this, the first row corresponds to the Temperature data points and the second row corresponds to the Specific Heat data points.

We will need to extract these two data sets in the separate vectors to allow easy operations on them. It is done with the help of the following MATLAB Command.

% 
% Extracting the temperature data as the first column
temperature_data = data(:,1);

%
% Extracting the Cp data
cp_data = data(:,2);

Explanation: This code extracts the first column of the data table and then stores it in a vector called the temperature data and the second column which is the data set for the specific values in the cp_data vector.

Now after the procurement of data it is imperative to look at the data in a graphical format. It is achieved via the following MATLAB Code:

%% Plotting raw data

% Plotting the initial data set
figure(1);clf;

% Resizing the figure
set(gcf,'Position',[100,100,900,700]);

% Plotting the data
plot(temperature_data,cp_data,'linew',4,'color','b');

% Adding the labels and title
xlabel(' Temperature in [K]');
ylabel(' Specific Heat in [KJ/k-molK] ');
title(' Initial Data set ');

% Turning on the grid
grid on;
grid minor;

% Incresing the Fontsize
set(gca,'FontSize',20)

The following graph of the initial data was obtained:

As it is seen clearly that the variation of the specific heat with temperature is not constant as is assumed in several of the cases to simplify the calculations. Rather it seems like a complex function of Temperature. By the end of this report, the reader will realize the importance of the curve fitting.

Now we will try to generate a polynomial fit to the data above according to a polynomial function:

1.) Linear fit :

The idea behind the linear fit is very simple. We want to fit the linear function $f(T) = A + BT$ where T is the variable and A, B are the constants. It is achieved with the help of the following MATLAB command:

%% Fitting the linear data

% Using the Polyfit command
coeffs_linear = polyfit(temperature_data,cp_data,1);

% Creating the linear fit function
linear_fit = polyval(coeffs_linear,temperature_data);

Explanation: This code will create the coefficients of the polynomial which will be linear and stores it in a variable coeffs_linear. Then using these coefficients we generate a new data set corresponding to the linear fit by substituting the discreet Temperature points in the function and evaluating it.

Now we shall compare the linear fit versus the original data. It is done with the help of the following MATLAB Command:

% Plotting and comparing the initial vs the fit curve 
figure(2);clf;

% Resizing the figure
set(gcf,'Position',[100,100,700,700]);

% Plotting the data
plot(temperature_data,cp_data,'linew',4,'color','b');
hold on;
plot(temperature_data,linear_fit,'linew',4,'color','r');


% Adding the labels and title
xlabel(' Temperature in [K]');
ylabel(' Specific Heat in [KJ/k-molK] ');
title(' Linear fit vs original data ');

% Adding the legend
legend('Original Data','Using the Curve fit for a linear function',...
    'location','northwest');

% Turning on the grid
grid on;
grid minor;

% Incresing the Fontsize
set(gca,'FontSize',20)

The following graph was obtained:

As it is seen clearly that the linear fit is a very cruel approximation to the original data. Our calculation of the statistical parameters will also confirm this.

Firstly we shall find the error between the original and the curve fitted data which is done with the help of the following MATLAB Command:

% Linear Data
size_of_data = max(size(linear_fit));

% Calculating the sum of the squares of the error 
for i=1:size_of_data
    
    % Computing the squared difference between the fit and 
    % the approximated value
    square_linear(i) = ( cp_data(i) - linear_fit(i) )^2;
    
end

% Computing the sum
sum_of_square_linear = sum(square_linear);

Explanation: This code will create a matrix such that each element corresponds to its corresponding squared of the difference between the actual data set value and the curve fir.

On running this MATLAB command the following output was obtained:

As can be seen, the error is quite high. This confirms our observation that the linear fit is very cruel. Also, we can see the square of the error at each discrete data point. It is done with the help of following MATLAB Command:

% Plotting the error:
figure(3);clf;

% Resizing the figure
set(gcf,'Position',[100,100,900,700]);

% Plotting the data
plot(temperature_data,square_linear,'linew',4,'color','b');

% Adding the labels and title
xlabel(' Temperature in [K]');
ylabel(' Error ');
title(' Error at each discrete data point ');

% Turning on the grid
grid on;
grid minor;

% Incresing the Fontsize
set(gca,'FontSize',20)

The following graph was obtained:

As is seen from the graph the error in the initial data points is quite high suggesting that the curve fit is not so good in the initial points. However, in the middle towards the end, the curve is fairly close to zero suggesting that towards the middle and the end the curve fit is better.

Now we shall calculate the R-Square term for the linear fitted data. The following MATLAB Command is implemented for the same:

% calculating the mean of the cp_data to be used in the calculation of the
% R-sqaured.
mean_cp = mean(cp_data);

% Calculating the Least squared average
for i=1:size_of_data
    
    % Computing the difference between the fit and the approximated value
    R_linear(i) = ( mean_cp - linear_fit(i) )^2;
    
end

% Summing up the data
SSR_squared_linear = sum(R_linear);

% Finding the SST term
SST_linear = SSR_squared_linear + sum_of_square_linear;

% Finally finding the R-square
R_square_linear = SSR_squared_linear/SST_linear;

This code implements the formula as discussed in the Theory Section and finds the R-square. The following output is obtained:

As it is seen the R-squared for the linear fit is closer to one suggesting that it is a good fit. Now, we shall calculate the RSME ( Root Square Mean Error ). It is implemented with the help of the MATLAB Command:

% Calculating the RSME for linear fit
RSME_linear = sqrt(sum_of_square_linear / size_of_data );

The following output was obtained:

2.) Cubic fit :

The idea behind the cubic fit is very simple. We want to fit the cubic function $f(T) = A + BT + CT^2 + DT^3$ where T is the variable and A, B are the constants. It is achieved with the help of the following MATLAB command:

% Using the Polyfit command
coeffs_cubic = polyfit(temperature_data,cp_data,3);

% Creating the cubic fit function
cubic_fit = polyval(coeffs_cubic,temperature_data);

Explanation: This code will create the coefficients of the polynomial which will be cubic and stores it in a variable coeffs_cubic. Then using these coefficients we generate a new data set corresponding to the cubic fit by substituting the discreet Temperature points in the function and evaluating it.

Now we shall compare the cubic fit versus the original data. It is done with the help of the following MATLAB Command:

% Plotting and comparing the initial vs the cubic fit curve 
figure(4);clf;

% Resizing the figure
set(gcf,'Position',[100,100,700,700]);

% Plotting the data
plot(temperature_data,cp_data,'linew',4,'color','b');
hold on;
plot(temperature_data,cubic_fit,'linew',4,'color','r');


% Adding the labels and title
xlabel(' Temperature in [K]');
ylabel(' Specific Heat in [KJ/k-molK] ');
title(' Cubic fit vs Original Data ');

% Adding the legend
legend('Original Data','Using the Curve fit for a cubic function',...
    'location','northwest');

% Turning on the grid
grid on;
grid minor;

% Incresing the Fontsize
set(gca,'FontSize',20)

The following graph was obtained:

As it is seen clearly that the cubic fit is a comparatively good approximation than the linear fit to the original data. Our calculation of the statistical parameters will also confirm this.

Firstly we shall find the error between the original and the curve fitted data which is done with the help of the following MATLAB Command:

% Calculating the Least squared average
for i=1:size_of_data
    
    % Computing the difference between the fit and the approximated value
    square_cubic(i) = ( cp_data(i) - cubic_fit(i) )^2;
    
end

% Computing the least square
error_square_cubic = sum(square_cubic);

Explanation: This code will create a matrix such that each element corresponds to its corresponding squared of the difference between the actual data set value and the curve fit.

On running this MATLAB command the following output was obtained:

As can be seen, the error is high but not as high as the linear fit case. This confirms our observation that the cubic fit is a better approximation than the linear fit. Also, we can see the square of the error at each discrete data point. It is done with the help of following MATLAB Command:

% Plotting the error:
figure(5);clf;

% Resizing the figure
set(gcf,'Position',[100,100,900,700]);

% Plotting the data
plot(temperature_data,square_cubic,'linew',4,'color','b');

% Adding the labels and title
xlabel(' Temperature in [K]');
ylabel(' Error ');
title(' Error at each discrete data point for cubic case ');

% Turning on the grid
grid on;
grid minor;

% Incresing the Fontsize
set(gca,'FontSize',20)

The following graph was obtained:

As is seen from the graph the error in the initial data points is quite low suggesting that the curve fit is good in the initial points. However, towards the end, the curve is high suggesting that towards the end the curve fit is very cruel. Overall if we see the cubic curve fit remains well within the bounds with very low fluctuations.

Now we shall calculate the R-Square term for the cubic fitted data. The following MATLAB Command is implemented for the same:

% Calculating the Least squared average
for i=1:size_of_data
    
    % Computing the difference between the fit and the approximated value
    R_cubic(i) = ( mean_cp - cubic_fit(i) )^2;
    
end


% Summing up the data
SSR_squared_cp_cubic = sum(R_cubic);

% Calculating the SST term for the cubic fit
SST_cubic = SSR_squared_cp_cubic + square_cubic;

% Calculating the R-squared term
R_squared_cubic = SSR_squared_cp_cubic/SST_cubic;

This code implements the formula as discussed in the Theory Section and finds the R-square. The following output is obtained:

As it is seen the R-squared for the cubic fit is very closer to one suggesting that it is a very good fit. Now, we shall calculate the RSME ( Root Square Mean Error ). It is implemented with the help of the MATLAB Command:

% Calculating the RSME term for the cubic fit
RMSE_cubic = sqrt(error_square_cubic / size_of_data );

The following output was obtained:

Getting a confirmation our calculations are correct:

The correlation coefficient according to the formula discussed above is nothing but the square root of $R^2$ . Implementing this in MATLAB and comparing the correlation coefficient using the MATLAB inbuilt command:

% From our formula the correlation coefficient
correlation_cubic_fit = sqrt(R_squared_cubic)

% Using the MATLAB inbuilt command
correlation_matlab_function = corr(cp_data,cubic_fit)

The following output is obtained:

As is seen the one using the formula and the one using the MATLAB inbuilt command is the same.

The workspace after the final calculations looks like shown below:

Now after calculation of the parameters of the linear and cubic fit it is time to compare them:

Comparison of Parameters

	Sum of squares of error	R-Squared	RSME	correlation coefficient
linear fit	2163049.08	0.9249	25.991	0.9617
cubic ft	94272.02	0.9967	5.4277	0.98

Clearly, as per our discussion on the good fit, the parameters of the cubic fit are in favor of qualifying it as a better fit than the linear fit.

Now we can answer some of the questions:

1.) How to make a curve fit perfectly?

Ans: The answer is embedded in the question. A perfect fit is nothing but a fit in which all the points are satisfied in the function i.e. the error is zero. It can be achieved if we take the order of the polynomial to be fitted to be equal to the number of data points available. In this way, we will make sure that every data point satisfies the function.

2.) How to get the best fit?

Ans: getting the best fit is like experimenting with the basic parameters like the order of the polynomial, the data set to be fitted, etc. The best fit is obtained on a trial and error basis until we get the desired level of accuracy.

3.) What could be done to improve the cubic fit?

Ans: There are several ways in which we can improve a cubic fit. We can do the cubic fitting for several intervals i.e. piecewise and then combine them to form a global fit. This is shown in the MATLAB Code below:

%% Improving the cubic fit:

% Partitioning the T into sub-intervals
T1 = temperature_data(1:1100);
T2 = temperature_data(1101:2099);
T3 = temperature_data(2100:end);

% Partitioning the Cp into subintervals
C1 = cp_data(1:1100);
C2 = cp_data(1101:2099);
C3 = cp_data(2100:end);


% Polyfitting the curve in these subintervals
coeffs_1 = polyfit(T1,C1,3);
coeffs_2 = polyfit(T2,C2,3);
coeffs_3 = polyfit(T3,C3,3);

% Making the polynomails
P1 = polyval(coeffs_1,T1);
P2 = polyval(coeffs_2,T2);
P3 = polyval(coeffs_3,T3);

% Combining the data
P_overall = [P1(:)',P2(:)',P3(:)'];

This code will partition the domain into 3 sub-intervals and will perform the cubic fitting there. Now after the piecewise cubic fitting we shall now plot the data sets on top of one another. It is done using the following MATLAB command:

% Plotting the overall curve fit with the peicewise splitting and fitting
% the cubic curves individually
% Plotting and comparing the initial vs the cubic fit curve 
figure(6);clf;

% Resizing the figure
set(gcf,'Position',[100,100,800,700]);

% Plotting the data
plot(temperature_data,cp_data,'linew',4,'color','b');
hold on;
plot(temperature_data,P_overall,'linew',4,'color','r');


% Adding the labels and title
xlabel(' Temperature in [K]');
ylabel(' Specific Heat in [KJ/k-molK] ');
title(' Comparison using the peicewise splitting ');

% Adding the legend
legend('Original Data',['Using the Curve fit for'...
    ' a cubic function by peice-wise splitting'],...
    'location','northwest');

% Turning on the grid
grid on;
grid minor;

% Incresing the Fontsize
set(gca,'FontSize',20)

The following graph was obtained:

As is seen from the graph we get a very good fit using the piecewise split and then performing the cubic fit in each subinterval. Now we shall compute the statistic parameter to confirm this. It is done by the following code:

%% Calculation of statistic paramters for peicewise fit

% Calculating the Least squared average
for i=1:size_of_data
    
    % Computing the difference between the fit and the approximated value
    P_cubic(i) = ( cp_data(i) - P_overall(i) )^2;
    
end

% Computing the least square
error_square_cubic_P = sum(P_cubic);

% Calculating the Least squared average
for i=1:size_of_data
    
    % Computing the difference between the fit and the approximated value
    R_cubic_P(i) = ( mean_cp - P_overall(i) )^2;
    
end


% Summing up the data
SSR_squared_cp_cubic_P = sum(R_cubic_P);

% Calculating the SST term for the cubic fit
SST_cubic_P = SSR_squared_cp_cubic_P + error_square_cubic_P;

% Calculating the R-squared term
R_squared_cubic_P = SSR_squared_cp_cubic_P/SST_cubic_P;

% Calculating the RSME term for the cubic fit
RMSE_cubic_P = sqrt(error_square_cubic_P / size_of_data );

% From our formula the correlation coefficient
correlation_cubic_fit_P = sqrt(R_squared_cubic_P);

The following outputs were obtained:

It is visible that the parameters strongly suggest this is a very good fit as compared to the normal cubic and linear polynomial fits which concludes the argument of the accuracy of the piecewise splitting.

Errors Faced While Solving the Challenge:

1.) The mislabeling of the import data:

% 
% Extracting the temperature data as the first column
temperature_data = data(1,:);

As is seen here this code will by itself not give any error. However, in the further running, the program will crash because of the non-agreement of the matrix dimensions like is shown below:

2.) Misordering of the inputs in the polyfit command:

Sometimes the user can make a mistake of providing the input arguments in the wrong order as is shown here:

% Using the Polyfit command
coeffs_linear = polyfit(cp_data,temperature_data,1);

% Creating the linear fit function
linear_fit = polyval(coeffs_linear,temperature_data);

This code will by itself not give any error but the graph we will get in the further part of the code will be completely meaningless as is shown below:

This error can be fixed easily by providing the input arguments in the correct order.

3.) The wrong partitioning of the interval:

One more mistake that the user tends to make is he might partition the interval in the wrong ways either by including an extra point or missing an extra point as is shown below:

% Partitioning the T into sub-intervals
T1 = temperature_data(1:1100);
T2 = temperature_data(1100:2100);
T3 = temperature_data(2100:end);

% Partitioning the Cp into subintervals
C1 = cp_data(1:1100);
C2 = cp_data(1100:2100);
C3 = cp_data(2100:end);

This code will not give any error however we can see it clearly that the partitioning is wrong. The user has taken the points 1100 and 2100 two times and hence the final vector will be more than the length of the original vector and it will result in the error in the subsequent part of the code as is shown below:

Conclusion:

In this report, we had studied the nuances of the curve fitting and the statistic parameters that help determine how good is a fit. Curve fitting has several applications in actuary and applied mathematics. Like for example in this report, we did the curve fit on the Cp data. Now just imagine if because of some issue we are not able to obtain the Cp data at very high temperatures using the curve fitting approach with appropriate accuracy we can obtain a very good approximation of the data we want. Thought the value may still be far from the real value at least curve fitting introduces the basic idea.