Skill-Lync Launch Pad – Your Gateway to a Core Engineering Job by 2025! Only 1̶0̶0̶ +50 Seats Available.

01D 03H 44M 00S

Executive Programs

Workshops

Projects

Blogs

Careers

Placements

Student Reviews

For Business

Academic Training

Informative Articles

Find Jobs

We are Hiring!

All Courses

Choose a category

Mechanical

Electrical

Civil

Computer Science

Electronics

Offline Program

All Courses

CHOOSE A CATEGORY

Mechanical

Electrical

Civil

Computer Science

Electronics

Offline Program

Top Job Leading Courses

Automotive

CFD

FEA

Design

MBD

Med Tech

Courses by Software

Design

Solver

Automation

Vehicle Dynamics

CFD Solver

Preprocessor

Courses by Semester

First Year

Second Year

Third Year

Fourth Year

Courses by Domain

Automotive

CFD

Design

FEA

Tool-focused Courses

Design

Solver

Automation

Preprocessor

CFD Solver

Vehicle Dynamics

Machine learning

Machine Learning and AI

POPULAR COURSES

Post Graduate Program in Hybrid Electric Vehicle Design and Analysis

Post Graduate Program in Computational Fluid Dynamics

Post Graduate Program in CAD

Post Graduate Program in CAE

Post Graduate Program in Manufacturing Design

Post Graduate Program in Computational Design and Pre-processing

Post Graduate Program in Complete Passenger Car Design & Product Development

Executive Programs

Workshops

For Business

Success Stories

Placements

Student Reviews

Projects

Blogs

Academic Training

Find Jobs

Informative Articles

We're Hiring!

+91 9342691281 Log in

Supervised Learning - Prediction Week 3 Challenge

1. Perform Gradient Descent in Python with any loss function.Refer to Jupyter Notebook attached.2. Difference between L1 & L2 Gradient descent method.The L1 and L2 Gradient Descent methods refer to optimization techniques used to minimize a loss function in machine learning, often in conjunction with regularization.…

Anupama Yeragudipati
updated on 27 Jan 2025

1. Perform Gradient Descent in Python with any loss function.

Refer to Jupyter Notebook attached.

2. Difference between L1 & L2 Gradient descent method.

The L1 and L2 Gradient Descent methods refer to optimization techniques used to minimize a loss function in machine learning, often in conjunction with regularization. The key difference lies in how they handle regularization and the resulting impact on the model.

L1 Gradient Descent (Lasso Regularization)

Regularization Term: $encoding="application/x-tex">\lambda \sum |w_i|</annotation></semantics></math>$ (absolute value of weights).
Penalty: Adds a penalty proportional to the absolute value of weights.
Effect on Weights:
- Encourages sparsity by shrinking some weights to exactly 0, effectively performing feature selection.
- Suitable for datasets with many irrelevant features.
Optimization Behavior:
- The gradient of the L1 term is not smooth at $encoding="application/x-tex">w_i = 0</annotation></semantics></math>$ , which can lead to suboptimal convergence in some cases.
Usage:
- Preferable when you expect many features to be irrelevant and want to simplify the model.

L2 Gradient Descent (Ridge Regularization)

Regularization Term: $encoding="application/x-tex">\lambda \sum w_i^2</annotation></semantics></math>$ (squared value of weights).
Penalty: Adds a penalty proportional to the square of the weights.
Effect on Weights:
- Shrinks weights smoothly towards 0, but they rarely reach exactly 0.
- Retains all features but reduces their impact.
Optimization Behavior:
- The gradient of the L2 term is smooth, leading to more stable and faster convergence.
Usage:
- Suitable when all features are expected to contribute to the prediction but need regularization to prevent overfitting.

Key Differences

Feature	L1 (Lasso)	L2 (Ridge)
Regularization Term	( \lambda \sum	w_i
Weight Behavior	Sparse (some weights = 0)	Shrinks weights but retains all
Feature Selection	Yes (automatic selection)	No
Gradient Smoothness	Non-smooth at $encoding="application/x-tex">w_i = 0</annotation></semantics></math>$	Smooth
Model Complexity	Simpler, fewer features	Retains complexity
Use Case	Irrelevant features expected	All features contribute

3. What are the different loss functions for regression?

In regression tasks, the choice of loss function depends on the specific problem and the desired behavior of the model. Here are some common loss functions used for regression:

1. Mean Squared Error (MSE)

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2</annotation></semantics></math>$
Description:
- Penalizes large errors more heavily due to squaring.
- Produces smoother predictions.
Use Case:
- Suitable when large deviations are undesirable.

2. Mean Absolute Error (MAE)

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|</annotation></semantics></math>$
Description:
- Penalizes errors linearly, treating all deviations equally.
- More robust to outliers than MSE.
Use Case:
- Use when the dataset has outliers or when equal treatment of errors is preferred.

3. Huber Loss

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mo stretchy="false">)</mo><mo>=</mo><mrow><mo fence="true">{</mo><mtable rowspacing="0.36em" columnalign="left left" columnspacing="1em"><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mfrac><mn>1</mn><mn>2</mn></mfrac><mo stretchy="false">(</mo><mi>y</mi><mo>−</mo><mover accent="true"><mi>y</mi><mo>^</mo></mover><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mtext>if </mtext><mi mathvariant="normal">∣</mi><mi>y</mi><mo>−</mo><mover accent="true"><mi>y</mi><mo>^</mo></mover><mi mathvariant="normal">∣</mi><mo>≤</mo><mi>δ</mi></mrow></mstyle></mtd></mtr><mtr><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mi>δ</mi><mi mathvariant="normal">∣</mi><mi>y</mi><mo>−</mo><mover accent="true"><mi>y</mi><mo>^</mo></mover><mi mathvariant="normal">∣</mi><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><msup><mi>δ</mi><mn>2</mn></msup></mrow></mstyle></mtd><mtd><mstyle scriptlevel="0" displaystyle="false"><mrow><mtext>if </mtext><mi mathvariant="normal">∣</mi><mi>y</mi><mo>−</mo><mover accent="true"><mi>y</mi><mo>^</mo></mover><mi mathvariant="normal">∣</mi><mo>></mo><mi>δ</mi></mrow></mstyle></mtd></mtr></mtable></mrow></mrow><annotation encoding="application/x-tex">L(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\ \delta |y - \hat{y}| - \frac{1}{2} \delta^2 & \text{if } |y - \hat{y}| > \delta \end{cases}</annotation></semantics></math>$
Description:
- Combines MSE and MAE: behaves like MSE for small errors and MAE for large errors.
- Controlled by a hyperparameter $encoding="application/x-tex">\delta</annotation></semantics></math>$ .
Use Case:
- Useful when the dataset contains outliers but smooth predictions are still desired.

4. Log-Cosh Loss

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mo stretchy="false">)</mo><mo>=</mo><mo>∑</mo><mi>log</mi><mo>⁡</mo><mo stretchy="false">(</mo><mi>cosh</mi><mo>⁡</mo><mo stretchy="false">(</mo><mover accent="true"><mi>y</mi><mo>^</mo></mover><mo>−</mo><mi>y</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L(y, \hat{y}) = \sum \log(\cosh(\hat{y} - y))</annotation></semantics></math>$
Description:
- Similar to MAE but differentiable everywhere.
- Less sensitive to outliers than MSE but more robust than Huber Loss.
Use Case:
- Use for smooth regression tasks with moderate outliers.

5. Quantile Loss

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mo stretchy="false">)</mo><mo>=</mo><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mi>max</mi><mo>⁡</mo><mo stretchy="false">(</mo><mi>q</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mi>i</mi></msub><mo>−</mo><msub><mover accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><mo stretchy="false">)</mo><mo separator="true">,</mo><mo stretchy="false">(</mo><mi>q</mi><mo>−</mo><mn>1</mn><mo stretchy="false">)</mo><mo stretchy="false">(</mo><msub><mover accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><mo>−</mo><msub><mi>y</mi><mi>i</mi></msub><mo stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L(y, \hat{y}) = \sum_{i=1}^n \max(q(y_i - \hat{y}_i), (q - 1)(\hat{y}_i - y_i))</annotation></semantics></math>$
Description:
- Used for quantile regression, predicting specific percentiles (e.g., median, 90th percentile).
- $q$ determines the quantile (e.g., $q = 0.5$ for median).
Use Case:
- Use when predicting conditional quantiles instead of mean.

6. Mean Squared Logarithmic Error (MSLE)

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><mo stretchy="false">)</mo><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">\text{MSLE} = \frac{1}{n} \sum_{i=1}^n (\log(1 + y_i) - \log(1 + \hat{y}_i))^2</annotation></semantics></math>$
Description:
- Penalizes under-predictions more than over-predictions.
- Suitable when the target variable spans multiple orders of magnitude.
Use Case:
- Use for non-negative targets where relative differences matter more than absolute differences.

7. Poisson Loss

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mo stretchy="false">)</mo><mo>=</mo><mover accent="true"><mi>y</mi><mo>^</mo></mover><mo>−</mo><mi>y</mi><mi>log</mi><mo>⁡</mo><mo stretchy="false">(</mo><mover accent="true"><mi>y</mi><mo>^</mo></mover><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">L(y, \hat{y}) = \hat{y} - y \log(\hat{y})</annotation></semantics></math>$
Description:
- Based on the Poisson distribution.
- Suitable for count-based regression problems.
Use Case:
- Use when modeling count data or event occurrences.

8. Custom Loss Functions

Description:
- Tailored loss functions can be created for specific business objectives or problem constraints.
Example:
- Weighted loss functions to prioritize certain errors.

Choosing the Right Loss Function

MSE: Use when large errors need to be penalized heavily.
MAE: Use when robustness to outliers is critical.
Huber/Log-Cosh: Use when a balance between MSE and MAE is needed.
Quantile Loss: Use for conditional quantile predictions.
MSLE: Use for non-negative targets with a wide range.
Poisson Loss: Use for count data.

4. What is the importance of learning rate?

The learning rate is one of the most important hyperparameters in machine learning, especially in optimization algorithms like gradient descent. It controls how much the model adjusts its parameters in response to the error at each iteration. The learning rate plays a crucial role in determining the efficiency and effectiveness of the training process.

Importance of Learning Rate

Controls the Step Size in Gradient Descent:
- The learning rate determines the magnitude of updates to the model parameters.
- A small learning rate results in small steps, leading to slow convergence.
- A large learning rate may result in overshooting the optimal solution or cause the model to diverge.
Affects Convergence Speed:
- Optimal Learning Rate: Balances speed and accuracy, allowing the model to converge efficiently.
- Too Low: The training process becomes excessively slow, and the model may get stuck in local minima.
- Too High: The model may oscillate around the minimum or fail to converge.
Prevents Overfitting or Underfitting:
- A well-tuned learning rate ensures proper updates to parameters, helping the model generalize better.
- A poorly chosen learning rate can lead to overfitting (if too high) or underfitting (if too low).
Helps Avoid Vanishing or Exploding Gradients:
- Gradients that are too small or too large can destabilize training. An appropriate learning rate mitigates these issues by controlling the parameter updates.
Balances Exploration and Exploitation:
- A larger learning rate encourages exploration of the loss surface, which can help escape shallow local minima.
- A smaller learning rate focuses on fine-tuning and exploitation near the optimal solution.
Enables Smooth Convergence:
- A well-chosen learning rate ensures that the model converges smoothly to the global minimum, avoiding abrupt changes in the loss function.

Practical Considerations

Learning Rate Schedulers:
- Dynamically adjust the learning rate during training (e.g., reduce it as training progresses).
- Common schedulers:
  - Step decay
  - Exponential decay
  - Cosine annealing
  - Reduce on plateau
Adaptive Learning Rates:
- Optimizers like Adam, RMSprop, and Adagrad adapt the learning rate for each parameter based on gradients, reducing the need for manual tuning.
Warm-up Learning Rate:
- Start with a small learning rate and gradually increase it to stabilize initial training.
Learning Rate Tuning:
- Use techniques like grid search, random search, or automated tools like Optuna or Ray Tune to find the optimal learning rate.

How to Choose the Learning Rate

Perform a learning rate range test:
- Gradually increase the learning rate during training and observe the loss.
- Select a learning rate where the loss decreases steadily without oscillations.
Start with a common default value (e.g., $0.01$ or $0.001$ ) and refine based on results.

In summary, the learning rate is critical for the success of training, as it directly influences the model's ability to learn efficiently and converge to an optimal solution.

5. How to evaluate linear regression?

Evaluating a linear regression model involves assessing how well the model fits the data and how accurately it predicts the target variable. The evaluation typically includes statistical metrics, residual analysis, and visual inspection. Here's a comprehensive guide:

1. Key Evaluation Metrics

A. Coefficient of Determination ( $encoding="application/x-tex">R^2</annotation></semantics></math>$ )

Formula: $encoding="application/x-tex">R^2 = 1 - \frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}}</annotation></semantics></math>$ Where:
- $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">\text{SS}_{\text{residual}} = \sum (y_i - \hat{y}_i)^2</annotation></semantics></math>$
- $encoding="application/x-tex">\text{SS}_{\text{total}} = \sum (y_i - \bar{y})^2</annotation></semantics></math>$
Description:
- Indicates the proportion of variance in the target variable explained by the model.
- Values range from 0 to 1 (closer to 1 is better).
Limitation:
- Does not account for model complexity or overfitting.

B. Adjusted $encoding="application/x-tex">R^2</annotation></semantics></math>$

Formula: $encoding="application/x-tex">R^2_{\text{adjusted}} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)</annotation></semantics></math>$ Where:
- $n$ : Number of observations.
- $p$ : Number of predictors.
Description:
- Adjusts $encoding="application/x-tex">R^2</annotation></semantics></math>$ for the number of predictors, penalizing unnecessary complexity.
- More reliable for models with multiple features.

C. Mean Squared Error (MSE)

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2</annotation></semantics></math>$
Description:
- Measures the average squared difference between actual and predicted values.
- Penalizes larger errors more heavily.

D. Root Mean Squared Error (RMSE)

Formula: $encoding="application/x-tex">\text{RMSE} = \sqrt{\text{MSE}}</annotation></semantics></math>$
Description:
- Provides error in the same units as the target variable, making it easier to interpret.

E. Mean Absolute Error (MAE)

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><mi mathvariant="normal">∣</mi></mrow><annotation encoding="application/x-tex">\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|</annotation></semantics></math>$
Description:
- Measures the average magnitude of errors without considering their direction.
- More robust to outliers than MSE.

F. Mean Absolute Percentage Error (MAPE)

Formula: $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub></mrow><msub><mi>y</mi><mi>i</mi></msub></mfrac><mo fence="true">∣</mo></mrow><mo>×</mo><mn>100</mn></mrow><annotation encoding="application/x-tex">\text{MAPE} = \frac{1}{n} \sum_{i=1}^n \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100</annotation></semantics></math>$
Description:

Expresses errors as a percentage of actual values.
Useful for comparing models across datasets.

2. Residual Analysis

Residuals are the differences between actual and predicted values ( $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">y_i - \hat{y}_i</annotation></semantics></math>$ ).
Key checks:
- Mean of Residuals: Should be close to 0.
- Residual Plot: Plot residuals vs. predicted values.
  - Residuals should be randomly distributed (no patterns).
  - Patterns indicate issues like non-linearity or heteroscedasticity.
- Normality of Residuals:
  - Use a histogram or Q-Q plot to check if residuals follow a normal distribution.
  - Perform statistical tests like the Shapiro-Wilk test.

3. Cross-Validation

Split the dataset into training and testing sets (e.g., 80-20 split).
Train the model on the training set and evaluate on the test set.
Use k-fold cross-validation to get a more robust estimate of performance.

4. Visual Evaluation

Scatter Plot:
- Plot actual vs. predicted values. A perfect model will have points along the 45° line.
Residual Plot:
- Check for randomness and uniform spread.
Line of Best Fit:
- Overlay the regression line on a scatter plot of the data.

5. Statistical Significance of Coefficients

Perform hypothesis testing to assess the significance of predictors.
t-test:
- Null hypothesis: The coefficient is 0 (no effect).
- A small p-value (< 0.05) indicates the predictor is statistically significant.
F-statistic:
- Tests the overall significance of the model.

6. Regularization

If overfitting is detected, use regularization techniques like:
- L1 Regularization (Lasso): Encourages sparsity.
- L2 Regularization (Ridge): Shrinks coefficients without eliminating them.

7. Practical Steps for Evaluation

Fit the model to the training data.
Calculate metrics like $encoding="application/x-tex">R^2</annotation></semantics></math>$ , MSE, RMSE, and MAE.
Perform residual analysis.
Visualize results (scatter plots, residual plots).
Use cross-validation for robustness.
Interpret the significance of coefficients.

6. What is the difference between multiple and adjusted coefficient of determination?

The multiple coefficient of determination ( $encoding="application/x-tex">R^2</annotation></semantics></math>$ ) and the adjusted coefficient of determination ( $encoding="application/x-tex">R^2_{\text{adjusted}}</annotation></semantics></math>$ ) are both metrics used to evaluate the fit of a regression model. However, they differ in how they account for the complexity of the model and the number of predictors. Here's a detailed comparison:

1. Multiple Coefficient of Determination ( $encoding="application/x-tex">R^2</annotation></semantics></math>$ )

Definition:
$encoding="application/x-tex">R^2</annotation></semantics></math>$ measures the proportion of the variance in the dependent variable ( $y$ ) that is explained by the independent variables ( $X$ ) in the model.
Formula:
$encoding="application/x-tex">R^2 = 1 - \frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}}</annotation></semantics></math>$
Where:
- $accent="true"><mi>y</mi><mo>^</mo></mover><mi>i</mi></msub><msup><mo stretchy="false">)</mo><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">\text{SS}_{\text{residual}} = \sum (y_i - \hat{y}_i)^2</annotation></semantics></math>$ (unexplained variance).
- $encoding="application/x-tex">\text{SS}_{\text{total}} = \sum (y_i - \bar{y})^2</annotation></semantics></math>$ (total variance).
Key Points:
- Ranges from 0 to 1:
  - $encoding="application/x-tex">R^2 = 0</annotation></semantics></math>$ : The model explains none of the variance.
  - $encoding="application/x-tex">R^2 = 1</annotation></semantics></math>$ : The model explains all the variance.
- Increases (or remains constant) as more predictors are added, regardless of their relevance.
- Does not penalize for overfitting or the inclusion of irrelevant variables.

2. Adjusted Coefficient of Determination ( $encoding="application/x-tex">R^2_{\text{adjusted}}</annotation></semantics></math>$ )

Definition:
$encoding="application/x-tex">R^2_{\text{adjusted}}</annotation></semantics></math>$ adjusts $encoding="application/x-tex">R^2</annotation></semantics></math>$ to account for the number of predictors in the model, penalizing for adding variables that do not improve the model's explanatory power.
Formula:
$encoding="application/x-tex">R^2_{\text{adjusted}} = 1 - \left( \frac{\text{SS}_{\text{residual}} / (n - p - 1)}{\text{SS}_{\text{total}} / (n - 1)} \right)</annotation></semantics></math>$
Where:
- $n$ : Number of observations.
- $p$ : Number of predictors (independent variables).
Key Points:
- Adjusts $encoding="application/x-tex">R^2</annotation></semantics></math>$ downward as more predictors are added unless they significantly improve the model.
- Can decrease if irrelevant predictors are included.
- Provides a more realistic measure of the model's performance, especially for models with many predictors.

Key Differences

Aspect	Multiple $encoding="application/x-tex">R^2</annotation></semantics></math>$	Adjusted $encoding="application/x-tex">R^2</annotation></semantics></math>$
Purpose	Measures total variance explained.	Adjusts for the number of predictors.
Effect of Adding Predictors	Always increases or stays the same.	Increases only if the predictor improves the model.
Overfitting	Does not penalize for overfitting.	Penalizes for overfitting.
Interpretation	Indicates the proportion of variance explained.	Indicates the proportion of variance explained, adjusted for model complexity.
Use Case	Use for simple models with few predictors.	Use for complex models with multiple predictors.

Example

Suppose you fit a regression model with $n = 100$ observations and try adding predictors:

Initial Model:
- $encoding="application/x-tex">R^2 = 0.85</annotation></semantics></math>$
- $encoding="application/x-tex">R^2_{\text{adjusted}} = 0.83</annotation></semantics></math>$
After Adding an Irrelevant Predictor:
- $encoding="application/x-tex">R^2 = 0.86</annotation></semantics></math>$ (slight increase due to added variable).
- $encoding="application/x-tex">R^2_{\text{adjusted}} = 0.82</annotation></semantics></math>$ (decrease due to penalization).

When to Use Each

$encoding="application/x-tex">R^2</annotation></semantics></math>$ :
- Use when evaluating the explanatory power of the model without concern for the number of predictors.
$encoding="application/x-tex">R^2_{\text{adjusted}}</annotation></semantics></math>$ :
- Use when comparing models with different numbers of predictors or when concerned about overfitting.