Skip to content

naive666/Kaggle_Benz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle_Benz

A Summary For Kaggle Featured Prediction Competition: Mercedes-Benz Greener Manufacturing

license standard-readme compliant

Table of Content

Background

Mercedes-Benz Greener Manufacturing is a Kaggle featured competition held two years ago with $25,000 prize money. The contestants were asked to make predictions based on the given datasets. The train dataset contains 4209 examples and 378 features each. Only less than 10 features are categorical, and the remainings are sparse features, and moreover none of the features has NaN or outliers, which is a good chance for practicing. Following the same data distribution, the test set has also 4209 blanks that need to predict. In this repository, I will show three models applying different methods to reduce feature dimensions and make predictions. The best model is the second one, which scores 0.54998 of R2_score in the final private leaderboard.

ML Model 1

This model use Random Forest Regressor to reduce the feature dimensions. The followings are brief summary of this method.

  • Feature Selection

    • Set total epoch number to 3000
    • Use KFold to do cross validation in each epoch
    • Randomly pick up 10 features from the entire feature sets with bootstrap for each epoch
    • Use these 10 features to train, then record the r2_score of each epoch
    • Record the feature importance of each feature in every epoch
    • Select the top 5 features for each epoch and store the feature importance in an array of shape (epoch, 5)
    • After finishing first 30 epochs, judge if the r2_score is higher than the average of previous 30 epoch, if true, record this epoch, otherwise omit it
  • Model

    • Use pd.factorize() to encoder the categorical features. One can also use labelencoder to get the same result
    • Use GridSearch to train Random Forest Regressor and XGBoost models
  • Result

    • Score 0.52 in private leaderboard.
  • Submission

  • Summary

    • This model could be improved by the following aspects:
    • The Random Forest Regressor parameters could be tune
    • I factorize the categorical features first, then use model to selecting features
    • In each epoch, select over 10 features
    • I could try more than 7 features in the predict models
  • Warning

    • After finishing this model, I found I made several mistakes. Since most features are encoded by 0 and 1, it is reasonable to guess some columns together represents one feature(This is like our factorize process). In order to correct the mistake, the second method comes up

ML Model 2

This model does not apply sophisticate feature reduction methods, it uses tsvd, pca, FastIca to select 12 features.

  • Model

    • Concat train data and test data
    • Use LabelEncoder to transfer categorical features to numerical
    • Use tsvd, pca, FastIca to choose 12 features respectively, be careful that pca is not a good choice for spase data since it need centralize.
    • Create a stacked model composited by one Lasso, one gradient boost regressor and one Lasso to do final prediction.
    • Create another model of XGBoost
    • Train the selected 12 features to XGBoost and the stacked model respectively
    • Calculate the R2 score
  • Result

    • Unfortunately, XGBoost model get 0.39 while the stacked model scores 0.45, neither is a good score. It seems I should not reduce the features so much.
  • Revised Model

    • Since most of our features are sparse, I should try not to reduce the feature, just use the original entire features. Let's look at what happen.
    • Model
      • Concat train data and test data
      • Use LabelEncoder to transfer categorical features to numerical
      • Create the same stacked model, and use this to train.
      • Calculate the R2 Score
  • Result

    • Bravo! I get 0.5498 in final private leaderboard which approaches top 100 in this competition!
  • Submission

  • Summary

    • It seems plausible that feature reduction has negative effect to the result. I guess this is because most features are sparse in this competition which means several particular columns represent one feature in the original dataset. Since the sparsity, it will not take too much space while runing the model. However, in the general cases one should do feature reduction

Deep Learning Model

Deep Learning Model should have good results under the condition of large features, but this time the result is mediocre as expected. I construct a 3-layer dense neural network and add an extra layer to predict. Here I don't want to talk too much about this method because here is not a very good instance to apply deep learning.

Contributor

About

A Trial For Kaggle Featured Prediction Competition: Mercedes-Benz Greener Manufacturing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published