Link to Kaggle competition
Contributors: Edouard Lacourt, Maxime Raillat & Antoine Royer
We entered this competition as a project for the Machine Learning course at our university. Our personnal goal was to learn about the process of building a Machine Learning model and obviously win the competition.
In this document, we will describe how we build our model and what was our mindset to learn more about Machine Learning.
The data is available on the kaggle competition page but I decided to try the Kaggle API to get the data directly from the commandline.
!kaggle competitions download -c epf-montpellier-2022
!unzip epf-montpellier-2022.zip
!rm epf-montpellier-2022.zip
Once the data is downloaded, we load the train and test file into a pandas dataframe.
data_train = pd.read_csv("train.csv")
data_train.head()
In this section, we try to understand the dataset and do the wrangling. We have many variables explained on competition data page.
Data fields
- auctionId - a random anonymous id unique to a given auction
- timeStamp - the timestamp (in seconds) of the start of the auction
- placementId - the id of a zone in the web page where the ad would be displayed
- websiteId - the id of the corresponding website
- refererDeepThree - the URL of the page where the ad would be displayed, truncated at its 3rd level and anonymized
- country - the id of the user's country
- operatingSystem - the user's operating system
- browser - the user's browser
- browserVersion - the user's browser version
- device - the user's device type
- environmentType - the type of the webpage environment
- integrationType - the id of the webpage integration type
- articleSafenessCategorization - the categorization of the content of the article
Target variable
- isSold - a boolean variable defining wether the auction led to a sale or not
We can see that the data is balanced. Also the data is either True
or False
. This means that later, we will use a classification model.
We prepared the data by removing the columns that we don't need (ie. auctionId). Then we took care of the column that needed to such as timeStamp. We checked how others treated this value.
After implementing a baseline model (we will talk about modeling later), we had to do feature engineering. So we created new features by concatenating different columns.
We made sure that these steps was written with function to make it easier to understand and easer to feature engineering with different models. We tried to write docstring for anyone who want to use our model or imitating production type model.
def concatFeaturesToTarget(df, featuresToConcat, target):
""" Concatenate features to the target column.
Args:
df (pandas dataFrame): dataframe to concatenate features to the target column
featuresToConcat (list): list of features to concatenate
target (str): name of the target column
Returns:
df (pandas dataFrame): dataframe with the features concatenated to the target column
"""
concatDf = pd.DataFrame()
for feature in featuresToConcat:
concatDf[feature+"X"+target] = df[target].str.cat(df[feature].astype(str),sep=",")
return concatDf
After that, we applied the different functions to our dataset.
A lot of classmate was using Random Forest. Also, we used a lot Random Forest in practical work at school. So we wanted to explore new models. Our Machine Learning teacher advices us to check XGBoost and LightGBM because they were famous for winning Kaggle competitions.
First we split our data into train and test.
X = data_train_clean.drop(columns=["isSold"]).to_numpy()
y = data_train_clean["isSold"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
NOT FINISHED YET... Update soon on github.
Like XGBoost, we didn't know much about this model. So when we don't know a function or a model, it is always good to search for ressources. Post on Towards Data Science , Documentation
Reading these ressources helped us a lot to understand how to use this model and what are the different parameters. Here are the common tips about parameters:
Hyperparameters that control the tree structure:
- In LGBM, the most important parameter to control the tree structure is
num_leaves
. As the name suggests, it controls the number of decision leaves in a single tree. - The higher
max_depth
, the more levels the tree has, which makes it more complex and prone to overfit. Too low, and you will underfit. A good range is between 3 and 12. - Good rule of thumb is: the maximum limit to
num_leaves
should be2^(max_depth)
min_data_in_leaf
: This is a very important parameter to prevent over-fitting in a leaf-wise tree. Its optimal value depends on the number of training samples and num_leaves. Setting it to a large value can avoid growing too deep a tree, but may cause under-fitting. In practice, setting it to hundreds or thousands is enough for a large dataset.
Hyperparameters for better accuracy:
n_estimators
: controls the number of decision treeslearning_rate
: is the step size parameter of the gradient descent.- The perfect setup for these 2 parameters (n_estimators and learning_rate) is to use many trees with early stopping and set a low value for
learning_rate
. - Use large
max_bin
(may be slower)
Hyperparameters to control overfitting:
- Use small
max_bin
- Use small
num_leaves
- Use
min_data_in_leaf
andmin_sum_hessian_in_leaf
- Use bagging by
set bagging_fraction
andbagging_freq
- Use feature sub-sampling by set
feature_fraction
- Use bigger training data
- Try
lambda_l1
,lambda_l2
andmin_gain_to_split
for regularization - Try
max_depth
to avoid growing deep tree - Try
extra_trees
- Try
increasing path_smooth
We tried different set of hyperparameters to submit the best model. We didn't care much about computing time since LightGBM was the fastest computing model 1.5 times faster than XGBoost. We wanted to have a good accuracy without overfitting too much. This what our model's parameters looked like:
import lightgbm as lgb
clf_lgb = lgb.LGBMClassifier(
max_bin=500,
max_depth = 8,
num_leaves = 100,
min_data_in_leaf=500,
learning_rate=0.08,
n_estimators=500
)
To find the best parameters, we first tried different combinations of hyperparameters by hand that led us to parameters found above. We will talk later about other methods.
The evaluation metric for this competition is Mean F1-Score. The F1 score, commonly used in information retrieval, measures accuracy using the statistics precision p
and recall r
.
The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously. Thus, moderately good performance on both will be favored over extremely good performance on one and poor performance on the other.
After setting our model parameters, we fit our model. Then we predict the test data and calculate the F1 score. We did that for all the different models.
clf_lgb.fit(X_train,y_train)
y_pred_lgb = clf_lgb.predict(X_test)
f1_score(y_test,y_pred_lgb)
Then, we compare it to the training sample to check if the difference of the two f1 scores is significant. To choose our model, we based our selection on having the best test f1 score and lowest difference between the training and test f1 scores to prevent overfitting.
y_training_pred_lgb = clf_lgb.predict(X_train)
f1_score(y_train,y_training_pred_lgb)
We submit our model to the Kaggle competition using the Kaggle API. Our final score on Kaggle was: 0.76377
.
!kaggle competitions submit -c epf-montpellier-2022 -f testSubmissionFile_bestVersion.csv -m "Trying to get back to top 1 !!"
This section describes all the thing we tried but unfortunately didn't work. However, testing other techniques and methods taught us a lot.
First, we wanted to do the competition using pyspark
but the deadline was too close. So we decided to focus more on the process of implementing a Machine Learning Model. We wanted to try parallel computing so we gave a look to Dask
.
What Dask
does is it splits the data into smaller pieces and then it sends these pieces to different workers. The workers then do the work and send the results back to the master. The master then merges the results and sends the final result to the user. It was supposed to accelerate the process of training a model but it didn't work as we wish. So we just use the native version of our models as they offer multi-threading too.
Instead of testing parameters by hand, we also tried GridSearchCV
. So we read about GridSearchCV
to find the best parameters. GridSearchCV
is a class that allows us to find the best parameters for a model by trying every combinaison of the range of parameters you enter. It also performs cross-validation and offer you the best model. However, it takes a lot of time to compute.
We finished second in the competition. We are madly disappointed because our primary goal was to finish top 1. On the other hand, this competition was a great opportunity to learn more about Machine Learning and to try different techniques.