youtube trend predıctıon
As it is known, when we say video nowadays, Youtube comes to mind for everyone. An average of 300 hours of video is uploaded per minute to the Youtube platform, which contains many detailed videos about lectures, entertainment, science and millions of other topics, and it has 1,300,000,000 users currently. Looking at these statistics, Youtube can become a huge data warehouse. Therefore, using this data, videos can be made more popular and machine learning models can feed on this data and make successful predictions.
A simple example of this can be observed in this project.
This project can be defined as a project to create a machine learning regression model that predicts the target column of the trending video, which is basically like count of the video / total review of the video.
So, trained model can predict how much any trending video has potential to keep their position in trending videos.
In the dataset used in the project, trending video features like channel,category,video tags,total_like,total_review are given with some datetime features like publish_date,trend_date etc.
The result column(output) is target which is calculated like total_like/total_review of the trending video.
Initial features of the dataset
MAIN STEPS OF THE PROJECT:
-
Preprocessing:
-
Missing values are determined. There is a few nan values belong to description and duration_second columns like %2 of all dataset. So they are removed directly.
-
Some datatypes are converted to approtiate one.(object to datetime or object to float etc.)
-
In this project, result column is target and to calculate it, we need like count data. So if any videos which are rating_disabled=True, it is redundant and not usable for feed the model. So, they are removed.
2. Encoding:
-
All columns which has categorical or Boolean structure is converted to narrow range numeric type. (category_id, comments_disabled, has_tumbnail)
3. Feature Engineering:
-
New features are generated by using other features with different calculation and combinations.
-
Some new features created:
-
time-features (Publish_day, Publish_hour, Curent_day): It includes time-series information inside.
-
total_trend_day: It indicates how many days the video has been trending in total.
-
Channel_average_targets: Target mean of all videos which belong to its channel.
-
Category_average_targets: Target mean of all videos which belong to its category.
-
Popular_tag_count: It indicates that how many video tags is counted as popular which means that if any tag includes most 50 repetitive tags of all dataset.
4. Feature Selection:
-
After the feature engineering part, we have final columns which are candidate for train a model.
-
A trial model was trained with all the data using the Lightgbm model. Then, using this model, feature importance was ranked and the features that were unimportant for classification were determined and cleaned.
5. Modeling:
-
Train and test data are seperated.
-
For choosing the best model algorithm, I used AutoML frame which gets parameters like metric(mae for this project), estimator list(I choose models with generally has high accuracy or low loss which are lgbm,xgboost,catboost) and task(regression for this project) and after many iterations with trying and comparing many parameter combination groups, the best model with its best hyperparameters are given by AutoML function.
-
After the process I mentioned above, best estimator was Lightgbm.
Training part is finished and evaluation of trained model, MAE of model is 0.004.