order return predıctıon

Today, every company applies certain technology-supported techniques and analyzes to increase its profits. Currently, the most popular and effective of these is undoubtedly artificial intelligence algorithms.

This project is about training a machine learning model that predicts whether any of the products that users bought in the same session will be returned later, using sample data from a large-scale e-commerce company. You can use the data to predict product returns for your ecommerce business. This can help you design your returns process around customer behavior, which will solve a wide array of operational problems for you.

The data shared by the company as an example can basically be evaluated under 5 main headings:

-product features (product_id, product_content_id, product_variant_id,product_name, brand_id, brand_name, gender_id etc.)

-product reviews (product_content_id, rate, comment,review_like_count,supplier_id)

-user informations (user_id,birth_date,membership_date,gender)

-transaction data (order_date ,user_id ,is_elite_user ,supplier_id ,order_line_item_id ,order_parent_id ,product_content_id ,product_variant_id ,original_price etc.)

-supplier informations(supplier_id, return_rate)

Product id hierarchical diagram

MAIN STEPS OF THE PROJECT:

Cleaning Data:

Missing values are handled filling with mean, median or other techniques.
Redundant columns are removed.

2. Encoding:

All columns which has categorical structure is converted to narrow range numeric type.

3. Preprocessing:

Text datatype columns are prepared for embedding part.(remove stopwords, apply lowercase for all words, remove punctuations etc.)

4. Embedding:

Preprocessed Text datatype columns are converted to numerical type for ability of process by machine learning algorithm by using word2vec that is one of the embedding tecnique of Natural Language Processing.(Example Columns are: Product_name,brand_name,reviews for products etc.)

5. Feature Engineering:

New features are generated by using other features with different calculation and combinations.(user return rate, product return rate, user age, is there any same product with different sizes in same session etc.)

6. Filtering:

Columns which are not used in model training are filtered and pure data is obtained.

7. Feature Selection:

A trial model was trained with all the data using the Lightgbm model. Then, using this model, feature importance was ranked and the features that were unimportant for classification were determined and cleaned.

8. Modeling:

Model is trained with data which has selected features.(Algorithm: Lightgbm)

9. Evaluation:

Pre-allocated test data were used to measure the performance of the model. Accuracy and f1_score was selected as performance metrics.

Lets try manually if the model running in success,

I choose an example product, which has 1800 dollar original price and do not have any discount.

It has also high return rate in previous time by other users. The model can detect these details and predict it as 1(can be returned).

< View Code

Back to Projects