Used Car Price Prediction — with Machine Learning

8 min readAug 22, 2024

https://www.kget.com/business/press-releases/ein-presswire/715217627/indonesia-used-car-market-growth-grew-by-idr-440-trillion-revenue-during-2019-2023/

With more cars being bought and sold than ever before, predicting used car prices has become a hot topic. In developing countries, where used cars offer a more affordable option, many people are choosing them over new ones. This shift has sparked a growing interest in understanding how used car prices will change. In this Medium post, we’ll explore why predicting these prices matters so much and what’s driving the trend.

Dataset

For this project, we used a dataset sourced from raw, scraped data from OLX and Carmudi, which contains around 800,000 samples. It’s important to note that the dataset is entirely unprocessed and has not been cleaned prior to our analysis.

Objectives

A primary objective of this project is to estimate used car prices using attributes that are highly correlated with the price. This will assist sellers in determining the best price for their cars. By considering a variety of relevant features — such as car brand, model, type, age, mileage, transmission type, and more — we aim to build a reliable predictive model.

Scopes

Create a used car price recommendation system based on specific attributes.
Develop a system capable of assessing the similarity between newly scraped data and the existing data within the system. (Coming soon on Part 2)

Limitation

It’s important to acknowledge some limitations of this project. The dataset we used is limited to the Indonesian region, which may affect the generalizability of our findings to other areas with different market dynamics. Additionally, the features available in the dataset are relatively sparse, missing key attributes such as car type, fuel type, and other potentially influential factors. These limitations could impact the accuracy and comprehensiveness of our price predictions.

Project Phases

Data Exploration and Preprocessing
Feature Engineering
Model Training
Model Evaluation
Predictive Insights
Hyperparameter Tuning (Coming soon on Part 2)
Build ML Ops (Coming soon on Part 2)

Data Exploration and Preprocessing

Objective: To prepare and clean the dataset for machine learning by exploring the data and handling missing values, outliers, and other issues.

Load data

# Create the database connection
engine = create_engine(f'postgresql://{username}:{password}@{host}:{port}/{database}')

# Query to get the data
query = '''
select 
 scrapper.id,
 brand.name as brand, 
 model.name as model, 
 type.name as type, 
 color.name as color,
 year, 
 mileage, 
 transmission, 
 engine_capacity, 
 condition,
    price,
 province, 
 region,
    scrapper.data_source
from scrapper
join brand on scrapper.brand_id = brand.id
join model on scrapper.model_id = model.id
join type on scrapper.type_id = type.id
join color on scrapper.color_id = color.id
'''

# Load data into a DataFrame
df = pd.read_sql(query, engine)

# Close the connection (if necessary)
engine.dispose()

first we need to load the data so then we able to do further data exploration and pre processing, we will the data to be a valuable assets.

unique_values = df['data_source'].unique()
print(f"Data sources: {unique_values}")

# Output | Data sources: ['OLX' 'CARMUDI']

Remove Duplicates

Handling Missing Value

In the image above, we see that there are 115 null values in the province column. We will address this issue after splitting the data. Since we need the data to build a machine learning model, it’s important to split the data first and then handle missing values and outliers. However, if the goal is only data exploration, splitting the data is not necessary.

Handling Missing Value by Removing the Column

We need to remove a column if the percentage of missing values exceeds our threshold of 20%. If the percentage of missing values is greater than this threshold, we should remove the column entirely.

Fix wrong condition type, based on the mileage

Split Train & Test

from sklearn.model_selection import train_test_split
train_car, test_car = train_test_split(df, test_size = 0.2, random_state=42)

Splitting the data into training and testing sets before performing any preprocessing is crucial to avoid data leakage and ensure realistic model evaluation. This approach keeps the test set independent, avoiding skewed preprocessing that could lead to overoptimistic performance estimates. By preprocessing the training set separately and applying the same methods to the test set, we maintain the integrity of our model’s evaluation and better reflect real-world scenarios.

Handle the previous missing value on province

train_mode_province = train_car[~train_car['province'].isna()]['province'].mode()[0]
train_mode_province # this is the modus or the most frequency province
# 'Jakarta D.K.I.'

For numerical features, we use the median, while for categorical features, we use the mode. Since the data is categorical, we apply the mode approach. After determining the mode from the training data, we then apply this mode to the train data and the test data.

# Apply the missing value province, with the modus province
train_car['province'] = train_car['province'].apply(lambda x: train_mode_province if pd.isna(x) else x)

test_car['province'] = test_car['province'].apply(lambda x: train_mode_province if pd.isna(x) else x)

Handling Outliers

Sample of the outlier data

From the image above, we observed that the price for Daihatsu is abnormal; the actual price is not Rp 85 billion. This issue likely arises from data input by unverified users rather than reliable sources. To address this, we need to group and match the data by brand, model, and type. Additionally, we should create separate plots for each brand, model, and type to detect outliers more effectively. I will tackle this in the next part of the project.

# Function to identify and handle outliers based on IQR
def handle_outliers_iqr(group):
    Q1 = group['price'].quantile(0.25)
    Q3 = group['price'].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Convert outliers to the nearest bound
    group['price'] = group['price'].apply(lambda x: lower_bound if x < lower_bound else (upper_bound if x > upper_bound else x))
    
    return group

# Apply the outlier handling function to each group
train_car = train_car.groupby(['brand', 'model', 'type']).apply(handle_outliers_iqr).reset_index(drop=True)

print("DataFrame with outliers handled:")
print(train_car)

If we check the plot, we may still see outliers because the data is aggregated into a single plot. In the future, we should separate the data by brand, model, and type, as shown in the image below.

Feature Engineering

Encoding

For encoding, I used a target encoder for each categorical column. Additionally, I added a new feature called “age,” calculated as the difference between the current year and the production year.

import datetime
from category_encoders import TargetEncoder, OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

# Calculate age from the year
train_car['age'] = datetime.datetime.now().year - train_car['year']
test_car['age'] = datetime.datetime.now().year - test_car['year']


# One Hot Encoding
# --------------------------------
# Apply One-Hot Encoding for transmission and condition
ohe_cols = ['transmission', 'condition']

encoder = OneHotEncoder()
encoder.fit(X_train[ohe_cols])

one_hot_encoded_output_train = encoder.transform(X_train[ohe_cols])
one_hot_encoded_output_test = encoder.transform(X_test[ohe_cols])

X_train = pd.concat([X_train, one_hot_encoded_output_train], axis = 1)
X_test = pd.concat([X_test, one_hot_encoded_output_test], axis = 1)

X_train.drop(ohe_cols, axis = 1, inplace = True)
X_test.drop(ohe_cols, axis = 1, inplace = True)


# Target Encoding
# --------------------------------
encoder = TargetEncoder(cols = ['brand', 'model', 'type', 'color', 'province', 'region'])

# Fit and transform the encoder on the training data
X_train_encoded = encoder.fit_transform(X_train, y_train)

# Transform the test data using the fitted encoder
X_test_encoded = encoder.transform(X_test)

Feature Selection

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Compute correlation matrix
correlation_matrix = train_data.corr(method='pearson')

# Plot heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Heatmap')
plt.show()

no need for feature selection

Model Training

Linear Regression: I started with Linear Regression to establish a baseline. This straightforward approach helps understand the basic relationship between the features and the target variable.

K-Nearest Neighbors Regressor: Next, I used the K-Nearest Neighbors Regressor. This model is useful for capturing more complex patterns in the data by considering the similarity between data points.

PLS Regression: I then applied PLS Regression (Partial Least Squares), which is effective in scenarios with high-dimensional data. It helps in reducing the dimensionality while maintaining the relationship between features and the target.

Deep Learning (Soon on part 2): For the next phase of the project, I plan to explore Deep Learning techniques. This advanced approach promises to uncover even deeper patterns in the data and will be covered in Part 2 of the project.

Model Evaluation

From the regression plots, I found that the K-Nearest Neighbors model performed better than the others. As a result, I decided to use the K-Nearest Neighbors model as a temporary baseline.

Predictive Insights

From the regression plots, it’s clear that the K-Nearest Neighbors model outperformed the other approaches, making it a strong candidate for our current predictive tasks. Consequently, I’ve chosen to use this model as a temporary baseline. While it offers a solid foundation for understanding the data, there’s still room for improvement. Future efforts will focus on refining the model and exploring additional techniques to enhance accuracy.

We still need to perform further cleaning and feature engineering to enhance the data’s quality. Additionally, balancing the data samples and acquiring more comprehensive datasets are crucial steps in developing a more robust and accurate predictive model. By addressing these areas, we aim to achieve more reliable and insightful predictions in the future.

Github Repository

https://github.com/reinhardjs/capstone-dibimbing