Analyzing Used Car Listings on eBay Kleinanzeigen


We will be working on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle. The version of the dataset we are working with is a sample of 50,000 data points.

The data dictionary provided with data is as follows:

The aim of this project is to clean the data and analyze the included used car listings.

Our dataset contains 20 columns, most of which are stored as strings. There are a few columns with null values, but no columns have more than ~20% null values. There are some columns that contain dates stored as strings.

We'll start by cleaning the column names to make the data easier to work with.

Clean Columns

We'll make a few changes here:

Initial Data Exploration and Cleaning

We'll start by exploring the data to find obvious areas where we can clean the data.

Our initial observations:

It looks like the num_photos column has 0 for every column. We'll drop this column, plus the other two we noted as mostly one value.

There are two columns, price and auto, which are numeric values with extra characters being stored as text. We'll clean and convert these.

Exploring Odometer and Price

We can see that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field. Additionally, there are more high mileage than low mileage vehicles.

Again, the prices in this column seem rounded, however given there are 2357 unique values in the column, that may just be people's tendency to round prices on the site.

There are 1,421 cars listed with $0 price - given that this is only 2% of the of the cars, we might consider removing these rows. The maximum price is one hundred million dollars, which seems a lot, let's look at the highest prices further.

There are a number of listings with prices below \$30, including about 1,500 at \\$0. There are also a small number of listings with very high values, including 14 at around or over $1 million.

Given that eBay is an auction site, there could legitimately be items where the opening bid is \$1. We will keep the \\$1 items, but remove anything above \$350,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

Exploring the date columns

There are a number of columns with date information:

These are a combination of dates that were crawled, and dates with meta-information from the crawler. The non-registration dates are stored as strings.

We'll explore each of these columns to learn more about the listings.

Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.

The crawler recorded the date it last saw any listing, which allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.

The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.

Dealing with Incorrect Registration Year Data

Because a car can't be first registered before the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

One option is to remove the listings with these values. Let's determine what percentage of our data has invalid values in this column:

Given that this is less than 4% of our data, we will remove these rows.

It appears that most of the vehicles were first registered in the past 20 years.

Exploring Price by Brand

German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.

There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings.

Of the top 5 brands, there is a distinct price gap:

Exploring Mileage

The range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.