When learning Python I got really interested in data scraping. Surprisingly it coincided with an urge to buy a flat in Szczecin. Of course working a lot with data I wanted to at least analyze how the housing market in Szczecin looked like. I got much into web data scraping and downloaded all the housing advertisements from Szczecin from otodom.pl. This turned out to be a really nice database. I also use this database to create a price heatmap of Szczecin.
I've automated the process and I've been downloading this data weekly every Saturday night to balance load on the site. I ended up with a nice database from advertisements of Szczecin from nearly two years.
I no longer wish to buy a house in Szczecin, but it is really interesting to show the results what's in the data.
Note: This is the results of gathering publicly available data from otodom.pl. I do not use this data for commercial purposes and have no claim to ownership of it. Below is jus a simple analysis and given that buying a house is a serious investment it should not be used in any decision making. I also take no responsibility for these results. Also all this work has been done in my spare time and has not been financed by anyone. I stopped gathering data as of July 2018.
import os, re, datetime, json
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import numpy as np
%matplotlib inline
plt.rcParams["figure.figsize"] = (10,6)
ddir = r"/home/mateusz/projects/szczecin_analysis/data/"
files = os.listdir(ddir)
# Filter files
data = {
'file': [],
'date': [],
'lines':[]
}
for f in files:
if not re.match("^szczecin_[0-9]+\.jl", f):
continue
data['file'].append(f)
date = datetime.datetime.strptime(f[9:15], "%d%m%y").date()
data['date'].append(date)
num_lines = sum(1 for line in open(os.path.join(ddir, f)))
data['lines'].append(num_lines)
df = pd.DataFrame(data)
df = df.sort_values(by='date')
# df = df.set_index('date')
df.describe()
We see that there are observations with 0 lines, where for some reasons the data was not downloaded. We next get rid of observations with less than 500 lines.
df = df[df['lines'] > 500]
df.describe()
We can next graph how the number of observations changed during the time of data gathering.
years = mdates.YearLocator() # every year
months = mdates.MonthLocator() # every month
yearsFmt = mdates.DateFormatter('%Y')
fig, ax = plt.subplots()
ax.plot(df['date'], df['lines'])
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
ax.xaxis.set_minor_locator(months)
ax.set_ylabel('Number of observations')
ax.set_xlabel('Date')
ax.grid(color='g', linestyle='--', linewidth=1, alpha=.3)
ax.get_xaxis().tick_bottom()
ax.get_yaxis().tick_left()
ax.spines["top"].set_visible(False)
#ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
#ax.spines["left"].set_visible(False)
We conclude that since the beginning of the experiment the number of offers in the service has been dropping substantially. The biggest drop has been recorded in the beginning of 2018 when the number of offers dropped from 4000 to less than 2000.
The questions remains if this situation is because there are less offers on the market or whether the service is loosing to some other opponent? Unfortunately given this dataset be cannot answer this questions.
Data analysis¶
We first need to load that data that is stored as JSON lines.
data = []
for idx in df.index:
read_date = pd.read_json(os.path.join(ddir, df.loc[idx, 'file']), lines=True)
read_date['date'] = df.loc[idx, 'date']
data.append(read_date)
read_date = None
data = pd.concat(data, axis=0)
data.tail()
We see that some columns as _type
, sellType
, town
or we won't be needing. The details column includes the whole description of the offer. Let's see what's inside the data. We leave url
for checking for duplicates.
data['date'] = pd.to_datetime(data['date'])
data = data.drop(columns=['_type', 'sellType', 'town'])
data.describe()
The average size of offered apartment was 68.4 m$^2$ and the average price was around 300k PLN.
We see that geographical coordinates, area and price are numeric. Apparently the floor
variable has some non-numeric values.
data['floor'].value_counts()
We can see that the ground floor is described as the word parter
. Also there are some apartments lower than ground floor suterena
, but we will consider them as ground floor. We also add a category as giving the number of floors as 5 and more and 10 and more.
data.loc[data['floor'] == 'parter', 'floor'] = '0'
data.loc[data['floor'] == 'suterena', 'floor'] = '0'
data.loc[data['floor'] == 'poddasze', 'floor'] = '5'
data.loc[data['floor'] == '> 10', 'floor'] = '11'
data['floor'] = pd.to_numeric(data['floor'])
data.loc[data['floor'] > 10, 'floor'] = 10
data['floor'].value_counts()
data['rooms'].value_counts()
data.loc[data['rooms'] == 'więcej niż 10', 'rooms'] = '10'
data['rooms'] = pd.to_numeric(data['rooms'])
data['rooms'].value_counts()
SQL database export¶
The floor and room numbers looks fine now. Next we can visualize our data. Also for visualization we will create a sample without duplicates. Before we do that let's just write the data into some better format for further analysis. I want to have a choice if I want to load just the appartments data without the description so I will go with a SQL database. Let's just stick with simple SQLite database.
We create a simple schema with just two tables with the following code:
CREATE TABLE adds (
id integer PRIMARY KEY AUTOINCREMENT,
description varchar,
details varchar,
location varchar,
url varchar,
price float,
pow float,
rooms integer,
floor integer,
data_lat float,
data_lon float
);
CREATE TABLE dates (
id integer PRIMARY KEY AUTOINCREMENT,
add_id integer,
date datetime
);
import sqlite3 as db
table_clear1 = '''DROP TABLE IF EXISTS adds; '''
table_clear2 = '''DROP TABLE IF EXISTS dates;'''
table_create1 = '''CREATE TABLE adds (
id integer PRIMARY KEY AUTOINCREMENT,
description varchar,
details varchar,
location varchar,
url varchar,
price float,
pow float,
rooms integer,
floor integer,
data_lat float,
data_lon float
);'''
table_create2 = '''
CREATE TABLE dates (
id integer PRIMARY KEY AUTOINCREMENT,
add_id integer,
date datetime
);
'''
scn_db = db.connect(ddir + 'szczecin_housing_290718.sqlite')
scn_cur = scn_db.cursor()
scn_cur.execute(table_clear1)
scn_cur.execute(table_clear2)
scn_cur.execute(table_create1)
scn_cur.execute(table_create2)
scn_db.commit()
No we can populate our table with observations from the dataframe.
unique_cols = ['location', 'pow', 'floor', 'rooms', 'data_lat', 'data_lon', 'url']
unique_adds_ids = ~data.duplicated(unique_cols, keep='first')
unique_adds = data[unique_adds_ids].drop(columns=['date', 'poi_lat', 'poi_lon'])
unique_adds = unique_adds.reset_index(drop=True)
unique_adds['id'] = unique_adds.index
unique_adds.tail()
Still the description
column is coded as a list. We need to correct that.
unique_adds['description'] = unique_adds['description'].apply(lambda x: x[0])
We can now upload the db to SQLite.
unique_adds.to_sql("adds", scn_db, if_exists='append', index=False)
print(scn_cur.execute('SELECT * FROM adds WHERE details != "" LIMIT 2;').fetchall())
It worked! Now we merge the unique DataFrame with the data
DF and leave just the ids from the unique_adds
with dates.
data_export = data.reset_index()
data_export['index'] = data_export.index
all_dates = data_export.merge(unique_adds, how='left', on=unique_cols)[['index', 'date', 'id']]
print(data_export.shape, unique_adds.shape, all_dates.shape)
all_dates.tail()
all_dates = all_dates.rename(columns={'index' : 'id', 'id' : 'add_id'})
add_dates = all_dates.sort_values('id')
all_dates.tail()
all_dates.to_sql("dates", scn_db, if_exists='append', index=False)
print(scn_cur.execute('SELECT * FROM dates LIMIT 5;').fetchall())
scn_db.close()
Success! All in all the export worked and now we have a 92 MB database instead of 680 MB JSONLine files.
Summaries¶
First we want to look how does the price depend on the area.
duplicate_verify = ['details', 'location', 'pow', 'floor', 'rooms', 'data_lat', 'data_lon', 'url']
sample = ~data.duplicated(duplicate_verify, keep='first')
data[sample].plot.scatter('price', 'pow')
We see that there are some outliers with area greater than 5000 m$^2$. Also We will consider everything priced more than 3M PLN as outliers.
data = data.loc[(data['pow'] < 5000) & (data['price'] < 3e6)]
sample = ~data.duplicated(duplicate_verify, keep='first')
data[sample].plot.scatter('pow', 'price', alpha=0.1)
There seems to be a lot of variance with increasing price. It's good to check the distribution of price and log price. We will once again plot, but with logs this time.
We will also calculate the log of area.
fig , ax= plt.subplots(ncols=2, nrows=1)
data.loc[sample, 'price'].plot.hist(bins=100, ax=ax[0])
ax[0].set_xlabel('Price, PLN')
data.loc[sample, 'price'].plot.hist(bins=100, logx=True, ax=ax[1])
ax[0].set_xlabel('Price, log PLN')
data['lprice'] = np.log(data['price'])
data['lpow'] = np.log(data['pow'])
data[sample].plot.scatter('lpow', 'lprice', alpha=0.1)
Now let's see how the price, area and price per m$^2$ changed in time.
data['price_m'] = data['price'] / data['pow']
summ_cols = ['date', 'price', 'pow', 'price_m']
time_averages = data[summ_cols].groupby('date').aggregate(np.mean)
fig , ax= plt.subplots(ncols=1, nrows=3)
time_averages.plot.line(y='price', ax=ax[0])
time_averages.plot.line(y='pow', ax=ax[1])
time_averages.plot.line(y='price_m', ax=ax[2])
for axis in [0, 1, 2]:
ax[axis].xaxis.set_major_locator(years)
ax[axis].xaxis.set_major_formatter(yearsFmt)
ax[axis].xaxis.set_minor_locator(months)
ax[axis].grid(color='g', linestyle='--', linewidth=1, alpha=.3)
ax[axis].get_xaxis().tick_bottom()
ax[axis].get_yaxis().tick_left()
ax[axis].spines["top"].set_visible(False)
#ax[axis].spines["bottom"].set_visible(False)
ax[axis].spines["right"].set_visible(False)
#ax[axis].spines["left"].set_visible(False)
if axis < 2:
ax[axis].set_xticklabels([])
ax[axis].set_xlabel("")
else:
ax[axis].set_xlabel('Date')
time_averages.tail()
We see that the average area does not change much. What we see is that the price rises continuously over the analyzed period.
We also look at how long an advertisement is on average available.
data['group'] = data.groupby(duplicate_verify).grouper.group_info[0]
gr = data.groupby('group')
gr_dur = gr['date'].max() - gr['date'].min()
print(gr_dur.describe())
gr_dur.dt.days.hist(bins=60)
We see that on average an ad is alive for 62 days. The median time recorded is 22 days, around 3 weeks. Unfortunately our data acquisition technique is on 7 day basis, so these results are likely biased. There are also many advertisements with survival time == 0 indicating that before the second run they are already gone.
Linear regression analysis¶
Having cleaned our data we can next analyze according to different criteria like area, number of rooms and which floor is the apartment on. For that we will be using statsmodels
package.
Also we want only to use first occurrences for our observations.
import statsmodels.api as sm
sample = ~data.duplicated(duplicate_verify, keep='first')
formula = "price ~ pow"
m0 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m0.summary()
From this initial regressions results we see that increasing area by 1 m$^2$ increases the price by 3400 PLN. Let's see how this changes when we add floor number and number of rooms.
We also add localization and length of description.
data['desc_len'] = data['details'].apply(len)
formula = "price ~ pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()
So each subsequent room adds 13k PLN to the price. Also each floor causes price to fall by 2.1k PLN. Also the longer the description the higher the price. Each letter "costs" around 10 PLN ;)
Let's see what happens after adding quarterly price interaction.
data['quarter'] = data['date'].dt.year.apply(str) + \
"Q" + data['date'].dt.quarter.apply(str)
formula = "price ~ C(quarter) * pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len + C(quarter)"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()
Here we can observe that the price per m$^2$ dependency is increasing in the last two quarters when compared to 2016Q4.
Being curious we also want to see how does the time that an advertisement spends online depend on our independent variables.
life = gr_dur.reset_index()
life = life.rename(columns={'date':'life'})
life['life'] = life['life'].dt.days
data = data.merge(life, how='left', left_on='group', right_on='group')
data['price_k'] = data['price'] / 1e3
sample = ~data.duplicated(duplicate_verify, keep='first')
formula = "life ~ price_k + pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()
Quick look on the coefficients suggests that the bigger the house and the more expensive it is the longer it stayed on the website. Also the more rooms a house had the quicker it had gone. The effect of geographical position is big, but not significantly different from 0.
Also the longer the description, the longer an ad stayed on the website.
The R$^2$ of this model is really low indicating that most of what we record is noise and this model does not explain this dependence in a good way.
Conclusions¶
This turns out to be a fairly nice dataset. Also this is not everything that can be done with it. I will try to further show what else can be done given that I have information on geographical coordinates for each house and also the full text of the advertisement. More will come.