When learning Python I got really interested in data scraping. Surprisingly it coincided with an urge to buy a flat in Szczecin. Of course working a lot with data I wanted to at least analyze how the housing market in Szczecin looked like. I got much into web data scraping and downloaded all the housing advertisements from Szczecin from otodom.pl. This turned out to be a really nice database. I also use this database to create a price heatmap of Szczecin.

I've automated the process and I've been downloading this data weekly every Saturday night to balance load on the site. I ended up with a nice database from advertisements of Szczecin from nearly two years.

I no longer wish to buy a house in Szczecin, but it is really interesting to show the results what's in the data.

Note: This is the results of gathering publicly available data from otodom.pl. I do not use this data for commercial purposes and have no claim to ownership of it. Below is jus a simple analysis and given that buying a house is a serious investment it should not be used in any decision making. I also take no responsibility for these results. Also all this work has been done in my spare time and has not been financed by anyone. I stopped gathering data as of July 2018.

In [1]:

import os, re, datetime, json
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import numpy as np

%matplotlib inline
plt.rcParams["figure.figsize"] = (10,6)

ddir  = r"/home/mateusz/projects/szczecin_analysis/data/"

In [2]:

files = os.listdir(ddir)
# Filter files
data = {
    'file': [],
    'date': [],
    'lines':[]
        }
for f in files:
    if not re.match("^szczecin_[0-9]+\.jl", f):
        continue
    data['file'].append(f)
    date = datetime.datetime.strptime(f[9:15], "%d%m%y").date()
    data['date'].append(date)
    num_lines = sum(1 for line in open(os.path.join(ddir, f)))
    data['lines'].append(num_lines)

In [3]:

df = pd.DataFrame(data)
df = df.sort_values(by='date')
# df = df.set_index('date')
df.describe()

Out[3]:

	lines
count	96.000000
mean	4312.854167
std	2498.183669
min	0.000000
25%	1626.250000
50%	4743.500000
75%	6723.250000
max	7675.000000

We see that there are observations with 0 lines, where for some reasons the data was not downloaded. We next get rid of observations with less than 500 lines.

In [4]:

df = df[df['lines'] > 500]
df.describe()

Out[4]:

	lines
count	85.000000
mean	4867.105882
std	2084.254309
min	1431.000000
25%	3593.000000
50%	4951.000000
75%	6755.000000
max	7675.000000

We can next graph how the number of observations changed during the time of data gathering.

In [5]:

years = mdates.YearLocator()   # every year
months = mdates.MonthLocator()  # every month
yearsFmt = mdates.DateFormatter('%Y')

fig, ax = plt.subplots()
ax.plot(df['date'], df['lines'])
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
ax.xaxis.set_minor_locator(months)
ax.set_ylabel('Number of observations')
ax.set_xlabel('Date')
ax.grid(color='g', linestyle='--', linewidth=1, alpha=.3)
ax.get_xaxis().tick_bottom()    
ax.get_yaxis().tick_left()    
ax.spines["top"].set_visible(False)    
#ax.spines["bottom"].set_visible(False)    
ax.spines["right"].set_visible(False)    
#ax.spines["left"].set_visible(False)

We conclude that since the beginning of the experiment the number of offers in the service has been dropping substantially. The biggest drop has been recorded in the beginning of 2018 when the number of offers dropped from 4000 to less than 2000.

The questions remains if this situation is because there are less offers on the market or whether the service is loosing to some other opponent? Unfortunately given this dataset be cannot answer this questions.

Data analysis¶

We first need to load that data that is stored as JSON lines.

In [6]:

data = []
for idx in df.index:
    read_date = pd.read_json(os.path.join(ddir, df.loc[idx, 'file']), lines=True)
    read_date['date'] = df.loc[idx, 'date']
    data.append(read_date)
    read_date = None

data = pd.concat(data, axis=0)

In [7]:

data.tail()

Out[7]:

	_type	data_lat	data_lon	description	details	floor	location	poi_lat	poi_lon	pow	price	rooms	sellType	town	url	date
1604	HomespiderItem	53.413120	14.578210	[Mieszkanie, 82,60 m², Szczecin]	Polecam na sprzedaż 4 pokojowe mieszkanie o po...	3	Mieszkanie na sprzedaż, Szczecin, Śródmieście	53.4156	14.577	82.60	350000.0	4	Mieszkanie na sprzedaż	Szczecin	https://www.otodom.pl/oferta/mieszkanie-82-60-...	2018-06-23
1605	HomespiderItem	53.441479	14.538239	[Mieszkanie, 132 m², Szczecin]	OFERTA DLA INWESTORA!!! MIESZKANIE JEST WYNAJĘ...	2	Mieszkanie na sprzedaż, Szczecin, ks. Piotra S...	53.4418	14.5354	132.00	555000.0	5	Mieszkanie na sprzedaż	Szczecin	https://www.otodom.pl/oferta/mieszkanie-132-m-...	2018-06-23
1606	HomespiderItem	53.432408	14.536559	[Mieszkanie, 105 m², Szczecin]	Atrakcyjne mieszkanie (aktualnie wykorzystane ...	1	Mieszkanie na sprzedaż, Szczecin, Centrum, JAG...	53.4328	14.5343	105.00	460000.0	4	Mieszkanie na sprzedaż	Szczecin	https://www.otodom.pl/oferta/mieszkanie-105-m-...	2018-06-23
1607	HomespiderItem	53.413120	14.578210	[Mieszkanie, 132 m², Szczecin]	Piękne i przestronne mieszkanie o powierzchni ...	2	Mieszkanie na sprzedaż, Szczecin, Śródmieście	53.4156	14.577	132.00	525000.0	5	Mieszkanie na sprzedaż	Szczecin	https://www.otodom.pl/oferta/mieszkanie-132-m-...	2018-06-23
1608	HomespiderItem	53.481732	14.541711	[2 pokoje Apartamenty Warszewo - Szczecin]	Wyróżniająca się inwestycja realizowana w prze...	1	Mieszkanie na sprzedaż, Szczecin, Warszewo, M...	53.4826	14.5453	42.61	249000.0	2	Mieszkanie na sprzedaż	Szczecin	https://www.otodom.pl/oferta/2-pokoje-apartame...	2018-06-23

We see that some columns as _type, sellType, town or we won't be needing. The details column includes the whole description of the offer. Let's see what's inside the data. We leave url for checking for duplicates.

In [8]:

data['date'] = pd.to_datetime(data['date'])
data = data.drop(columns=['_type', 'sellType', 'town'])
data.describe()

Out[8]:

	data_lat	data_lon	pow	price
count	413704.000000	413704.000000	413704.000000	4.137040e+05
mean	53.196654	14.488744	68.401310	3.016699e+05
std	3.527650	0.976426	65.261838	1.308628e+05
min	0.000000	0.000000	11.000000	3.900000e+04
25%	53.420000	14.521824	47.740000	2.190000e+05
50%	53.432830	14.546396	62.000000	2.790000e+05
75%	53.445569	14.554006	82.000000	3.530000e+05
max	54.524587	53.221296	6330.000000	6.990000e+06

The average size of offered apartment was 68.4 m$^2$ and the average price was around 300k PLN.

We see that geographical coordinates, area and price are numeric. Apparently the floor variable has some non-numeric values.

In [9]:

data['floor'].value_counts()

Out[9]:

1           89674
4           78874
2           74595
parter      71983
3           64151
5            9602
6            7014
10           3671
8            3333
> 10         3258
7            3164
9            2437
poddasze      510
4             419
2             252
3             232
1             178
11            120
5              92
10             64
8              27
12             23
6              17
suterena        7
28              6
26              1
Name: floor, dtype: int64

We can see that the ground floor is described as the word parter. Also there are some apartments lower than ground floor suterena, but we will consider them as ground floor. We also add a category as giving the number of floors as 5 and more and 10 and more.

In [10]:

data.loc[data['floor'] == 'parter', 'floor'] = '0'
data.loc[data['floor'] == 'suterena', 'floor'] = '0'
data.loc[data['floor'] == 'poddasze', 'floor'] = '5'
data.loc[data['floor'] == '> 10', 'floor'] = '11'

data['floor'] = pd.to_numeric(data['floor'])
data.loc[data['floor'] > 10, 'floor'] = 10

data['floor'].value_counts()

Out[10]:

1     89852
4     79293
2     74847
0     71990
3     64383
5     10204
10     7143
6      7031
8      3360
7      3164
9      2437
Name: floor, dtype: int64

In [11]:

data['rooms'].value_counts()

Out[11]:

3                137132
2                132763
4                 55673
1                 25911
3                 16997
2                 15722
5                 13671
4                  7372
1                  4063
6                  2080
5                  1595
6                   289
7                   233
8                    76
7                    70
9                    32
więcej niż 10        13
8                    12
Name: rooms, dtype: int64

In [12]:

data.loc[data['rooms'] == 'więcej niż 10', 'rooms'] = '10'
data['rooms'] = pd.to_numeric(data['rooms'])
data['rooms'].value_counts()

Out[12]:

3     154129
2     148485
4      63045
1      29974
5      15266
6       2369
7        303
8         88
9         32
10        13
Name: rooms, dtype: int64

SQL database export¶

The floor and room numbers looks fine now. Next we can visualize our data. Also for visualization we will create a sample without duplicates. Before we do that let's just write the data into some better format for further analysis. I want to have a choice if I want to load just the appartments data without the description so I will go with a SQL database. Let's just stick with simple SQLite database.

We create a simple schema with just two tables with the following code:

CREATE TABLE adds (
    id integer PRIMARY KEY AUTOINCREMENT,
    description varchar,
    details varchar,
    location varchar,
    url varchar,
    price float,
    pow float,
    rooms integer,
    floor integer,
    data_lat float,
    data_lon float
);

CREATE TABLE dates (
    id integer PRIMARY KEY AUTOINCREMENT,
    add_id integer,
    date datetime
);

In [13]:

import sqlite3 as db

table_clear1 = '''DROP TABLE IF EXISTS adds; '''
table_clear2 = '''DROP TABLE IF EXISTS dates;'''
table_create1 = '''CREATE TABLE adds (
	id integer PRIMARY KEY AUTOINCREMENT,
	description varchar,
	details varchar,
	location varchar,
	url varchar,
	price float,
	pow float,
	rooms integer,
	floor integer,
	data_lat float,
	data_lon float
);'''
table_create2 = '''
CREATE TABLE dates (
	id integer PRIMARY KEY AUTOINCREMENT,
	add_id integer,
	date datetime
);
'''

scn_db = db.connect(ddir + 'szczecin_housing_290718.sqlite')
scn_cur = scn_db.cursor()

scn_cur.execute(table_clear1)
scn_cur.execute(table_clear2)

scn_cur.execute(table_create1)
scn_cur.execute(table_create2)

scn_db.commit()

No we can populate our table with observations from the dataframe.

In [14]:

unique_cols = ['location', 'pow', 'floor', 'rooms', 'data_lat', 'data_lon', 'url']
unique_adds_ids = ~data.duplicated(unique_cols, keep='first')

unique_adds = data[unique_adds_ids].drop(columns=['date', 'poi_lat', 'poi_lon'])
unique_adds = unique_adds.reset_index(drop=True)
unique_adds['id'] = unique_adds.index

In [15]:

unique_adds.tail()

Out[15]:

	data_lat	data_lon	description	details	floor	location	pow	price	rooms	url	id
48772	53.39368	14.51525	[Centrum 43 m2 po gen remoncie 269tys]	Apartament po generalnym remoncie przy Wałach ...	3	Mieszkanie na sprzedaż, Szczecin, Centrum, Hen...	43.00	269000.0	2	https://www.otodom.pl/oferta/centrum-43-m2-po-...	48772
48773	53.39368	14.51525	[Mieszkanie z dużym potencjałem, I p. w centrum.]	Na sprzedaż mieszkanie zlokalizowane w centrum...	1	Mieszkanie na sprzedaż, Szczecin, Mazurska	108.68	349000.0	4	https://www.otodom.pl/oferta/mieszkanie-z-duzy...	48773
48774	53.39368	14.51525	[Mieszkanie, 85 m², Szczecin]	DUŻE mieszkanie 3 pokojowe o powierzchni 85m2 ...	1	Mieszkanie na sprzedaż, Szczecin, Centrum, Ksi...	85.00	299000.0	3	https://www.otodom.pl/oferta/mieszkanie-85-m-s...	48774
48775	53.39368	14.51525	[Mieszkanie, 54 m², Szczecin]	Sprzedaż mieszkania z garażem łącznie za dod...	1	Mieszkanie na sprzedaż, Szczecin, Śródmieście,...	54.00	270000.0	2	https://www.otodom.pl/oferta/mieszkanie-54-m-s...	48775
48776	53.43283	14.54819	[Mieszkanie, 97,38 m², Szczecin]	Przestronne mieszkanie rozkładowe w kamienicy ...	2	Mieszkanie na sprzedaż, Szczecin, Centrum	97.38	320000.0	3	https://www.otodom.pl/oferta/mieszkanie-97-38-...	48776

Still the descriptioncolumn is coded as a list. We need to correct that.

In [16]:

unique_adds['description'] = unique_adds['description'].apply(lambda x: x[0])

We can now upload the db to SQLite.

In [17]:

unique_adds.to_sql("adds", scn_db, if_exists='append', index=False)

In [18]:

print(scn_cur.execute('SELECT * FROM adds WHERE details != "" LIMIT 2;').fetchall())

[(8962, 'Malczewskiego, 3pokoje,55m2', 'Na sprzedaż mieszkanie trzypokojowe położone przy ulicy Malczewskiego w Szczecinie. Na powierzchnię niespełna 55m2 składają się trzy pokoje( jeden przejściowy), łazienka z wc, kuchnia, oraz przedpokój. Mieszkanie wymaga nakładów finansowych. Do mieszkania przynależy piwnica. Zapraszam na prezentację oferty.', 'Mieszkanie na sprzedaż, Szczecin, Centrum', 'https://otodom.pl/oferta/malczewskiego-3pokoje-55m2-ID371WG.html', 235000.0, 54.9, 3, 4, 53.4354056, 14.5612614), (8963, 'Mieszkanie w domu z dużym ogrodem. Dąbie - super l', 'Do zaoferowania mam Państwu bezczynszowe mieszkanie, znajdujące się na parterze dwurodzinnego domu w Dąbiu. Łączna powierzchnia mieszkania to 87 m2, na która składają się: salon o powierzchni 18 m2, sypialnia 17m2, dwa pokoje po 10 m2 oraz kuchnia, łazienka z wc i przedpokój. Z salonu wyjście na zadaszony taras o powierzchni 7m2, z którego wychodzimy na duży, zadbany ogród (450m2). Poczucie prywatności zapewnia oddzielenie się od sąsiadów żywopłotem oraz tujami. Dodatkowo wydzielono miejsce na stół, krzesła czy pawilon ogrodowy poprzez zrobienie ogrodowego tarasu o powierzchni 12m2. Mieszkanie jest w bardzo dobrym stanie. w 2008 roku wykonano generalny remont wraz z wymianą wszystkich instalacji, a w 2014 mieszkanie zostało odświeżone, wymieniono podłogi. Ogrzewanie zapewnia piec gazowy, lecz dzięki ociepleniu budynku, mieszkanie jest bardzo ekonomiczne. W kuchni, łazience i przedpokoju ogrzewanie podłogowe. Dodatkowo do mieszkania przynależy duża piwnica składająca się z 3 pomieszczeń, pomieszczenia na strychu oraz 2 miejsca postojowe dla samochodów na podwórku na podjeździe. Mimo, że mieszkanie znajduje się w domu dwurodzinnym, wejścia do mieszkań są całkowicie niezależne. Wielkim atutem nieruchomości jest lokalizacja. Bliskość centrum Dąbia sprawia, że nawet na pieszo mamy dostęp do wielu sklepów, wszelkich urzędów, szkół itp. Mimo to z mieszkania możemy wyjść na ogród i w spokoju odpocząć. Termin wydania mieszkania to połowa 2017 roku. Więcej informacji chętnie udzielę telefonicznie. Zapraszam na prezentację! - Nasza firma pośredniczy w bezpłatnym doborze najtańszego kredytu spośród 25 banków na zakup wybranej przez Państwa nieruchomości. Jako jedni z nielicznych pracujemy w systemie wymiany ofert z biurami nieruchomości na terenie całej Polski. Posiadamy własne cztery polisy OC. Nr licencji pośrednika: 17866. Nasza firma posiada 5 oddziałów. CENTRALA FIRMY: Nasze oddziały: Oddział GOLENIÓW ul. M. Konopnickiej 76 72-100 Goleniów Tel. Oddział STARGARD SZCZECIŃSKI ul. B. Chrobrego 8a/7 (CH Starówka) 73-110 Stargard Szczeciński Tel. Oddział NOWOGARD ul. 15 lutego 2 (I piętro) 72-200 Nowogard Tel. Oddział SZCZECIN PRAWOBRZEŻE ul. Bagienna 36C, przy Andrzeja Struga 70-772 Szczecin Tel. Oddział ŚWINOUJŚCIE ul. Gen. Józefa Bema 7/2 (I piętro) 72-600 Świnoujście Tel. Zapraszamy do naszych oddziałów! ::oferta eksportowana z programu mediaRent::', 'Mieszkanie na sprzedaż, Szczecin, Dąbie', 'https://otodom.pl/oferta/mieszkanie-w-domu-z-duzym-ogrodem-dabie-super-l-ID3723V.html', 395000.0, 87.0, 4, 0, 53.4285438, 14.5528116)]

It worked! Now we merge the unique DataFrame with the data DF and leave just the ids from the unique_adds with dates.

In [19]:

data_export = data.reset_index()
data_export['index'] = data_export.index
all_dates = data_export.merge(unique_adds, how='left', on=unique_cols)[['index', 'date', 'id']]

print(data_export.shape, unique_adds.shape, all_dates.shape)

all_dates.tail()

(413704, 14) (48777, 11) (413704, 3)

Out[19]:

	index	date	id
413699	413699	2018-06-23	16718
413700	413700	2018-06-23	46878
413701	413701	2018-06-23	16637
413702	413702	2018-06-23	16498
413703	413703	2018-06-23	16221

In [20]:

all_dates = all_dates.rename(columns={'index' : 'id', 'id' : 'add_id'})
add_dates = all_dates.sort_values('id')
all_dates.tail()

Out[20]:

	id	date	add_id
413699	413699	2018-06-23	16718
413700	413700	2018-06-23	46878
413701	413701	2018-06-23	16637
413702	413702	2018-06-23	16498
413703	413703	2018-06-23	16221

In [21]:

all_dates.to_sql("dates", scn_db, if_exists='append', index=False)
print(scn_cur.execute('SELECT * FROM dates LIMIT 5;').fetchall())
scn_db.close()

[(0, 0, '2016-10-01 00:00:00'), (1, 1, '2016-10-01 00:00:00'), (2, 2, '2016-10-01 00:00:00'), (3, 3, '2016-10-01 00:00:00'), (4, 4, '2016-10-01 00:00:00')]

Success! All in all the export worked and now we have a 92 MB database instead of 680 MB JSONLine files.

Summaries¶

First we want to look how does the price depend on the area.

In [22]:

duplicate_verify = ['details', 'location', 'pow', 'floor', 'rooms', 'data_lat', 'data_lon', 'url']
sample = ~data.duplicated(duplicate_verify, keep='first')

data[sample].plot.scatter('price', 'pow')

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd9b99df128>

We see that there are some outliers with area greater than 5000 m$^2$. Also We will consider everything priced more than 3M PLN as outliers.

In [23]:

data = data.loc[(data['pow'] < 5000) & (data['price'] < 3e6)]

sample = ~data.duplicated(duplicate_verify, keep='first')

data[sample].plot.scatter('pow', 'price',  alpha=0.1)

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd9b99fb278>

There seems to be a lot of variance with increasing price. It's good to check the distribution of price and log price. We will once again plot, but with logs this time.

We will also calculate the log of area.

In [24]:

fig , ax= plt.subplots(ncols=2, nrows=1)
data.loc[sample, 'price'].plot.hist(bins=100, ax=ax[0])
ax[0].set_xlabel('Price, PLN')
data.loc[sample, 'price'].plot.hist(bins=100, logx=True, ax=ax[1])
ax[0].set_xlabel('Price, log PLN')

Out[24]:

Text(0.5,0,'Price, log PLN')

In [25]:

data['lprice'] = np.log(data['price'])
data['lpow'] = np.log(data['pow'])
data[sample].plot.scatter('lpow', 'lprice',  alpha=0.1)

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd9beae1c88>

Now let's see how the price, area and price per m$^2$ changed in time.

In [26]:

data['price_m'] = data['price'] / data['pow']

summ_cols = ['date', 'price', 'pow', 'price_m']
time_averages = data[summ_cols].groupby('date').aggregate(np.mean)

In [27]:

fig , ax= plt.subplots(ncols=1, nrows=3)
time_averages.plot.line(y='price', ax=ax[0])
time_averages.plot.line(y='pow', ax=ax[1])
time_averages.plot.line(y='price_m', ax=ax[2])

for axis in [0, 1, 2]:
    ax[axis].xaxis.set_major_locator(years)
    ax[axis].xaxis.set_major_formatter(yearsFmt)
    ax[axis].xaxis.set_minor_locator(months)
    ax[axis].grid(color='g', linestyle='--', linewidth=1, alpha=.3)
    ax[axis].get_xaxis().tick_bottom()    
    ax[axis].get_yaxis().tick_left()    
    ax[axis].spines["top"].set_visible(False)    
    #ax[axis].spines["bottom"].set_visible(False)    
    ax[axis].spines["right"].set_visible(False)    
    #ax[axis].spines["left"].set_visible(False)   
    if axis < 2:
        ax[axis].set_xticklabels([])
        ax[axis].set_xlabel("")
    else:
        ax[axis].set_xlabel('Date')

In [28]:

time_averages.tail()

Out[28]:

	price	pow	price_m
date
2018-05-26	331198.355851	66.996952	5120.528661
2018-06-02	332626.154605	66.974287	5134.696668
2018-06-09	334994.679686	67.655043	5136.716990
2018-06-16	337102.781377	68.329473	5128.648129
2018-06-23	333983.030733	67.655736	5153.752442

We see that the average area does not change much. What we see is that the price rises continuously over the analyzed period.

We also look at how long an advertisement is on average available.

In [29]:

data['group'] = data.groupby(duplicate_verify).grouper.group_info[0]
gr = data.groupby('group')
gr_dur = gr['date'].max() - gr['date'].min()
print(gr_dur.describe())
gr_dur.dt.days.hist(bins=60)

count                      63966
mean     43 days 06:23:22.720195
std      63 days 21:29:32.398003
min              0 days 00:00:00
25%              7 days 00:00:00
50%             22 days 00:00:00
75%             48 days 00:00:00
max            545 days 00:00:00
Name: date, dtype: object

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd9c18f95f8>

We see that on average an ad is alive for 62 days. The median time recorded is 22 days, around 3 weeks. Unfortunately our data acquisition technique is on 7 day basis, so these results are likely biased. There are also many advertisements with survival time == 0 indicating that before the second run they are already gone.

Linear regression analysis¶

Having cleaned our data we can next analyze according to different criteria like area, number of rooms and which floor is the apartment on. For that we will be using statsmodels package.

Also we want only to use first occurrences for our observations.

In [30]:

import statsmodels.api as sm

In [31]:

sample = ~data.duplicated(duplicate_verify, keep='first')

formula = "price ~ pow"
m0 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m0.summary()

Out[31]:

OLS Regression Results
Dep. Variable:	price	R-squared:	0.580
Model:	OLS	Adj. R-squared:	0.580
Method:	Least Squares	F-statistic:	8.837e+04
Date:	Wed, 08 Aug 2018	Prob (F-statistic):	0.00
Time:	06:32:36	Log-Likelihood:	-8.1194e+05
No. Observations:	63966	AIC:	1.624e+06
Df Residuals:	63964	BIC:	1.624e+06
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	6.937e+04	796.518	87.089	0.000	6.78e+04	7.09e+04
pow	3393.6968	11.416	297.275	0.000	3371.321	3416.072

Omnibus:	50588.275	Durbin-Watson:	1.830
Prob(Omnibus):	0.000	Jarque-Bera (JB):	7365727.648
Skew:	3.042	Prob(JB):	0.00
Kurtosis:	55.217	Cond. No.	178.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From this initial regressions results we see that increasing area by 1 m$^2$ increases the price by 3400 PLN. Let's see how this changes when we add floor number and number of rooms.

We also add localization and length of description.

In [32]:

data['desc_len'] = data['details'].apply(len)

formula = "price ~ pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()

Out[32]:

OLS Regression Results
Dep. Variable:	price	R-squared:	0.591
Model:	OLS	Adj. R-squared:	0.591
Method:	Least Squares	F-statistic:	1.320e+04
Date:	Wed, 08 Aug 2018	Prob (F-statistic):	0.00
Time:	06:32:36	Log-Likelihood:	-8.1110e+05
No. Observations:	63966	AIC:	1.622e+06
Df Residuals:	63958	BIC:	1.622e+06
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	8.191e+04	3936.631	20.807	0.000	7.42e+04	8.96e+04
pow	3047.9401	18.703	162.966	0.000	3011.282	3084.598
floor	-2375.9730	147.611	-16.096	0.000	-2665.291	-2086.655
rooms	1.13e+04	532.995	21.205	0.000	1.03e+04	1.23e+04
data_lat	8675.7804	746.630	11.620	0.000	7212.385	1.01e+04
data_lon	4.493e+04	5345.441	8.404	0.000	3.44e+04	5.54e+04
data_lat:data_lon	-1470.5759	141.645	-10.382	0.000	-1748.200	-1192.951
desc_len	9.9362	0.364	27.308	0.000	9.223	10.649

Omnibus:	54714.590	Durbin-Watson:	1.840
Prob(Omnibus):	0.000	Jarque-Bera (JB):	8977024.174
Skew:	3.444	Prob(JB):	0.00
Kurtosis:	60.626	Cond. No.	2.73e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

So each subsequent room adds 13k PLN to the price. Also each floor causes price to fall by 2.1k PLN. Also the longer the description the higher the price. Each letter "costs" around 10 PLN ;)

Let's see what happens after adding quarterly price interaction.

In [33]:

data['quarter'] = data['date'].dt.year.apply(str) + \
    "Q" + data['date'].dt.quarter.apply(str)

formula = "price ~ C(quarter) * pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len + C(quarter)"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()

Out[33]:

OLS Regression Results
Dep. Variable:	price	R-squared:	0.598
Model:	OLS	Adj. R-squared:	0.598
Method:	Least Squares	F-statistic:	5005.
Date:	Wed, 08 Aug 2018	Prob (F-statistic):	0.00
Time:	06:32:37	Log-Likelihood:	-8.1055e+05
No. Observations:	63966	AIC:	1.621e+06
Df Residuals:	63946	BIC:	1.621e+06
Df Model:	19
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	8.299e+04	3981.936	20.842	0.000	7.52e+04	9.08e+04
C(quarter)[T.2017Q1]	-8913.9666	2659.061	-3.352	0.001	-1.41e+04	-3702.205
C(quarter)[T.2017Q2]	6907.2167	2095.344	3.296	0.001	2800.340	1.1e+04
C(quarter)[T.2017Q3]	1.006e+04	3108.769	3.236	0.001	3966.920	1.62e+04
C(quarter)[T.2017Q4]	2.091e+04	3081.835	6.783	0.000	1.49e+04	2.69e+04
C(quarter)[T.2018Q1]	1.04e+04	3507.255	2.964	0.003	3522.359	1.73e+04
C(quarter)[T.2018Q2]	1.127e+04	3943.485	2.857	0.004	3538.005	1.9e+04
pow	3057.5999	21.765	140.485	0.000	3014.941	3100.259
C(quarter)[T.2017Q1]:pow	88.8832	38.815	2.290	0.022	12.805	164.961
C(quarter)[T.2017Q2]:pow	-77.3315	29.734	-2.601	0.009	-135.611	-19.052
C(quarter)[T.2017Q3]:pow	-62.2346	46.988	-1.324	0.185	-154.330	29.861
C(quarter)[T.2017Q4]:pow	-58.8361	45.783	-1.285	0.199	-148.570	30.898
C(quarter)[T.2018Q1]:pow	237.5971	52.472	4.528	0.000	134.752	340.442
C(quarter)[T.2018Q2]:pow	377.9014	57.809	6.537	0.000	264.595	491.208
floor	-2368.4159	146.429	-16.175	0.000	-2655.416	-2081.415
rooms	1.119e+04	530.117	21.111	0.000	1.02e+04	1.22e+04
data_lat	8629.3926	740.431	11.655	0.000	7178.146	1.01e+04
data_lon	4.565e+04	5300.945	8.612	0.000	3.53e+04	5.6e+04
data_lat:data_lon	-1487.2635	140.469	-10.588	0.000	-1762.584	-1211.943
desc_len	8.5650	0.380	22.518	0.000	7.820	9.311

Omnibus:	54502.764	Durbin-Watson:	1.870
Prob(Omnibus):	0.000	Jarque-Bera (JB):	8945236.463
Skew:	3.420	Prob(JB):	0.00
Kurtosis:	60.528	Cond. No.	2.73e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Here we can observe that the price per m$^2$ dependency is increasing in the last two quarters when compared to 2016Q4.

Being curious we also want to see how does the time that an advertisement spends online depend on our independent variables.

In [34]:

life = gr_dur.reset_index()
life = life.rename(columns={'date':'life'})
life['life'] = life['life'].dt.days

data = data.merge(life, how='left', left_on='group', right_on='group')

data['price_k'] = data['price'] / 1e3

sample = ~data.duplicated(duplicate_verify, keep='first')
formula = "life ~ price_k + pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()

Out[34]:

OLS Regression Results
Dep. Variable:	life	R-squared:	0.014
Model:	OLS	Adj. R-squared:	0.014
Method:	Least Squares	F-statistic:	111.5
Date:	Wed, 08 Aug 2018	Prob (F-statistic):	7.46e-186
Time:	06:32:39	Log-Likelihood:	-3.5624e+05
No. Observations:	63966	AIC:	7.125e+05
Df Residuals:	63957	BIC:	7.126e+05
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.7096	3.224	2.391	0.017	1.391	14.028
price_k	0.0239	0.003	7.413	0.000	0.018	0.030
pow	0.2297	0.018	12.651	0.000	0.194	0.265
floor	-0.1355	0.121	-1.123	0.262	-0.372	0.101
rooms	-2.3239	0.437	-5.323	0.000	-3.180	-1.468
data_lat	-1.1783	0.610	-1.931	0.053	-2.374	0.017
data_lon	-9.7080	4.365	-2.224	0.026	-18.264	-1.152
data_lat:data_lon	0.2869	0.116	2.479	0.013	0.060	0.514
desc_len	0.0015	0.000	5.085	0.000	0.001	0.002

Omnibus:	41549.719	Durbin-Watson:	1.234
Prob(Omnibus):	0.000	Jarque-Bera (JB):	475143.906
Skew:	3.062	Prob(JB):	0.00
Kurtosis:	14.865	Cond. No.	2.77e+04

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.77e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Quick look on the coefficients suggests that the bigger the house and the more expensive it is the longer it stayed on the website. Also the more rooms a house had the quicker it had gone. The effect of geographical position is big, but not significantly different from 0.

Also the longer the description, the longer an ad stayed on the website.

The R$^2$ of this model is really low indicating that most of what we record is noise and this model does not explain this dependence in a good way.

Conclusions¶

This turns out to be a fairly nice dataset. Also this is not everything that can be done with it. I will try to further show what else can be done given that I have information on geographical coordinates for each house and also the full text of the advertisement. More will come.

Szczecin housing market data introduction

Data analysis¶

SQL database export¶

Summaries¶

Linear regression analysis¶

Conclusions¶