While reading through Hands-On Machine Learning with Scikit-Learn and TensorFlow I finished the chapter on Recurrent Neural Networks (RNN). After the first reading I felt lost, but now I understood much more.
I'm generally a person who learns the most by doing. One of the exercises for this chapter encouraged the reader to go through official Tensor Flow tutorial regarding RNN. It seemed easy. I went through it, but I added a twist. I did not want to use the original data. I had an idea to create a text generator which I trained on the Polish Parliament minutes, which are freely available on the internet as PDFs.
It seemed entertaining for me to create an artificial Polish Parliament session generator. It seemed to me as an idea that I wanted to try out.
In this notebook I will show how to download, convert them into text and then how to train the using the TF tutorial code.
from bs4 import BeautifulSoup
import requests
import re
import time
import os
import numpy as np
import tensorflow as tf
import pyprind
# Matplotlib options
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
save_folder = './rnn_sejm/data/'
# Enable eager execution
tf.enable_eager_execution()
Scraping Polish Parliament minutes¶
I wanted to try out my web scraping skills and I decided to automatize downloading the minutes using loops and BautifulSoup
to crawl the main HTML document. Thankfully the Parliament's website is pretty simple and it was easy to download the page for every year of this term starting in 2015 until this year.
The HTML included a simple table with links to the PDFs for each sitting with class="pdf"
. For each such link I just downloaded the file in the output folder. I also added a 1 s break not to spam the server.
I also added a progress bar using PyPrind.
I ended up with 213 PDFs weighting over 375 MiB.
for year in [2019, 2018, 2017, 2016, 2015]:
sejm_url = 'http://www.sejm.gov.pl/Sejm8.nsf/stenogramy.xsp?rok={}'
sejm_url = sejm_url.format(year)
page_response = requests.get(sejm_url, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
#%%
table = page_content.find('table')
links = table.findAll('a', {'class': 'pdf'})
dl_bar = pyprind.ProgBar(
len(links),
track_time=True,
title='Downloading PDFs for {}'.format(year)
)
for link in links:
pdf_url = link.get('href')
dl_name = pdf_url.split('/')[-1:][0]
if re.match(r'.*\.pdf$', dl_name):
r = requests.get(pdf_url)
if r.status_code == 200:
save_file = os.path.join(save_folder, 'pdf', dl_name)
with open(save_file, 'wb') as f:
f.write(r.content)
time.sleep(1)
else:
print('Could not download: ', pdf_url)
dl_bar.update()
I decided to use pdftotext from Xpdf to convert downloaded docs to text.
It turned out it wasn't so easy as I've expected. I've tried using the Python PyPDF2
library, but I couldn't get any reliable results fast. I also tried Calibre's converter, but the PDFs have tw-column format and the converter mixed up the lines. Only Xpdf converted the text with least mistakes
import subprocess
import pyprind
pdf_files = os.listdir(os.path.join(save_folder, 'pdf'))
bar = pyprind.ProgBar(
len(pdf_files),
track_time=True,
title='Converting PDF to txt'
)
for f in pdf_files:
in_pdf = os.path.join(save_folder, 'pdf', f)
txt_name = f.split('.')[0] + '.txt'
out_txt = os.path.join(save_folder, 'raw_text', txt_name)
subprocess.run(["pdftotext", in_pdf, out_txt])
bar.update()
Unfortunately the conversion wasn't perfect. I got a lot of additional stuff that I did not need, like table of contents, headers, footers or page numbers.
Below is a quick and dirty approach to remove this unnecessary text from the content. Still some unnecessary text remains. These are mostly MP's names that appeared in the beginning of each page, indicating the current speaker. Fot this short exercise I think that it is enough. For a more serious analysis I would have to be more thorough.
def clean_steno(in_file, out_file):
text = ''
with open(in_file, 'r') as f:
text = f.read()
# Remove control characters
# text = "".join(ch for ch in text if unicodedata.category(ch)[0]!="C")
text = re.sub(r'\f', '', text)
# Remmove 'Spis treści'
toc = re.search(r'\(Na posiedzeniu', text)
if toc is not None:
text = text[toc.span(0)[0]:]
else:
print('Did not find TOC for ', in_file)
# Remove ending
toc = re.search(r'\nTŁOCZONO ', text)
if toc is not None:
text = text[:toc.span(0)[0]]
else:
print('Did not find ending for ', in_file)
# Clean some chars
for ch in ['–', '—', '−']:
text = text.replace(ch, '-')
for ch in ['«', '»', '”', '„']:
text = text.replace(ch, '"')
text = text.replace('§', 'par. ')
# Remove headers, multiple lines, line breaks, page nr., etc.
text = re.sub(r'\n\n[0-9]+\n+([0-9]+\n)?', '\n[line]\n', text)
text = re.sub(r'\nSpis treści\n', '\n', text)
text = re.sub(r'[0-9]+\. posiedzenie .*\n+', '\n[line]\n', text)
text = re.sub(r'\[line\]\n(.*\n){2}\n+[A-Z].*\n', '\n[line]\n', text)
text = re.sub(r'-\n+([A-Z\[].*\n+)+', '', text)
text = re.sub(r'\n\n(.*\n)\[line\]', '\n[line]', text)
text = re.sub(r'\[line\]\n[A-Z].*\n\n', '', text)
text = re.sub(r'\[line\]\n', '\n', text)
text = re.sub(r'-\n+', '', text)
text = re.sub(r'…', '...', text)
text = re.sub(r'([^\.\)\?\!:\n])\n+', r'\1 ', text)
text = re.sub(r'\n{3,}', '\n\n', text)
# Save file
with open(out_file, 'w') as f:
f.write(text)
bar.update()
# Test
# txt_name = '01_ksiazka_c_bis.txt'
txt_files = os.listdir(os.path.join(save_folder, 'raw_text'))
bar = pyprind.ProgBar(
len(txt_files),
track_time=True,
title='Cleaning text files'
)
for txt_name in txt_files:
# txt_name = '73_b_ksiazka.txt'
in_file = os.path.join(save_folder, 'raw_text', txt_name)
out_file = os.path.join(save_folder, 'clean_text', txt_name)
clean_steno(in_file, out_file)
I decided to select just one full sitting (23rd) to train my data. Unfortunately I only have my laptop with Linux running in a Virtual Machine environment, so my computational power is limited. Even for the 2.5M characters selected each Epoch of training takes around an hour to complete.
# Build corpus
files = os.listdir(os.path.join(save_folder, 'raw_text'))
# We read 23. seating
files = [file for file in files if re.match(r'^23', file)]
corpus = ''
for file in files:
text_file = os.path.join(save_folder, 'clean_text', file)
text = open(text_file, 'r').read()
corpus += text
print('Length of text: {} characters'.format(len(corpus)))
Model training¶
In the next lines I follow the approach form the official TensorFlow tutorial.
Unfortunately I didn't modify the code too much I only changed the loss function to tf.losses.sparse_softmax_cross_entropy
. The tutorial used a nightly build of TF and I use the official release. I got an error when using the loss function from the example.
At this point I feel more like a script kiddie following someone else's code. I really like to write my own procedures, because this is how I learn. Even when learning Python, ML or NN from books I refrained from copy-pasting the code from the repositories. Instead I've written everything myself, by hand. For me this is a way of getting to know the mostly used classes and functions. I somehow see no added value in copying and pasting code written by someone else when learning a new library. It would just give me a false feeling of accomplishment instead of understanding or even familiarizing myself with each step needed. Of course I've went through the code line by line, but it is just not the same for me as writing it myself. For me that should not be the point of learning. Also I see this an easy to fall trap when learning how to code. That's why it was my ambition to at least try it out with my own data.
# The unique characters in the file
vocab = sorted(set(corpus))
print('{} unique characters'.format(len(vocab)))
This is quite a lot. The original had around 60 chars. But that is the peculiarity of the Polish language, that it includes chars like ąłśćźżó
.
# Creating a mapping from unique characters to indices
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in corpus])
tf.enable_eager_execution()
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(corpus) // seq_length
# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
for item in sequences.take(5):
print(repr(''.join(idx2char[item.numpy()])))
def split_input_target(chunk):
input_text = chunk[:-1]
target_text = chunk[1:]
return input_text, target_text
dataset = sequences.map(split_input_target)
for input_example, target_example in dataset.take(1):
print('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
print('Target data:', repr(''.join(idx2char[target_example.numpy()])))
# Batch size
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE
# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 1000
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset
# Length of the vocabulary in chars
vocab_size = len(vocab)
# The embedding dimension
embedding_dim = 256
# Number of RNN units
rnn_units = 1024
import functools
rnn = functools.partial(
tf.keras.layers.GRU,
recurrent_activation='sigmoid'
)
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(
vocab_size,
embedding_dim,
batch_input_shape=[batch_size, None]
),
rnn(rnn_units,
return_sequences=True,
recurrent_initializer='glorot_uniform',
stateful=True
),
tf.keras.layers.Dense(vocab_size)
])
return model
model = build_model(
vocab_size=len(vocab),
embedding_dim=embedding_dim,
rnn_units=rnn_units,
batch_size=BATCH_SIZE
)
model.summary()
def loss(labels, logits):
return tf.losses.sparse_softmax_cross_entropy(
labels,
logits
)
model.compile(
optimizer=tf.train.AdamOptimizer(),
loss=loss
)
# Directory where the checkpoints will be saved
checkpoint_dir = './rnn_sejm/model'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix,
save_weights_only=True
)
EPOCHS = 10
history = model.fit(
dataset.repeat(),
epochs=EPOCHS,
steps_per_epoch=steps_per_epoch,
callbacks=[checkpoint_callback]
)
As I mentioned before, each epoch takes around 3 quarters of an hour to train. Initially I played with only 1 epoch, but in the end I let it run through 10 epochs, while I left home to work. When I returned I has my 10 epoch model trained.
The graph below suggests, that there is still much place for improvement. But for practice this should suffice.
epoch = range(EPOCHS)
plt.plot(epoch, history.history['loss'], label='Training set loss')
plt.xlabel('Epoch')
plt.ylabel('loss')
plt.legend(loc='best', prop={'size': 20})
pred_model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
pred_model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
pred_model.build(tf.TensorShape([1, None]))
pred_model.summary()
def generate_text(model, start_string, num_generate=100):
# Evaluation step (generating text using the learned model)
# Converting our start string to numbers (vectorizing)
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)
# Empty string to store our results
text_generated = []
# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.
temperature = .5
# Here batch size == 1
model.reset_states()
for i in range(num_generate):
predictions = model(input_eval)
# remove the batch dimension
predictions = tf.squeeze(predictions, 0)
# using a multinomial distribution to predict the word returned by the model
predictions = predictions / temperature
predicted_id = tf.multinomial(predictions, num_samples=1)[-1, 0].numpy()
# We pass the predicted word as the next input to the model
# along with the previous hidden state
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])
return (start_string + ''.join(text_generated))
Simulation results¶
After training I generate a sample of text. To my astonishment the text is even readable! Given that the model generates text letter by letter I'm impressed. Even given the complexities of the Polish grammar there are less mistakes then I expected. Of course the sentences are too long, they switch sense in the middle, but there is some cohesion in them. Moreover the model catches some peculiarities of the discussion well.
I'm curios enough to check how the results would look like after more training.
I would like to play with the models more myself, but currently I'm handicapped by my laptop's computational power. Still I really like Kers
and how easy it is to build models in it compared to raw TensorFlow.
I'll keep the ideas coming!
print(generate_text(pred_model, start_string=u"Wicemarszałek ", num_generate=1500))