Using deep learning to generate offensive license plates
Combining a rich corpus with Keras in R
If you’ve been on the internet for long enough you’ve seen quality content generated by deep learning algorithms. This includes algorithms trained on band names, video game titles, and Pokémon. As a data scientist who wants to keep up with modern tends in the field, I figured there would be no better way to learn how to use deep learning myself than to find a fun topic to generate text for. After having the desire to do this, I waited for a year before I found just the right data set to do it,
I happened to stumble on a list of banned license plates in Arizona. This list contains all of the personalized license plates that people requested but were denied by the Arizona Motor Vehicle Division. This dataset contained over 30,000 license plates which makes a great set of text for a deep learning algorithm. I included the data as text in my GitHub repository so other people can use it if they so choose. Unfortunately the data is from 2012, but I have an active Public Records Request to the state of Arizona for an updated list. I highly recommend you look through it, it’s very funny.
The next step was to actually learn how deep learning works and write the code. Since I do most of my work in R I was hoping to use that, and thankfully Rstudio has a package ‘keras’ which makes this easy to do. While the tool itself is easy to use, I had to piece together an intro to deep learning myself through perusing the internet. In retrospect, I probably should have just bought a book. The rest of this post is my summary of what I learned so that other people can read it and figure out how to generate their own fun text. If you want to see my actual code, check out my GitHub repository.
A colloquial intro to how deep learning works for text generation
When people talk about using deep learning to generate text, they almost always are referring to Recurrent Neural Networks (RNN). These are a special form of artificial neural networks that work well on sequences of data points (for instance, a sequence of words in a book). Ignoring the mathematics behind this, the idea is that given a sequence of words, for example “this is a boring example of a sequence of words,” we can train a model that predicts a word in the sequence based on the words in the sequence before it (and the order of the words). So we pick a small number of previous words to consider before we pick our next word. If our number of previous words was 3, then our small sequences would create a data set that looks like:
Previous word 1 | Previous Word 2 | Previous word 3 | ➡️ | Next Word |
---|---|---|---|---|
This | is | a | ➡️ | boring |
is | a | boring | ➡️ | example |
a | boring | example | ➡️ | of |
Since the recurrent neural network can’t deal with words and instead needs numbers, we arbitrarily assign each word to a number. For instance, a = 1, boring = 2, this = 7, and so on. Then our data becomes:
Previous word 1 | Previous Word 2 | Previous word 3 | ➡️ | Next Word |
---|---|---|---|---|
7 | 5 | 1 | ➡️ | 2 |
5 | 1 | 2 | ➡️ | 4 |
1 | 2 | 4 | ➡️ | 3 |
But unfortunately even this is still too complicated for the RNN. We need to represent each number as a categorical variable by converting it to a binary vector. So for example, we could code the 5th of 7 words as a binary vector of length 7 with only the fifth element as a 1. (0,0,0,0,1,0,0). If we do this for our whole dataset we get:
Previous word 1 | Previous Word 2 | Previous word 3 | ➡️ | Next Word |
---|---|---|---|---|
(0,0,0,0,0,0,1) | (0,0,0,0,1,0,0) | (1,0,0,0,0,0,0) | ➡️ | (0,1,0,0,0,0,0) |
(0,0,0,0,1,0,0) | (1,0,0,0,0,0,0) | (0,1,0,0,0,0,0) | ➡️ | (0,0,0,1,0,0,0) |
(1,0,0,0,0,0,0) | (0,1,0,0,0,0,0) | (0,0,0,1,0,0,0) | ➡️ | (0,0,1,0,0,0,0) |
This is now data that the RNN can work on. The model will fit to it, and then we can give it a set of three previous words and have it guess what the next word will be. Since there are many possible next words, what ends up happening is the model produces a combination of all the possible words with different weightings that add up to one. These represent the probabilities of each word being next:
Previous word 1 | Previous Word 2 | Previous word 3 | ➡️ | Next Word |
---|---|---|---|---|
(0,0,0,0,1,0,0) | (0,0,0,0,1,0,0) | (0,0,0,1,0,0,0) | ➡️ | (0.25, 0.1, 0.2, 0.1, 0.3, 0.05, 0.0) |
To generate a sequence of words, we pick three words as a starting point, and then draw the forth word by sampling from what the model predicted as probabilities of each of the words. For instance in the example data above if we used that predicted next word distribution as a set of weights of possible words, we could draw from the sample. If we drew the second element that corresponds to the word “boring” which would be the next word in the sequence. We then put that into the sequence and use the most recent elements in the sequence to generate the fifth word and so on.
Setting up the license plate data
For the license plate data, we had to make a few alterations to this main idea.
-
The “words” in this case are letters and numbers in the plate (not full words).
-
Since a license plate may not be a full length, we want to know when to stop it. Thus, we append a stop character ‘+’ to each license plate. Therefore the plate ABC001 becomes ABC001+. If when generating new license plates we ever generate the symbol ‘+’, that’s a sign we need to stop the sequence.
-
The sequences that we use for training data are subsequences of each plate in the data. So for instance if ABC001+ is in the data, we want to train the model on (A → B, AB → C, ABC → 0, …). Because each sequence of training data has to be the same length, we pad the data at the beginning (~~~~~~A → B, ~~~~~AB → C, ~~~~ABC → 0, …). We generate all the sequences from each plate and put them into one large training set.
Using R and Keras for Tensorflow
I ended up using three different technologies stacked up on each other to run the model. These terms get thrown around a lot, so it’s best to clearly distinguish each one:
-
TensorFlow (base technology): TensorFlow is an open source set of libraries for running deep learning algorithms. It’s incredibly powerful, but also somewhat of a hassle to deploy directly.
-
Keras (Python API): Keras is a Python library that makes TensorFlow way easier to use. It takes all of the hassle out of setting up and deploying TensorFlow, and really made deep learning accessible to Python users.
-
keras (R package): This is the package by RStudio that allows you to use Keras in R instead of Python. This isn’t Keras itself, but is instead is a wrapper on top of the Keras Python library. The good news is that it works extremely well, follows modern R programming principles, and is easy to install. The bad news is that it’s a wrapper on a wrapper so sometimes searching for help isn’t easy. But as far as I can tell, anything you can do in Python Keras you can do in R Keras too.
I figured out how this stuff all works by following the text generation example of the keras package. I ran their code line by line to understand how it worked, and then modified it for my license plate project. To get the code running you need to install TensorFlow and Python, which the RStudio package helps you with.
Their example text generation code really only does a few key things:
-
Load up the example text data and do any formatting needed. In my case, I needed to add the license plate modifications from the previous section.
-
Create sequences of words to use as training data and format them as a 3-dimensional array X. The dimensions of the array are: which sequence the data is from, which element in sequence we are referring to, and which word. The element X[i,j,k] = 1 if sequence i has in the jth word as k. The predictions get stored in a 2-dimensional array Y. The dimensions are: which sequence we are referring to, and which word in the sequence.
-
Run a keras model on the data. When specifying a model, you need list the different layers of neurons to create an artificial neural network. Honestly, I have almost no idea how to make a “good” specification, so for my code I used the same one as the RStudio sample. The art of choosing layers for a neural network feels like the most unclear part of this whole process, and I am hoping to learn more. If you’re trying to write your own deep learning code for the first time, I recommend piggybacking off someone else’s specification who is solving a similar problem.
-
Use the created model to generate words and create new sequences from those words.
Once I had the model working, I took a bunch of randomly generated plates and threw them in an image. All together this project took me around 8 hours, and that would have been faster had I not had a few off-by-one errors in my code. I was extremely happy with how it turned out, and am surprised at how straightforward the whole process it. If I had known it would be this easier I would have done it months ago. If you’re interested in deep learning I highly recommend you get a text corpus and try generating new text yourself. It has taught me a ton about the process and I feel much more equipped to use deep learning for more advanced projects. The code and data is available on GitHub so you can try making offensive things yourself!