Last month, I experimented with building a reddit comment bot that generated natural language replies by combining two pre-trained deep learning models: GPT-2 and BERT. I wrote another post on the motivation and background, but here I wanted to give a step by step walkthrough so others can work with what I've built. If you prefer, you can jump straight to the project code.
And to see the work that I based this on see this and this. Before getting into the nitty-gritty, I wanted to give a general overview of the process that I'm going to be using.
This flow diagram shows the 3 models that I needed to train, as well as the process fr hooking the models together to generate the output.
There are quite a few steps, but I hope it doesn't get too confusing. Check out my previous post for an even higher-level architecture overview.
How to Make a GPT2 Twitter Bot
Here are the steps I'll be explaining in this post. As with any machine learning project, nothing can start until you have data from which to train your model. The data I used to fine-tune the models came from a large database of previously retrieved reddit comments. There is an ongoing project that scrapes many sites around the web and stores them in a bunch of Google BigQuery tables.
To me, it's very surprising that I couldn't find a central page about such a big project, but I used a few reddit and medium posts to piece together the format of the queries I'd need.
To start, I just downloaded a bunch of comment and reply information for the subreddits on 'writing', 'scifi', 'sciencefiction', 'MachineLearning', 'philosophy', 'cogsci', 'neuro', and 'Futurology'. I used the bigquery python API to automate the generation of the queries I needed to download the data across a number of months in and In the end, I'm going to want to be able to prime the GPT-2 network with a comment and generate a reply. To do this, I needed to reformat the data to contain both parts separated by a special [SEP] string to let the algorithm know which part is which.
Each line of training data file will look like the following. After I train the model with this format, I can then feed the trained model a string like "some new primary comment text" [SEP]and it will start to generate the remaining "some new reply" that it thinks fits best based on the training data.
I'll explain in more detail below about how to feed this kind of data into the GPT-2 fine-tuning script. The major advantage of using GPT-2 is that it has been pre-trained on a massive dataset of millions of pages of text on the internet.
However, if you were to use GPT-2 straight "out-of-the-box," you'd end up generating text that could look like anything you might find on the internet.
Sometimes it'll generate a news article, sometimes it'll generate a cooking blog recipe, sometimes it'll generate a rage-filled facebook post. You don't really have too much control, and therefore, you won't really be able to use it to effectively generate reddit comments.Runs smoothly on an iPhone 7.
The dawn of lightweight generative transformers? More info Start writing. The almighty king of text generation, GPT-2 comes in four available sizes, only three of which have been publicly made available. Feared for its fake news generation capabilities, it currently stands as the most syntactically coherent model. Overcoming the unidirectional limit while maintaining an independent masking algorithm based on permutation, XLNet improves upon the state-of-the-art autoregressive model that is TransformerXL.
Posts are frequently incoherent or contain non sequiturs, and the bots make obvious factual errors. Flubs aside, the bots are remarkable creations. Interestingly, the bots even manage to mimic the metatext of Reddit.
They quote one another although the quotes are made up and link to fake YouTube videos and Imgur posts. All this AI hubbub is the creation of redditor disumbrationist, who explains some of the technical details behind the project here. Ask about a car? Each bot is trained on a pretty small text file between just 80mb and mb in size which contains some of the most popular posts and comments scraped from different subreddits.
Fake text generation like this is undeniably impressive, but it raises worrying questions as well. A decision that was controversial in the usually-open world of AI research.
Cybersecurity Mobile Policy Privacy Scooters. Phones Laptops Headphones Cameras. Tablets Smartwatches Speakers Drones. Health Energy Environment. YouTube Instagram Adobe. Kickstarter Tumblr Art Club. Film TV Games. Fortnite Game of Thrones Books. Comics Music. Filed under: Tech Artificial Intelligence.Over the past 2 years NLP and language generation have experienced a renaissance moment.
With the onset of the transformer neural network architecture there has been an explosion of work and hype around the plethora of potential use cases for these types of models. I had initially scraped this dataset down a couple of years ago for another project from my undergrad that carried out sentiment analysis and clustering on dialogue in film scripts. Thankfully, it exists now and I was able to build script buddy using it. IMSDB is an online repository of film scripts consisting of approximately screenplays.Conversational Agents with GPT-2
While there are only scripts, a screenplay on average contains about 30, words. So in the dataset we have close to 40 million sequences of words. Since the screenplays are stored in plain text on the site it made it relatively easy to design a scraper to iterate over each script url with Scrapy. Screenplays are highly structured pieces of text with visual cues to denote scene action, scene location, and character dialogue.
I wanted the model to be able to generate entire sequences of scripts with mixed script elements in each sequence — therefore it was important that the data that went into the model at training represented both the text and structural layout of a screenplay to allow the model to optimize towards that specific structure. With every script scraped I was also able to collect genre metadata. To load the script data into the model in batches it needs to be in the correct tokenized format for GPT I created a ScriptData class see below that splits the entire script dataset up into tokenized blocks of tensors.
Once run, these blocks are ready to be loaded in batches into GPT-2 in a training loop. Fine-tuning for me seemed very daunting before I got hands on with it. Thanks to Martin Frolov for his post detailing how he fine-tuned GPT-2 on a dataset of jokes, it was a huge help.
I used gpt2-medium which is the medium sized version of GPT-2 provided by transformers.
AI can write just like me. Brace for the robot apocalypse
This version of the model has 12 layers, and roughly million parameters. With theses two objects you can use GPT-2 as is — but to fine-tune or optimize it on a custom dataset of tokenized text you need to create a training loop where you progressively load a batch of script sequences from the entire dataset.
Choosing a batch size for the data loader can be tricky and result in you running out of GPU memory fast if you pick too large a size. In the end I managed to train the model with a batch size of 7.For us at Humanise. My thoughts below distill many of the lessons and experiences we learnt throughout and that culminated in the success we had at the end of the year.
In Google demonstrated Duplexperhaps the first example of a narrow domain-specific AI that blew my mind. Duplex is able to book a restaurant table or an appointment for a haircut, communicating naturally with humans over a phone call. But, really, booking a restaurant table is a very narrow use case.
What Duplex really demonstrated was that if you put enough effort into a narrow scenario, you can do something pretty awesome. I expect more narrow use cases to impress inas the lessons from Duplex begin to emerge.
Last year the technology behind Duplex started rolling out to our smartphones. Make no mistake, this is a significant moment for AI. Chatbots: A new way to communicate with your customers. Chatbot Conference Virtual assistatants will continue to improve, but in a slow and incremental fashion. The smart money will go into narrow domain use cases for the forseeable future.
After a number of years of stagnation in the NLP technology market, was the year in which we began to see really exciting progress. New algorithms and frameworks using transformer-based techniques emerged throughout last year. But big challenges with these new techniques remain — things like BERT can be far too slow and require far too much expensive compute capacity for many applications. Slimming the models and improving performance has been a focus for late and will continue to be so for A very small but specialist community is leading here — notably, tech startup HuggingFace who just raised their Series A investment round.
I expect more investment and emergence of niche NLP outfits in Right now, using the latest techniques requires quite deep expertise — these are not chatfuel.
Do I think these new NLP techniques will be transformational and usher in a new era? Let me put it this way: I expect them to bring meaningful improvements.First they released Million Parameters model, then M then M and finally in last November, they open sourced 1. We made it work and generated some text but it was not of very good quality.
As OpenAI kept publishing better models, we kept trying on them and result was improving. Finally, when 1. For previous models, we had seen that sometimes model would generate text which was totally unrelated to input but in 1. To make GPT-2 based text generation available for testing for all enthusiasts we started working on to create a demo and now it is available at:.
You can provide input and select the length of the text you would like to generate. To improve it further it was needed to be fine-tuned further. GPT-2 is already trained on very large text — 40GB of text from 8 million web pages of internet text.
But that text would be general and not domain specific. On this fined-tuned model, when we generated the text, it was improved a lot. Once we have the fine-tuned model, we can churn out the articles very quickly by providing it various inputs on the same topic.
We could generate thousands of sample articles within couple of days for various inputs. After witnessing some improvements for articles on Artificial Intelligence, we went ahead and fine-tuned GPT-2 model on other topics. Below are lists and links of topics for which we fine-tuned GPT-2 Model and generated sample articles. Check the sample articles on machinewrites. Let us know what would you like to get done from GPT-2 model in comment section.
If we find it interesting then we may work on it. Published On: February 25, February 25, 0 Comments. GPT-2 Text Generator Demo To make GPT-2 based text generation available for testing for all enthusiasts we started working on to create a demo and now it is available at: Text generation Using GPT-2 Demo You can provide input and select the length of the text you would like to generate.
So, we took approach to make it domain specific by creating dataset of specific domain. Many of the times we get jumbled up, fully meaningless text. We would like to make it better by fine-tuning the model further. We are also working on to make it multi-lingual.
Our initial tests showed that normal GPT-2 model is not able to generate proper text other than English.GPT-2 is a deep learning model that is able to generate astonishingly coherent English text. Its creators at OpenAI were so impressed by the model's performance that they originally didn't release it for fear of it being too easy to abuse.
I think they were right to be concerned. Here is an excerpt that the model generated, taken from their release page.
We are now very close to effectively simulating human creativity. I find machine imitation of human communication fascinating; in fact, it's something I've explored in my fiction writing previously. But since I've never worked on natural language generation or deep learning, I decided to look more closely at just what this machine could do.
My goal was to see how close I could come to impersonating a real human with algorithmically generated text and almost no manual quality control. I decided that one of the easiest places to test such a system would be in the responses to comments on the social media website, reddit. My goal became to generate a bot that would respond topically to comments, garner upvotes, and see if it can promote discussion.
In case you are worried about the ethicality of releasing a surreptitious human on reddit, rest assured I have only deployed the bot sparingly to avoid generating too much annoyance in the world. And I have manually reviewed evey comment to ensure that it produced nothing too offensive.
Honestly, I was hoping I could use this tool to become a little more popular on this whole internet thing.
I've been pretty much terrible at interacting on social media so I figured maybe I could automate the problem away. I quickly learned that just using GPT-2 on it's own is not quite adequate to impersonate a human most of the time. But with a little modification, I've found that building a frighteningly passable reddit commenter is not only possible; it's pretty easy. What GPT-2's creators fail to mention is that while almost everything the model generates is grammatically and syntactically correct, only a tiny fraction of the outputs make any damn sense.
Here is another excerpt that shows just how non-human the output normally looks. When I first started experimenting, I generated a lot of similar gibberish. As it turns out, GPT2 on its own is fairly prone to getting into weird unintelligible rants.
Here are some examples. Clearly the algorithm is getting confused on the way quotations work. Then there's this one, which makes grammatical sense, but is clearly a series of statements that no regular person would ever say unless they were trolling. Worse still, a lot of the time, GPT2 will just start repeating a few crazy phrases over and over. You can check out some of the model's output to get a taste of the kinds of things that it generates in the raw. I wouldn't want to build a bot that spewed crazy looking responses like that all the time.
It would be incredibly annoying to other redditors and would probably be flagged right away. Still, I didn't want to give up on the idea completely. I started brainstorming about ways that I could fix the performance problems with GPT2 and make it more robust, and I came up with something that was able to filter out a lot of the crap responses. To fix the problem, I borrowed an idea from another deep learning architecture called a generative adversarial network or GAN.