UPDATE: This project is no longer being maintained, but is being kept here for you to fork.
Visit r/SubSimGPT2Interactive and the associated Discord channel to chat with other users and find out about any evolved forks.
Minimum Python requirement: Python 3.7
Full support/developed on Ubuntu-flavor Linux.
Works on Windows with Python 3.8+ (support for SQLite3's JSON field type is required)
This is the framework for an AI/Machine Learning Reddit Chatbot.
The bot works by reading Reddit comments and then using OpenAI's GPT-2 model to generate a new comment. The AI model is smart enough to generate a new comment on the topic.
The model also uses a small Bert model (courtesy of the Detoxify package) to classify toxic and offensive content.
This framework also contains scripts and tools for fine-tuning the GPT-2 model. Fine-tuning will give your chatbot a personality and make it generate text in particular themes.
An overview of AI/Machine Learning Text Generation can be found here: https://huggingface.co/tasks/text-generation
Bots using this framework have been banned by reddit. Bot owner's main account and any other accounts using the same IP address. If you create a toxic bot that creates offensive content it's highly likely to get banned by Reddit. There is nothing Subreddit moderators can do about that.
Although the Subreddit Automoderator might remove some posts and comments, Reddit might still ban your bot for posting offensive content even if nobody except the moderator team saw it. The subreddit moderators have no control over this ban. It's very important to use the negative_keywords feature in the config to prevent bad text being posted to Reddit in the first place.
The best way to avoid getting your bot banned is to train it with safe material in the first place. Some tips for cleaning the data are:
- Choose subreddits with safe content
- Modify the output_finetuning_data script to exclude comments with offensive content
- Choose content where the context is clear and unambiguous
- Find/replace on the training output data to change phrases to safe content
- Use the negative_keywords parameter in ssi-bot ini file to stop the bot posting negative content
Choosing good training material for your bot is very important. Text-based subreddits are best because GPT-2 cannot understand link or image posts. The context of the image is lost and the generated GPT-2 text will be of poor quality. If you download link posts, you can easily exclude them from the training data by modifying the output_finetuning_data script, some code examples are available near the end of this README.md
Meme-type subreddits are often poor data sources for GPT-2. In meme subreddits, the real/funny content is often in a meme image. GPT-2 cannot read images, so the context and funnys of the meme are lost.
When you run your bot, it will use the data in a different context compared to the original source. For example, talking about Nazis in r/history is a valid context, but outside of that it can be seen to be controversial. The context is important when deciding if your bot is posting offensive content or not.
simpletransformers
An open source Python package made by Thilina Rajapakse. It wraps pytorch and enables fine tuning and text generation of huggingface transformer models and others.
Documentation: https://simpletransformers.ai/docs/installation/
peewee
A database ORM that creates Python access to the database. SQL functions and queries can be completed using Python functions. It's like SQLAlchemy but much much easier to use!
Documentation: http://docs.peewee-orm.com/en/latest/index.html
praw
A Python package to interface with Reddit's API. It streamlines a lot of the hard work of interacting with the API.
Documentation: https://praw.readthedocs.io/en/latest/
detoxify
A python package wrapper around a Bert model that can classify the toxicity of comments.
Documentation: https://pypi.org/project/detoxify/
This is a very broad overview of the workflow for creating an ssi-bot
- Decide your bot's personality and choose which subreddit to get training data from. Be aware that GPT-2 cannot understand images, so subreddits which focus on images are not really suitable. Subreddits where text is the main content are most suitable.
- Download the subreddit training data from Pushshift
- Format and output the training data into a text file
- Finetune the GPT-2 model on a GPU (Google Collaboratory or locally)
- Setup a server (or 24/7 computer) to run the reddit bot on
- Install/unzip the model in
models/
directory - Create your reddit bot on reddit and enable the API acccess
- Setup the config files
- Run the bot!
A GPT-2 model already has a very good knowledge of language and a large vocabulary.
However, we can finetune the model with our chosen training data, which is what will give the bot its personality. Finetuning will also teach it about reddit-style comments and conversation style. This is done by showing it a few dozen megabytes of existing reddit data.
The reddit training data will have metatags applied so that the model can distinguish between posts and replies (which in real life have different styles of language, from the perspective of the author).
For example in the training we will surround all comments with:
<|sor|>This is a comment reply!<|eor|>
The GPT-2 alrgorithm will learn that <|sor|> means to start generating comment reply-type text.
It will also learn average length of text and so on, so training data with short replies will produce short replies and other such nuances.
The final finetuned model will reflect the data you have trained it with.
In the model_finetuning/
folder are some scripts used to assist downloading reddit data from Pushshift (a reddit mirror) and outputting the training data.
download_reddit_training_data.py
This script downloads submission and comment JSON files from Pushshift and saves them to the hard disk. It will take a long time due to the rate limiting on Pushshift.
It then parses the JSON file and pushes it into a database.
Putting the file into a database makes it easier to filter the data (by score, exclude NSFW, etc).
You will need to download a few hundred Mb of data to produce enough training data.
output_training_data.py
This script will output all of the data from the pushshift database into two text files for finetuning.
One text file is the training data and the other is a control sample used for evaluating the fine tuning process.
Using the 124M GPT-2 model, 6-10mb of training data is preferred. With less than 6mb of data you are at risk of overfitting the model to the data and you won't get good results.
To use these scripts, copy dataset_template.ini
to dataset.ini
and configure it accordingly.
ssi-bot_finetuning_notebook.ipynb
This interactive Python notebook contains all the code and instructions for fine-tuning the GPT-2 model.
It can be uploaded to Google Colaboratory to use a free GPU, or can be run locally on Jupyter Notebook/etc if you have your own GPU.
Google Query
Some people have used Google Query to download the training data faster. You'll need to write your own script to output the data into the same structure of the output_training_data.py script.
The default script will just output all of the training data, excluding that which contains negative keywords.
By writing custom queries and joining up multiple sets of lists together we can create a custom training text file
Here are some examples of using Peewee to filter the reddit data we downloaded.
Text submissions are better for GPT-2. With link submissions, the context of the image is often lost.
# Filter only text submissions
all_submissions = list(db_Submission.select().
where(db_Submission.is_self) &
(fn.Lower(db_Submission.subreddit).in_([s.lower() for s in training_subreddits])) &
(fn.Lower(db_Submission.author).not_in([a.lower() for a in author_blacklist]))))
Subreddits that have enormous volumes of data can be filtered down using a keyword on the title.
# Filtering by one subreddit
# Excluding titles by a single keyword
all_submissions = []
filtered_submissions = list(db_Submission.select().
where((fn.Lower(db_Submission.subreddit) == 'minecraft') &
(fn.Lower(db_Submission.title).contains('java')))
all_submissions.extend(filtered_submissions)
Repetitive content makes the model and the bot repeat that content too much. We can avoid it by excluding posts that include certain keywords.
# Excluding multiple strings
import operator
from functools import reduce
all_submissions = []
exclude_title_strings = ['Weekly Thread', 'Moderator News']
# creates a list of OR filters excluding each of the title strings
# ~ means negative, to exclude all titles containing the word java
exclude_title_filters = reduce(operator.and_, [~fn.Lower(db_Submission.title).contains(s) for s in exclude_title_strings])
filtered_submissions = list(db_Submission.select().
where((fn.Lower(db_Submission.subreddit) == 'learnpython') &
(exclude_title_filters))
all_submissions.extend(filtered_submissions)
Submissions with short titles often lack context when trained with GPT-2. We can filter for longer titles with more words which will be more interesting.
# Filter all selftext posts only
# Filter by title having a minimum length of 50,
# but selftext being < 1000 characters
filtered_submissions = list(db_Submission.select().
where((fn.Lower(db_Submission.subreddit) == 'showerthoughts') &
(db_Submission.is_self) &
(fn.Length(db_Submission.title) > 50) &
(fn.Length(db_Submission.selftext) < 1000))
all_submissions.extend(filtered_submissions)
Official documentation for the peewee ORM is here: https://docs.peewee-orm.com/en/latest/peewee/querying.html#filtering-records and a full list of peewee's query operators: https://docs.peewee-orm.com/en/latest/peewee/query_operators.html
The cheapest way to finetune the model is to use Google Colaboratory, which gives free access to a GPU for periods of 8-12 hours.
A Python notebook file (ssi-bot_finetuning_notebook.ipynb
) is kept in the model_finetuning
directory.
Navigate to https://colab.research.google.com/ and click Upload. Upload the .ipynb file and then follow the instructions.
After training, the optimum trained model will be saved in the best_model
folder. Download the model and unzip it into the models/
folder of your ssi-bot project.
If you have a powerful GPU at home, you can finetune the bot on your own computer. Copy the code from the Google Colab above into a Python script and run it on your computer. (And place a pull request on Github too, so we can improve the codebase).
Although the bot is finetuned on a GPU, a CPU is sufficient for using the model to generate text.
Any modern CPU can be used; having around 4Gb of RAM or more is the main requirement.
In order to run on SubSimGPT2Interactive, we require the bot to be running 24/7. This means putting it on a VPS/server, or an old laptop in your house could suffice too.
- Install packages with
pip install -Ur requirements.txt
(Advised: Use virtualenv) To keep a terminal window open on Ubuntu Server, use an application calledtmux
ssi-bot Config file
- Copy and rename ssi-bot_template.ini to ssi-bot.ini
- Where you have section [bot_1_username], change the section to your bot's username.
- Populate bot's section with filepath to model and negative keywords you want to use. The program already includes some basic negative keywords but we suggest you add more.
Create the bot account, setup reddit app and associated PRAW Config file
- Create the bot account on reddit
- Logged in as the bot, navigate to https://www.reddit.com/prefs/apps
- Click "are you a developer? Create an app.." and complete the flow
- Copy praw_template.ini to praw.ini
- Just as you did with ssi-bot.ini, rename the [bot_1_username] section to your bot's username. The sections should match, between ssi-bot.ini and praw.ini.
- Set all the data in praw.ini as from step 3 above
Running the bot
- The bot is run by typing
python run.py