The pretty HTML version is here and the source for this guide is on Github.
A summary of how to set up a full Wikipedia.org mirror using three different approaches.
DEMO: https://other-wiki.zervice.io
Did you know that Wikipedia.org just runs a mostly-traditional LAMP stack on ~350 servers? (as of 2019)
Unfortunately, Wikipedia attracts lots of hate from people and nation-states who object to certain articles or want to hide information from the public eye.
Wikipedia's infrastructure (2 racks the USA, 1 in Holland, and 1 in Singapore, + CDNs) cant always stand up to large DDoS attacks, but thankfully they provide regular database dumps and static HTML archives to the public, and have permissive licensing that allows for rehosting with modification (even for profit!).
Growing up in China behind the GFC I often experienced Wikipedia unavailability, and in light of the recent DDoS I decided to make a guide for people to help demystify the process of running a mirror. I'm also a big advocate for free access to information, and I'm the maintainer of a major internet archiving project called ArchiveBox (a self-hosted internet archiver powered by headless Chromium).
This aim of this guide is to encourage people to use these publicly available dumps to host Wikipedia mirrors, so that malicious actors don't succeed in limiting public access to one of the world's best sources of information.
A full English Wikipedia.org clone in 3 steps.
DEMO: https://other-wiki.zervice.io
# 1. Download the Kiwix-Serve static binary from https://www.kiwix.org/en/downloads/kiwix-serve/
wget 'https://download.kiwix.org/release/kiwix-tools/kiwix-tools_linux-x86_64.tar.gz'
tar -xzf kiwix-tools_linux-x86_64-3.0.1.tar.gz && cd kiwix-tools_linux-x86_64-3.0.1
# 2. Download a compressed Wikipedia dump from https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/ (79GB, images included!)
wget --continue "https://download.kiwix.org/zim/wikipedia_en_all_maxi.zim"
# 3. Start the kiwix server, then visit http://127.0.0.1:8888
./kiwix-serve --verbose --port 8888 "$PWD/wikipedia_en_all_maxi_2018-10.zim"
Wikipedia.org itself is powered by a PHP backend called WikiMedia, using MariaDB for data storage, Varnish and Memcached for request and query caching, and ElasticSearch for full-text search. Production Wikipedia.org also runs a number of extra plugins and modules on top of MediaWiki.
π₯ There are several ways to host your own mirror of Wikipedia (with varying complexity):
- Run a caching proxy in front of Wikipedia.org (disk used on-demand for cache, low CPU use)
- Serve the static HTML ZIM archive with Kiwix (10~80GB for compressed archive, low CPU use)
- Run a full MediaWiki server (hardest to set up, ~600GB for XML & database, high CPU use)
π Don't expect it to look perfect on the first try
Setting up a Wikipidea mirror involves a complex dance between software, data, and devops, so beginners are encouraged to start with the static html archive or proxy and before attempting to run a full MediaWiki Server. Users should expect their mirrors to be able to serve articles with images and search, but should not expect it to look exactly like Wikipedia.org on the first try, or the second...
β Choosing an approach
Each method in this guide has its pros and cons. A caching proxy is the most lightweight option, but if the upstream servers go down and a request comes in that hasn't been seen before and cached it will 404, so it's not a fully redundant mirror. The static ZIM mirror is lightweight to download and host (and requests are easy to cache), it has full-text search, but it has no interactivity, talk page history, or Wikipedia-style category pages (though they are coming soon). MediaWiki/XOWA are the most complex, but they can provide a full working Wikipedia mirror complete with history revisions, users, talk pages, search, and more.
Running a full MediaWiki server is by far the hardest method to set up. Expect it to take multiple days/weeks depending on available system resources, and expect it to look fairly broken since the production Wikipedia.org team run many tweaks and plugins that take extra work to set up locally.
For more info, see the Wikipedia.org index of all dump types available, with descriptions.
Some mirrors load a page from the Wikimedia servers directly every time someone requests a page from them. They alter the text in some way, such as framing it with ads, then send it on to the reader. This is called remote loading, and it is an unacceptable use of Wikimedia server resources. Even remote loading websites with little legitimate traffic can generate significant load on our servers, due to search engine web crawlers. https://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks#Remote_loading
Luckily, regardless of how you choose to rehost Wikipedia text, you are not breaking any terms and conditions or violating copyright law as long as you don't remove their copyright statements (however, note the article images and videos on Wikimedia.org may not be licensed for re-use).
Every contribution to the English Wikipedia has been licensed for re-use, including commercial, for-profit websites. Republication is not necessarily a breach of copyright, so long as the appropriate licenses are complied with. https://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks#Things_you_need_to_know
[TOC]
See the HTML version of this guide for the best browsing experience. See pirate/wikipedia-mirror on Github for example config source, docker-compose files, binaries, folder structure, and more.
-
Provision a server to act as your Wikipedia mirror
You can use a cheap VPS provider like DigitalOcean, Vultr, Hetzner, etc. For the static ZIM archive and MediaWiki server methods you will need significant disk space, so a home server with a cheap external HD may be a better option.
The setup examples below are based on Ubuntu 19.04 running on a home server, however they should work across many other OS's with minimal tweaking (e.g. FreeBSD, macOS, Arch, etc.).
-
Purchase a new domain or create a subdomain to host your mirror
You can use Google Domains, NameCheap, GoDaddy, etc. any registrar will work.
In the setup examples below, replace
wiki.example.com
with the domain you chose. -
Point the DNS records for the domain to your mirror server
Configure these records via your DNS provider (e.g. NameCheap, DigitalOcean, CloudFlare, etc.):
wiki.example.com
A
->your server's public ip
(the root domain)en.wiki.example.com
CNAME
->wiki.example.com
(the wiki domain)upload.wiki.example.com
CNAME
->wiki.example.com
(the uploads/media domain)
-
Create a directory to store the project, and a dotenv file for your config options
Not all of these values are needed for all the methods, but it's easier to just define all of them in one place and remove things later that turn out to be unneeded.
mkdir -p /opt/wiki # change PROJECT_DIR below to match nano /opt/wiki/.env
Create the
.env
config file indotenv
/bash
syntax with the contents below. Make sure to replace the example values likewiki.example.com
with your own.PROJECT_DIR="/opt/wiki" # folder for all project state CONFIG_DIR="$PROJECT_DIR/etc/nginx" CACHE_DIR="$PROJECT_DIR/data/cache" CERTS_DIR="$PROJECT_DIR/data/certs" LOGS_DIR="$PROJECT_DIR/data/logs" LANG="en" # Wikipedia language to mirror LISTEN_PORT_HTTP="80" # public-facing HTTP port to bind LISTEN_PORT_HTTPS="443" # public-facing HTTPS port to bind LISTEN_HOST="wiki.example.com" # root domain to listen on LISTEN_WIKI="$LANG.$LISTEN_HOST" # wiki domain to listen on LISTEN_MEDIA="upload.$LISTEN_HOST" # uploads domain to listen on UPSTREAM_HOST="wikipedia.org" # main upstream domain UPSTREAM_WIKI="$LANG.$UPSTREAM_HOST" # upstream domain for wiki UPSTREAM_MEDIA="upload.wikimedia.org" # upstream domain for uploads # Only needed if using an nginx reverse proxy: SSL_CRT="$CERTS_DIR/$LISTEN_HOST.crt" SSL_KEY="$CERTS_DIR/$LISTEN_HOST.key" SSL_DH="$CERTS_DIR/$LISTEN_HOST.dh" CACHE_SIZE="100G" # or "500GB", "1GB", "200MB", etc. CACHE_REQUESTS="GET HEAD POST" # or "GET HEAD", "any", etc. CACHE_RESPONSES="200 206 302" # or "200 302 404", "any", etc. CACHE_DURATION="max" # or "1d", "30m", "12h", etc. ACCESS_LOG="'$LOGS_DIR/nginx.out' trace" # or "off", etc. ERROR_LOG="'$LOGS_DIR/nginx.err' warn" # or "off", etc.
The setup steps below depend on this file existing and the config values being correct, so make sure you create it and replace all example values with your own before proceeding!
- https://download.kiwix.org/zim/wikipedia/ (for BitTorrent add
.torrent
to the end of any.zim
url) - https://en.wikipedia.org/wiki/MediaWiki
- https://www.mediawiki.org/wiki/MediaWiki
- https://www.mediawiki.org/wiki/Download
- https://www.wikidata.org/wiki/Wikidata:Database_download
- https://dumps.wikimedia.org/backup-index.html
Wikipedia HTML dumps are provided in a highly-compressed web-archiving format called ZIM. They can be served using a ZIM server like Kiwix (the most common one), or ZimReader, GoZIM, & others.
- Kiwix.org full ZIM archive list or Kiwix.org Wikipedia-specific ZIM archive list
- Wikimedia.org ZIM archive list
- List of ZIM BitTorrent links
ZIM archive dumps are usually published yearly, but the release schedule is not guaranteed. As of August 2019 the latest available dump containing all English articles is from October 2018:
wikipedia_en_all_mini_2019-09.zim
(torrent) (10GB, mini English articles, no pictures or video)
wikipedia_en_all_nopic_2018-09.zim
(torrent) (35GB, all English articles, no pictures or video)
wikipedia_en_all_maxi_2018-10.zim
(torrent) (79GB, all English articles w/ pictures, no video)
wikipedia_en_simple_all_maxi_2020-01.zim
(1.6GB, SimpleWiki English only, good for testing)
Download your chosen Wikipedia ZIM archive (e.g. wikipedia_en_all_maxi_2018-10.zim
)
mkdir -p /opt/wiki/data/dumps && cd /opt/wiki/data/dumps
# Download via BitTorrent:
transmission-cli --download-dir . 'magnet:?xt=urn:btih:O2F3E2JKCEEBCULFP2E2MRUGEVFEIHZW'
# Or download via HTTPS from one of the mirrors:
wget -c 'https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'
wget -c 'https://ftpmirror.your.org/pub/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'
wget -c 'https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2018-10.zim'
# Optionally after download, verify the length (fast) or MD5 checksum (slow):
stat --printf="%s" wikipedia_en_all_maxi_2018-10.zim | grep 83853668638
md5sum wikipedia_en_all_maxi_2018-10.zim | openssl dgst -md5 -binary | openssl enc -base64 | grep 01eMQki29P9vD5F2h6zWwQ
Database dumps are usually published monthly. As of August 2019, the latest dump containing all English articles is from July 2019:
enwiki-20190720-pages-articles.xml.bz2
(15GB, all English articles, no pictures/videos)
simplewiki-20170820-pages-meta-current.xml.bz2
(180MB, SimpleWiki only, good for testing)
Download your chosen Wikipedia XML dump (e.g. enwiki-20190720-pages-articles.xml.bz2
)
mkdir -p /opt/wiki/data/dumps && cd /opt/wiki/data/dumps
# Download via BitTorrent:
transmission-cli --download-dir . 'magnet:?xl=16321006399&dn=enwiki-20190720-pages-articles.xml.bz2'
# Download via HTTP:
# lol no. no one wants to serve you a 15GB file via HTTP
Complexity: Low
Minimal setup and operations requirements, no download of large dumps needed.
Disk space requirements: On-Demand
Disk is only used as pages are requested (can be 1gb up to 2TB+ depending on usage).
CPU requirements: Very Low
Lowest out of the three options, can be run on a tiny VPS or home-server.
Content freshness: Very Fresh
Configurable to cache content indefinitely or pull fresh data for every request.
Set the following options in your /opt/wiki/.env
config file:
UPSTREAM_HOST=wikipedia.org
UPSTREAM_WIKI=en.wikipedia.org
UPSTREAM_MEDIA=upload.wikimedia.org
Then run all the setup steps below under Nginx Reverse Proxy to set up Nginx.
Then restart nginx to apply your config with systemctl restart nginx
.
Your mirror should now be running and proxying requests to Wikipedia.org!
Visit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).
Alternatively, check out a similar setup that uses Caddy instead of Nginx as the reverse proxy: https://github.com/CristianCantoro/wikiproxy
Complexity: Moderate
Static binary makes it easy to run, but it requires downloading a large dump file.
Disk space requirements: >80GB
The ZIM archive is a highly-compressed collection of static HTML articles only.
CPU requirements: Very Low
Low, especially with a CDN in front (more than a proxy, but less than a full server).
Content freshness: Often Stale
ZIM archives are published yearly (ish) by Wikipedia.org.
First download a ZIM archive dump like wikipedia_en_all_maxi_2018-10.zim
into /opt/wiki/data/dumps
as described above.
Run kiwix-serve
with docker like so:
docker run \
-v '/opt/wiki/data/dumps:/data' \
-p 8888:80 \
kiwix/kiwix-serve \
'wikipedia_en_all_maxi_2018-10.zim'
Or create /opt/wiki/docker-compose.yml
and run docker-compose up
:
version: '3'
services:
kiwix:
image: kiwix/kiwix-serve
command: 'wikipedia_en_all_maxi_2018-10.zim'
ports:
- '8888:80'
volumes:
- "./data/dumps:/data"
-
Download the latest
kiwix-serve
binary for your OS & CPU architectureFind the latest release for your architecture here and copy its URL to download it below: https://download.kiwix.org/release/kiwix-tools/
cd /opt/wiki wget 'https://download.kiwix.org/release/kiwix-tools/kiwix-tools_linux-x86_64-3.0.1.tar.gz' tar -xzf 'kiwix-tools_linux-x86_64-3.0.1.tar.gz' mv 'kiwix-tools_linux-x86_64-3.0.1' 'bin'
-
Run
kiwix-serve
, passing it a port to listen on and your ZIM archive file/opt/wiki/bin/kiwix-serve --port 8888 /opt/wiki/data/dumps/wikipedia_en_all_maxi_2018-10.zim
Your server should now be running!
Visit http://en.yourdomainhere.com:8888 to see it in action!
Set the following options in your /opt/wiki/.env
config file:
UPSTREAM_HOST=localhost:8888
UPSTREAM_WIKI=localhost:8888
UPSTREAM_MEDIA=upload.wikimedia.org
Then run all the setup steps below under Nginx Reverse Proxy to set up Nginx. To run nginx inside docker-compose next to Kiwix, see the Run Nginx via docker-compose section below.
Your mirror should now be running and proxying requests to kiwix-serve
!
Visit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).
Complexity: Very High
Complex multi-component setup with an intricate setup process and high resource use.
Disk space requirements: >550GB (>2TB needed for import phase)
The uncompressed database is very large (multiple TB with revision history and stubs).
CPU requirements: Moderate (very high during import phase)
Depends on usage, but it's the most demanding out of the 3 options.
Content freshness: Very fresh
Udpated database dumps are published monthly (ish) by Wikipedia.org.
First download a database dump like enwiki-20190720-pages-articles.xml.bz2
into /opt/wiki/data/dumps
as described above.
If you need to decompress it, pbzip2
is much faster than bzip2
:
pbzip2 -v -d -k -m10000 enwiki-20190720-pages-articles.xml.bz2
# -m10000 tells it to use 10GB of RAM, adjust accordingly
https://github.com/QuantumObject/docker-xowa
docker run \
-v /opt/wiki/data/xowa:/opt/xowa/ \
-p 8888 \
sblop/xowa_offline_wikipedia
version: '3'
services:
xowa:
image: sblop/xowa_offline_wikipedia
ports:
- 8888:80
volumes:
- './data/xowa:/opt/xowa'
- https://hub.docker.com/_/mediawiki
- https://github.com/wikimedia/mediawiki-docker
- https://github.com/AirHelp/mediawiki-docker
- https://en.wikipedia.org/wiki/MediaWiki
- https://www.mediawiki.org/wiki/MediaWiki
- https://www.mediawiki.org/wiki/Download
- https://www.wikidata.org/wiki/Wikidata:Database_download
- https://dumps.wikimedia.org/backup-index.html
Configure your docker-compose.yml
file
Default MediaWiki config file: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/DefaultSettings.php
Create the following /opt/wiki/docker-compose.yml
file then run docker-compose up
:
version: '3'
services:
database:
image: mariadb
command: --max-allowed-packet=256M
environment:
MYSQL_DATABASE: wikipedia
MYSQL_USER: wikipedia
MYSQL_PASSWORD: wikipedia
MYSQL_ROOT_PASSWORD: wikipedia
mediawiki:
image: mediawiki
ports:
- 8080:80
depends_on:
- database
volumes:
- './data/html:/var/www/html'
# After initial setup, download LocalSettings.php into ./data/html
# and uncomment the following line, then docker-compose restart
# - ./LocalSettings.php:/var/www/html/LocalSettings.php
Then import the XML dump into the MediaWiki database:
- https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
- https://hub.docker.com/r/ueland/mwdumper/
- https://www.mail-archive.com/[email protected]/msg02108.html
Do not attempt to import it directly with importDump.php
, it will take months:
php /var/www/html/maintenance/importDump.php enwiki-20170320-pages-articles-multistream.xml
Instead, convert the XML dump into compressed chunks of SQL then import individually:
Warning: For large imports (e.g. English) this process can still take 5+ days depending on the system.
apt install -y openjdk-8-jre zstd pbzip2
# Download patched mwdumper version and pre/post import SQL scripts
wget "https://github.com/pirate/wikipedia-mirror/raw/master/bin/mwdumper-1.26.jar"
wget "https://github.com/pirate/wikipedia-mirror/raw/master/preimport.sql"
wget "https://github.com/pirate/wikipedia-mirror/raw/master/postimport.sql"
DUMP_NAME="enwiki-20190720-pages-articles"
# Decompress the XML dump using all available cores and 10GB of memory
pbzip2 -v -d -k -m10000 "$DUMP.xml.bz2"
# Convert the XML file into a SQL file using mwdumper
java -server \
-jar ./wikipedia-importing-tools/mwdumper-1.26.jar \
--format=sql:1.5 \
"$DUMP.xml" \
> wikipedia.sql
# Split the generated SQL file into compressed chunks
split --additional-suffix=".sql" --lines=1000 wikipedia.sql
for partial in $(ls *.sql); do
zstd -z $partial
done
# Fix a schema issue that may otherwise cause import bugs
docker-compose exec database \
mysql --user=wikipedia --password=wikipedia --database=wikipedia \
"ALTER TABLE page ADD page_counter bigint unsigned NOT NULL default 0;"
# Import the compressed chunks into the database
for partial in $(ls *.sql.zst); do
zstd -dc preimport.sql.zst $partial postimport.sql.zst \
| docker-compose exec database \
mysql --force --user=wikipedia --password=wikipedia --database=wikipedia
done
Credit for these steps goes to https://github.com/wayneworkman/wikipedia-importing-tools.
Set the following options in your /opt/wiki/.env
config file:
UPSTREAM_HOST=localhost:8888
UPSTREAM_WIKI=localhost:8888
UPSTREAM_MEDIA=upload.wikimedia.org
Then run all the setup steps below under Nginx Reverse Proxy to set up Nginx. To run nginx inside docker-compose next to MediaWiki, see the Run Nginx via docker-compose section below.
Your mirror should now be running and proxying requests to your wiki server!
Visit https://en.yourdomainhere.com to see it in action (e.g. https://en.wiki.example.com).
You can optionally set up an Nginx reverse proxy in front of kiwix-serve
, Wikipedia.org
, or a MediaWiki
server to add caching and HTTPS support.
Make sure the options in /opt/wiki/.env
are configured correctly for the type of setup you're trying to achieve.
- To run nginx in front of
kiwix-serve
on localhost, set:UPSTREAM_HOST=localhost:8888
UPSTREAM_WIKI=localhost:8888
UPSTREAM_MEDIA=upload.wikimedia.org
- To run nginx in front of Wikipedia.org, set:
UPSTREAM_HOST=wikipedia.org
UPSTREAM_WIKI=en.wikipedia.org
UPSTREAM_MEDIA=upload.wikimedia.org
- To run nginx in front of a MediaWiki server on localhost, set:
UPSTREAM_HOST=localhost:8888
UPSTREAM_WIKI=localhost:8888
UPSTREAM_MEDIA=upload.wikimedia.org
- To run nginx in front of a docker container via docker-compose: See Run Nginx via docker-compose section below.
# Install the dependencies: nginx and certbot
add-apt-repository -y -n universe
add-apt-repository -y -n ppa:certbot/certbot
add-apt-repository -y -n ppa:nginx/stable
apt update -qq
apt install -y nginx-extras certbot python3-certbot-nginx
systemctl enable nginx
systemctl start nginx
# Load your config values from step 4 into the environment, and create dirs
source /opt/wiki/.env
mkdir -p "$CONFIG_DIR" "$CACHE_DIR" "$CERTS_DIR" "$LOGS_DIR"
# Get an SSL certificate and generate the Diffie-Hellman parameters file
certbot certonly \
--nginx \
--agree-tos \
--non-interactive \
-m "ssl@$LISTEN_HOST" \
--domain "$LISTEN_HOST,$LISTEN_WIKI,$LISTEN_MEDIA"
openssl dhparam -out "$PROJECT_DIR/data/certs/$DOMAIN.dh" 2048
# Link the certs into your project directory
ln -s /etc/letsencrypt/live/$DOMAIN/fullchain.pem $PROJECT_DIR/data/certs/$DOMAIN.crt
ln -s /etc/letsencrypt/live/$DOMAIN/privkey.pem $PROJECT_DIR/data/certs/$DOMAIN.key
LetsEncrypt certs must be renewed every 90 days or they'll expire and you'll get "Invalid Certificate" errors. To have certs automatically renewed periodically, add a systemd timer or cron job to run certbot renew
. Here's an example tutorial on how to do that:
https://gregchapple.com/2018/02/16/auto-renew-lets-encrypt-certs-with-systemd-timers/
# Load your config options into the environment
source /opt/wiki/.env
# Download the nginx config template
curl --silent \
"https://github.com/pirate/wikipedia-mirror/raw/master/etc/nginx/nginx.conf.template" \
> "$CONFIG_DIR/nginx.conf.template"
# Fill your config options into nginx.conf.template to create nginx.conf
envsubst \
"$(printf '${%s} ' $(bash -c "compgen -A variable"))"\
< "$CONFIG_DIR/nginx.conf.template" \
> "$CONFIG_DIR/nginx.conf"
# Link the your nginx.conf into the system's default nginx config location
ln -s -f "$CONFIG_DIR/nginx.conf" "/etc/nginx/nginx.conf"
# Restart nginx to load the new config
systemctl restart nginx
Now you can visit https://en.yourdomainhere.com to see it in action with HTTPS!
For troubleshooting, you can find the nginx logs here:
/opt/wiki/data/logs/nginx.err
/opt/wiki/data/logs/nginx.out
Set the config values in your /opt/wiki/.env
file to correspond to the docker container's hostname that you want to proxy, and tweak the directory paths to be the paths inside the container. e.g. for mediawiki
:
UPSTREAM_HOST=mediawiki:8888`
UPSTREAM_WIKI=mediawiki:8888`
UPSTREAM_MEDIA=upload.wikimedia.org
CERTS_DIR=/certs
CACHE_DIR=/cache
LOGS_DIR=/logs
Then regenerate your nginx.conf
file with envsubst
as described in Nginx Reverse Proxy below.
Then add the nginx
service to your existing /opt/wiki/docker-compose.yml
file:
version: '3'
services:
...
nginx:
image: nginx:latest
volumes:
- ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf
- ./data/certs:/certs
- ./data/cache:/cache
- ./data/logs:/logs
ports:
- 80:80
- 443:443
- https://github.com/openzim/mwoffliner (archiving only, no serving)
- https://www.yunqa.de/delphi/products/wikitaxi/index (Windows only)
- https://www.nongnu.org/wp-mirror/ (last updated in 2014, Dockerfile)
- https://github.com/dustin/go-wikiparse
- https://www.learn4master.com/tools/python-and-java-libraries-to-parse-wikipedia-dump-dataset
- https://dkpro.github.io/dkpro-jwpl/
- https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c
- https://meta.wikimedia.org/wiki/Data_dumps/Import_examples#Import_into_an_empty_wiki_of_a_subset_of_en_wikipedia_on_Linux_with_MySQL
- https://github.com/shimondoodkin/wikipedia-dump-import-script/blob/master/example-result.sh
- https://github.com/wayneworkman/wikipedia-importing-tools
- https://github.com/chrisbo246/mediawiki-loader
- https://dzone.com/articles/how-clone-wikipedia-and-index
- https://www.xarg.org/2016/06/importing-entire-wikipedia-into-mysql/
- https://dengruo.com/blog/running-mediawiki-your-own-copy-restore-whole-mediwiki-backup
- https://brionv.com/log/2007/10/02/wiki-data-dumps/
- https://www.evanjones.ca/software/wikipedia2text.html
- https://lists.gt.net/wiki/wikitech/160482
- https://helpful.knobs-dials.com/index.php/Harvesting_wikipedia
- https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community