Preprocessing-scripts

Preparation

Wikipedia text data if training from scratch. We also release our training data on English(en), Spanish(es), Hindi(hi), and Russian(ru) here.

We first provide a basic usage of our scripts.
init.sh will clone the official BERT repo, and create a test_data_folder with dummy text.
preprocess_corpus.py takes in a text file and tokenizes it, additional parameter can be passed to control whether the language should be fake.
run.sh will shard the text files, create vocabulary for it, create bert-readable tensorflow records, and upload to google cloud.
create_pretraining_data_permutation.py allows creating pre-training data with permuted sentences, where the permute probability and method can be freely chosen. frequency_based_shuffle.py takes in a text corpus, and shuffles such that every word is replaced by a random word from the distribution of its vocabulary.

An example run that creates data that contains English and English Fake:

./init.sh
python preprocess_corpus.py \
    --corpus test_data_folder/raw_text/test.txt \
    --output test_data_folder/txt/en.txt
python preprocess_corpus.py \
    --corpus test_data_folder/raw_text/test.txt \
    --output test_data_folder/txt/en-fake.txt \
    --make_fake
./run.sh

run.sh requires a valid google cloud storage bucket to upload the data to gcloud. It also requires gsutil to copy the files to the bucket.