TF-binding prediction can be easily achieved through the motif scorings of the sequence, however the sequences that remain overlapping are difficult to classify with accuracy. This GitHub repository is the implementation of the paper: https://drive.google.com/file/d/16eQ9wBZkh2MYc5stWbwQUWuM3819rsWp/view?usp=sharing.
We setup everything with conda as shown below:
conda create -n tfbinding python=3.12
Then, you can install the requirements using:
pip install -r requirements.txt
Note: we require python version 3.12 for the pyranges1 package.
For all this existing data, we host the preprocessed data (alongside the other information) on our Google Drive here:
- Sequence information and TF regions needed for the models: https://drive.google.com/file/d/1UgGh8bTUN7pOOCwPaKt5_kgtzi7GYpJd/view
- The structural features used to enhance the model are found here: https://drive.google.com/file/d/19oz42DGXyzThQAhL74M-4sPO18xmgEuk/view?usp=sharing
- The structural features but slightly larger to incorporate a 200 context window around the TF region, found here: https://drive.google.com/file/d/1FVmIdu91k1Ggo26NnNs3C4KCil62o_PE/view?usp=sharing
Otherwise, if you wish to do it yourself, follow these instructions:
- Human genome sequence data found from: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz. Move this under the folder of
data/fasta, and unzip. - Active regulatory regions in
bedformat. See https://en.wikipedia.org/wiki/BED_(file_format) for more information. - Set of genomic coordinates for true transcription factor binding sites in a txt file. This is usually from a factorbook.
- Set of motifs for each transcription factor, encoded as a probability weight matrix, that suggests a sequence is likely to be binded by a certain transcription factor, also in a txt file.
Make sure to set them all under the same directory. The default directory that the preprocessing scripts are under data, however, feel free to specify the following:
python src/preprocess/preprocess.py --fasta_data_dir <str> --chip_seq_file <str> --true_tf_file <str> --pwm_file <str>
If you are unzipping from the Google Drive, and have everything under the correct locations of data/fasta, data/wgEncodeRegTfbsClusteredV3.GM12878.merged.bed, data/factorbookMotifPos.txt, data/factorbookMotifPwm.txt respectively, then you can simply run:
python src/preprocess/preprocess.py --tf <str>
However, if you wish to preprocess for all the TFs, you can use the following command. This will preprocess the data and spit out to what is specified by --output_dir, which is data/tf_sites by default. Warning: this will likely run a long time as the default is to run over every TF. We suggest to only run the preprocessing script on TFs that you need.
make preprocess
The currently supported structural information is:
- MGW: https://rohslab.usc.edu/ftp/hg19/hg19.MGW.wig.bw
- PrOT: https://rohslab.usc.edu/ftp/hg19/hg19.ProT.wig.bw
- HelT: https://rohslab.usc.edu/ftp/hg19/hg19.HelT.wig.bw
- Roll: https://rohslab.usc.edu/ftp/hg19/hg19.Roll.wig.bw
- OC2: https://rohslab.usc.edu/ftp/hg19/hg19.OC2.wig.bw
Either download to the same directory, which then you can further preprocess (as these files are huge) for faster training or use the preprocessed files from: https://drive.google.com/file/d/19oz42DGXyzThQAhL74M-4sPO18xmgEuk/view?usp=sharing as above. You can add more structural compatability by adding to src/models/config.py.
If you wish to preprocess your own, simply run:
python src/preprocess/preprocess.py --tf PAX5 --bigwig_dir <str> --bigwigs hg19.MGW.wig.bw hg19.HelT.wig.bw hg19.ProT.wig.bw hg19.OC2.wig.bw hg19.Roll.wig.bw --context_window <int>
This will be outputted to the specified bigwig directory under <tf>/bigwig_processed. The context window flag is used if you wish to preprocess the bigwigs so that they include more structural information before and after the
Note: These preprocessed bigwig files are not compatible with certain context window lengths. To make them compatible, be sure to preprocess them beforehand using the flag..
To run a training loop, simply run:
python src/main.py -c configs/<config.yaml>
Or add any of the flags found under src/models/config.py to override any values in the yaml file.
For training, we also use mlflow, a framework used for organizing different ML training runs. To access the dashboard, simply run:
mlflow server --port 5000
or run:
make mlflow
and access the dashboard through the URL shown in the terminal.
Take the configs/simple.yaml as an example, where you can set the following:
architecture, must be set. Seesrc/models/config.pyfor a list of all models.tf, must be specified.preprocess_data_dir, default isdata/tf_sites,train_split, default is 0.8.batch_size, default is 32.pwm_file, file containing the Position Weight Matrix. Should be the same used in the preprocessing step and look like:
GFI1 8 1.000000,1.000000,0.000000,0.000000,0.626866,0.000000,0.671642,0.000000, 0.000000,0.000000,0.298507,1.000000,0.000000,0.716418,0.000000,0.000000, 0.000000,0.000000,0.000000,0.000000,0.000000,0.283582,0.000000,1.000000, 0.000000,0.000000,0.701493,0.000000,0.373134,0.000000,0.328358,0.000000,
where the tabs delimit the nucleotide probabilities.
pred_struct_data_dir, the directory containing the BigWig files. Note these files must all be in the same directory.pred_struct_features, a list input for the types of structural features to include.<feat>_file_namethe file name for the specified structural feature. Defaults to<pred_struct_data_dir>/hg19.<pred_struct>.wig.bw.use_probs, whether or not we will use the probability scores from the position weight matrix per nucleotide.restart_trainspecifies whether or not to use the previous found model given the exact same parametrization.random_seedspecifies the random seeding to use to ensure the same data splits across runs and model comparisons.epochsspecifies the number of epochs to train the model for.context_windowspecifies the extra context window for the model to use. Be sure that the proper bigwig files can handle the context window.devicespecifies the torch device to use.dtypespecifies the data type to use for training the MLP model.use_seqspecifies whether to use the one-hot encoding from the sequence itself.
Many others are included in the config file itself, and are model dependent. Read src/models/config.py for more information.
We use some autoformatters, such as black to ensure readability.
Run:
pip install pre-commit
pre-commit install