Train and Predict¶
In machine-learning(ML)approaches, the outcome largely depends on the training. In case of an appropriate training dataset availability, it might be worth to perform predictions with a newly trained model. For that purpose, you need to first build a trianing dataset and train the model.
Train¶
Perform following:
- Generate training dataset (Usage Prepare training):
- The training dataset comprised of 2 classes: Positive classes as enhancers (provided by you) and negative classes which contain different genomic regions as promoter, gene-body and heterochromatin in 5:3:2 ratio approximately. Following are the input files required:
Necessary files
- Positive samples: A 3 column bed file containing positive samples (Enhancers)
- List: A text file containing a list of all the feature files and names (see /example/input/list.txt for an example at https://github.com/ShaluJhanwar/GEP/)
- Chromosome: A two column tab-delimited file with Chr and sizes
- TSS: A three column tab-delimited coding TSS file with Chr, Pos, Strand
- Exon: A three-column exon bed file
- Intron: A three-column intron bed file
- aTSS: A six column bedfile with all (coding and non-coding TSS)
Note:
For Human (hg19) and mouse (mm9), refer (/misc/ folder) for Chromosome, TSS, aTSS files. For Human (hg19), intron and exon files are provided (/misc/ folder)
Build trining data:
Perl buildTrainingData.pl –chrSize <positive_samples.txt> --l <list.txt> --gmSize <Chromosome_sizes.txt> --tss <tssFile> --gbFile <exon.bed> –inFile <intron.bed> --o <outputFolder> --aTSS <All TSS file>
Visit Prepare training for output
- Building model (Usage Train and Predict GEP)
Build a model on your own trainign data using GEP. Provide more number of cores to enable parallel processing.
Necessary files
- Output of <buildTrainingData.pl>: In above example, <outfolder/matrix.txt> file
- feature column index: index of the feature columns in <outfolder/matrix.txt> file
- label column: index of label column in <outfolder/matrix.txt> file
- Optional params
Run:
python trainAndPredictGEP.py --data-file <outfolder/matrix.txt> --output-folder outputFolder --feature-columns “<comma separated column indices>” --label-column <label column index> --percent-test-size <float value> --fold-cross-validation <int> --save-file “<filename>” --n-estimators "<comma separated n-estimators for optimization>" --max-depth-start <int> --max-depth-end <int> --n-jobs <int> --verbosity <binary value>
Note:
Training can be performed using SVM classifier. See :ref:`howitworks/tapS`
- Visit Train and Predict GEP for output
Predict¶
Perform prediction on the new model in the simialr way as using in-built models. Visit Quick-Start
- Visit Predict for output