Usage

Help usage of different functionalities of the package is shown here:

Feature selection

python featureSelection.py –h

Usage: featureSelection.py [options]

Options:
-h, --help            show this help message and exit
--data-file=DATA_FILE
            A tab delimited file: 1st column as rownames, 2nd column as class and additional columns as tab-
            delimited features: see example folder for test file
--output-folder=OUTPUT_FOLDER
                An output folder: Default is ../output
--label-column=LABEL_COLUMN
                Column index containing class of instances: Default: 1 (2nd column)
--feature-columns=DATA_COLUMNS
                Column index containing features in data-file
--plot-bar-without-std=PLOT_BAR_WITHOUT_STD
                Bar plot without std deviation: Default:on
--plot-bar-with-std=PLOT_BAR_WITH_STD
                Bar plot with std deviation: Default:off, provide 1 value to turn on
--plot-line=PLOT_LINE
                Line plot: Default off: provide 1 value to turn on
--n-estimators=N_ESTIMATORS
                n_estimator: Default=100
--verbosity=VERBOSITY
            Verbosity: Default on: provide 0 to turn off

Predict

python genomewidePrediction.py –h

Usage: genomewidePrediction.py [required: --data-file <Filename> --feature-columns <feature column indices> --label-column <label column>] [optional: --output-folder <Output_folder> --fold-cross-validation <no. of cross-validation-folds> --save-file <output-filename> --n-estimators <value> --max-depth <value>]

Options:
-h, --help            show this help message and exit
--output-folder=OUTPUT_FOLDER
                An output folder: Default is ../output
--feature-columns=DATA_COLUMNS
                  Column index containing features in data-file
--fold-cross-validation=FOLD_CROSS_VALIDATION
                        n-fold: Default:10
--save-file=SAVE_FILE
                        Output filename: Deafult: output_file
--n-jobs=N_JOBS         No. of CPUs <value>
--genome_file=GENOME_FILE
                        Genome-wide file: 1st column as rownames and additional columns as tab-delimite features: see example folder for test file
--model_file=MODEL_FILE
                        A file containing model
--scalar_file=SCALAR_FILE
                        A file containing scaled training data in pkl format:
                        see example folder for help
--verbosity=VERBOSITY
                        Verbosity: Default on: provide 0 to turn off

Train and Predict GEP

python trainAndPredictGEP.py –h

Usage: trainAndPredictGEP.py [required: --data-file <Filename> --feature-columns <feature column indices> --label-column <label column>] [optional: --output-folder <Output_folder> --fold-cross-validation <no. of cross-validation-folds> --percent-test-size <float> --save-file <output-filename> --n-estimators <value> --max-depth-start <value> --max-depth-end <value>--n-jobs <int>]

Options:
-h, --help            show this help message and exit
--data-file=DATA_FILE
            A tab delimited file: 1st column as rownames, 2nd column as class and additional columns as tab-delimited features: see example folder for test file
--output-folder=OUTPUT_FOLDER
                An output folder: Default is ../output
--label-column=LABEL_COLUMN
                Column index containing class of instances: Default: 1 (2nd column)
--feature-columns=DATA_COLUMNS
                Column index containing features in data-file
--percent-test-size=PERCENT_TEST_SIZE
                    Size of the test dataset in percentage: Default: 20
--fold-cross-validation=FOLD_CROSS_VALIDATION
                        n-fold: Default:10
--save-file=SAVE_FILE
            Output filename: Deafult: output_file
--n-estimators=N_ESTIMATORS
                n_estimator list: Default: [10, 100, 1000]
--max-depth-start=MAX_DEPTH_START
                max-depth start: Default:5
--max-depth-end=MAX_DEPTH_END
                max-depth end: Default: 10
--n-jobs=N_JOBS       no. of cores: Default:1
--verbosity=VERBOSITY
            Verbosity: Default on: provide 0 to turn off

Train and Predict SVM

python trainAndPredictSVM.py –h

Usage: trainAndPredictSVM.py [required: --data-file <Filename> --feature-columns <feature column indices> --label-column <label column>] [optional: --output-folder <Output_folder> --fold-cross-validation <no. of cross-validation-folds> --percent-test-size <float> --save-file <output-filename> --n-estimators <value> --max-depth-start <value> --max-depth-end <value>--n-jobs <int>]

Options:
-h, --help            show this help message and exit
--data-file=DATA_FILE
            A tab delimited file: 1st column as rownames, 2nd column as class and additional columns as tab-
            delimited features: see example folder for test file
--output-folder=OUTPUT_FOLDER
                An output folder: Default is ../output
--label-column=LABEL_COLUMN
                Column index containing class of instances: Default: 1 (2nd column)
--feature-columns=DATA_COLUMNS
                Column index containing features in data-file
--percent-test-size=PERCENT_TEST_SIZE
                Size of the test dataset in percentage: Default: 20
--fold-cross-validation=FOLD_CROSS_VALIDATION
                n-fold: Default:10
--save-file=SAVE_FILE
            Output filename: Deafult: output_file
--SVM_C_min=SVM_C_MIN
            C <power of 10>: Default: -2 == 0.01
--SVM_C_max=SVM_C_MAX
            C <power of 10>: Default: 9 == 1000000000.0
--SVM_gamma_min=SVM_GAMMA_MIN
            gamma: <power of 10> default: -4 == 0.0001
--SVM_gamma_max=SVM_GAMMA_MAX
            gamma: <power of 10> default: 5 == 100000.0
--n-jobs=N_JOBS       no. of CPUs: Default:10
--verbosity=VERBOSITY
            Verbosity: Default on: provide 0 to turn off

n-fold cross-validation

python crossValidation.py –h

Usage:

python crossValidation.py [required: --data-file <Filename> --feature-columns <feature column indices> --label-column <label column>] [optional: --output-folder <Output_folder> --fold-cross-validation <no. of cross-validation-folds> --save-file <output-filename> --n-estimators <value> --max-depth <value>]

Options:
-h, --help            show this help message and exit
--data-file=DATA_FILE
            A tab delimited file: 1st column as rownames, 2nd column as class and additional columns as tab-
            delimited features: see example folder for test file
--output-folder=OUTPUT_FOLDER
                An output folder: Default is ../output
--label-column=LABEL_COLUMN
                Column index containing class of instances: Default: 1 (2nd column)
--feature-columns=DATA_COLUMNS
                Column index containing features in data-file
--fold-cross-validation=FOLD_CROSS_VALIDATION
                n-fold: Default:10
--save-file=SAVE_FILE
                Output filename: Deafult: output_file
--RF_n-estimators=RF_N_ESTIMATORS
                RF_n_estimator list: Default: 100
--RF_max-depth=RF_MAX_DEPTH
                RF_max_depth=<value>: Default:5
--n-jobs=N_JOBS       no. of cores: Default:10
--SVM_C=SVM_C         SVM_C <power of 10>: Default: 8 == 100000000.0
--SVM_gamma=SVM_GAMMA
            SVM_gamma: <power of 10> default: -2 == 0.01
--method=METHOD       Method: 'RF': Random Forest, 'SVM': Support Vector
                        Machine: Default: RF
--verbosity=VERBOSITY
                Verbosity: Default on: provide 0 to turn off

Prepare genomewide prediction

perl prepare_genomeWidePrediction.pl –h

Description: Prepare genome to perform prediction using GEP
System requirements:
Perl:
Module - Cwd
bedtools - Assumed it in the path

Usage:

perl prepare_genomeWidePrediction.pl --l  FeatureFileList --gmSize <ChromosomeSize.txt> --tss <A three column file containing TSS to exclude from genome> --aTSS <A six column bed file containing all coding and non-coding TSS> --active <active histones bedFile> --o <output_folder> <optional parameters>

### Required parameters:

--l | --listFeatureFile                     <A tab delimited file containing the name of the files (along with the path) and the name of the feature to be displayed>

--gmSize | --genomeSizeFile         <A tab delimited file containing chromosome name and its size>
For Human hg19: Hg19_ChromosomeSize.txt
For Mouse mm9: mm9_ChromosomeSize.txt

--tss | --tssFile                   <A three column file containing TSS>

--active | --activeRegionFiles              <Active region bed files containing three regions: chrName, start and end

--aTSS | --allTssFile                       <A six column bed file containing all coding and non-coding TSS>
                                For Mouse mm9: Please mention "Mouse_gencode.vM1_tss_coding_non-coding_6_column.bed" for annotation from gencode.vM1
                                For Human hg19: Please mention "Human_gencode.v19_tss_coding_non-coding_6_column.bed" for annotation from gencode.v19

### Optional parameters:

--f | --fractionOverlap                     <Fraction cut-off of the bin required to overlap with the feature in order to consider the signal in that bin>

--h | --help                                <Print help usage>

--o | --outDir                              <output_folder: All the output files will be saved in the output folder>
                            default output folder:current folder/output_folder

--bin                                       <Bin size in bp: default is 500>


This script was last edited on 29th July 2015.

Prepare training

perl buildTrainingData.pl –h

Description: Form training datatset of positive and negative samples in 1:1 ratio

System requirements:
Perl:
Module - Cwd
bedtools - Assumed it in the path

Usage:
perl buildTrainingData.pl --chrSize <pos_samples.bed> --gmSize <ChromosomeSize.txt> --l FeatureFileList --tss <tssFile> --gbFile <exonBed> --inFile <intronBed> --aTSS <A six column bed file containing all coding and non-coding TSS> <optional parameters>


### Required parameters:
--chrSize | --chrSizeFile           <A tab delimited file of positive samples containing chrName, start and end>

--l | --listFeatureFile                     <A tab delimited file containing 2 columns: i) the name of the files (along             with the path) ii) the name of the feature to be displayed>

--gmSize | --genomeSizeFile         <A tab delimited file containing chromosome name and sizes>
                                For Human hg19: Hg19_ChromosomeSize.txt
                                For Mouse mm9: mm9_ChromosomeSize.txt

--tss | --tssFile               <A three column: <chrom><txStart><strand> tab delimited file containing TSS
                                corresponding to protein coding genes>
                                    For Mouse mm9 gencode.vM1 annotation, please mention: "Mouse_gencode.vM1_tss_coding.bed"
                                    For Human hg19: Please mention "Human_gencode.v19_tss_coding.bed"

--gbFile | --geneBodyFile           <A three column bed file containing all the exons information>
                                For Human hg19: Human_gencode.v19_exon_Protein_coding.bed

--inFile | --intronFile                     <A three column bed file containing all the introns information>
                                For Human hg19: Please mention: Human_gencode.v19_intron_Protein_coding.bed

--aTSS | --allTssFile                       <A six column bed file containing all coding and non-coding TSS>
                                Already preprocessed Files provided with the package are:
                                For Mouse mm9: Please mention "Mouse_gencode.vM1_tss_coding_non-coding_6_column.bed" for annotation from gencode.vM1
                                For Human hg19: Please mention "Human_gencode.v19_tss_coding_non-coding_6_column.bed" for annotation from gencode.v19

### Optional parameters:

--f | --fractionOverlap                     <Fraction cut-off of the bin required to overlap with the feature in order to consider the signal in that bin>

--h | --help                                <Print help usage>

--o | --outDir                              <output_folder: All the output files will be saved in the output folder>
                            default output folder:current folder/output_folder
--bin                                       <Bin size in bp: default is 200>