Outputs

Output provided by various functionality is shown here:

Feature selection

python featureSelection.py

  • It provides the relative importance of the features provided by GEP on the command prompt
  • Based on the options chosen, it also shows the relative importance of the features as bar or line graph with or without variance. (see /example folder)

Predict

python genomewidePrediction.py

  • Provides the statistics on the command prompt

  • Predicted enhancer positions (.txt): There are four columns corresponding to “chr”, “start”, “End”, “Confidence”

  • UCSC browser file: A file (.bed) to upload on UCSC browser (default: hg19 assembly). Once uploaded in the browser, the colour corresponds to the confidence of prediction (dark blue - high confidence)

    Note:: If you are working on other species, please change within the UCSC browser file on 2nd line. Change db = hg19 to db = “your species”). For e.g. if you are working with mouse mm9, then write in the file db = “mm9”

Train and Predict GEP

python trainAndPredictGEP.py

  • Provides best parameters and classification report with machine-learning measures (precision, recall, F-measure) on the command prompt
  • Learning_Curve: A learning curve represents training and validation score for different numbers of training samples. This is used to determine if the model can be benefitted on addition of more no. of samples.
  • Model: Trained model on your training data
  • ROC(Receiver Operating Characteristic) Curve: ROC curve obtained on the test dataset using the trained model
  • Scaler: A scalar storing the statistics of traning data

Train and Predict SVM

python trainAndPredictSVM.py

  • Provides best parameters and classification report with machine-learning measures (precision, recall, F-measure) on the command prompt
  • Learning_Curve: A learning curve represents training and validation score for different numbers of training samples. This is used to determine if the model can be benefitted on addition of more no. of samples.
  • Model: Trained SVM model on your training data
  • ROC(Receiver Operating Characteristic) Curve: ROC curve obtained on the test dataset using the trained model
  • Scaler: A scalar storing the statistics of traning data

n-fold cross-validation

python crossValidation.py

  • Measures (.txt): A txt file with machine-learning measures (accuracy, precision, recall, F-measure) corresponding to random and true classification at each fold. It also provides average measures of all the folds
  • ROC(Receiver Operating Characteristic) Curve

Prepare genomewide prediction

perl prepare_genomeWidePrediction.pl

  • matix.txt: A 2D matrix with features as columns (same order as training data) and samples as row

Prepare training

perl buildTrainingData.pl

  • matix.txt: A 2D matrix with features as columns samples as row. You can use this file as input during model building step

Plot size distribution

Rscript plot_size_distribution.R <size_file.txt>

  • histogram of the size of enhancers

Comparison with other enhancer set

perl enhancerDistribution.pl –eFile <enhancer bedFile> –l <list (a tab-delimited file with fileName and name of the states)> –temp <tempDir>

  • It provides the no. of enhancers overlapped with different regions provided by the user

Calculate ML measures

python calculateML_measures.py

It provides different measures (e.g. Recall, Precision, F-measure, PPV, Methiew correlation coefficient etc) on stdout

Validation dataset

python validationTestSet.py –output-folder <outFolder> –label-column <”class indices”> –feature-columns <”feature indices”> –test_file “test_file.txt” –model_file <ModelFile> –scalar_file <ScalerFile> –save-file “File Prefix” –verbosity 1

  • It provides ML measures obtained on the test dataset on application of the model on stdout
  • ROC curve

Enrichment analysis

perl votingScore.pl

  • HistoFile.txt: This gives the no. of characteristic element supporting each enhancer
  • votingTable.txt: Presence or absence of each element corrosponding to enhancers in tabular format
  • No_validation_evidence.txt: An additional table corresponding to enhancers not supported by any element

Enrichment test

Rscript enrichmentAnalysis.R <enhancer bedFile> <gFile> <list> <outputFolder> <outFileName>

  • .txt: It will generate a .txt file corresponding to each factor containing means of random overlap
  • .pdf: It will generate a .pdf file showing the distribution of means of random overlaps