GitHub - btjaden/TargetExpression: Improving Prediction of Bacterial sRNA Regulatory Targets with Expression Data

==========

Identifying Targets of sRNA Regulation

`analyze_dataset.py`

analyze_dataset.py trains a machine learning model on a provided dataset (the folder data contains several example datasets for E. coli and for Salmonella) and reports the model's performance at identifying bacterial sRNA regulatory targets

EXAMPLE USAGE: python analyze_dataset.py data/training_Salmonella.csv data/testing_Salmonella.csv

As input, analyze_dataset.py requires a .csv file of training data (for training the machine learning model) and a .csv file of testing data (for evaluating the model's performance in identifying targets of sRNA regulation). analyze_dataset.py performs the following steps:

Read in training data and testing data
Undersample the majority class
Scale the data so that values for each feature have a mean of zero and unit variance
For each feature, compute the mutual information as well as the ANOVA F-statistic and corresponding p-value between the feature and the dependent class variable indicating interactions and non-interactions
Train two Gradient Boosting Classifiers: one using 9 features (no expression data) and one using 15 features (including 6 features corresponding to expression data)
Report the performance (sensitivity, false positive rate, area under ROC curve) of both classifiers on the testing data

==========

Creating a Custom Dataset

`ICA.py` and `calculate_feature_values.py`

In order to execute the abovementioned program, analyze_dataset.py, to identify sRNA regulatory targets, the user needs a dataset containing information about sRNA and target interactions. Example datasets are provided in the data folder. To create your own custom dataset, you must start with a .csv file of normalized log TPM values obtained from a set of genome-wide expression experiments, e.g., a set of RNA-seq experiments (example files are provided in the data folder). The rows of the .csv file correspond to genes and the columns correspond to experiments. Each entry is the normalized log TPM value of a gene in an experiment. ICA.py performs ICA (Independent Component Analysis), outputting the source matrix to the file data/S.csv and the mixing matrix to the file data/A.csv.

EXAMPLE USAGE: python ICA.py data/TPM.csv

Once a source matrix is computed using ICA with the ICA.py program, feature values for candidate sRNA and target interactions can be calculated from the source matrix using the program calculate_feature_values.py.

EXAMPLE USAGE: python calculate_feature_values.py Salmonella

calculate_feature_values.py will output a dataset as a .csv file that can be used as input to analyze_dataset.py to identify regulatory interactions between sRNAs and targets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Identifying Targets of sRNA Regulation

`analyze_dataset.py`

Creating a Custom Dataset

`ICA.py` and `calculate_feature_values.py`

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
genome		genome
ICA.py		ICA.py
LICENSE		LICENSE
README.md		README.md
analyze_dataset.py		analyze_dataset.py
calculate_feature_values.py		calculate_feature_values.py

License

btjaden/TargetExpression

Folders and files

Latest commit

History

Repository files navigation

Identifying Targets of sRNA Regulation

analyze_dataset.py

Creating a Custom Dataset

ICA.py and calculate_feature_values.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`analyze_dataset.py`

`ICA.py` and `calculate_feature_values.py`

Packages