Python Modules

You will need to install the following modules:

pip install biopython
pip install cupy-cuda12x
pip install fair-esm
pip install logomaker
pip install matplotlib
pip install numpy
pip install openpyxl
pip install pandas
pip install playsound
pip install PyObjC
pip install seaborn
pip install scikit-learn
pip install torch
pip install wordcloud
pip install xgboost

For GPU accelleration on an NIVIDA GPU, you will need to install a different version of PyTorch:
- NOTE: Ignore this section if your computer does not have an NVIDIA graphis card.
If you already ran the previous PyTorch installation, you will need to uninstall it with:
```
pip uninstall torch
```
To install a CUDA (V 12.1) enabled version of PyTorch run:
```
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
```

Keyboard Shortcuts

When a figure has been plotted, and it is selected, press a key for the following shortcuts:

Esc: Close the current figure
E: End the script that is currently running
R: Rerun the script

Note:

For the shortcuts to work, the displayed figure must be selected.
If nothing happens after pressing a key, click on the figure and try again.

General Information

These scripts are designed to extact a protein substrate sequence from a longer DNA sequence.

The extracted sequences are grouped into two catagories, the Unfilteded (or Initial) and Filtered (or Final) sorts.

Unfiltered: contains the background data consisting of randomized substrate sequnces
Filtered: contains the set of substrates were initially in the Unsorted set, and that were found to be active, then colleced with Fluorescence Activated Cell Sorting

The specificity of the enzyme is evaluated by calculating the Enrichment Score of each amino acid (AA).

The enrichment score compairs the frequency of the AAs in the Filtered dataset to the Unfiltered set

Express DNA

Run extractSubs.py to express DNA sequneces, and extract the protein substrates.

Input parameters you will need to know include:

Input 1:

inFileName: select files you want to process.
- Important: do not mix files from the intial and final sorts, these must be processed separatly. Files witin the same set can be processed in one batch, or individually, it is up to you how to do this, but I recommend at minimum combining the forward (R1) and reverse reads (R2).
inPathFolder: define path to the folder for your
inPathDNASeqs: add the name of the folder with your fastq data
inSaveFileName: define the name of your saved files (This should be related to the input files)

Input 2:

inSaveFileName: define the name of your saved files (This should be related to the input files)
- If:
```
  inFileName = ['Fyn-I_S6_L001_R1_001', 'Fyn-I_S6_L001_R2_001']
```
- Recommend:
```
  inSaveFileName = 'Fyn-I_S6_L001'
```

Input 3:

inAAPositions: name the positions in the substrate
- Recommended: R1 to RN, where N = substrate length
  
  If: N = 8
```
inAAPositions = ['R1','R2','R3','R4','R5','R6','R7','R8']
```

Input 4:

inPrintNumber: how many expressed & extracted do you want to print to inspect how the script is performing with your input parameters
inStartSeqR1, inStartSeqR2: what DNA sequences do you expect to see to the left of your substrate
inEndSeqR1, inEndSeqR2: what DNA sequences do you expect to see to the right of your substrate

Input 5:

inFixedLibrary: are any of the residues in your substrate not randomized
inFixedResidue: make a list of what residue(s) should always be in the substrate
inFixedPosition: where are these AAs expected to be

If your substrate is NNNLQNNN, where Luecine and Glutamine at the 4th and 5th position then define the inputs as:
```
inFixedLibrary = True
inFixedResidue = ['L', 'Q']
inFixedPosition = [4, 5]
```
If you have multiple AAs at the same position such as NNN(L/M/F/Y)NNNN, put these residues in a list within the list:
```
  inFixedResidue = [['L', 'M', 'F', 'Y']]
  inFixedPosition = [4]
```

Input 6:

inAlertPath: an optional input to play a sound then the script has finished processing the files

Evaluate Substrates (Incomplete Section)

Run evaluateSubstrates.py to load all processed protein substrate files and evaluate enrichment.

Don't run this script until you have processed all of your fastq/fasta files with extractSubstrates.py

If you are ready to process the extracted data, find the "filePaths" function in functions.py:

Create a conditional for your enzyme that inclues the names you used to save the extracted substrates, and the lables for the AA positions:

  def filePaths(enzyme):
       if enzyme == 'enzymeName':
            inFileNamesInitialSort = ['fileNameA', 'fileNameB']
            inFileNamesFinalSort = ['fileName1', 'fileName2'] 
            inAAPositions = ['R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8']

Use else statments for the easy processing of multiple datasets:

    def filePaths(enzyme):
         if enzyme == 'enzymeName':
              inFileNamesInitialSort = ['fileNameA', 'fileNameB']
              inFileNamesFinalSort = ['fileName1', 'fileName2']
              inAAPositions = ['R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8']
         elif enzyme == 'enzymeName2':
              inFileNamesInitialSort = ['fileNamC', 'fileNameD']
              inFileNamesFinalSort = ['fileName3', 'fileName4']
              inAAPositions = ['R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8']

Input parameters you will need to know include:

Input 1:

inEnzymeName: select the enzyme you are working with
inFilePath: define path to the folder with your extracted data
inSavePathFigures: define path to the folder used to store your figures
inSaveFigures: Set as True to automatically save your figures, Set as False to not automatically save the figures (this will not overwrite previously saved figures)

Input 2:

Select which figure you want to plot

Input 3:

inFilterSubstrates: do you want to select substrates with specific residue(s) in the substrate
inFixedResidue: make a list of what residue(s) should always be in the substrate
inFixedPosition: where are these AAs expected to be
inExcludeResidues: do you want to select substrates without specific residue(s) in the substrate
inExcludedResidue: define excluded AAs
inExcludedPosition: define positions to exclude the AAs
inMinimumSubstrateCount: exclude substrates with less than this value

Motif Eval (Incomplete Section)

Input 2:

inMotifPositions: Label the residues in the motif sequence

  inMotifPositions = ['P4', 'P3', 'P2', 'P1', 'P1\'', 'P2\'']

inIndexNTerminus: Define the index of the first position in "inMotifPositions"
- If the positions in the full substrate sequnece are:
```
pos = ['R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8']
```
  - If: inIndexNTerminus = 0, the motif will start at 'R1', and extend to 'R6'
  - If: inIndexNTerminus = 1, the motif will start at 'R2', and extend to 'R7'

Miscellaneous

Figures: Word Cloud

The orientations of the words, and their colors will be randomized with each figure.
To remake the figure with a new arrangemnt, press the "R" key (when the figure is selected) to rerun the script.
- Recommendation: Turn off the switches that plot the Enrichment Maps, and Logos to avoid spending time plotting unnecessary figures before plotting the Word Cloud

Name		Name	Last commit message	Last commit date
Latest commit History 600 Commits
README.md		README.md
convertFastq.py		convertFastq.py
extractSubs.py		extractSubs.py
filterAA.py		filterAA.py
filterMotif.py		filterMotif.py
functions.py		functions.py
functionsGNN.py		functionsGNN.py
functionsML.py		functionsML.py
kinetics.py		kinetics.py
kineticsActivity.py		kineticsActivity.py
motifEval.py		motifEval.py
plotSkyline.py		plotSkyline.py
predictActivity.py		predictActivity.py
sampleSizeFastq.py		sampleSizeFastq.py
testGPU.py		testGPU.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Modules

Keyboard Shortcuts

General Information

Express DNA

Evaluate Substrates (Incomplete Section)

Motif Eval (Incomplete Section)

Miscellaneous

About

Uh oh!

Releases

Packages

Languages

Collinformatics/COMET

Folders and files

Latest commit

History

Repository files navigation

Python Modules

Keyboard Shortcuts

General Information

Express DNA

Evaluate Substrates (Incomplete Section)

Motif Eval (Incomplete Section)

Miscellaneous

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages