Skip to content

Conversation

@cwusinich
Copy link

No description provided.

README.md Outdated
Comment on lines 32 to 33
1. Transfer MEG, behavior, and MRI data to data processing folder on biowulf (this is just done with a BASH command)
2. Update demographics spreadsheet for use in analysis
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These may not require manual intervention. The more you can automate the more scalable and less error prone your processing will be.

README.md Outdated
1. Script name will be **01_megpreprocess1**, and I would like it to do the following:
- Change file name using CTF command (to make file names look like “subject#_MID.ds” so they are more uniform)
- Set markers for stimuli (cue markers) and output file (.txt file) of the times of the cue marks (this file will be used in the next step)
- I already have some of the CTF commands for this, but I would like to figure out how to give this script (and all of the following ones) a list of subject names to loop through so I can do this all at once.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snakemake may be a useful approach for this. Or since you are working in human neuroimaging using nipype might be worth it. These tools for building and executing workflows are hard and the beginning but they pay off in the long run. An added advantage is that nipype is that many of the generic tools for neuroimaging processing use it so you will be able to debug/modify these pipelines more easily

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for all of these responses and suggestions!!

README.md Outdated
## Behavior processing
1. Script name will be **02_behprocess**, and it will do the following:
- Makes marker files for more MEG processing; these markers will designate win/loss/neutral cues to be marked in the MEG file and are output as three separate .txt files to the subject’s MEG directory
- Pulls columns from behavior data file and calculates mean RTs and accuracy by trial and subject and appends that to master behavior data sheet (I will probably work with this sheet in R unless I can do an ANOVA as another step)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look into pandas for working on tabular data. It will make things a lot easier. And using csv files instead of text might be useful.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python has advance tools for statistical learning/analysis. Statsmodels is probably what you need here but as the sophistication of your analysis grows you could add in the packages lme, scikit-learn, and tensorflow. There are certainly times to use R but if you can complete everything with python it will help you a lot. Learning how to debug bash, python, and R is never a fun task.

README.md Outdated
1. Script name will be **02_behprocess**, and it will do the following:
- Makes marker files for more MEG processing; these markers will designate win/loss/neutral cues to be marked in the MEG file and are output as three separate .txt files to the subject’s MEG directory
- Pulls columns from behavior data file and calculates mean RTs and accuracy by trial and subject and appends that to master behavior data sheet (I will probably work with this sheet in R unless I can do an ANOVA as another step)
- I have already made two separate jupyter notebook files to do each of the above tasks, but I can figure how to do it without finding and replacing the subject ID; obviously a loop would be a lot easier, so I would like to figure that out and collapse both notebooks into one script
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect a pandas dataframe, string formatting should be adequate. Your processing should likely be encapsulated in functions (and in turn in modules). Any function could take subject id as an argument to modify it's behavior appropriately.

README.md Outdated
- Copy talairached .nii file into a group folder for group analysis
2. Problems:
- This step uses a parameter file that I would like to be able to create and distribute among subject directories whenever we want to change parameters (e.g. the time window we’re looking at in the task)
- Each of the three SAM commands takes a bit of time, but I heard that using swarm in biowulf may be a good way to do this. I am not sure how it would work with a loop just yet, but it’s an aspiration to make this somewhat efficient if possible.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where nipype would shine. It is true that swarm would distribute this but then it is not very portable. And workflow managers do lots of intelligent management of the processing so that when you make a change you only recompute the bits of the analysis that you need to.

README.md Outdated

## Permissions
1. I want a script that will set all of the permissions correctly for my group for all of the output files above because umask in my bash profile does not seem to work for this.
2. The commands in this script may just be put at the end of each of the above ones depending on how this works out, or should it just be a bash script?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bash is fine. Python would be better. There's not too much point in rewriting lots of bash scripts into python. With that said if you tie everything together with python it will be easier to maintain and debug

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurred to me that some tools completely ignore umask. Perhaps that is your problem. Assuming you have written all your output to "/data/$USER/output_dir" you could make sure your output is well behaved regarding permissions by ending each swarm job with something like:

OUTPUT_DIR=/data/$USER/output_dir
GROUP=mygroup
find $OUTPUT_DIR -group $USER -exec chown :$GROUP {} \; -exec chmod g+rx \;

README.md Outdated
2. The commands in this script may just be put at the end of each of the above ones depending on how this works out, or should it just be a bash script?

# Additional Questions:
- Can I have a script do “module load afni” (and ctf, and R) so that I don’t have to?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can certainly put this in your ~/.bashrc file (or ~/.bash_profile, depends on what os you are on). For the class project you are developing a python package. A python package (a pip installed one) will often just assume the appropriate dependencies are installed on the system.

README.md Outdated

# Additional Questions:
- Can I have a script do “module load afni” (and ctf, and R) so that I don’t have to?
- If the commands are from CTF, can they be in a python script? I had just been running them in the terminal, so would I have to add something special to the beginning of each one to include it in a python script?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subprocess package may be what you need here. Or the above mentioned workflow managers

README.md Outdated
# Additional Questions:
- Can I have a script do “module load afni” (and ctf, and R) so that I don’t have to?
- If the commands are from CTF, can they be in a python script? I had just been running them in the terminal, so would I have to add something special to the beginning of each one to include it in a python script?
- Can I make a giant script that will run all of these steps (01-05 and setting permissions) on a list of subjects?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a giant script. It should be a script that provides a command line interface to the user (use arg parse) and then calls the appropriate functionality (imported from other python modules in your package).

Or you might just have several smaller scripts if that is more convenient. Python entrypoints might be useful as you attempt to achieve this. Using the scripts keyword for setup.py may also be useful if you do wish to install non-python scripts as part of your package.

README.md Outdated
- Can I have a script do “module load afni” (and ctf, and R) so that I don’t have to?
- If the commands are from CTF, can they be in a python script? I had just been running them in the terminal, so would I have to add something special to the beginning of each one to include it in a python script?
- Can I make a giant script that will run all of these steps (01-05 and setting permissions) on a list of subjects?
- How can I set my default python version in biowulf without manually having to load 3.7 every time? Can I do that at the beginning of the first script or in my profile?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I have a Conda environment that is set in my profile as a default. That way I will always be using the python in that environment.

Often module load can be quite slow. Often, I create an bash alias to software within a conda environment. so for example alias git=/home/$USER/anaconda3/envs/misc_env/bin/git takes no time to load at login but will give me git from that environment when I want it.

@cwusinich cwusinich changed the title Christina Wusinich Project Outline Christina Wusinich Project Apr 23, 2020
@leej3
Copy link

leej3 commented Apr 23, 2020

Am I doing git right?

Yes it appears so

Is there a reason that something would work when I run a command individually vs. running as a swarm? Trouble-shooting ideas would be great because my whole project won’t work if I can’t figure this out…

Sounds like checking both the .o and .e files will help to debug this.

Related to q#1: how do I figure out how many CPUs and threads to assign to a job? I’ve been doing a lot of googling and am still a bit confused. I’ve just been assigning a lot until the command runs successfully, but that seems kind of sloppy and maybe a disrespectful use of shared resources. And what about threads?

No need to worry about taking more than you have allocated, they don't let you. You only harm your own processing. It is an sbatch system so likely you can use variables like the ones listed here Try: SLURM_CPUS_PER_TASK.

I need some help conceptualizing how to bring multiple scripts together so that I can run them in order on a bunch of subjects at once while allocating enough memory for each so that this doesn’t take forever. Also I have a combination of python scripts I’ve written and some bash swarm files (which one of the python scripts generates), which is adding to some of my confusion. In Dr. Lee’s comments on my outline, he mentioned looking into pipelines, which I hope to do by next week but will probably still have some questions about organizing this and which one to pick.

Have a look at snakmake and nipype. They're worth it for the long run. Just maybe not for this project

Other questions if there is time:
Additionally, I have been encountering a lot of errors when swarming a bunch of subjects at once which sometimes seem to get lost in the shuffle (so I don’t realize it until a later step) or are very vague (i.e. it will just say “FAILED” and I have to then go figure out why); so it would be nice to figure out how to build in some more informative error messages and mechanisms that will stop later scripts in the chain if one of the earlier ones fails (also messages that tell me if things have run successfully--maybe some kind of report output?).
Another higher level question is that the neuroimaging data I’m working with just seems to be kind of unpredictable (or error-inducing?), and I wonder if it’s even wise to combine too many steps into one, though maybe the error messages will help with that. Or maybe is there something I can do early in the process to make sure the datasets are ready to go through the pipeline?

@leej3
Copy link

leej3 commented May 3, 2020

f-strings will make a lot of this easier to construct and more legible. For example...

subject = "01"
session = "a"
cmd = f"""
    command 
        -ax 
        --another-flag 
        -s {subject}
        --session {session}
    """
cmd = ' '.join(cmd.split())
print(cmd)

Will generate "command -ax --another-flag -s 01 --session a". It is a lot more legible than adding strings together. Something to bear in mind in future.

Also, add a setup.py and some tests...

@cwusinich
Copy link
Author

cwusinich commented May 5, 2020

Thank you for the f-strings suggestion! If I don't change it for the project deadline, I definitely will at some point moving forward.

Christina Wusinich and others added 3 commits May 4, 2020 22:17
… draft--plz don't look at it yet), added blank setup.py and blank test file
@cwusinich
Copy link
Author

@leej3 If we don't have any functions (I just have scripts), what should the tests be? Should I make one of them a function just so it can have tests?

@leej3
Copy link

leej3 commented May 6, 2020 via email

@cwusinich
Copy link
Author

Thanks! Will do.

Christina Wusinich and others added 6 commits May 6, 2020 19:51
@cwusinich
Copy link
Author

@leej3 I'm still planning to add more tests today and clean some of the scripts up if there's time, but the one test I have in there now works, and so do all of my modules, as far as I can tell. I also tested installing the package on a different computer, and that worked, and the test ran successfully. I'm wondering why it's not passing circleci; I read the details of it failing and looked at other people's pull requests that passed, but I'm still a little confused about what's wrong with mine. I'm guessing something isn't formatted or set-up correctly, but I'm not sure what. If you have time, can you check to see what's going wrong?

cwusinich and others added 3 commits May 7, 2020 03:23
test files should start with test to be picked up by pytest

You should run your tests using pytest

The creation of test data has been tidied into a function to
make troubleshooting it a little easier. Ultimately creating a
test fixture would be ideal for this sort of function.

Using Path objects from pathlib helps when working with lots of
directories and filenames. It adds the appropriate path seperators
"/" or "\" depending on your operating system. And you no longer have
to worry about trailing slashes. And it has lots of neat methods.

If you wish to create variables throughout a .py file the correct style
is to capitilize the names
@leej3
Copy link

leej3 commented May 7, 2020

Great work. I've submitted some suggested fixes to help you on your way.

Quick summary:

  • Encapsulating fake data creation is a good idea
  • test modules must start with "test" for pytest to pick them up
  • Try to use pathlib.Path more. Just as f-string change one's life with strings, Path objects make working with paths much easier. The one gotcha (which is becoming less frequent) is that some tools insist on a string as input instead of a Path object. The quick fix is my_path_as_a_string = str(my_path_object)

@cwusinich
Copy link
Author

Thank you! It looks so much better now!

@cwusinich
Copy link
Author

Submitting now. Thank you for your help with this! I learned way more than I thought I would!

@leesup
Copy link

leesup commented May 11, 2020

You did an awesome job on your project! You provided truly impressive description of your project and tests. Great job using Numpy and pandas - I hope these become your best friends when using Python! I would suggest having more white space for improved readability. In addition, I would suggest checking whether the libraries/modules you imported are actually being used. I saw that you imported csv module, but you ended up using pandas csv instead. Using pandas is completely fine, but you don’t need to import csv in that case. Also, you should have a look into pathlib and argparse. Again, great job with your project!

-Paul

@cwusinich
Copy link
Author

Thank you @suhwanplee !! Those are great suggestions; I am definitely going to look into them. Take care!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants