Creating an app that will import a blastn output file and parse out and import the details.
mysql data type of “float” will accept small scientific notation, however, 2e-64 will be converted to 0.
“Double” will hold it though. I’m checking to see where the line is. I need to change the data type of the expect column.
Found this in a comment, but can’t find it in any actual documentation.
I am rather confused by the multiple ranges.
DOUBLE[(M,D)] [ZEROFILL] holds double-precision numeric values, pretty similar to FLOAT double-precision, except for its allowable range, which is -1.7976931348623157E+308 to -2.2250738585072014E-308, 0, and 2.2250738585072014E-308 to 1.7976931348623157E+308. FLOAT[(M[,D])] [ZEROFILL] stores floating point numbers in the range of -3.402823466E+38 to -1.175494351E-38 and 1.175494351E-38 to 3.402823466E+38. If precision isn't specified, or <= 24, it's SINGLE precision, otherwise FLOAT is DOUBLE precision. When specified alone, precision can range from 0 to 53. If the scale is defined, too, precision may be up to 255, scale up to 253.
dev.mysql.com/doc/refman/5.7/en/numeric-type-overview.html mariadb.com/kb/en/double/ mariadb.com/kb/en/float/
I’m not sure what the limits in ruby or sunspot would be, but I suppose that I should check.
Ruby seems to be ok down to 1e-323
irb(main):016:0> 1e-323 => 1.0e-323 irb(main):017:0> 1e-324 => 0.0
Sunspot seems to also be sensitive to float and double differences
The dynamic column names are different. And they also seem to muck with the data when faceting
So, I’ve extracted all of the accession numbers, gi numbers and taxids from the nt database. And I’ve blasted a fasta file to the nt database. Yet there are a bunch of accession numbers in the blast output file that were NOT extracted.
For example, there are a number of matches to 3IZT in the blast output.
> grep ">pdb|3IZT|" trinity_non_human_paired.blastn.txt | wc -l 12 > grep 3IZT trinity_non_human_paired.blastn.txt | head -1 pdb|3IZT|B Chain B, Structural Insights Into Cognate Vs. Near-Co... 71.3 2e-09
Yet, there is no extracted accession number 3IZT, but there are 3IZT_A and 3IZT_B.
> grep 3IZT data/nt.accession_gi_taxid.csv 3IZT_A,326634209,83333 3IZT_B,326634210,83333
What? Why?
Identifiers were extracted from the ncbi nt database (>26million)
> blastdbcmd -db /Volumes/cube/working/indexes/nt -entry all -outfmt '%a,%g,%T' -out nt.accession_gi_taxid.csv
Recently, after blasting to the viral_genomics database, I found that all of the accession numbers didn’t have associated identifiers. This is apparently due to the viral_genomics database using REFSEQ data. I tried to extract them from the database as I did for nt but found that this database doesn’t include taxids.
ftp.ncbi.nih.gov/gene/DATA/ ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz
#Format: tax_id GeneID status RNA_nucleotide_accession.version RNA_nucleotide_gi protein_accession.version protein_gi genomic_nucleotide_accession.version genomic_nucleotide_gi start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation assembly mature_peptide_accession.version mature_peptide_gi Symbol (tab is used as a separator, pound sign - start of a comment)
sed -e 1d gene2accession | awk ‘{print $8“,”$9“,”$1}’ | grep -vs “-,-” > gene2accession.accession_gi_taxid.csv & sort uniq
gene2refseq is just a subset of gene2accession
ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz sed -e 1d gene2refseq | awk ‘{print $8“,”$9“,”$1}’ | grep -vs “-,-” > gene2refseq.accession_gi_taxid.csv & sort uniq
Unfortunately, these datasets have a number of problems which halt data importing. A few missing gi numbers. A handful of duplicate accession and gi numbers with alternate taxids. I randomly removed some offending duplicates just to get the data imported. I still have some blast data with no associated identifier. I think that this may be due to different accession versions. I don’t quite get this either as I vaguely remember reading that every accession number gets a new gi number so all of the old ones should still be there.
gene2accession recalculated daily
This file is a comprehensive report of the accessions that are related to a GeneID. It includes sequences from the international sequence collaboration, Swiss-Prot, and RefSeq. The RefSeq subset of this file is also available as gene2refseq. Because this file is updated daily, the RefSeq subset does not reflect any RefSeq release. Versions of RefSeq RNA and protein records may be more recent than those included in an annotation release (build) or those in the current RefSeq release. To identify the annotation release/build to which the genomic RefSeqs belong, please refer to the species-specific README_CURRENT_RELEASE or README_CURRENT_BUILD file in the genomes ftp site: ftp://ftp.ncbi.nih.gov/genomes/ For example: ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/README_CURRENT_RELEASE ftp://ftp.ncbi.nih.gov/genomes/Ailuropoda_melanoleuca/README_CURRENT_BUILD More notes about this file: tab-delimited one line per genomic/RNA/protein set of sequence accessions Column header line is the first line in the file. NOTE: Because this file is comprehensive, it may include some RefSeq accessions that are not current, because they are part of the annotation of the current genomic assembly. In other words, the annotation of a genome is not continuous, but depends on a data freeze. Sub-genomic RefSeqs, however, are updated continuously. Thus some RefSeqs may have been replaced or suppressed after a data freeze assocated with a genomic annotation. Until the release of a new genomic annotation, all RefSeqs that are included in the current annotation are reported in this file.
tax_id:
the unique identifier provided by NCBI Taxonomy for the species or strain/isolate
GeneID:
the unique identifier for a gene
status:
status of the RefSeq if a refseq, else '-' RefSeq values are: INFERRED, MODEL, NA, PREDICTED, PROVISIONAL, REVIEWED, SUPPRESSED, VALIDATED
RNA nucleotide accession.version:
may be null (-) for some genomes
RNA nucleotide gi:
the gi for an RNA nucleotide accession, '-' if not applicable
protein accession.version:
will be null (-) for RNA-coding genes
protein gi:
the gi for a protein accession, '-' if not applicable
genomic nucleotide accession.version:
may be null (-)
genomic nucleotide gi:
the gi for a genomic nucleotide accession, '-' if not applicable
start position on the genomic accession:
position of the gene feature on the genomic accession, '-' if not applicable position 0-based NOTE: this file does not report the position of each exon. For positions on RefSeq contigs and chromosomes, use the seq_gene.md file in the appropriate build directory. For example, for the human genome, ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/mapview/ This file has one line for each annotation, with the feature name, feature_id and feature_type columns indicating the name and type of feature. Note that the GeneID value in the feature_id column can be used to find all locations for a gene by GeneID. WARNING: Positions in seq_gene.md files are one-based, not 0-based NOTE: if genes are merged after an annotation is released, there may be more than one location reported on a genomic sequence per GeneID, each resulting from the annotation before the merge.
end position on the genomic accession:
position of the gene feature on the genomic accession, '-' if not applicable position 0-based NOTE: this file does not report the position of each exon. For positions on RefSeq contigs and chromosomes, use the seq_gene.md file in the appropriate build directory. For example, for the human genome, ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/mapview/ This file has one line for each annotation, with the feature name, feature_id and feature_type columns indicating the name and type of feature. Note that the GeneID value in the feature_id column can be used to find all locations for a gene by GeneID. WARNING: Positions in seq_gene.md files are one-based, not 0-based NOTE: if genes are merged after an annotation is released, there may be more than one location reported on a genomic sequence per GeneID, each resulting from the annotation before the merge.
orientation:
orientation of the gene feature on the genomic accession, '?' if not applicable
assembly:
the name of the assembly '-' if not applicable
mature peptide accession.version:
will be null (-) if absent
mature peptide gi:
the gi for a mature peptide accession, '-' if not applicable
Symbol:
the default symbol for the gene
OK then. So. Lots of little problems and gaps. Downloading a newer blast nt database along with human_genomic, refseq_genomic and other_genomic. I’m going to extract accession/gi/taxid from all and see if that works.
update_blastdb.pl nt update_blastdb.pl refseq_genomic update_blastdb.pl other_genomic update_blastdb.pl human_genomic cat refseq_genomic.??.tar.gz.md5 > refseq_genomic.md5list gmd5sum -c refseq_genomic.md5list echo 'for file in `ls refseq_genomic.??.tar.gz` ; do tar xfzv $file ; done' | bash
Nodes and Names came from ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz ( I did not import all of nodes, or any of the other data. ) ( I only use the scientific name, so next time I will only import those. )
*.dmp files are bcp-like dump from GenBank taxonomy database. General information. Field terminator is "\t|\t" Row terminator is "\t|\n" nodes.dmp file consists of taxonomy nodes. The description for each node includes the following fields: tax_id -- node id in GenBank taxonomy database parent tax_id -- parent node id in GenBank taxonomy database rank -- rank of this node (superkingdom, kingdom, ...) embl code -- locus-name prefix; not unique division id -- see division.dmp file inherited div flag (1 or 0) -- 1 if node inherits division from parent genetic code id -- see gencode.dmp file inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent mitochondrial genetic code id -- see gencode.dmp file inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet comments -- free-text comments and citations Taxonomy names file (names.dmp): tax_id -- the id of node associated with this name name_txt -- name itself unique name -- the unique variant of this name if name not unique name class -- (synonym, common name, ...)
After downloading newer versions of names and nodes,
bundle exec rake app:names:import bundle exec rake app:nodes:import bundle exec rake app:nodes:import_scientific_names bundle exec rake app:nodes:build_nested_set bundle exec rake sunspot:reindex
This will require updating of the preset taxonomy ranges in BlastResults. Get the lft and rgt values from .…
Node.where(:scientific_name => "Viruses").first (taxid 10239)
Node.where(:scientific_name => "Bacteria").order('depth ASC').first # two actually!!! (taxid 2)
Node.where(:scientific_name => "Homo Sapiens").first.taxid (really deep) (taxid 9606)
=> 9606
Node.where(:taxid => 9606).first.parent.parent.parent.parent.parent.parent.parent.parent.taxid
=> 9443
Primates (taxid 9443)
May have to check the preset taxids for Virus and such, but I would expect the taxids to NOT change, but the lft and rgt very much may have.
Getting .…
RSolr::Error::Http (RSolr::Error::Http - 500 Internal Server Error Error: Java heap space java.lang.OutOfMemoryError: Java heap space
How to get current value?
java -XX:+PrintFlagsFinal -version
How to set new value?
config/sunspot.yml allows for adjusting max and min
I have found on my mac, that using Sunspot with Java 1.7 will cause all kinds of random grief.
This sucks now as sunspot 2.0.0 needs patches to work with rails 4.
This is all good now. Yay!
If brand new mariadb installation, MUST do this FIRST. as it creates files with the proper ownership and permissions in /opt/local/var/db/mariadb/
sudo mysql_install_db --user=mysql sudo /opt/local/share/mariadb/support-files/mysql.server start bundle exec rake sunspot:solr:start
Reimport (~30 million) file_names = BlastResult.group(:file_name).count select = file_names.keys.select{|f|f.match(/^fallon_SFP/)}.select{|f|f.match(/blastn.txt$/)} BlastResult.where(:file_name => select).find_each{|b|b.destroy} bundle exec rake app:blast_results:import file=“blast_results/fallon_SFPB001?/trinity_non_human_{single,paired}.blastn.txt”
bundle exec rake app:blast_results:import file=“blast_results/fallon_SFPB001/trinity_non_human_single.blastn.txt“
BlastResult.where(:file_name => “fallon_SFPB001B_filtered_20130731_trinity_non_human_single.blastn.txt”).find_each{|b| b.delete }
BlastResult.where(BlastResult.arel_table.gt(19895047)).find_each{|b|b.index}
Copyright © 2013 [Jake Wendt], released under the MIT license