GitHub

Taxonomy

Creating an app that will import a blastn output file and parse out and import the details.

Another piece of new info

mysql data type of “float” will accept small scientific notation, however, 2e-64 will be converted to 0.

“Double” will hold it though. I’m checking to see where the line is. I need to change the data type of the expect column.

Found this in a comment, but can’t find it in any actual documentation.

I am rather confused by the multiple ranges.

DOUBLE[(M,D)] [ZEROFILL] holds double-precision numeric values, pretty similar to FLOAT double-precision, except for its allowable range, which is -1.7976931348623157E+308 to -2.2250738585072014E-308, 0, and 2.2250738585072014E-308 to 1.7976931348623157E+308.

FLOAT[(M[,D])] [ZEROFILL] stores floating point numbers in the range of -3.402823466E+38 to -1.175494351E-38 and 1.175494351E-38 to 3.402823466E+38. If precision isn't specified, or <= 24, it's SINGLE precision, otherwise FLOAT is DOUBLE precision. When specified alone, precision can range from 0 to 53. If the scale is defined, too, precision may be up to 255, scale up to 253.

dev.mysql.com/doc/refman/5.7/en/numeric-type-overview.html mariadb.com/kb/en/double/ mariadb.com/kb/en/float/

I’m not sure what the limits in ruby or sunspot would be, but I suppose that I should check.

Ruby seems to be ok down to 1e-323

irb(main):016:0> 1e-323
=> 1.0e-323
irb(main):017:0> 1e-324
=> 0.0

Sunspot seems to also be sensitive to float and double differences

The dynamic column names are different.
And they also seem to muck with the data when faceting

Confusion?

So, I’ve extracted all of the accession numbers, gi numbers and taxids from the nt database. And I’ve blasted a fasta file to the nt database. Yet there are a bunch of accession numbers in the blast output file that were NOT extracted.

For example, there are a number of matches to 3IZT in the blast output.

> grep ">pdb|3IZT|" trinity_non_human_paired.blastn.txt | wc -l
12

> grep 3IZT trinity_non_human_paired.blastn.txt | head -1
pdb|3IZT|B  Chain B, Structural Insights Into Cognate Vs. Near-Co...  71.3    2e-09

Yet, there is no extracted accession number 3IZT, but there are 3IZT_A and 3IZT_B.

> grep 3IZT data/nt.accession_gi_taxid.csv 
3IZT_A,326634209,83333
3IZT_B,326634210,83333

What? Why?

Imported Identifiers

Identifiers were extracted from the ncbi nt database (>26million)

> blastdbcmd -db /Volumes/cube/working/indexes/nt -entry all -outfmt '%a,%g,%T' -out nt.accession_gi_taxid.csv

Recently, after blasting to the viral_genomics database, I found that all of the accession numbers didn’t have associated identifiers. This is apparently due to the viral_genomics database using REFSEQ data. I tried to extract them from the database as I did for nt but found that this database doesn’t include taxids.

ftp.ncbi.nih.gov/gene/DATA/ ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz

#Format: tax_id GeneID status RNA_nucleotide_accession.version RNA_nucleotide_gi protein_accession.version protein_gi genomic_nucleotide_accession.version genomic_nucleotide_gi start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation assembly mature_peptide_accession.version mature_peptide_gi Symbol (tab is used as a separator, pound sign - start of a comment)

sed -e 1d gene2accession | awk ‘{print $8“,”$9“,”$1}’ | grep -vs “-,-” > gene2accession.accession_gi_taxid.csv & sort uniq

gene2refseq is just a subset of gene2accession

ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz sed -e 1d gene2refseq | awk ‘{print $8“,”$9“,”$1}’ | grep -vs “-,-” > gene2refseq.accession_gi_taxid.csv & sort uniq

Unfortunately, these datasets have a number of problems which halt data importing. A few missing gi numbers. A handful of duplicate accession and gi numbers with alternate taxids. I randomly removed some offending duplicates just to get the data imported. I still have some blast data with no associated identifier. I think that this may be due to different accession versions. I don’t quite get this either as I vaguely remember reading that every accession number gets a new gi number so all of the old ones should still be there.

gene2accession recalculated daily

This file is a comprehensive report of the accessions that are 
related to a GeneID.  It includes sequences from the international
sequence collaboration, Swiss-Prot, and RefSeq. The RefSeq subset
of this file is also available as gene2refseq.

Because this file is updated daily, the RefSeq subset does not 
reflect any RefSeq release. Versions of RefSeq RNA and protein 
records may be more recent than those included in an annotation
release (build) or those in the current RefSeq release.

To identify the annotation release/build to which the 
genomic RefSeqs belong, please refer to the species-specific
README_CURRENT_RELEASE or README_CURRENT_BUILD
file in the genomes ftp site:

  ftp://ftp.ncbi.nih.gov/genomes/

For example:
  ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/README_CURRENT_RELEASE
  ftp://ftp.ncbi.nih.gov/genomes/Ailuropoda_melanoleuca/README_CURRENT_BUILD            

More notes about this file:

tab-delimited
one line per genomic/RNA/protein set of sequence accessions
Column header line is the first line in the file.

NOTE: Because this file is comprehensive, it may include
some RefSeq accessions that are not current, because they are
part of the annotation of the current genomic assembly. In other 
words, the annotation of a genome is not continuous, but depends
on a data freeze. Sub-genomic RefSeqs, however, are updated 
continuously. Thus some RefSeqs may have been replaced or 
suppressed after a data freeze assocated with a genomic annotation. 
Until the release of a new genomic annotation, all
RefSeqs that are included in the current annotation are reported
in this file.

tax_id:

the unique identifier provided by NCBI Taxonomy
for the species or strain/isolate

GeneID:

the unique identifier for a gene

status:

status of the RefSeq if a refseq, else '-'
RefSeq values are: INFERRED, MODEL, NA, PREDICTED, PROVISIONAL,
REVIEWED, SUPPRESSED, VALIDATED

RNA nucleotide accession.version:

may be null (-) for some genomes

RNA nucleotide gi:

the gi for an RNA nucleotide accession, '-' if not applicable

protein accession.version:

will be null (-) for RNA-coding genes

protein gi:

the gi for a protein accession, '-' if not applicable

genomic nucleotide accession.version:

may be null (-)

genomic nucleotide gi:

the gi for a genomic nucleotide accession, '-' if not applicable

start position on the genomic accession:

position of the gene feature on the genomic accession,
'-' if not applicable
position 0-based

NOTE: this file does not report the position of each exon.
For positions on RefSeq contigs and chromosomes, 
use the seq_gene.md file in the appropriate build directory.
For example, for the human genome,
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/mapview/

This file has one line for each annotation, with the feature name, feature_id and
feature_type columns indicating the name and type of feature. Note that the GeneID 
value in the feature_id column can be used to find all locations for a gene by GeneID.

WARNING: Positions in seq_gene.md files are one-based, not 0-based

NOTE: if genes are merged after an annotation is released, there 
may be more than one location reported on a genomic sequence 
per GeneID, each resulting from the annotation before the merge.

end position on the genomic accession:

position of the gene feature on the genomic accession,
'-' if not applicable
position 0-based

NOTE: this file does not report the position of each exon.
For positions on RefSeq contigs and chromosomes, 
use the seq_gene.md file in the appropriate build directory.
For example, for the human genome,
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/mapview/

This file has one line for each annotation, with the feature name, feature_id and
feature_type columns indicating the name and type of feature. Note that the GeneID 
value in the feature_id column can be used to find all locations for a gene by GeneID.

WARNING: Positions in seq_gene.md files are one-based, not 0-based

NOTE: if genes are merged after an annotation is released, there 
may be more than one location reported on a genomic sequence 
per GeneID, each resulting from the annotation before the merge.

orientation:

orientation of the gene feature on the genomic accession,
'?' if not applicable

assembly:

the name of the assembly
'-' if not applicable

mature peptide accession.version:

will be null (-) if absent

mature peptide gi:

the gi for a mature peptide accession, '-' if not applicable

Symbol:

the default symbol for the gene

OK then. So. Lots of little problems and gaps. Downloading a newer blast nt database along with human_genomic, refseq_genomic and other_genomic. I’m going to extract accession/gi/taxid from all and see if that works.

update_blastdb.pl nt
update_blastdb.pl refseq_genomic
update_blastdb.pl other_genomic
update_blastdb.pl human_genomic

cat refseq_genomic.??.tar.gz.md5 > refseq_genomic.md5list
gmd5sum -c refseq_genomic.md5list
echo 'for file in `ls refseq_genomic.??.tar.gz` ; do tar xfzv $file ; done' | bash

Imported Nodes and Names

Nodes and Names came from ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz ( I did not import all of nodes, or any of the other data. ) ( I only use the scientific name, so next time I will only import those. )

*.dmp files are bcp-like dump from GenBank taxonomy database.

General information.
Field terminator is "\t|\t"
Row terminator is "\t|\n"

nodes.dmp file consists of taxonomy nodes. The description for each node includes the following
fields:
	tax_id					-- node id in GenBank taxonomy database
	parent tax_id				-- parent node id in GenBank taxonomy database
	rank					-- rank of this node (superkingdom, kingdom, ...) 
	embl code				-- locus-name prefix; not unique
	division id				-- see division.dmp file
	inherited div flag  (1 or 0)		-- 1 if node inherits division from parent
	genetic code id				-- see gencode.dmp file
	inherited GC  flag  (1 or 0)		-- 1 if node inherits genetic code from parent
	mitochondrial genetic code id		-- see gencode.dmp file
	inherited MGC flag  (1 or 0)		-- 1 if node inherits mitochondrial gencode from parent
	GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage
	hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet
	comments				-- free-text comments and citations

Taxonomy names file (names.dmp):
	tax_id					-- the id of node associated with this name
	name_txt				-- name itself
	unique name				-- the unique variant of this name if name not unique
	name class				-- (synonym, common name, ...)

After downloading newer versions of names and nodes,

bundle exec rake app:names:import
bundle exec rake app:nodes:import
bundle exec rake app:nodes:import_scientific_names
bundle exec rake app:nodes:build_nested_set
bundle exec rake sunspot:reindex

This will require updating of the preset taxonomy ranges in BlastResults. Get the lft and rgt values from .…

Node.where(:scientific_name => "Viruses").first (taxid 10239)

Node.where(:scientific_name => "Bacteria").order('depth ASC').first		#	two actually!!! (taxid 2)

Node.where(:scientific_name => "Homo Sapiens").first.taxid  (really deep) (taxid 9606)
=> 9606
Node.where(:taxid => 9606).first.parent.parent.parent.parent.parent.parent.parent.parent.taxid
=> 9443
Primates (taxid 9443)

May have to check the preset taxids for Virus and such, but I would expect the taxids to NOT change, but the lft and rgt very much may have.

Java Error

Getting .…

RSolr::Error::Http (RSolr::Error::Http - 500 Internal Server Error
Error:     Java heap space

java.lang.OutOfMemoryError: Java heap space

How to get current value?

java -XX:+PrintFlagsFinal -version

How to set new value?

config/sunspot.yml allows for adjusting max and min

Java Version

I have found on my mac, that using Sunspot with Java 1.7 will cause all kinds of random grief.

This sucks now as sunspot 2.0.0 needs patches to work with rails 4.

This is all good now. Yay!

Running

If brand new mariadb installation, MUST do this FIRST. as it creates files with the proper ownership and permissions in /opt/local/var/db/mariadb/

sudo mysql_install_db --user=mysql

sudo /opt/local/share/mariadb/support-files/mysql.server start

bundle exec rake sunspot:solr:start

Reimport (~30 million) file_names = BlastResult.group(:file_name).count select = file_names.keys.select{|f|f.match(/^fallon_SFP/)}.select{|f|f.match(/blastn.txt$/)} BlastResult.where(:file_name => select).find_each{|b|b.destroy} bundle exec rake app:blast_results:import file=“blast_results/fallon_SFPB001?/trinity_non_human_{single,paired}.blastn.txt”

bundle exec rake app:blast_results:import file=“blast_results/fallon_SFPB001/trinity_non_human_single.blastn.txt“

BlastResult.where(:file_name => “fallon_SFPB001B_filtered_20130731_trinity_non_human_single.blastn.txt”).find_each{|b| b.delete }

BlastResult.where(BlastResult.arel_table.gt(19895047)).find_each{|b|b.index}

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
app		app
config		config
db		db
doc		doc
lib		lib
log		log
public		public
script		script
solr		solr
test		test
vendor		vendor
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
MIT-LICENSE		MIT-LICENSE
README.rdoc		README.rdoc
Rakefile		Rakefile
config.ru		config.ru

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Taxonomy

Another piece of new info

Confusion?

Imported Identifiers

Imported Nodes and Names

Java Error

Java Version

Running

About

Uh oh!

Releases

Packages

Languages

License

ccls/taxonomy

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages