Development

Setup

Clone the repository and install the package in development mode with an activated virtual environment:

git clone git@github.com:Imageomics/TaxonoPy.git
cd TaxonoPy

Set up and activate a virtual environment.

Install the package in development mode:

pip install -e ".[dev]"

OpenAPI Specification Managment and Type Generation

TaxonoPy uses GNVerifier to generate and integrates with its API from its OpenAPI specification.

The script that handles this is scripts/generate_gnverifier_types.py, which saves api_specs/gnverifier_openapi.json and from this produces src/taxonopy/types/gnverifier.py.

To check for changes in the OpenAPI specification, run:

python scripts/generate_gnverifier_types.py

Cache inspection

Identify the namespace via CLI

taxonopy --cache-dir ~/diskcache \
     --cache-input /fs/ess/PAS2136/thompsonmj/projects/TaxonoPy/redlist_species_data_9dfee105-4f82-4d9e-bac6-cb0b39b7cd9c/taxonopy_input \
     --cache-stats

Output:

TaxonoPy Cache Statistics:
  namespace: /users/PAS2136/thompsonmj/diskcache/resolve_v0.1.0b0_27496709ec8027fe
  total_size_bytes: 170095739
  db_file_count: 3
  entry_count: 79869
  meta_count: 79869
  prefix_counts: {'taxonomic_entries': 1, 'entry_groups': 1, 'resolution_chain': 79867}

Inspect the namespace in a REPL

from itertools import islice
from diskcache import Cache

namespace = "/users/PAS2136/thompsonmj/diskcache/resolve_v0.1.0b0_27496709ec8027fe"
cache = Cache(directory=namespace)

# List the first few keys
for key in islice(cache.iterkeys(), 6):
    print(key)

# Example output:
# entry_groups_ac704841132e2e625fc24aa8c8b9f90b
# entry_groups_ac704841132e2e625fc24aa8c8b9f90b::meta
# resolution_chain_000096fc60c37d82fc5a7099f45d402217ba2d240bd139615926ce439db82a6c
# resolution_chain_000096fc60c37d82fc5a7099f45d402217ba2d240bd139615926ce439db82a6c::meta
# resolution_chain_0000d930710191d6390d9b5a58984d1337f1c21a9630666e2fad5ad59e459c30
# resolution_chain_0000d930710191d6390d9b5a58984d1337f1c21a9630666e2fad5ad59e459c30::meta

Inspect metadata for a key

value_key = "entry_groups_ac704841132e2e625fc24aa8c8b9f90b"
meta = cache.get(f"{value_key}::meta")
print(meta)

# -> {'checksum': 'fe190ad3f491a023d6d630ad78fead71e0877da731c7132e62d8bdf04e0565c6', 'timestamp': '2025-12-16T10:09:30.188566', 'serializer': 'diskcache', 'version': 1, 'function': 'create_entry_groups', 'execution_time': 1.3435392379760742}

Inspect the cached data

value = cache.get(value_key)
print(len(value))           # number of entries cached
print(value[0])             # first TaxonomicEntry object (dataclass)

For a resolution chain:

chain_key = "resolution_chain_35a9570c5995fb46e23b36163061b42e4901b33749c039e8fd9e8c8eb3ad3fdd"
chain = cache.get(chain_key)

print(len(chain))           # Number of attempts in the chain
print(chain[-1]["status"])  # Final status
print(chain[-1]["resolved_classification"])
print(chain[-1])            # All fields in final chain link

Why inspect the cache?

Provenance debugging: the cache holds the full attempt chains per entry group. Inspecting them helps explain why a particular entry resolved (or failed) without rerunning GNVerifier.
Performance tuning: the metadata stores execution times for cached functions, so you can monitor how long parsing/grouping took in past runs.
Offline reanalysis: by reading entry_groups_* and taxonomic_entries_* directly, you can reconstruct the intermediate datasets for ad-hoc analysis without reprocessing the raw files.

Eventually, these types of insights should be incorporated into trace functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development

Setup

OpenAPI Specification Managment and Type Generation

Cache inspection

Uh oh!

Uh oh!

Clone this wiki locally