-
Notifications
You must be signed in to change notification settings - Fork 0
Development
Matt Thompson edited this page Dec 16, 2025
·
3 revisions
Clone the repository and install the package in development mode with an activated virtual environment:
git clone git@github.com:Imageomics/TaxonoPy.git
cd TaxonoPySet up and activate a virtual environment.
Install the package in development mode:
pip install -e ".[dev]"TaxonoPy uses GNVerifier to generate and integrates with its API from its OpenAPI specification.
The script that handles this is scripts/generate_gnverifier_types.py, which saves api_specs/gnverifier_openapi.json and from this produces src/taxonopy/types/gnverifier.py.
To check for changes in the OpenAPI specification, run:
python scripts/generate_gnverifier_types.py- Identify the namespace via CLI
taxonopy --cache-dir ~/diskcache \
--cache-input /fs/ess/PAS2136/thompsonmj/projects/TaxonoPy/redlist_species_data_9dfee105-4f82-4d9e-bac6-cb0b39b7cd9c/taxonopy_input \
--cache-stats
Output:
TaxonoPy Cache Statistics:
namespace: /users/PAS2136/thompsonmj/diskcache/resolve_v0.1.0b0_27496709ec8027fe
total_size_bytes: 170095739
db_file_count: 3
entry_count: 79869
meta_count: 79869
prefix_counts: {'taxonomic_entries': 1, 'entry_groups': 1, 'resolution_chain': 79867}
- Inspect the namespace in a REPL
from itertools import islice
from diskcache import Cache
namespace = "/users/PAS2136/thompsonmj/diskcache/resolve_v0.1.0b0_27496709ec8027fe"
cache = Cache(directory=namespace)
# List the first few keys
for key in islice(cache.iterkeys(), 6):
print(key)
# Example output:
# entry_groups_ac704841132e2e625fc24aa8c8b9f90b
# entry_groups_ac704841132e2e625fc24aa8c8b9f90b::meta
# resolution_chain_000096fc60c37d82fc5a7099f45d402217ba2d240bd139615926ce439db82a6c
# resolution_chain_000096fc60c37d82fc5a7099f45d402217ba2d240bd139615926ce439db82a6c::meta
# resolution_chain_0000d930710191d6390d9b5a58984d1337f1c21a9630666e2fad5ad59e459c30
# resolution_chain_0000d930710191d6390d9b5a58984d1337f1c21a9630666e2fad5ad59e459c30::meta
- Inspect metadata for a key
value_key = "entry_groups_ac704841132e2e625fc24aa8c8b9f90b"
meta = cache.get(f"{value_key}::meta")
print(meta)
# -> {'checksum': 'fe190ad3f491a023d6d630ad78fead71e0877da731c7132e62d8bdf04e0565c6', 'timestamp': '2025-12-16T10:09:30.188566', 'serializer': 'diskcache', 'version': 1, 'function': 'create_entry_groups', 'execution_time': 1.3435392379760742}
- Inspect the cached data
value = cache.get(value_key)
print(len(value)) # number of entries cached
print(value[0]) # first TaxonomicEntry object (dataclass)
For a resolution chain:
chain_key = "resolution_chain_35a9570c5995fb46e23b36163061b42e4901b33749c039e8fd9e8c8eb3ad3fdd"
chain = cache.get(chain_key)
print(len(chain)) # Number of attempts in the chain
print(chain[-1]["status"]) # Final status
print(chain[-1]["resolved_classification"])
print(chain[-1]) # All fields in final chain link
- Why inspect the cache?
- Provenance debugging: the cache holds the full attempt chains per entry group. Inspecting them helps explain why a particular entry resolved (or failed) without rerunning GNVerifier.
- Performance tuning: the metadata stores execution times for cached functions, so you can monitor how long parsing/grouping took in past runs.
- Offline reanalysis: by reading
entry_groups_*andtaxonomic_entries_*directly, you can reconstruct the intermediate datasets for ad-hoc analysis without reprocessing the raw files.
Eventually, these types of insights should be incorporated into trace functionality.