Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -141,3 +141,4 @@ jsonid-integration-files/

# Secreta
token.pypi
jsonid_pronom.xml
54 changes: 54 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -413,6 +413,60 @@ PRONOM IDs that can then be referenced in the JSONID output.
Evantually, PRONOM or a PRONOM-like tool might host an authoritative version
of the JSONID registry.

### JSONID for PRONOM Signature Development

JSONID provides a high-level language for output of PRONOM compatible
signatures. The feature set is still in its BETA phase but JSONID provides
two distinct capabilities:

#### 1. Registry output

JSONID's registry can be output using the `--pronom` flag. A signature file
will be created under `jsonid_pronom.xml` which can be imported into DROID
for identification of document types registered with JSONID.

JSONID's registry is output alongisde a handful of baseline JSON signatures
designed to capture "plain"-JSON that is not yet encoded in the registry.

#### 2. Signature development

A standalone `json2pronom` utility is provided for creation of potentially
robust DROID compatible signatures.

As a high-level language, signatures can be defined in easy to understand
syntax and then output consistently via the `json2pronom` utility. Signatures
include sensible defaults for whitespace and other aspects that are
difficult for signature developers to consistently anticipate when writing
JSON based signatures.

Given a [sample pattern file](./pronom_example/patterns_example.json) a DROID
compatible snippet can be output as follows (UTF-8 shown for brevity):

<!--markdownlint-disable-->

```xml
<?xml version="1.0" ?>
<FFSignatureFile xmlns="http://www.nationalarchives.gov.uk/pronom/SignatureFile" Version="1" DateCreated="2026-01-04T16:14:16Z">
<InternalSignatureCollection>
<InternalSignature ID="1" Specificity="Specific">
<ByteSequence Reference="BOF" Sequence="{0-4095}7B" MinOffset="0" MaxOffset="4095"/>
<ByteSequence Reference="VAR" Sequence="226B65793122{0-16}3A" MinOffset="" MaxOffset=""/>
<ByteSequence Reference="VAR" Sequence="226B65793222{0-16}3A" MinOffset="" MaxOffset=""/>
<ByteSequence Reference="EOF" Sequence="7D{0-4095}" MinOffset="0" MaxOffset="4095"/>
</InternalSignature>
</InternalSignatureCollection>
<FileFormatCollection>
<FileFormat ID="1" Name="JSONID2PRONOM Conversion (UTF-8)" PUID="jsonid2pronom/1" Version="" MIMEType="application/json" FormatType="structured text">
<InternalSignatureID>1</InternalSignatureID>
<Extension>json</Extension>
</FileFormat>
</FFSignatureFile>
```

<!--markdownlint-enable-->

Feedback on this utility is welcome.

## Output format

Previously JSONID output YAML containing all result object metadata. It has
Expand Down
145 changes: 144 additions & 1 deletion docs/jsonid/export.html
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ <h2 class="section-title" id="header-functions">Functions</h2>
</summary>
<pre><code class="python">def exportJSON() -&gt; None: # pylint: disable=C0103
&#34;&#34;&#34;Export to JSON.&#34;&#34;&#34;
logger.debug(&#34;exporting registry ad JSON&#34;)
logger.debug(&#34;exporting registry as JSON&#34;)
data = registry_data.registry()
json_obj = []
id_ = {
Expand All @@ -74,9 +74,144 @@ <h2 class="section-title" id="header-functions">Functions</h2>
</details>
<div class="desc"><p>Export to JSON.</p></div>
</dd>
<dt id="src.jsonid.export.export_pronom"><code class="name flex">
<span>def <span class="ident">export_pronom</span></span>(<span>) ‑> None</span>
</code></dt>
<dd>
<details class="source">
<summary>
<span>Expand source code</span>
</summary>
<pre><code class="python">def export_pronom() -&gt; None:
&#34;&#34;&#34;Export a PRONOM compatible set of signatures.

Export is done in two phases. A set of proposed &#34;Baseline&#34; JSON
signatures to catch many JSON instances.

Second the JSONID registry is exported.

Every export has a priority over the other so that there should
be no multiple identification results.
&#34;&#34;&#34;

# pylint: disable=R0914; too-many local variables.

logger.debug(&#34;exporting registry as PRONOM&#34;)

reg_data = registry_data.registry()
formats = []

encodings = (&#34;UTF-8&#34;, &#34;UTF-16&#34;, &#34;UTF-16BE&#34;, &#34;UTF-32LE&#34;)
priorities = []

increment_id = 0

for encoding in encodings:
all_baseline = pronom.create_baseline_json_sequences(encoding)
for baseline in all_baseline:
increment_id += 1
fmt = pronom.Format(
id=increment_id,
name=f&#34;JSON (Baseline - fmt/817) ({encoding})&#34;,
version=&#34;&#34;,
puid=&#34;jsonid:0000&#34;,
mime=&#34;application/json&#34;,
classification=&#34;structured text&#34;,
external_signatures=[
pronom.ExternalSignature(
id=increment_id,
signature=&#34;json&#34;,
type=pronom.EXT,
)
],
internal_signatures=[baseline],
priorities=priorities,
)
priorities.append(f&#34;{increment_id}&#34;)
formats.append(fmt)

for encoding in encodings:
for entry in reg_data:
increment_id += 1
json_puid = f&#34;{entry.json()[&#39;identifier&#39;]};{encoding}&#34;
name_ = f&#34;{entry.json()[&#39;name&#39;][0][&#39;@en&#39;]} ({encoding})&#34;
markers = entry.json()[&#34;markers&#34;]
try:
mime = entry.json()[&#34;mime&#34;][0]
except IndexError:
mime = &#34;&#34;
try:
sequences = pronom.process_markers(
copy.deepcopy(markers),
increment_id,
encoding=encoding,
)
except pronom.UnprocessableEntity as err:
logger.error(
&#34;%s %s: cannot handle: %s&#34;,
json_puid,
name_,
err,
)
for marker in markers:
logger.debug(&#34;--- START ---&#34;)
logger.debug(&#34;marker: %s&#34;, marker)
logger.debug(&#34;--- END ---&#34;)
continue
fmt = pronom.Format(
id=increment_id,
name=name_,
version=&#34;&#34;,
puid=json_puid,
mime=mime,
classification=&#34;structured text&#34;,
external_signatures=[
pronom.ExternalSignature(
id=increment_id,
signature=&#34;json&#34;,
type=pronom.EXT,
)
],
internal_signatures=sequences,
priorities=copy.deepcopy(list(set(priorities))),
)
priorities.append(f&#34;{increment_id}&#34;)
formats.append(fmt)

pronom.process_formats_and_save(formats, PRONOM_FILENAME)</code></pre>
</details>
<div class="desc"><p>Export a PRONOM compatible set of signatures.</p>
<p>Export is done in two phases. A set of proposed "Baseline" JSON
signatures to catch many JSON instances.</p>
<p>Second the JSONID registry is exported.</p>
<p>Every export has a priority over the other so that there should
be no multiple identification results.</p></div>
</dd>
</dl>
</section>
<section>
<h2 class="section-title" id="header-classes">Classes</h2>
<dl>
<dt id="src.jsonid.export.PRONOMException"><code class="flex name class">
<span>class <span class="ident">PRONOMException</span></span>
<span>(</span><span>*args, **kwargs)</span>
</code></dt>
<dd>
<details class="source">
<summary>
<span>Expand source code</span>
</summary>
<pre><code class="python">class PRONOMException(Exception):
&#34;&#34;&#34;Exception class if we can&#39;t create a PRONOM signature as expected.&#34;&#34;&#34;</code></pre>
</details>
<div class="desc"><p>Exception class if we can't create a PRONOM signature as expected.</p></div>
<h3>Ancestors</h3>
<ul class="hlist">
<li>builtins.Exception</li>
<li>builtins.BaseException</li>
</ul>
</dd>
</dl>
</section>
</article>
<nav id="sidebar">
Expand All @@ -92,6 +227,14 @@ <h2 class="section-title" id="header-functions">Functions</h2>
<li><h3><a href="#header-functions">Functions</a></h3>
<ul class="">
<li><code><a title="src.jsonid.export.exportJSON" href="#src.jsonid.export.exportJSON">exportJSON</a></code></li>
<li><code><a title="src.jsonid.export.export_pronom" href="#src.jsonid.export.export_pronom">export_pronom</a></code></li>
</ul>
</li>
<li><h3><a href="#header-classes">Classes</a></h3>
<ul>
<li>
<h4><code><a title="src.jsonid.export.PRONOMException" href="#src.jsonid.export.PRONOMException">PRONOMException</a></code></h4>
</li>
</ul>
</li>
</ul>
Expand Down
118 changes: 118 additions & 0 deletions docs/jsonid/export_helpers.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1">
<meta name="generator" content="pdoc3 0.11.6">
<title>src.jsonid.export_helpers API documentation</title>
<meta name="description" content="Helpers for the export functions.">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/13.0.0/sanitize.min.css" integrity="sha512-y1dtMcuvtTMJc1yPgEqF0ZjQbhnc/bFhyvIyVNb9Zk5mIGtqVaAB1Ttl28su8AvFMOY0EwRbAe+HCLqj6W7/KA==" crossorigin>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/13.0.0/typography.min.css" integrity="sha512-Y1DYSb995BAfxobCkKepB1BqJJTPrOp3zPL74AWFugHHmmdcvO+C48WLrUOlhGMc0QG7AE3f7gmvvcrmX2fDoA==" crossorigin>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/styles/default.min.css" crossorigin>
<style>:root{--highlight-color:#fe9}.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:1.5em;overflow:hidden}#sidebar > *:last-child{margin-bottom:2cm}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:2em 0 .50em 0}h3{font-size:1.4em;margin:1.6em 0 .7em 0}h4{margin:0;font-size:105%}h1:target,h2:target,h3:target,h4:target,h5:target,h6:target{background:var(--highlight-color);padding:.2em 0}a{color:#058;text-decoration:none;transition:color .2s ease-in-out}a:visited{color:#503}a:hover{color:#b62}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900;font-weight:bold}pre code{font-size:.8em;line-height:1.4em;padding:1em;display:block}code{background:#f3f3f3;font-family:"DejaVu Sans Mono",monospace;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{margin-top:.6em;font-weight:bold}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}dt:target .name{background:var(--highlight-color)}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source > summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible;min-width:max-content}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}td{padding:0 .5em}.admonition{padding:.1em 1em;margin:1em 0}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
<style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%;height:100vh;overflow:auto;position:sticky;top:0}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul ul{padding-left:1em}.toc > ul > li{margin-top:.5em}}</style>
<style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
<script defer src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.9.0/highlight.min.js" integrity="sha512-D9gUyxqja7hBtkWpPWGt9wfbfaMGVt9gnyCvYa+jojwwPHLCzUm5i8rpk7vD7wNee9bA35eYIjobYPaQuKS1MQ==" crossorigin></script>
<script>window.addEventListener('DOMContentLoaded', () => {
hljs.configure({languages: ['bash', 'css', 'diff', 'graphql', 'ini', 'javascript', 'json', 'plaintext', 'python', 'python-repl', 'rust', 'shell', 'sql', 'typescript', 'xml', 'yaml']});
hljs.highlightAll();
/* Collapse source docstrings */
setTimeout(() => {
[...document.querySelectorAll('.hljs.language-python > .hljs-string')]
.filter(el => el.innerHTML.length > 200 && ['"""', "'''"].includes(el.innerHTML.substring(0, 3)))
.forEach(el => {
let d = document.createElement('details');
d.classList.add('hljs-string');
d.innerHTML = '<summary>"""</summary>' + el.innerHTML.substring(3);
el.replaceWith(d);
});
}, 100);
})</script>
</head>
<body>
<main>
<article id="content">
<header>
<h1 class="title">Module <code>src.jsonid.export_helpers</code></h1>
</header>
<section id="section-intro">
<p>Helpers for the export functions.</p>
</section>
<section>
</section>
<section>
</section>
<section>
<h2 class="section-title" id="header-functions">Functions</h2>
<dl>
<dt id="src.jsonid.export_helpers.get_utc_timestamp_now"><code class="name flex">
<span>def <span class="ident">get_utc_timestamp_now</span></span>(<span>)</span>
</code></dt>
<dd>
<details class="source">
<summary>
<span>Expand source code</span>
</summary>
<pre><code class="python">def get_utc_timestamp_now():
&#34;&#34;&#34;Get a formatted UTC timestamp for &#39;now&#39; that can be used when
a timestamp is needed.
&#34;&#34;&#34;
return datetime.datetime.now(timezone.utc).strftime(UTC_TIME_FORMAT)</code></pre>
</details>
<div class="desc"><p>Get a formatted UTC timestamp for 'now' that can be used when
a timestamp is needed.</p></div>
</dd>
<dt id="src.jsonid.export_helpers.new_prettify"><code class="name flex">
<span>def <span class="ident">new_prettify</span></span>(<span>c)</span>
</code></dt>
<dd>
<details class="source">
<summary>
<span>Expand source code</span>
</summary>
<pre><code class="python">def new_prettify(c):
&#34;&#34;&#34;Remove excess newlines from DOM output.

via: https://stackoverflow.com/a/14493981
&#34;&#34;&#34;
reparsed = parseString(c)
return &#34;\n&#34;.join(
[
line
for line in reparsed.toprettyxml(indent=&#34; &#34; * 2).split(&#34;\n&#34;)
if line.strip()
]
)</code></pre>
</details>
<div class="desc"><p>Remove excess newlines from DOM output.</p>
<p>via: <a href="https://stackoverflow.com/a/14493981">https://stackoverflow.com/a/14493981</a></p></div>
</dd>
</dl>
</section>
<section>
</section>
</article>
<nav id="sidebar">
<div class="toc">
<ul></ul>
</div>
<ul id="index">
<li><h3>Super-module</h3>
<ul>
<li><code><a title="src.jsonid" href="index.html">src.jsonid</a></code></li>
</ul>
</li>
<li><h3><a href="#header-functions">Functions</a></h3>
<ul class="">
<li><code><a title="src.jsonid.export_helpers.get_utc_timestamp_now" href="#src.jsonid.export_helpers.get_utc_timestamp_now">get_utc_timestamp_now</a></code></li>
<li><code><a title="src.jsonid.export_helpers.new_prettify" href="#src.jsonid.export_helpers.new_prettify">new_prettify</a></code></li>
</ul>
</li>
</ul>
</nav>
</main>
<footer id="footer">
<p>Generated by <a href="https://pdoc3.github.io/pdoc" title="pdoc: Python API documentation generator"><cite>pdoc</cite> 0.11.6</a>.</p>
</footer>
</body>
</html>
14 changes: 7 additions & 7 deletions docs/jsonid/helpers.html
Original file line number Diff line number Diff line change
Expand Up @@ -273,19 +273,19 @@ <h2 class="section-title" id="header-functions">Functions</h2>
# pylint: disable=R0911

if replace_me.__name__ == &#34;dict&#34;:
return &#34;map&#34;
return TYPE_MAP
if replace_me.__name__ == &#34;int&#34;:
return &#34;integer&#34;
return TYPE_INTEGER
if replace_me.__name__ == &#34;list&#34;:
return &#34;list&#34;
return TYPE_LIST
if replace_me.__name__ == &#34;str&#34;:
return &#34;string&#34;
return TYPE_STRING
if replace_me.__name__ == &#34;float&#34;:
return &#34;float&#34;
return TYPE_FLOAT
if replace_me.__name__ == &#34;bool&#34;:
return &#34;bool&#34;
return TYPE_BOOL
if replace_me.__name__ == &#34;NoneType&#34;:
return &#34;NoneType&#34;
return TYPE_NONE
if not isinstance(replace_me, type):
pass
return replace_me</code></pre>
Expand Down
Loading