Skip to content

client.Search not returning parsed data: epo_ops.ParseSearch not parsing XML properly #2

@lsg551

Description

@lsg551

Description

The OPS API natively returns XML via HTTP. v0.1.0 introduced parsing capabilities for methods, so that users can work with nicely formatted and easy-to-use Go structs instead if clunky stringified XML. Unfortunately, this seems not to be working properly for the Client.Search method, which delegates XML parsing to epo_ops.ParseSearch.

import (
	ops "github.com/patent-dev/epo-ops"
)

const (
	key    = "…"
	secret = "…"
)

func main() {
	client, err := ops.NewClient(&ops.Config{
		ConsumerKey:    key,
		ConsumerSecret: secret,
	})
	if err != nil {
		fmt.Printf("authenticate OPS API: %v", err)
		return
	}

	patents, err := client.Search(context.Background(), "ti=battery", "1-5")
	if err != nil {
		fmt.Printf("search patents: %v", err)
		return
	}

	fmt.Printf("TotalCount: %d\n", patents.TotalCount)
	fmt.Printf("len(.Results): %d\n", len(patents.Results))
	for _, patent := range patents.Results {
		fmt.Printf("- %s\n", patent.DocNumber)
	}
}
$ go run main.go
TotalCount: 10000
len(.Results): 0

There's no documentation for what SearchResultData.TotalCount means, although I guess this is just the total results in their database, which the OPS API reports back. Then 10.0001 should be correct for the broad search ti=battery.

However, the actual search results slice patents.Results is empty, which rather should contain exactly 5 items.

XML Output

Replace Client.Search with ops.SearchRaw to skip the parsing step and obtain the XML:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="../../style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
    <ops:biblio-search total-result-count="10000" publications-count="5">
        <ops:query syntax="CQL">ti = battery</ops:query>
        <ops:range begin="1" end="5"/>
        <ops:search-result>
            <ops:publication-reference system="ops.epo.org" family-id="78468024">
                <document-id document-id-type="docdb">
                    <country>ES</country>
                    <doc-number>3051365</doc-number>
                    <kind>T3</kind>
                </document-id>
            </ops:publication-reference>
            <ops:publication-reference system="ops.epo.org" family-id="76686681">
                <document-id document-id-type="docdb">
                    <country>ES</country>
                    <doc-number>3051364</doc-number>
                    <kind>T3</kind>
                </document-id>
            </ops:publication-reference>
            <ops:publication-reference system="ops.epo.org" family-id="77338870">
                <document-id document-id-type="docdb">
                    <country>AU</country>
                    <doc-number>2025271196</doc-number>
                    <kind>A1</kind>
                </document-id>
            </ops:publication-reference>
            <ops:publication-reference system="ops.epo.org" family-id="85412822">
                <document-id document-id-type="docdb">
                    <country>AU</country>
                    <doc-number>2025271216</doc-number>
                    <kind>A1</kind>
                </document-id>
            </ops:publication-reference>
            <ops:publication-reference system="ops.epo.org" family-id="74100935">
                <document-id document-id-type="docdb">
                    <country>AU</country>
                    <doc-number>2025271176</doc-number>
                    <kind>A1</kind>
                </document-id>
            </ops:publication-reference>
        </ops:search-result>
    </ops:biblio-search>
</ops:world-patent-data>

Steps to reproduce

Basically just run the above code, but be aware of #1 and my temporary fix for testing: #1 (comment).

Cause

As to my understanding, this is caused by epo_ops.ParseSearch and especially the annotated struct searchXML that it uses to unmarshal the stringified XML.

searchXML seems to be rather made for Client.SearchWithConstituents than Client.Search. Apparently, both methods use ops.ParseSearch and try to unmarshal into searchXML. Although the OPS API will return different XML structures depending on constituents used during search.

Right now, the searchXML expects an XML element named <ops:biblio-search>, although the "normal search" (=without constituents) will return the elements within <ops:search-result>. Subsequently parsed elements are similar, although not identical. So the current implementation of searchXML won't work for all cases.

Also, it seems like there are no test cases that could have catched this bug.

Proposed Fix

A minimal working example is:

Modify

// xml.go

type searchXML struct {
    // […] existing fields
    // add:
    SearchResult struct {
        Publications []struct {
            System     string `xml:"system,attr"`
            FamilyID   string `xml:"family-id,attr"`
            DocumentID struct {
                Country   string `xml:"country"`
                DocNumber string `xml:"doc-number"`
                Kind      string `xml:"kind"`
            } `xml:"document-id"`
        } `xml:"publication-reference"`
    } `xml:"search-result"`
    // […] existing fields
}

func ParseSearch(xmlData string) (*SearchResultData, error) {
    // […] after line 1113, add

    for _, pub := range raw.BiblioSearch.SearchResult.Publications {
        data.Results = append(data.Results, SearchResult{
            System:    pub.System,
            FamilyID:  pub.FamilyID,
            Country:   pub.DocumentID.Country,
            DocNumber: pub.DocumentID.DocNumber,
            Kind:      pub.DocumentID.Kind,
        })
    }

    // […] remaining code

}

This accounts for the different XML structure when no constituents are used by adding the nested struct SearchResult xml:"search-result" to searchXML. The explicit "parsing" added to ParseSearch is needed because ParseSearch does NOT return a 1:1 XML-to-struct copy, but actively modifies the structure.

This makes it work, although I do NOT recommend to actually use it this way. I believe the OPS API can return even more different XML structures, depending on the constituents specified.

I can think of two different fixes

  1. Use searchXML for every possible XML data the search with and without constituents returns. The root elements are similar (identical?) and only nested elements change slightly. Unmarshal to searchXML and omit nil values. This would hopefully cover all cases. However, SearchResultData and SearchResult, which hold the actual properties for each patent, maybe would also need be modified accordingly.
  2. Define different structures for each possible XML structure. Either write multiple ParseSearch methods (one for each XML structure) or make ParseSearch aware of this.

Footnotes

  1. IIRC from reading the OPS API docs, 10.000 is the maximum of positive search results the OPS API will keep a cursor for and therefore will only report a maximum of 10.000 positive search results

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions