Skip to content

Can we ignore parsing content at all by adding config key in settings? #846

@shahariaazam

Description

@shahariaazam

We are considering fscrawler so much as our document indexing tools where we are processing more than 32 millions of docs every day (average).

But our usecase is, we are indexing contents from aggregated archive source and that contains various types of files (php script, js script, css script, html/non-html file). Mostly source codes. In that scenario, most of the time Tika parser will faile to parse the document and ultimately that docs won't be indexed at all.

Use case

For example,

14:39:02,423 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/source_code/november/footer.php]  -> XML parse error -> The value of attribute "src" associated with an element type "img" must not contain the '<' character.

File footer.php was

<img src="{$src}">

So Tika failed to parse it because it's not a valid src tag. It can't be. Because the src value would come from PHP variable.

This is just one usecase. Me and my team closely looked into the source code and may be it won't be too hard to bypase the parsing functionality with a configuration key. If you want, I can make a PR.

My suggestions

settings.yaml

ignore_tika_parser: true

And then we will just extract raw contents from the file and index that.

I am open for a discussion in this topics.

Note: If this is already implemented, then I am sorry to raise this issue. May be we couldn't find that in documentation. I don't know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    newFor new features or options

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions