-
Notifications
You must be signed in to change notification settings - Fork 306
Description
We are considering fscrawler so much as our document indexing tools where we are processing more than 32 millions of docs every day (average).
But our usecase is, we are indexing contents from aggregated archive source and that contains various types of files (php script, js script, css script, html/non-html file). Mostly source codes. In that scenario, most of the time Tika parser will faile to parse the document and ultimately that docs won't be indexed at all.
Use case
For example,
14:39:02,423 WARN [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/source_code/november/footer.php] -> XML parse error -> The value of attribute "src" associated with an element type "img" must not contain the '<' character.
File footer.php was
<img src="{$src}">
So Tika failed to parse it because it's not a valid src tag. It can't be. Because the src value would come from PHP variable.
This is just one usecase. Me and my team closely looked into the source code and may be it won't be too hard to bypase the parsing functionality with a configuration key. If you want, I can make a PR.
My suggestions
settings.yaml
ignore_tika_parser: true
And then we will just extract raw contents from the file and index that.
I am open for a discussion in this topics.
Note: If this is already implemented, then I am sorry to raise this issue. May be we couldn't find that in documentation. I don't know.