Can we ignore parsing content at all by adding config key in settings?

We are considering `fscrawler` so much as our document indexing tools where we are processing more than 32 millions of docs every day (average).

But our usecase is, we are indexing contents from aggregated archive source and that contains various types of files (php script, js script, css script, html/non-html file). Mostly source codes. In that scenario, most of the time Tika parser will faile to parse the document and ultimately that docs won't be indexed at all.

### Use case

For example, 
```
14:39:02,423 WARN  [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [/source_code/november/footer.php]  -> XML parse error -> The value of attribute "src" associated with an element type "img" must not contain the '<' character.

```

File `footer.php` was 
```
<img src="{$src}">
```
So Tika failed to parse it because it's not a valid `src` tag. It can't be. Because the `src` value would come from PHP variable.

This is just one usecase. Me and my team closely looked into the source code and may be it won't be too hard to bypase the parsing functionality with a configuration key. If you want, I can make a PR.

### My suggestions
`settings.yaml`
```
ignore_tika_parser: true
```
And then we will just extract raw contents from the file and index that. 

I am open for a discussion in this topics.

**Note:** If this is already implemented, then I am sorry to raise this issue. May be we couldn't find that in documentation. I don't know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can we ignore parsing content at all by adding config key in settings? #846

Use case

My suggestions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Can we ignore parsing content at all by adding config key in settings? #846

Description

Use case

My suggestions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions