plugins.indexer package¶

An RVT2 plugin to:

parse documents and directories using Apache Tika and saving the data as json.
parse the output from other modules and save the data as json.
index these JSON into an ElasticSearch server.
query, tag and export the data in an ElasticSearach server.

In order to use these modules, you’ll need a Tika server and/or an ElasticSearch server running in the network.

Submodules¶

plugins.indexer.blindsearches module¶

class plugins.indexer.blindsearches.BlindSearches(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

A module to annotate documents that match a blind search.

Configuration:

keyword_tag_field: Identification of the field to use for annotations. Remember ElasticSearchBulkSender allows appending to annotation lists by using a field ending in “-new”
strip_match: If true, return only the identifier of the document and the annotation, if any. If false, returns the whole document.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)¶: The path to a JSON file, output of a previous ElasticSearchAdapter.

plugins.indexer.elastic module¶

Modules to index documents parsed by other modules into ElasticSearch.

class plugins.indexer.elastic.ElasticSearchAdapter(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

A module to adapt the results from other modules to a format suitable to be indexed in bulk into ElasticSearch.

This module also registers its own execution and results in the index defined in configuration rvtindex.

Configuration:

name: the name of the index in ElasticSearch. The name will be converted to lowcase, since ES only accept lowcase names.
doc_type: the doc_type in ElasticSearch. Do not change the default value "_doc", it will be deprecated in ES>6.
rvtindex: The name of the index where the run of this module will be registered. The name MUST be in lowcase. If empty or None, the job is not registered.
operation: The operation for elastic search. Possible values are “index” (default) overwrites data, or “update” updates existing data with new information. An update does always an upsert.
casename: The name of the case
server: The URL of the file server to access directly to the files.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)¶

Returns: An iterator with the adapted JSON.

class plugins.indexer.elastic.ElasticSearchBulkSender(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

A module to index the results from the ElasticSearchAdapter into an ElasticSearch server.

Configuration:

es_hosts: a space separated list of hosts of ElasticSearch. Example: http://localhost:9200. The port is mandatory.
name: the name of the index in ElasticSearch. If the index does not exist, create it using mapping.
The name will be converted to lower case, since ES only accept lower case names.
mapping: If the index name must be created, use this file for initial settings and mappings.
chunk_size: The number of documents to send to elastic in each batch.
tag_fields: A space separated list of names of the fields that include tags. A new tag is appended using the special field “tag_field-new”. For example, you can append to the field “tags” a tag in “tags-new”.
only_test: If True, do not submit the final queries to ElasticSearch but yield them.
offset: ignore this number of lines at the beginning of the file.
restartable: if True, save the current status in the store to allow restarting the job.
max_retries: max number of retries after a HTTP 429 from the server (too many requests)
retry_on_timeout: If True, retry on TimeOut (the server is busy)
progress.disable (Boolean): If True, disable the progress bar.
progress.cmd: Run this command to know the number of actions to send.
cooloff.every: after sending cooloff.every number of items, wait cooloff.seconds.
cooloff.seconds: after sending cooloff.every number of items, wait cooloff.seconds.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)¶

Parameters

path (str) –

The path tp the file to upload to the ElasticSearch server. Each line is an action as described in <https://elasticsearch-py.readthedocs.io/en/master/helpers.html>.

”index” actions are also compatible with elasticdump, and you can upload the file using elasticdump if you prefer.

You MUST use this module if the operation from ElasticSearchAdapter is “update”, since elasticdump always overwrites data.

class plugins.indexer.elastic.ElasticSearchQuery(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

Query ES and yield the results.

The path is ignored.

Configuration section:

es_hosts: An array of strings with the ES servers.
name: The name of the index to query. The name will be converted to lower case, since ES only accept lower case names.
query: The query in lucene language.
source_includes: a space separated list of fields to include in the answer. Use empty string for all fields.
source_excludes: a space separated list of fields NOT to include in the answer.
progress.disable: if True, disable the progress bar.
max_results: If the query affects to more than this number of documents, raise an RVTCritical error to stop the execution. Set to 0 to disable.
retry_on_timeout: If True, retry after ES returned a timeour error.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.elastic.ElasticSearchQueryRelated(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

Query ES and yield all documents related to the query: containers, attachments…

The path is ignored.

Configuration section:

es_hosts: An array of strings with the ES servers.
name: The name of the index to query. The name will converted into lower case.
query: The query in lucene language.
source_includes: a space separated list of fields to include in the answer. Use empty string for all fields.
source_excludes: a space separated list of fields NOT to include in the answer.
retry_on_timeout: If True, retry after ES returned a timeour error.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.elastic.ExportFiles(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

plugins.indexer.elastic.coolOff(every=100, seconds=10)¶: A generator to wait a number of seconds after calling ‘every’ times. If every == 0, do not wait ever.

plugins.indexer.elastic.get_esclient(es_hosts, retry_on_timeout=False, logger=<module 'logging' from '/usr/lib/python3.7/logging/__init__.py'>)¶

Get an elasticsearch.Elasticsearch object.

Attrs:

es_hosts: A list of elastis servers
retry_on_timeout: If true, retry queries after a timeout
logger: The logger to use

Returns: An elasticsearch.Elasticsearch object
Raises: base.job.RVTError if any of the server is available –

Todo

Add security.

plugins.indexer.events module¶

class plugins.indexer.events.BrowsersCookies(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Converts browsers cookies to events. After this, you can save this file using events.save.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.BrowsersDownloads(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Converts browsers downloads to events. After this, you can save this file using events.save.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.BrowsersHistory(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Converts browsers history to events. After this, you can save this file using events.save.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.EventLogs(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Adapts windows event logs to Elastic. After this, you can save this file using events.save.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.NetworkConnections(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Adapts SRUM Network Connections information to Elastic. After this, you can save this file using events.save.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.NetworkUsage(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Adapts SRUM Network Usage information to Elastic. After this, you can save this file using events.save.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.Prefetch(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Converts prefetch execution times to events. After this, you can save this file using events.save.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.RecentFiles(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Converts Lnk and Jumplists to events. After this, you can save this file using events.save.

Configuration section:

classify: If True, categorize the files in the output.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.Registry(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Adapts Windows Registry information to Elastic. After this, you can save this file using events.save.

run(path='')¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.SuperTimeline(*args, **kwargs)¶

Bases: base.job.BaseModule

Main class to adapt any forensic source containing timestamped events to JSON format suitable for Elastic Common Schema (ECS)

Configuration section:

classify: If True, categorize the files in the output.

common_fields()¶: Get a new dictionary of mandatory fields for all sources

filegroup(entry, classify=True)¶: Return the category group given an extension, path or content_type

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.Timeline(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Convert a BODY file to events. After this, you can save this file using events.save

Configuration section:

include_filename: if True, include FILENAME entries in the output.
classify: If True, categorize the files in the output.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)¶: Converts a BODY file read from from_module into an Elastic Common Schema document.

class plugins.indexer.events.USB(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Adapts usb setup.api to Elastic. After this, you can save this file using events.save.

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.events.UsnJrnl(*args, **kwargs)¶

Bases: plugins.indexer.events.SuperTimeline

Adapts windows usnjrnl to Elastic. After this, you can save this file using events.save.

attributes(attributes)¶

Converts a string of attributes into a list:

Example: ‘ARCHIVE NOT_CONTENT_INDEXED ‘ -> [“archive”, “not_content_indexed”]

reasons(reasons)¶

Parse a string of reasons into a suitable event.action:

Example: ‘DATA_EXTEND FILE_CREATE CLOSE ‘ -> ‘file-created’

run(path=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

plugins.indexer.events.decompose_url(full_url)¶: Returns a dictionary with multiple fields for a full url using Elastic Common Schema

plugins.indexer.events.filetype(tsk_permisssion)¶: Get file type from a tsk permission string

plugins.indexer.events.permissions_to_octal(tsk_permisssion)¶: Convert a tsk permission string to octal format

plugins.indexer.events.sanitize_dashes(value)¶: Some Elastic fields format do not accept ‘-‘ as value

plugins.indexer.events.to_date(strtimestamp)¶: Converts a timestamp string in UNIX into a date

plugins.indexer.events.to_iso_format(timestring)¶: Converts a date string into iso format date

plugins.indexer.export_pst module¶

Modules to parse the output from the RVT related to emails.

class plugins.indexer.export_pst.CreatePstHtml(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

Creates html items from pst/ost.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)¶

Exports to html items from pst/ost

Parameters: path (str) – path to pff-n.export folder

class plugins.indexer.export_pst.ExportPstEml(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

Exports to html items from pst/ost.

export_eml(item, write=True)¶

Exports to eml an item parsed with pff-export

Parameters

item (str) – path to item
write (bool) – writes eml file or returns eml content

run(path)¶

Exports to eml items from pst/ost

Parameters: path (str) – path to item to export (Message, Task, Appointment, Activity, Meeting, Note, Contact)

plugins.indexer.export_pst.get_text(filename)¶: get content of a file

plugins.indexer.export_pst.repl_lt_gt(text)¶: replace text to show in html

plugins.indexer.pstparser module¶

Modules to export, parse and index PST and OST files using the external utility pffexport.

class plugins.indexer.pstparser.EmlxParseMessage(*args, **kwargs)¶

Bases: plugins.indexer.pstparser.PffExportParseObject

run(path, containerid=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.ExportPst(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

Export all pst and ost files from a mounted image using pffexport. This modules depends on the common plugin.

Configuration:

pffexport: route to the pffexport binary
outdir: outputs to this directory
delete_exists: if True, delete the outout directory if it already exists

Returns: a list of {filename, outdir, index}

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)¶: Export all pst and ost files in a mounted image. Path is ignored.

class plugins.indexer.pstparser.MacMailParser(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

Parses the macosx mailbox client files in the directory specified by path.

# Parameters: # path (str): the path to pass to from_module, which must yield, for each PST or OST file, a dictionary: # {outdir:”path to the output directory of the export”, filename:”filaname of the PST or OST file”}.

Configuration:

exportdir: the export main directory.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path='')¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.MailParser(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

Parses the exported pst files returned by from_module. Use the output from ExportPST as from_module.

Parameters: path (str) – the path to pass to from_module, which must yield, for each PST or OST file, a dictionary: {outdir:”path to the output directory of the export”, filename:”filaname of the PST or OST file”}.

Configuration:

exportdir: the export main directory.

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path='')¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.MailboxCSV(config, section=None, local_config=None, from_module=None)¶

Bases: base.job.BaseModule

Generates a dictionary with information about messages in a chain. Use a (probably forked) MailParser as from_module.

run(path)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.ParseMacMailbox(*args, **kwargs)¶

Bases: plugins.indexer.pstparser.PffExportParseObject

Parses an object (Message, Meeting, Contact, Task or Appointment) in a directory exported by pffexport.

Parameters: path (str) – the path to directory of the object. It must be a directory.
Yields: Information about the object and its attachments.

run(path, containerid=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.PffExportParseAppointment(*args, **kwargs)¶

Bases: plugins.indexer.pstparser.PffExportParseObject

Parses an Appointment in a directory exported by pffexport.

Parameters: path (str) – the path to directory of the Appointment. It must be a directory.
Yields: Information about the Appointment but not its attachments.

run(path, containerid=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.PffExportParseContact(*args, **kwargs)¶

Bases: plugins.indexer.pstparser.PffExportParseObject

Parses a Contact in a directory exported by pffexport.

Parameters: path (str) – the path to directory of the Message. It must be a directory.
Yields: Information about the Contact but not its attachments.

run(path, containerid=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.PffExportParseMeeting(*args, **kwargs)¶

Bases: plugins.indexer.pstparser.PffExportParseObject

Parses a Meeting in a directory exported by pffexport.

Parameters: path (str) – the path to directory of the Meeting. It must be a directory.
Yields: Information about the Meeting but not its attachments.

run(path, containerid=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.PffExportParseMessage(*args, **kwargs)¶

Bases: plugins.indexer.pstparser.PffExportParseObject

Parses a Message in a directory exported by pffexport.

Parameters: path (str) – the path to directory of the Message. It must be a directory.
Yields: Information about the Message but not its attachments.

run(path, containerid=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.PffExportParseObject(*args, **kwargs)¶

Bases: base.job.BaseModule

Parses an object (Message, Meeting, Contact, Task or Appointment) in a directory exported by pffexport.

Parameters: path (str) – the path to directory of the object. It must be a directory.
Yields: Information about the object and its attachments.

run(path, containerid=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

class plugins.indexer.pstparser.PffExportParseTask(*args, **kwargs)¶

Bases: plugins.indexer.pstparser.PffExportParseObject

Parses a Task in a directory exported by pffexport.

Parameters: path (str) – the path to directory of the Task. It must be a directory.
Yields: Information about the Task but not its attachments.

Todo

We couldn’t test this module

run(path, containerid=None)¶

Run the job on a path

Parameters: path (str) – the path to check.
Yields: If any, an iterable of elements with the output.

plugins.indexer.pstparser.decodeEmailDateHeader(header, ignore_errors=True)¶

Decodes a header (as returned by the standard email library) assuming it is a date.

:param : ignore_errors: If True, returns None if the date cannot be parsed. Else, raise ValueError.

>>> decodeEmailDateHeader('Mon, 23 Aug 2004 11:40:10 -0400')
'2004-08-23T11:40:10-04:00'
>>> decodeEmailDateHeader('nanana')
>>> decodeEmailDateHeader('nanana', ignore_errors=False)
Traceback (most recent call last):
    ...
ValueError: ('Unknown string format:', 'nanana')
>>> decodeEmailDateHeader(None)
>>> decodeEmailDateHeader('')

plugins.indexer.pstparser.decodeEmailHeader(header, errors='replace')¶

Decodes an international header (as returned by the standard email library) into unicode.

:param : header (str): The value of the header, as returned by the standard email library. It may content different parts with different encodings :param : errors (str): After an encoding error, you can be ‘strict’, or ‘replace’ or ‘ignore’

>>> decodeEmailHeader('=?utf-8?b?WW91J3JlIFNpbXBseSB0aGUgQmVzdCEgc3V6w6BubmUg77uH77uH77uH77uH77uH?=')
"You're Simply the Best! suzànne ﻇﻇﻇﻇﻇ"
>>> decodeEmailHeader('=?utf-8?b?SMOpY3RvciBGZXJuw6FuZGV6?= <hector@fernandez>')
'Héctor Fernández <hector@fernandez>'
>>> decodeEmailHeader(None)
>>> decodeEmailHeader('')

plugins.indexer.pstparser.readMessageFile(filename, stop_on_empty_line=True, encoding='utf-8', errors='replace')¶

Reads an email header file.

This funcion is a simplified version of email.message_from_file(), but headers can also include spaces. Body, if present is ignored. If filename does not exists, returns an empty dictionary. Header names will be saved in lower case.

:param : filename (str): The name of the file to open :param : stop_on_empty_line (boolean): If True, stop processing headers after an empty line. :param : encoding (str): The encoding of filename. :param : errors (str): After an encoding error, you can be ‘strict’, or ‘replace’ or ‘ignore’

plugins.indexer.tikaparser module¶

Tika management.

In this file:

metadata refers to a file metadata as returned by Tika. Metadata may not have a name suitable for coding, such as the slash in content-type and the name depends on the file type and the specific parser Tika, and there are hundred of different names. Metadata with the same semantics in different file types may have different names. For example, dc:author, author, metadata:author…
field refers to a metadata in the dictionary the modules in this file return. metadata is mapped to a field, possibly normalized or converted or maybe ignored. There is a file to configure the mapping between fields and metadata.

Todo

Allow using a standalone Tika, not the server mode.
Tell Tika somehow which parser must use for a file. Specially useful to improve the time to parse plain text files, since Tika test all possible parsers on them by default.

class plugins.indexer.tikaparser.TikaParser(*attrs, **kwattrs)¶

Bases: base.job.BaseModule

A module to parse files using Tika.

A Tika server must be running before creating this module. If the configured Tika server is not available, an exception will be raised.

Documents are parsed and their metadata and content are returned as a dictionary. By default, only some metadata (fields) are returned.

Beware of the character limit on files parsed by Tika: 100k characters.

Configuration:

tika_server: a URL to a preexisting TIKA server. Example: http://localhost:9998. The port is mandatory.
mapping: path to the mapping configuration file. By default, not all metadata is yielded. Metadata must be mapped to a field name, and only fields mapped are returned. This prevents the “field explosion problem” and metadata with the same semantics but different names for different file types.

The mapping file is a standard cfg file. See the default example:
- Sections are content types.
- Values map metadata=field. If field name is empty or IGNORED and include_unknown is set to False, this metadata will be ignored. In field names, try to use Python standard named: change - to _, no spaces…
save_mapping: if True, save in the mapping configuration file any new metadata you found.
include_unknown: if True, add metadata without a mapping to the results.
error_status (int): Error status code to report in case Tika is not available while parsing a specific file.
file_max_size (int): The max size in bytes of a file to be parsed. If the file is larger than this, it is not parsed and tika_status set to 413 (payload too large)
content_max_size (int): The max size in bytes of a content to be parsed. If the content is larger than this, it is removed and tika_status set to (payload too large)
tika_encoding (str): The encoding of the answers from tika

read_config()¶

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)¶

Parses a file using Tika.

A file could be:

single, if it has not any embedded files.
composite, it is includes several single files (compressed files, PDF, emails…)

Trying to parse directories is an error. If you want to parse a directory, configure a DirectoryFilter module before a TikaParser.

Parameters: path (str) – The absolute path to the file to parse
Yields: A dictionary with the fiels of the parsed file. Keep in mind files can be composite. In this case, a dictionary will be yielded for every individual single files.

shutdown()¶

Function to call at the end of a parsing session.

If option save_mapping is True, write the cuurent mapping configuration into the mapping configuration file. This is useful to discover new metadata fields not yet mapped.

tika_parse_file(filepath, tika_server)¶

Call a tika server to parse a file.

Parameters

filepath (str) – the path to the composite file
tike_server (str) – the URL to the tika server

Returns

A dictionary {status, metadata}. If the file cannot be parsed, metadata is not provided. If the file is larger than file_max_size, set status = 413 and no metadata is returned.