plugins.indexer package¶
An RVT2 plugin to:
parse documents and directories using Apache Tika and saving the data as json.
parse the output from other modules and save the data as json.
index these JSON into an ElasticSearch server.
query, tag and export the data in an ElasticSearach server.
In order to use these modules, you’ll need a Tika server and/or an ElasticSearch server running in the network.
Submodules¶
plugins.indexer.blindsearches module¶
-
class
plugins.indexer.blindsearches.
BlindSearches
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A module to annotate documents that match a blind search.
- Configuration:
keyword_tag_field: Identification of the field to use for annotations. Remember
ElasticSearchBulkSender
allows appending to annotation lists by using a field ending in “-new”strip_match: If true, return only the identifier of the document and the annotation, if any. If false, returns the whole document.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ The path to a JSON file, output of a previous ElasticSearchAdapter.
plugins.indexer.elastic module¶
Modules to index documents parsed by other modules into ElasticSearch.
-
class
plugins.indexer.elastic.
ElasticSearchAdapter
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A module to adapt the results from other modules to a format suitable to be indexed in bulk into ElasticSearch.
This module also registers its own execution and results in the index defined in configuration rvtindex.
- Configuration:
name: the name of the index in ElasticSearch. The name will be converted to lowcase, since ES only accept lowcase names.
doc_type: the doc_type in ElasticSearch. Do not change the default value
"_doc"
, it will be deprecated in ES>6.rvtindex: The name of the index where the run of this module will be registered. The name MUST be in lowcase. If empty or None, the job is not registered.
operation: The operation for elastic search. Possible values are “index” (default) overwrites data, or “update” updates existing data with new information. An update does always an upsert.
casename: The name of the case
server: The URL of the file server to access directly to the files.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ - Returns
An iterator with the adapted JSON.
-
class
plugins.indexer.elastic.
ElasticSearchBulkSender
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A module to index the results from the
ElasticSearchAdapter
into an ElasticSearch server.- Configuration:
es_hosts: a space separated list of hosts of ElasticSearch. Example:
http://localhost:9200
. The port is mandatory.- name: the name of the index in ElasticSearch. If the index does not exist, create it using mapping.
The name will be converted to lower case, since ES only accept lower case names.
mapping: If the index name must be created, use this file for initial settings and mappings.
chunk_size: The number of documents to send to elastic in each batch.
tag_fields: A space separated list of names of the fields that include tags. A new tag is appended using the special field “tag_field-new”. For example, you can append to the field “tags” a tag in “tags-new”.
only_test: If True, do not submit the final queries to ElasticSearch but yield them.
offset: ignore this number of lines at the beginning of the file.
restartable: if True, save the current status in the store to allow restarting the job.
max_retries: max number of retries after a HTTP 429 from the server (too many requests)
retry_on_timeout: If True, retry on TimeOut (the server is busy)
progress.disable (Boolean): If True, disable the progress bar.
progress.cmd: Run this command to know the number of actions to send.
cooloff.every: after sending cooloff.every number of items, wait cooloff.seconds.
cooloff.seconds: after sending cooloff.every number of items, wait cooloff.seconds.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ - Parameters
path (str) –
The path tp the file to upload to the ElasticSearch server. Each line is an action as described in <https://elasticsearch-py.readthedocs.io/en/master/helpers.html>.
”index” actions are also compatible with elasticdump, and you can upload the file using elasticdump if you prefer.
You MUST use this module if the operation from ElasticSearchAdapter is “update”, since elasticdump always overwrites data.
-
class
plugins.indexer.elastic.
ElasticSearchQuery
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Query ES and yield the results.
The path is ignored.
- Configuration section:
- es_hosts
An array of strings with the ES servers.
- name
The name of the index to query. The name will be converted to lower case, since ES only accept lower case names.
- query
The query in lucene language.
- source_includes
a space separated list of fields to include in the answer. Use empty string for all fields.
- source_excludes
a space separated list of fields NOT to include in the answer.
- progress.disable
if True, disable the progress bar.
- max_results
If the query affects to more than this number of documents, raise an RVTCritical error to stop the execution. Set to 0 to disable.
- retry_on_timeout
If True, retry after ES returned a timeour error.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.elastic.
ElasticSearchQueryRelated
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Query ES and yield all documents related to the query: containers, attachments…
The path is ignored.
- Configuration section:
- es_hosts
An array of strings with the ES servers.
- name
The name of the index to query. The name will converted into lower case.
- query
The query in lucene language.
- source_includes
a space separated list of fields to include in the answer. Use empty string for all fields.
- source_excludes
a space separated list of fields NOT to include in the answer.
- retry_on_timeout
If True, retry after ES returned a timeour error.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.elastic.
ExportFiles
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
plugins.indexer.elastic.
coolOff
(every=100, seconds=10)¶ A generator to wait a number of seconds after calling ‘every’ times. If every == 0, do not wait ever.
-
plugins.indexer.elastic.
get_esclient
(es_hosts, retry_on_timeout=False, logger=<module 'logging' from '/usr/lib/python3.7/logging/__init__.py'>)¶ Get an elasticsearch.Elasticsearch object.
- Attrs:
- es_hosts
A list of elastis servers
- retry_on_timeout
If true, retry queries after a timeout
- logger
The logger to use
- Returns
An elasticsearch.Elasticsearch object
- Raises
base.job.RVTError if any of the server is available –
Todo
Add security.
plugins.indexer.events module¶
-
class
plugins.indexer.events.
BrowsersCookies
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Converts browsers cookies to events. After this, you can save this file using events.save.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
BrowsersDownloads
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Converts browsers downloads to events. After this, you can save this file using events.save.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
BrowsersHistory
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Converts browsers history to events. After this, you can save this file using events.save.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
EventLogs
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Adapts windows event logs to Elastic. After this, you can save this file using events.save.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
NetworkConnections
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Adapts SRUM Network Connections information to Elastic. After this, you can save this file using events.save.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
NetworkUsage
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Adapts SRUM Network Usage information to Elastic. After this, you can save this file using events.save.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
Prefetch
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Converts prefetch execution times to events. After this, you can save this file using events.save.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
RecentFiles
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Converts Lnk and Jumplists to events. After this, you can save this file using events.save.
- Configuration section:
classify: If True, categorize the files in the output.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.events.
Registry
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Adapts Windows Registry information to Elastic. After this, you can save this file using events.save.
-
run
(path='')¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
SuperTimeline
(*args, **kwargs)¶ Bases:
base.job.BaseModule
Main class to adapt any forensic source containing timestamped events to JSON format suitable for Elastic Common Schema (ECS)
- Configuration section:
classify: If True, categorize the files in the output.
-
common_fields
()¶ Get a new dictionary of mandatory fields for all sources
-
filegroup
(entry, classify=True)¶ Return the category group given an extension, path or content_type
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.events.
Timeline
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Convert a BODY file to events. After this, you can save this file using events.save
- Configuration section:
include_filename: if True, include FILENAME entries in the output.
classify: If True, categorize the files in the output.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Converts a BODY file read from from_module into an Elastic Common Schema document.
-
class
plugins.indexer.events.
USB
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Adapts usb setup.api to Elastic. After this, you can save this file using events.save.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.events.
UsnJrnl
(*args, **kwargs)¶ Bases:
plugins.indexer.events.SuperTimeline
Adapts windows usnjrnl to Elastic. After this, you can save this file using events.save.
-
attributes
(attributes)¶ Converts a string of attributes into a list:
Example: ‘ARCHIVE NOT_CONTENT_INDEXED ‘ -> [“archive”, “not_content_indexed”]
-
reasons
(reasons)¶ Parse a string of reasons into a suitable event.action:
Example: ‘DATA_EXTEND FILE_CREATE CLOSE ‘ -> ‘file-created’
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
plugins.indexer.events.
decompose_url
(full_url)¶ Returns a dictionary with multiple fields for a full url using Elastic Common Schema
-
plugins.indexer.events.
filetype
(tsk_permisssion)¶ Get file type from a tsk permission string
-
plugins.indexer.events.
permissions_to_octal
(tsk_permisssion)¶ Convert a tsk permission string to octal format
-
plugins.indexer.events.
sanitize_dashes
(value)¶ Some Elastic fields format do not accept ‘-‘ as value
-
plugins.indexer.events.
to_date
(strtimestamp)¶ Converts a timestamp string in UNIX into a date
-
plugins.indexer.events.
to_iso_format
(timestring)¶ Converts a date string into iso format date
plugins.indexer.export_pst module¶
Modules to parse the output from the RVT related to emails.
-
class
plugins.indexer.export_pst.
CreatePstHtml
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Creates html items from pst/ost.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ Exports to html items from pst/ost
- Parameters
path (str) – path to pff-n.export folder
-
-
class
plugins.indexer.export_pst.
ExportPstEml
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Exports to html items from pst/ost.
-
export_eml
(item, write=True)¶ Exports to eml an item parsed with pff-export
- Parameters
item (str) – path to item
write (bool) – writes eml file or returns eml content
-
run
(path)¶ Exports to eml items from pst/ost
- Parameters
path (str) – path to item to export (Message, Task, Appointment, Activity, Meeting, Note, Contact)
-
-
plugins.indexer.export_pst.
get_text
(filename)¶ get content of a file
-
plugins.indexer.export_pst.
repl_lt_gt
(text)¶ replace text to show in html
plugins.indexer.pstparser module¶
Modules to export, parse and index PST and OST files using the external utility pffexport.
-
class
plugins.indexer.pstparser.
EmlxParseMessage
(*args, **kwargs)¶ Bases:
plugins.indexer.pstparser.PffExportParseObject
-
run
(path, containerid=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.pstparser.
ExportPst
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Export all pst and ost files from a mounted image using pffexport. This modules depends on the common plugin.
- Configuration:
pffexport: route to the pffexport binary
outdir: outputs to this directory
delete_exists: if True, delete the outout directory if it already exists
Returns: a list of {filename, outdir, index}
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Export all pst and ost files in a mounted image. Path is ignored.
-
class
plugins.indexer.pstparser.
MacMailParser
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Parses the macosx mailbox client files in the directory specified by path.
# Parameters: # path (str): the path to pass to from_module, which must yield, for each PST or OST file, a dictionary: # {outdir:”path to the output directory of the export”, filename:”filaname of the PST or OST file”}.
- Configuration:
exportdir: the export main directory.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path='')¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.pstparser.
MailParser
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Parses the exported pst files returned by from_module. Use the output from ExportPST as from_module.
- Parameters
path (str) – the path to pass to from_module, which must yield, for each PST or OST file, a dictionary: {outdir:”path to the output directory of the export”, filename:”filaname of the PST or OST file”}.
- Configuration:
exportdir: the export main directory.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path='')¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.pstparser.
MailboxCSV
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Generates a dictionary with information about messages in a chain. Use a (probably forked) MailParser as from_module.
-
run
(path)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
plugins.indexer.pstparser.
ParseMacMailbox
(*args, **kwargs)¶ Bases:
plugins.indexer.pstparser.PffExportParseObject
Parses an object (Message, Meeting, Contact, Task or Appointment) in a directory exported by pffexport.
- Parameters
path (str) – the path to directory of the object. It must be a directory.
- Yields
Information about the object and its attachments.
-
run
(path, containerid=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.pstparser.
PffExportParseAppointment
(*args, **kwargs)¶ Bases:
plugins.indexer.pstparser.PffExportParseObject
Parses an Appointment in a directory exported by pffexport.
- Parameters
path (str) – the path to directory of the Appointment. It must be a directory.
- Yields
Information about the Appointment but not its attachments.
-
run
(path, containerid=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.pstparser.
PffExportParseContact
(*args, **kwargs)¶ Bases:
plugins.indexer.pstparser.PffExportParseObject
Parses a Contact in a directory exported by pffexport.
- Parameters
path (str) – the path to directory of the Message. It must be a directory.
- Yields
Information about the Contact but not its attachments.
-
run
(path, containerid=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.pstparser.
PffExportParseMeeting
(*args, **kwargs)¶ Bases:
plugins.indexer.pstparser.PffExportParseObject
Parses a Meeting in a directory exported by pffexport.
- Parameters
path (str) – the path to directory of the Meeting. It must be a directory.
- Yields
Information about the Meeting but not its attachments.
-
run
(path, containerid=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.pstparser.
PffExportParseMessage
(*args, **kwargs)¶ Bases:
plugins.indexer.pstparser.PffExportParseObject
Parses a Message in a directory exported by pffexport.
- Parameters
path (str) – the path to directory of the Message. It must be a directory.
- Yields
Information about the Message but not its attachments.
-
run
(path, containerid=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.pstparser.
PffExportParseObject
(*args, **kwargs)¶ Bases:
base.job.BaseModule
Parses an object (Message, Meeting, Contact, Task or Appointment) in a directory exported by pffexport.
- Parameters
path (str) – the path to directory of the object. It must be a directory.
- Yields
Information about the object and its attachments.
-
run
(path, containerid=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
plugins.indexer.pstparser.
PffExportParseTask
(*args, **kwargs)¶ Bases:
plugins.indexer.pstparser.PffExportParseObject
Parses a Task in a directory exported by pffexport.
- Parameters
path (str) – the path to directory of the Task. It must be a directory.
- Yields
Information about the Task but not its attachments.
Todo
We couldn’t test this module
-
run
(path, containerid=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
plugins.indexer.pstparser.
decodeEmailDateHeader
(header, ignore_errors=True)¶ Decodes a header (as returned by the standard email library) assuming it is a date.
:param : ignore_errors: If True, returns None if the date cannot be parsed. Else, raise ValueError.
>>> decodeEmailDateHeader('Mon, 23 Aug 2004 11:40:10 -0400') '2004-08-23T11:40:10-04:00' >>> decodeEmailDateHeader('nanana') >>> decodeEmailDateHeader('nanana', ignore_errors=False) Traceback (most recent call last): ... ValueError: ('Unknown string format:', 'nanana') >>> decodeEmailDateHeader(None) >>> decodeEmailDateHeader('')
-
plugins.indexer.pstparser.
decodeEmailHeader
(header, errors='replace')¶ Decodes an international header (as returned by the standard email library) into unicode.
:param : header (str): The value of the header, as returned by the standard email library. It may content different parts with different encodings :param : errors (str): After an encoding error, you can be ‘strict’, or ‘replace’ or ‘ignore’
>>> decodeEmailHeader('=?utf-8?b?WW91J3JlIFNpbXBseSB0aGUgQmVzdCEgc3V6w6BubmUg77uH77uH77uH77uH77uH?=') "You're Simply the Best! suzànne ﻇﻇﻇﻇﻇ" >>> decodeEmailHeader('=?utf-8?b?SMOpY3RvciBGZXJuw6FuZGV6?= <hector@fernandez>') 'Héctor Fernández <hector@fernandez>' >>> decodeEmailHeader(None) >>> decodeEmailHeader('')
-
plugins.indexer.pstparser.
readMessageFile
(filename, stop_on_empty_line=True, encoding='utf-8', errors='replace')¶ Reads an email header file.
This funcion is a simplified version of email.message_from_file(), but headers can also include spaces. Body, if present is ignored. If filename does not exists, returns an empty dictionary. Header names will be saved in lower case.
:param : filename (str): The name of the file to open :param : stop_on_empty_line (boolean): If True, stop processing headers after an empty line. :param : encoding (str): The encoding of filename. :param : errors (str): After an encoding error, you can be ‘strict’, or ‘replace’ or ‘ignore’
plugins.indexer.tikaparser module¶
Tika management.
In this file:
metadata refers to a file metadata as returned by Tika. Metadata may not have a name suitable for coding, such as the slash in content-type and the name depends on the file type and the specific parser Tika, and there are hundred of different names. Metadata with the same semantics in different file types may have different names. For example, dc:author, author, metadata:author…
field refers to a metadata in the dictionary the modules in this file return. metadata is mapped to a field, possibly normalized or converted or maybe ignored. There is a file to configure the mapping between fields and metadata.
Todo
Allow using a standalone Tika, not the server mode.
Tell Tika somehow which parser must use for a file. Specially useful to improve the time to parse plain text files, since Tika test all possible parsers on them by default.
-
class
plugins.indexer.tikaparser.
TikaParser
(*attrs, **kwattrs)¶ Bases:
base.job.BaseModule
A module to parse files using Tika.
A Tika server must be running before creating this module. If the configured Tika server is not available, an exception will be raised.
Documents are parsed and their metadata and content are returned as a dictionary. By default, only some metadata (fields) are returned.
Beware of the character limit on files parsed by Tika: 100k characters.
- Configuration:
tika_server: a URL to a preexisting TIKA server. Example:
http://localhost:9998
. The port is mandatory.mapping: path to the mapping configuration file. By default, not all metadata is yielded. Metadata must be mapped to a field name, and only fields mapped are returned. This prevents the “field explosion problem” and metadata with the same semantics but different names for different file types.
The mapping file is a standard cfg file. See the default example:
Sections are content types.
Values map
metadata=field
. If field name is empty or IGNORED andinclude_unknown
is set toFalse
, this metadata will be ignored. In field names, try to use Python standard named: change - to _, no spaces…
save_mapping: if True, save in the mapping configuration file any new metadata you found.
include_unknown: if True, add metadata without a mapping to the results.
error_status (int): Error status code to report in case Tika is not available while parsing a specific file.
file_max_size (int): The max size in bytes of a file to be parsed. If the file is larger than this, it is not parsed and tika_status set to 413 (payload too large)
content_max_size (int): The max size in bytes of a content to be parsed. If the content is larger than this, it is removed and tika_status set to (payload too large)
tika_encoding (str): The encoding of the answers from tika
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ Parses a file using Tika.
A file could be:
single, if it has not any embedded files.
composite, it is includes several single files (compressed files, PDF, emails…)
Trying to parse directories is an error. If you want to parse a directory, configure a DirectoryFilter module before a TikaParser.
- Parameters
path (str) – The absolute path to the file to parse
- Yields
A dictionary with the fiels of the parsed file. Keep in mind files can be composite. In this case, a dictionary will be yielded for every individual single files.
-
shutdown
()¶ Function to call at the end of a parsing session.
If option
save_mapping
is True, write the cuurent mapping configuration into the mapping configuration file. This is useful to discover new metadata fields not yet mapped.
-
tika_parse_file
(filepath, tika_server)¶ Call a tika server to parse a file.
- Parameters
filepath (str) – the path to the composite file
tike_server (str) – the URL to the tika server
- Returns
A dictionary
{status, metadata}.
If the file cannot be parsed, metadata is not provided. If the file is larger thanfile_max_size
, setstatus = 413
and no metadata is returned.