base package

The base classes and function helpers for the RVT2.

It includes the base RVT2 modules, the configuration manager and helper functions and some general jobs.

Submodules

base.commands module

Utility functions and modules to run commands.

class base.commands.Command(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Run a command before or after the execution of other modules.

Configuration section:
run_before

If True, run the command before the execution of from_module.

run_after

If True, run the command after the execution of from_module.

from_dir

Run the external command from this directory

cmd

The external command to run. It is a python string template with two optional parameters: infile and outfile

infile

The infile parameter, if needed. Default: empty.

outfile

The outfile parameter, if needed. Default: empty.

delete_exists

If True, delete the outfile, if exists

stdout

If empty, do not overwrite stdout. If provided, save stdout to this filename

append

If True, append to the output file. If False and the file exists, remove it

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.commands.RegexFilter(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

A module to select lines that match a list of regex expressions. Yields a dict with the line, regex, tag and keyword_file as keys.

Configuration:
  • keyword_file: The keyword file to use. One keyword per line. Empty lines are ignored. Format: ANNOTATION:::REGEX or REGEX. In the last case, the annotation will be the regex.

  • keyword_list: A list of regex expressions to execute. Overwrites keyword_file if not empty. Same format as keyword_file.

  • keyword_dir: Load keyword files form this directory.

  • cmd: Run this external command to perform a search.

  • from_dir: Run the external command from this directory. If None, run from current directory.

  • encoding: The encoding to decode subprocess binary output

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)
Parameters

path – Search regex in this file.

Yields

For each line that matches, a dictionary where match is the matching line, regex is the regex that matched, tag the tag of the regex and keyword_file the file where the regex were read from, or None.

base.commands.estimate_iterations(path, cmd, from_dir=None, logger=<module 'logging' from '/usr/lib/python3.7/logging/__init__.py'>)

Estimate the number of iterations using an external command.

Parameters
  • cmd (str) – The path to use on the command.

  • from_dir (str) – If specified, run the external command from this directory.

  • cmd – The external command to run, as a string or an array. If cmd is a string, run the command as a shell command. It is a tempalte that will be formated as cmd.format(path=path).

Returns

The estimated number of iterations as an integer number. float('inf') if the number of iterations cannot be estimated.

base.commands.run_command(cmd, stdout=None, stderr=None, logger=<module 'logging' from '/usr/lib/python3.7/logging/__init__.py'>, from_dir=None)

Runs an external command using subprocess.

Parameters
  • cmd (str) – The command to run, as a string or an array. If cmd is a string, run the command as a shell command.

  • stdout (file) – If provided, set the stdout to this stream

  • stderr (file) – If provided, set the stderr to this stream

  • logger (logging.Logger) – If provided, use this logger. If not, use the global logging system.

  • from_dir (str) – Run the external command from this directory. If None, run from current directory.

Returns

If stdout is provided, returns None. If no stdout, returns the decoded UTF-8 output.

base.commands.yield_command(cmd, stderr=None, logger=<module 'logging' from '/usr/lib/python3.7/logging/__init__.py'>, from_dir=None)

Runs an external command using subprocess and yields the output line by line.

Parameters
  • cmd – The command to run, as a string or an array. If cmd is a string, run the command as a shell command.

  • logger – If provided, use this logger. If not, use the global logging system.

  • from_dir – Run the external command from this directory. If None, run from current directory.

Yields

UTF-8 decoded lines from the output of the command.

base.config module

Classes and helper functions to manage the global, local and job configuration.

Warning

Since this module is used to configure the logging system, it cannot log any message.

class base.config.ColoredFormatter(fmt=None, datefmt=None, style='%')

Bases: logging.Formatter

A formatter with colors for the logging system.

Based on ideas from: <https://stackoverflow.com/questions/384076/how-can-i-color-python-logging-output> and <https://github.com/borntyping/python-colorlog>

use_color

If True, use color on the output. If False, this formatters is the same than a regular Formatter.

Type

Boolean

format(record)

Format the specified record as text.

The record’s attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.

use_color = True
class base.config.Config(filenames=None, config=None, job_name=None)

Bases: object

Configuration of modules and jobs. It is a wrapper on configparser.SafeConfigParser object.

Parameters
  • filenames (array of str) – if not None, read configuration from these files

  • job_name (str) – The name of the job currently in execution.

  • config (configparser.SafeConfigParser) – the actual configuration object.

config

the actual configuration object.

Type

configparser.SafeConfigParser

copy()

Returns a deep copy of this configuration object

get(section, option, default=None)

Get a configuration value.

Parameters
  • section (str) – name of the section.

  • option (str) – name of the option

  • default (Object) – if not None, the default value to return.

Returns

The value of the option.

has_section(section)

Return True if the configuration has a section

options(section)

Return the options in a section

read(path, pattern='**/*.cfg')

Read configuration from a file or directory. The configuration file is appended to the current configuration.

Parameters
  • path (str) – The path of the single file or directory to read the configuration from.

  • pattern (regex) – If the path is a directory, use this pattern to select configuration files.

sections()

Returns a list of sections in this config

set(section, option, value)

Add a configuration to a section. If the section does not exists, add it.

store_get(option, default, job_name=None)

Read and returns an option from the local store.

A local store can be used to save and retrieve options between runnings or communicate modules in in the save job. Do not rely on these options to exist and always use a default value.

The local store is saved optionally in the file configured in section job_name, option localstore. Options are saved in a section named after the current job_name. Notice you can read from any job_name, but only save options on your own job_name.

Parameters
  • option – the name of the option

  • default – the default value of the option. Do not rely on these options to exist an always use a default value.

  • job_name – if provided, read the option from this job_name.

store_set(option=None, value=None, save=False)

Store an option from the local store.

A local store can be used to save and retrieve options between runnings or communicate modules in in the save job. Do not rely on these options to exist and always use a default value.

The local store is saved optionally in the file configured in section job_name, option localstore. Options are saved in a section named after the current job_name. Notice you can read from any job_name, but only save options on your own job_name.

Parameters
  • option (str) – the name of the option. If None, do not store an option. The local store can be saved id option=None and save=True.

  • value (str) – the value of the option. If None, remove the option.

  • save (Boolean) – whether the local store must be saved inmediately in the file configured in section job_name, option localstore. If there is no file configured, do not save the localstore. If the localStore was not dirty, it is not saved.

class base.config.MyExtendedInterpolation

Bases: configparser.ExtendedInterpolation

Adds support to inheritance to the extended interpolator.

When getting the value of an option from a section, if the option is not defined in the current section, check if the section has an inherits options. If there is an inherits options, look for the option in the inherits section and then the DEFAULTS section.

before_get(parser, section, option, value, defaults)
class base.config.TelegramHandler(level=20, token='', chatids=[])

Bases: logging.Handler

A logging handler to send messages to a list of telegram chatids

emit(record)

Do whatever it takes to actually log the specified logging record.

This version is intended to be implemented by subclasses and so raises a NotImplementedError.

base.config.check_server(server)

Check whether a server cab be reached.

>>> check_server(None)
False
>>> check_server('https://www.google.es')
False
>>> check_server('https://www.google.es:443')
True
Parameters

server (str) – a URL to connect to a server, such as “http://localhost:9998”. A scheme, hostname and port must be provided. A malformatted server will return False.

Returns

True is the server is listening

base.config.configure_logging(config, basic=False)

Configure the logging system. Some variables can be configured from the [logging] section in the configuration.

Parameters
  • config (config.Config) – the global configuration object.

  • basic (Boolean) – if True, configure a basic but colored logging system to the console.

Todo

We couldn’t find a way to configure log filename easily for each case using configuration files. Temporally, the logging subsystem is configured using a dictionary and not a configuration file.

Configuration:
  • console.level: The logging level for the console handler. Defaults to WARN.

  • file.level: The logging level for the file handler. Defaults to INFO.

  • file.logfile: The filename for the file handler. Defaults to rvt2.log.

  • telegram.level: The logging level for the telegram handler. Defaults to INFO.

  • telegram.token: the token for the telegram bot. Defaults to None (do not send messages)

  • telegram.chatids”: a space separated list of chatids to send messages.

base.config.parse_conf_array(value)

Parses a value in an option and returns it as an array.

Values are sepparated using spaces or new lines. Double quotes can be used as quoting chars. Spaces can be espaced with a backslash.

>>> parse_conf_array('hello')
['hello']
>>> parse_conf_array('hello world')
['hello', 'world']
>>> parse_conf_array('hello\ world')
['hello world']
>>> parse_conf_array('"hello world" bye')
['hello world', 'bye']
>>> parse_conf_array('base.module.test{"param":"value1\ value2"}')
['base.module.test{"param":"value1 value2"}']
>>> parse_conf_array('base.module.test{"param":"value1\ value2"} base.module.test{"param":"value3\ value4"}')
['base.module.test{"param":"value1 value2"}', 'base.module.test{"param":"value3 value4"}']
>>> parse_conf_array(None)
[]
>>> parse_conf_array('')
[]
Parameters

value (str) – The value to parse.

Returns

An array of strings.

base.directory module

Modules to parse directories and subdirectories.

class base.directory.DirectoryClear(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Remove the the file or directory specified by ‘target’. Useful when certain jobs that append results to file are called again, avoiding duplication of output.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.directory.DirectoryFilter(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

The module gets a path to a directory and sends to from_module the path to all files inside this path. Optionally, the walker only manages a set of extensions, or excludes files using a regular expression.

Module description:
  • path: the abolute path to a file or directory.

  • from_module: mandatory. If path is a file, this module is transparent.

    If path is a directory, list all the files to the subdirectories (filters might apply) and call to from_module for each one of them.

  • yields: whatever from_module yields each time is called.

Configuration:
  • void_extension (Boolean): If True, files without an extension are always parsed even if a filter is set.

  • followlinks (Boolean): If True, follow symbolic links

  • filter: List of file categories to parse. If not provided, parse all files. Categories are section names to be read.

  • progress.disable (Boolean): If True, disable the progress bar.

  • progress.cmd (String): The shell command to run to estimate the number of subdirectories in the path.

  • exclude_pattern: If the path of the files matches this pattern, exclude the file.

  • restartable: If True, use the local store to save the name of the last directory fully parsed. The parsing won’t continue until this directory is found.

Todo

Files in the last parsed directory might be parsed twice

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Gets a path and calls to from_module for each file in the directory or subdirectories.

class base.directory.FileClassifier(*args, **kwargs)

Bases: base.job.BaseModule

Classifies a piece of data according to its content-type, extension or path.

This class can be used as a module or as a stand-alone object.

Configuration section:
  • categories: list of categories to use. Categories are section names with extension and content type.

  • check_extension: When used as module: if True, check path extension; if False, check only content_type to decide a the category.

Example

>>> import base.config
>>> import base.job
>>> c = base.config.Config(filenames=['conf/file_categories.cfg'])
>>> fc = FileClassifier(c, local_config=dict(categories='compressed office'))
>>> print(fc.classifyByExtension('.docx'))
office
>>> print(fc.classifyByContentType('application/x-compress'))
compressed
>>> print(fc.classify(dict(extension='.docx')))
office
>>> print(fc.classify(dict(extension='.docx', content_type='application/x-compress')))
compressed
>>> print(fc.classify(dict(path='filename')))
None
>>> print(fc.classify(dict(path='filename.docx')))
office
classify(data)

Classifies a piece of data. Data is a dictionary that must include either content_type, extension or path.

classifyByContentType(content_type)

Classifies a content type.

:param : extension: The extension to classify. For example, ‘application/x-msaccess’.

Returns

The name of the category, or None

classifyByExtension(extension)

Classifies an extension.

:param : extension: The extension to classify. For example, ‘.docx’.

Returns

The name of the category, or None

classifyByPath(path)

Classifies a path. This method extracts the extension from the path and calls to classifiyByExtension.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Classifies all items sent by from_module

class base.directory.FileParser(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Call a job for each file in a path that matches a regex.

Module description:
  • path: run from_module on this path. If the path matches a regex, call also to a configured jobname

  • from_module: optional. If None, not used.

  • yields: whatever from_module and jobname yield each time they are called.

Configuration:
  • parsers: A list of regex and modules. First, the regular expression matching a filename; second, the jobname to run on this filename.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.directory.GlobFilter(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

The module gets a glob pattern as a path, and runs to from_module all items matching the pattern.

See: https://docs.python.org/3.6/library/glob.html

Module description:
  • path: a glob pattern. Run from_module on all items matching this pattern.

  • from_module: mandatory.

  • yields: whatever from_module yields each time it is called.

Configuration:
  • recursive: Passes this parameters to glob.iglob: whether the path must run recursively or not.

  • ftype: either “file”, “directory” or “all”.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Parses objects matching a glob pattern.

If the glob can match files and directories, you probably want to feed the results to a DirectoryFilter.

Parameters

path (str) – the glob pattern. It will be recursive. See https://docs.python.org/3.6/library/glob.html

class base.directory.MirrorOptions(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Return the value of the local options

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

base.help module

Jobs to assist in the use of the RVT2: list available jobs, show help about a job or module.

class base.help.AvailableJobs(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

A module to list all avaiable jobs in the rvt

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.help.Help(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

A module to show help about a job or module whose name is passed as the path of the module.

Configuration section:
  • show_vars: List of variables in the section to show. If “ALL”, show all variables. If Empty, do not show context variables.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

base.input module

Some simple file readers to be used as input for other modules.

class base.input.AllLinesInFile(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Yields every line in a file as a string

Configuration:
  • encoding (String): The encoding to use. Defaults to utf-8

  • progress.disable (Boolean): If True, disable the progress bar.

  • progress.cmd (String): The shell command to run to estimate the number of lines in the file.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)

Read all lines from the path. from_module is ignored

class base.input.CSVReader(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Yields every line in a CSV file.

Configuration:
  • encoding (String): The encoding to use. Defaults to “utf-8”

  • delimiter (String): The delimiter to use. Defaults to ;

  • quotechar (String): The quotechar. Defaults to “

  • restkey (String): The restkey of the DictReader. Defaults to “extra”.

  • restval (String): The restval of the DictReader. Defaults to the empty string.

  • content_type: The content_type to set, if fill_common_fields is set

  • fieldnames: A space separated list of header names. If None, use the first line. Warning: if provided, the first line will be considered data unless ignore_lines is set to >0

  • ignore_lines (int): Ignore this numner of initial lines. If fieldnames is provided, the first line is also ignored.

  • progress.disable (Boolean): If True, disable the progress bar.

  • progress.cmd (String): The shell command to run to estimate the number of lines in the file.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)

Read CSV file in the path. from_module is ignored

class base.input.DummyReader(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

A dummy reader that creates as many empty dictionaries as requested in _number_.

Use for debugging.

Configuration:
  • number: Yields this many of empty dictionaries

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)

path and from_module are ignored

class base.input.ForAllLinesInFile(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Pass to from_module each line in a file as the path.

Configuration:
  • encoding (String): The encoding to use. Defaults to utf-8

  • progress.disable (Boolean): If True, disable the progress bar.

  • progress.cmd (String): The shell command to run to estimate the number of lines in the file.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)

Read all lines from the path and pass them to from_module

class base.input.GeneratorReader(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Manages from_module as a generator, not a module, and yields it contents.

You can use this module to inject an array into another module down in the chain.

Example

Save a list in a CSV file:

data = [
    dict(greeting='Hello', language='English'),
    dict(greeting='Hola', language='Spanish')
]
base.output.CSVSink(
    config,
    from_module=GeneratorReader(config, from_module=data),
    local_config=dict(outfile='outfile.csv')
).run()
from_module

Any generator-like object such a list. Yields its contents.

run(path=None)

Path is ignored.

class base.input.JSONReader(config, section=None, local_config=None, from_module=None)

Bases: base.input.AllLinesInFile

Load every line in a file as a JSON dictionary and yields it.

run(path)

Read JSON file in the path. from_module is ignored

class base.input.SQLiteReader(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Returns the cursor of a query on a sqlite database.

Rows in the database are returned as dictionaries.

Configuration:
  • query: The SQL query to run.

Current job section:
  • query: If the query in the module section is empty, read the SQL query from the job section.

  • read_only: If True, open the database in read_only mode

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)

Read database from the path. from_module is ignored

base.job module

Jobs and modules.

class base.job.BaseModule(config, section=None, local_config=None, from_module=None)

Bases: object

The base for all modules. Do not use this module directly, always extend it.

Configuration:
  • stop_on_error: If True, stop the execution on an error.

  • logger_name: The name of the logger to use.

Parameters
  • config (base.config.Config) – Global configuration for the application.

  • section (str) – the name of the configuration section for this module in the global configuration object.. If None, use the classname.

  • local_config (dict) – local configuration for this module. This configuration overrides the values in the section in the global configuration.

  • from_module (base.job.BaseModule) – If in a chain, the next module in the chain, or None.

check_params(path, check_from_module=False, check_path=False, check_path_exists=False)

Check the module is configured correctly. Extend this function to run your own tests.

Parameters
  • path (str) – The path passed to the run() method.

  • check_from_module (boolean) – If True, check a from_module is defined.

  • check_path (boolean) – If True, check the path is not None.

  • check_path_exists (boolean) – If True, check the path exists.

Raises

RVTError if the tests are not passed.

logger()

Get the logger for this parser.

Warning

Do not store the logger as an internal variable: the user may want to change the logger at any time.

myconfig(option, default=None)

Get the value of a configuration for this module.

Parameters
  • option (str) – the name of the option

  • default – the dafault value of the option

myflag(option, default=False)

A convenience method for self.config.getboolean(self.section, option, False)

options()

Return a dictionary with the available options to this job

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

set_default_config(option, default=None)

Get the value of a configuration for this module.

Parameters
  • option (str) – the name of the option

  • default (str) – the dafault value of the option. It MUST be a string.

shutdown()

This function will be called at the end of the execution of a job.

The shutdown() function of the from_module is called recursively.

class base.job.CascadeWrapper(*args, **kwargs)

Bases: base.job.BaseModule

run(*args, **kwargs)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

exception base.job.RVTCritical

Bases: Exception

A special class for Exceptions inside the RVT. The rvt2 cannot continue.

exception base.job.RVTError

Bases: Exception

A special class for Exceptions inside the RVT. The module or job cannot continue.

base.job.get_path_array(job_name, myparams, extra_config, default_path, config)

get path array from 1. myparams (job_with params); 2. extra_config; 3. parameter 4. config

base.job.load_module(config, confsection, from_module=None, extra_config=None)

Loads a module from a section name.

Parameters
  • config (base.config.Config) – global configuration object to pass to the module.

  • confsection (str) – The section name, and optional local configuration. The section name is searched in the configuration. If the section is present and it has a “module” attribute, load the class “module”. If the section is not present or it doesn’t have a “module” attribute, try to load the section name as a class. Format: SECTIONAME{‘OPTION’: ‘VALUE’, ‘OPTION2’: ‘VALUE2’}.

  • from_module (base.job.BaseModule) – pass this value as the from_module configuration. Default: None

  • extra_config (dict) – extra local configuration for the module. Default: None.

base.job.parse_modules_chain(job_name, myparams, config)

Parse a list of modules or jobs from a conf_name, taking into account local configuration.

Parameters
  • job_name (str) – The name of the job. The modules of jobs will be loaded first from “jobs” and, if not found, from “modules”. The default parameters will be loaded from “default_params”. The chain string will be managed as a string template, where the “default_params” is applied. If the job does not have “jobs” or “modules” configuration, just return the name of the job. The system will assume this name id a class name.

  • myparams (dict) – The local parameters, as returned by parse_modules_name()

  • config (base.config.Config) – The configuration object for the application.

Returns

A list of modules to load.

base.job.parse_modules_name(input_name, default='True')

Parse a module or job name searching for local configurations.

Parameters
  • input_name (str) – The name of a module or job with an optional configuration. The optional configuration is appended next to the job name, as pair name=value, or only name. See examples below.

  • default (str) – The default value of a param, when only the name is given.

Returns

A set. The first member if the name of the module or job. The second is the local configuration, if any.

>>> parse_modules_name('funcname')
('funcname', OrderedDict())
>>> parse_modules_name('funcname ')
('funcname', OrderedDict())
>>> parse_modules_name('funcname greetings="good morning" name="Jim" morning')
('funcname', OrderedDict([('greetings', 'good morning'), ('name', 'Jim'), ('morning', 'True')]))
base.job.run_job(config, job_name_with_params, path=None, extra_config=None, from_module=None)

Runs a job from the configuration. This jobs has ‘jobs’, ‘modules’ or ‘cascade’

Parameters
  • config (base.config.Config) – The configuration object

  • job_name_with_params (str) – The name of the job to run. It must be a section in the configuration. This string will be parsed using parse_modules_params() and it may include additional parameters.

  • path (list of str) – Run the job on this paths.

  • extra_config (dict) – extra local configuration for all the modules in the job. Default: None

  • from_module (base.job.BaseModule) – use this as the from_module of the last module (only in single jobs)

Returns

If the job is sinble (it has ‘modules’ or ‘cascade’), a generator with the result of the execution. If the job is composite (it has ‘jobs’), return an empty list since the result of each job is probably not related to each other. You MUST read each item from the returned generator.

base.job.run_single_job(config, job_name_with_params, default_path=None, extra_config=None, from_module=None)

Runs a job from the configuration. This job has only ‘modules’, it does not include ‘jobs’.

Parameters
  • config (base.config.Config) – The configuration object

  • job_name_with_params (str) – The name of the job to run. It must be a section in the configuration. This string will be parsed using parse_modules_params() and it may include additional parameters.

  • default_path (list of str) – Run the job on this paths. The order to read paths for a job is: 1. job_with_params (single path); 2. extra_config; 3. this paramter 4. config

  • extra_config (dict) – extra local configuration for all the modules in the job. Default: None

  • from_module (base.job.BaseModule) – use this as the from_module of the last module in the chain.

Returns

A generator that yields each of the results of the execution.

base.mutations module

Modules to mutate data yielded by other modules: converte using specific convertes, remove fields, set fields to default values…

class base.mutations.AddFields(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Get data from from_module, add some new fields loaded from configuration and yield again.

Module description:
  • path: not used, passed to from_module.

  • from_module: Data is updated.

  • yields: The updated data.

Configuration:
  • section: Section from configuration where new values are to be retrieved

  • fields: A dictionary of fields to be set. fields will be managed as a string template, passing the options from the configuration section as parameter.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.mutations.Collapse(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Collapse different documents sent by from_module with a common field into just one document.

Warning: the collapse may take many time and memory

Configuration section:
  • field: collapse documents using this field name as the common field.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.mutations.CommonFields(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Adds common fields for a document: path, filename, dirname, extension, content_type and _id if they don’t exist yet.

Module description:
  • path: not used, passed to from_module.

  • from_module: mandatory. Copy the information sent by from_module and add fields if they don’t exist yet.

  • yields: the modified data.

Configuration:
  • calculate_id: if True, calls base.utils.generate_id to generate an identifier in the _id field.

  • disabled: if True, do not add anything and just yield the result. Useful in configurable module chains

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.mutations.DateFields(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Converts some fields into ISO date strings.

Fields might be:

  • An integer, or a string representing an integer: it is a UNIX timestamp.

  • A string: the module will use the datetutil package to parse it

If the field cannot be converted and stop_on_error is not set, the field is popped out from the data.

Module description:
  • path: not used, passed to from_module.

  • from_module: mandatory. Get data and udpate fields.

  • yields: The modified data.

Configuration:
  • fields: A space separated list of fields to check to convert

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

The path will be passed to the mandatory from_module

class base.mutations.ForEach(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Runs a job for each data yielded by from_module. The data is passed as params of the job.

Module description:
  • path: not used, passed to from_module.

  • from_module: mandatory. The data is passed to run_job as its extra_config parameter.

  • yields: None

Configuration:
  • run_job: The name of he job to run

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.mutations.GetFields(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Get data from from_module, yield fields specified.

Module description:
  • path: not used, passed to from_module.

  • from_module: Data dict.

  • yields: The updated dict data.

Configuration:
  • section: Section from configuration where new values are to be retrieved

  • fields: A list of fields to be yielded.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.mutations.RemoveFields(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Drops some fields from data.

Module description:
  • path: not used, passed to from_module.

  • from_module: mandatory. Get data and remove fields.

  • yields: The modified data.

Configuration:
  • fields: A space separated list of fields to drop

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.mutations.SetFields(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

Get data from from_module, set or update some if its fields and yield again.

Module description:
  • path: not used, passed to from_module.

  • from_module: mandatory. Data is updated.

  • yields: The updated data.

Configuration:
  • presets: A dictionary of fields to be set, unless already set by data yielded by from_module.

  • fields: A dictionary of fields to be set. fields will be managed as a string template, passing the data yielded by from_module as parameter.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

base.output module

Print the results from other modules to the console or a file.

class base.output.BaseSink(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

An abstract module that prints the results from other modules to a file or standard output.

Do not use this module directly, but one of its extensions.

The from_module of a BaseSink object can be a base.job.BaseModule or an array. This way, you can use sinks like this, to use common configuration.

Example

Save a list into a CSV file:

m = base.job.load_module(
    base.config.Config(), 'base.output.CSVSink',
    extra_config=dict(outfile='outfile.csv')
    from_module=[
        dict(greeting='Hello', language='English'),
        dict(greeting='Hola', language='Spanish')
    ]
)
list(m.run())
Configuration:
  • outfile (str): If provided, saved to this file (absolute path) instead of standard output. CONSOLE is a special name: prints to standard output.

  • file_exists (str): If outfile exists, APPEND (this is the default behaviour), OVERWRITE or throw an ERROR.

Current job section:
  • outfile (str): outfile can be defined in the job section if the outfile in the section is empty

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.output.CSVSink(config, section=None, local_config=None, from_module=None)

Bases: base.output.BaseSink

A module that prints the results from other modules to a file or the standard output as a CSV.

Configuration::
  • outfile (str): If provided, saved to this file (absolute path) instead of standard output. CONSOLE is a special name to force printing to the standard output

  • file_exists (str): If outfile exists, APPEND (this is the default behaviour), OVERWRITE or throw an ERROR.

  • delimiter (String): The delimiter parameter of the csv.DictWriter. “TAB” means tabulator.

  • quotechar (String): The quotechar of the csv.DictWriter. Defaults to “

  • extrasaction (String): The extrasaction parameter of the csv.DictWriter. Defaults to “raise”.

  • restval (String): The restval parameter of the csv.DictWriter. Defaults to the empty string.

  • write_header (boolean): If True (default), writes the header of the CSV file.

  • quoting (int): The quoting parameter of the csv.DictWriter.

  • fieldnames: If present, use these fieldsnames instead of the fields in the dictionary. You can use this option to order the fields

  • field_size_limit: maximum field size allowed by the parser. Default “sys.maxsize”. Lower the value to skip writing large inputs.

Current job section:
  • outfile (str): outfile can be defined in the job section if the outfile in the section is empty

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.output.JSONSink(config, section=None, local_config=None, from_module=None)

Bases: base.output.BaseSink

A module that prints the results from other modules to a file or standard output as a JSON object.

Configuration:
  • outfile (str): If provided, saved to this file (absolute path) instead of standard output. CONSOLE is a special name: prints to standard output.

  • file_exists (str): If outfile exists, APPEND (this is the default behaviour), OVERWRITE or throw an ERROR.

  • indent (str): Indentation value for the output. Default=None

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

class base.output.MirrorPath(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

A basic module that yields the path.

run(path)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

base.templates module

Modules to manage mako templates.

class base.templates.TemplateSink(config, section=None, local_config=None, from_module=None)

Bases: base.output.BaseSink

A _base.output.BaseSink_ that saves into a file or standard output, using a mako template.

Configuration:
  • template_dirs: A space separated list of directories to load templates from. Default: current path, rvt2 path

  • input_encoding: The encoding of the templates. Default: utf-8

  • template_file: A file with the template. Relative to ‘template_dirs’

  • template: The template as a string. This option is ignored if a template_file is provided.

  • skip_on_empty_data: If from_module doesn’t return anything and this is True, do not output anything

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path=None)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

base.threads module

class base.threads.Fork(config, section=None, local_config=None, from_module=None)

Bases: base.job.BaseModule

A module to send the data received from from_module up in the chain and to a job in a different thread.

Configuration:
  • secondary_job: The name of the job to run in the secondary thread. This job cannot be composite (only ‘modules’ allowed) and it will receive the data in the last module of the chain.

read_config()

Read options from the configuration section.

This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.

run(path)

Run the job on a path

Parameters

path (str) – the path to check.

Yields

If any, an iterable of elements with the output.

shutdown()

This function will be called at the end of the execution of a job.

The shutdown() function of the from_module is called recursively.

base.threads.run_job(*args, daemon=False, **kwargs)

Runs a job from the configuration in a different thread.

Returns

The new thread

base.threads.worker(*args, **kwargs)

The worker that actually runs a job in a thread.

This worker only consumes the generator returned by the job. It does nothing else

base.utils module

Utility functions to the rest of the system.

base.utils.check_directory(path, create=False, delete_exists=False, error_exists=False, error_missing=False)

Check if a directory exists.

Parameters
  • error_exists (Boolean) – If True and the directory exits, raise a RVTError

  • error_missing (Boolean) – If True and the file does not exist, raise a RVTError

  • create (Boolean) – If True and the directory does not exist, create it

  • delete_exists (Boolean) – If True, delete the directory and create a new one.

Returns

True if the directory exists at the end of this function.

base.utils.check_file(path, error_missing=False, error_exists=False, delete_exists=False, create_parent=False)

Check if a file exists, and optionally removes it.

Parameters
  • error_exists (Boolean) – If True and the file exists, raise a RVTError

  • error_missing (Boolean) – If True and the file does not exist, raise a RVTError

  • delete_exists (Boolean) – If True, delete the file if exists

  • create_parent (Boolean) – If True, create the parent directory

Raises

RVTError if the path is not a file, or the file does not exist and error_exists is set to True

Returns

True if the file exists at the end of this function.

base.utils.check_folder(path)

Check is a path is a folder and create if not exists.

Equivalent to check_directory(path, create=True)

base.utils.generate_id(data=None)

Generate a unique ID for a piece of data. If data is None, returns a random indentifier.

The identifier is created using:

uuid.uuid5(uuid.NAMESPACE_URL, 'file:///{}/{}?{}'.format(dirname, filename, embedded_path))

If the data already provides and identifier in an field _id, pop this field from data and return it.

base.utils.relative_path(path, start)

Transform a path to be relative to a start path.

Todo

We don’t want to go outside the starting path. Check that.

Returns

path relative to start path.

>>> relative_path('/morgue/112234-casename/01/23', '/morgue/112234-casename')
'01/23'
>>> relative_path('/another/112234-casename/01/23', '/morgue/112234-casename')
'../../another/112234-casename/01/23'
>>> relative_path(None, '/morgue/11223344-casename') is None
True
base.utils.save_csv(data, config=None, **kwargs)

Save data in a CSV file. This is a convenient function to run a base.output.CSVSink module from inside another module.

Parameters
  • data – The data to be saved. It can be a generator (such as list or tuple) or a base.job.BaseModule. In the last case, the module is run and saved.

  • config (base.config.Config) – The global configuration object, or None to use default configuration.

  • kwargs (dict) – The extra configuration for the base.output.CSVSink module. You’d want to set, at least, outfile.

base.utils.save_json(data, config=None, **kwargs)

Save data in a JSON file. This is a convenient function to run a base.output.JSONSink module from inside another module.

Parameters
  • data – The data to be saved. It can be a generator (such as list or tuple) or a base.job.BaseModule. In the last case, the module is run and saved.

  • config (base.config.Config) – The global configuration object, or None to use default configuration.

  • kwargs (dict) – The extra configuration for the base.output.JSONSink module. You’d want to set, at least, outfile.