base package¶
The base classes and function helpers for the RVT2.
It includes the base RVT2 modules, the configuration manager and helper functions and some general jobs.
Submodules¶
base.commands module¶
Utility functions and modules to run commands.
-
class
base.commands.
Command
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Run a command before or after the execution of other modules.
- Configuration section:
- run_before
If True, run the command before the execution of
from_module
.- run_after
If True, run the command after the execution of
from_module
.- from_dir
Run the external command from this directory
- cmd
The external command to run. It is a python string template with two optional parameters:
infile
andoutfile
- infile
The
infile
parameter, if needed. Default: empty.- outfile
The
outfile
parameter, if needed. Default: empty.- delete_exists
If True, delete the
outfile
, if exists- stdout
If empty, do not overwrite
stdout
. If provided, savestdout
to this filename- append
If True, append to the
output
file. If False and the file exists, remove it
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.commands.
RegexFilter
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A module to select lines that match a list of regex expressions. Yields a dict with the line, regex, tag and keyword_file as keys.
- Configuration:
keyword_file: The keyword file to use. One keyword per line. Empty lines are ignored. Format:
ANNOTATION:::REGEX
orREGEX
. In the last case, the annotation will be the regex.keyword_list: A list of regex expressions to execute. Overwrites keyword_file if not empty. Same format as keyword_file.
keyword_dir: Load keyword files form this directory.
cmd: Run this external command to perform a search.
from_dir: Run the external command from this directory. If
None
, run from current directory.encoding: The encoding to decode subprocess binary output
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ - Parameters
path – Search regex in this file.
- Yields
For each line that matches, a dictionary where
match
is the matching line,regex
is the regex that matched,tag
the tag of the regex andkeyword_file
the file where the regex were read from, or None.
-
base.commands.
estimate_iterations
(path, cmd, from_dir=None, logger=<module 'logging' from '/usr/lib/python3.7/logging/__init__.py'>)¶ Estimate the number of iterations using an external command.
- Parameters
cmd (str) – The path to use on the command.
from_dir (str) – If specified, run the external command from this directory.
cmd – The external command to run, as a string or an array. If cmd is a string, run the command as a shell command. It is a tempalte that will be formated as
cmd.format(path=path)
.
- Returns
The estimated number of iterations as an integer number.
float('inf')
if the number of iterations cannot be estimated.
-
base.commands.
run_command
(cmd, stdout=None, stderr=None, logger=<module 'logging' from '/usr/lib/python3.7/logging/__init__.py'>, from_dir=None)¶ Runs an external command using subprocess.
- Parameters
cmd (str) – The command to run, as a string or an array. If cmd is a string, run the command as a shell command.
stdout (file) – If provided, set the stdout to this stream
stderr (file) – If provided, set the stderr to this stream
logger (logging.Logger) – If provided, use this logger. If not, use the global logging system.
from_dir (str) – Run the external command from this directory. If None, run from current directory.
- Returns
If stdout is provided, returns
None
. If no stdout, returns the decoded UTF-8 output.
-
base.commands.
yield_command
(cmd, stderr=None, logger=<module 'logging' from '/usr/lib/python3.7/logging/__init__.py'>, from_dir=None)¶ Runs an external command using subprocess and yields the output line by line.
- Parameters
cmd – The command to run, as a string or an array. If cmd is a string, run the command as a shell command.
logger – If provided, use this logger. If not, use the global logging system.
from_dir – Run the external command from this directory. If
None
, run from current directory.
- Yields
UTF-8 decoded lines from the output of the command.
base.config module¶
Classes and helper functions to manage the global, local and job configuration.
Warning
Since this module is used to configure the logging system, it cannot log any message.
-
class
base.config.
ColoredFormatter
(fmt=None, datefmt=None, style='%')¶ Bases:
logging.Formatter
A formatter with colors for the logging system.
Based on ideas from: <https://stackoverflow.com/questions/384076/how-can-i-color-python-logging-output> and <https://github.com/borntyping/python-colorlog>
-
use_color
¶ If True, use color on the output. If False, this formatters is the same than a regular Formatter.
- Type
Boolean
-
format
(record)¶ Format the specified record as text.
The record’s attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.
-
use_color
= True
-
-
class
base.config.
Config
(filenames=None, config=None, job_name=None)¶ Bases:
object
Configuration of modules and jobs. It is a wrapper on configparser.SafeConfigParser object.
- Parameters
filenames (array of str) – if not None, read configuration from these files
job_name (str) – The name of the job currently in execution.
config (configparser.SafeConfigParser) – the actual configuration object.
-
config
¶ the actual configuration object.
- Type
configparser.SafeConfigParser
-
copy
()¶ Returns a deep copy of this configuration object
-
get
(section, option, default=None)¶ Get a configuration value.
- Parameters
section (str) – name of the section.
option (str) – name of the option
default (Object) – if not None, the default value to return.
- Returns
The value of the option.
-
has_section
(section)¶ Return True if the configuration has a section
-
options
(section)¶ Return the options in a section
-
read
(path, pattern='**/*.cfg')¶ Read configuration from a file or directory. The configuration file is appended to the current configuration.
- Parameters
path (str) – The path of the single file or directory to read the configuration from.
pattern (regex) – If the path is a directory, use this pattern to select configuration files.
-
sections
()¶ Returns a list of sections in this config
-
set
(section, option, value)¶ Add a configuration to a section. If the section does not exists, add it.
-
store_get
(option, default, job_name=None)¶ Read and returns an option from the local store.
A local store can be used to save and retrieve options between runnings or communicate modules in in the save job. Do not rely on these options to exist and always use a default value.
The local store is saved optionally in the file configured in section job_name, option localstore. Options are saved in a section named after the current job_name. Notice you can read from any job_name, but only save options on your own job_name.
- Parameters
option – the name of the option
default – the default value of the option. Do not rely on these options to exist an always use a default value.
job_name – if provided, read the option from this job_name.
-
store_set
(option=None, value=None, save=False)¶ Store an option from the local store.
A local store can be used to save and retrieve options between runnings or communicate modules in in the save job. Do not rely on these options to exist and always use a default value.
The local store is saved optionally in the file configured in section job_name, option localstore. Options are saved in a section named after the current job_name. Notice you can read from any job_name, but only save options on your own job_name.
- Parameters
option (str) – the name of the option. If
None
, do not store an option. The local store can be saved idoption=None
andsave=True
.value (str) – the value of the option. If
None
, remove the option.save (Boolean) – whether the local store must be saved inmediately in the file configured in section job_name, option localstore. If there is no file configured, do not save the localstore. If the localStore was not dirty, it is not saved.
-
class
base.config.
MyExtendedInterpolation
¶ Bases:
configparser.ExtendedInterpolation
Adds support to inheritance to the extended interpolator.
When getting the value of an option from a section, if the option is not defined in the current section, check if the section has an inherits options. If there is an inherits options, look for the option in the inherits section and then the DEFAULTS section.
-
before_get
(parser, section, option, value, defaults)¶
-
-
class
base.config.
TelegramHandler
(level=20, token='', chatids=[])¶ Bases:
logging.Handler
A logging handler to send messages to a list of telegram chatids
-
emit
(record)¶ Do whatever it takes to actually log the specified logging record.
This version is intended to be implemented by subclasses and so raises a NotImplementedError.
-
-
base.config.
check_server
(server)¶ Check whether a server cab be reached.
>>> check_server(None) False
>>> check_server('https://www.google.es') False
>>> check_server('https://www.google.es:443') True
- Parameters
server (str) – a URL to connect to a server, such as “http://localhost:9998”. A scheme, hostname and port must be provided. A malformatted server will return False.
- Returns
True is the server is listening
-
base.config.
configure_logging
(config, basic=False)¶ Configure the logging system. Some variables can be configured from the [logging] section in the configuration.
- Parameters
config (config.Config) – the global configuration object.
basic (Boolean) – if True, configure a basic but colored logging system to the console.
Todo
We couldn’t find a way to configure log filename easily for each case using configuration files. Temporally, the logging subsystem is configured using a dictionary and not a configuration file.
- Configuration:
console.level: The logging level for the console handler. Defaults to WARN.
file.level: The logging level for the file handler. Defaults to INFO.
file.logfile: The filename for the file handler. Defaults to rvt2.log.
telegram.level: The logging level for the telegram handler. Defaults to INFO.
telegram.token: the token for the telegram bot. Defaults to None (do not send messages)
telegram.chatids”: a space separated list of chatids to send messages.
-
base.config.
parse_conf_array
(value)¶ Parses a value in an option and returns it as an array.
Values are sepparated using spaces or new lines. Double quotes can be used as quoting chars. Spaces can be espaced with a backslash.
>>> parse_conf_array('hello') ['hello']
>>> parse_conf_array('hello world') ['hello', 'world']
>>> parse_conf_array('hello\ world') ['hello world']
>>> parse_conf_array('"hello world" bye') ['hello world', 'bye']
>>> parse_conf_array('base.module.test{"param":"value1\ value2"}') ['base.module.test{"param":"value1 value2"}']
>>> parse_conf_array('base.module.test{"param":"value1\ value2"} base.module.test{"param":"value3\ value4"}') ['base.module.test{"param":"value1 value2"}', 'base.module.test{"param":"value3 value4"}']
>>> parse_conf_array(None) []
>>> parse_conf_array('') []
- Parameters
value (str) – The value to parse.
- Returns
An array of strings.
base.directory module¶
Modules to parse directories and subdirectories.
-
class
base.directory.
DirectoryClear
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Remove the the file or directory specified by ‘target’. Useful when certain jobs that append results to file are called again, avoiding duplication of output.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
base.directory.
DirectoryFilter
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
The module gets a path to a directory and sends to from_module the path to all files inside this path. Optionally, the walker only manages a set of extensions, or excludes files using a regular expression.
- Module description:
path: the abolute path to a file or directory.
- from_module: mandatory. If path is a file, this module is transparent.
If path is a directory, list all the files to the subdirectories (filters might apply) and call to from_module for each one of them.
yields: whatever from_module yields each time is called.
- Configuration:
void_extension (Boolean): If True, files without an extension are always parsed even if a filter is set.
followlinks (Boolean): If True, follow symbolic links
filter: List of file categories to parse. If not provided, parse all files. Categories are section names to be read.
progress.disable (Boolean): If True, disable the progress bar.
progress.cmd (String): The shell command to run to estimate the number of subdirectories in the path.
exclude_pattern: If the path of the files matches this pattern, exclude the file.
restartable: If True, use the local store to save the name of the last directory fully parsed. The parsing won’t continue until this directory is found.
Todo
Files in the last parsed directory might be parsed twice
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Gets a path and calls to from_module for each file in the directory or subdirectories.
-
class
base.directory.
FileClassifier
(*args, **kwargs)¶ Bases:
base.job.BaseModule
Classifies a piece of data according to its content-type, extension or path.
This class can be used as a module or as a stand-alone object.
- Configuration section:
categories: list of categories to use. Categories are section names with extension and content type.
check_extension: When used as module: if True, check path extension; if False, check only content_type to decide a the category.
Example
>>> import base.config >>> import base.job >>> c = base.config.Config(filenames=['conf/file_categories.cfg']) >>> fc = FileClassifier(c, local_config=dict(categories='compressed office')) >>> print(fc.classifyByExtension('.docx')) office >>> print(fc.classifyByContentType('application/x-compress')) compressed >>> print(fc.classify(dict(extension='.docx'))) office >>> print(fc.classify(dict(extension='.docx', content_type='application/x-compress'))) compressed >>> print(fc.classify(dict(path='filename'))) None >>> print(fc.classify(dict(path='filename.docx'))) office
-
classify
(data)¶ Classifies a piece of data. Data is a dictionary that must include either content_type, extension or path.
-
classifyByContentType
(content_type)¶ Classifies a content type.
:param : extension: The extension to classify. For example, ‘application/x-msaccess’.
- Returns
The name of the category, or None
-
classifyByExtension
(extension)¶ Classifies an extension.
:param : extension: The extension to classify. For example, ‘.docx’.
- Returns
The name of the category, or None
-
classifyByPath
(path)¶ Classifies a path. This method extracts the extension from the path and calls to classifiyByExtension.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Classifies all items sent by from_module
-
class
base.directory.
FileParser
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Call a job for each file in a path that matches a regex.
- Module description:
path: run from_module on this path. If the path matches a regex, call also to a configured jobname
from_module: optional. If None, not used.
yields: whatever from_module and jobname yield each time they are called.
- Configuration:
parsers: A list of regex and modules. First, the regular expression matching a filename; second, the jobname to run on this filename.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.directory.
GlobFilter
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
The module gets a glob pattern as a path, and runs to from_module all items matching the pattern.
See: https://docs.python.org/3.6/library/glob.html
- Module description:
path: a glob pattern. Run from_module on all items matching this pattern.
from_module: mandatory.
yields: whatever from_module yields each time it is called.
- Configuration:
recursive: Passes this parameters to
glob.iglob
: whether the path must run recursively or not.ftype: either “file”, “directory” or “all”.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Parses objects matching a glob pattern.
If the glob can match files and directories, you probably want to feed the results to a DirectoryFilter.
- Parameters
path (str) – the glob pattern. It will be recursive. See https://docs.python.org/3.6/library/glob.html
-
class
base.directory.
MirrorOptions
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Return the value of the local options
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
base.help module¶
Jobs to assist in the use of the RVT2: list available jobs, show help about a job or module.
-
class
base.help.
AvailableJobs
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A module to list all avaiable jobs in the rvt
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
class
base.help.
Help
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A module to show help about a job or module whose name is passed as the path of the module.
- Configuration section:
show_vars: List of variables in the section to show. If “ALL”, show all variables. If Empty, do not show context variables.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
base.input module¶
Some simple file readers to be used as input for other modules.
-
class
base.input.
AllLinesInFile
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Yields every line in a file as a string
- Configuration:
encoding (String): The encoding to use. Defaults to utf-8
progress.disable (Boolean): If True, disable the progress bar.
progress.cmd (String): The shell command to run to estimate the number of lines in the file.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ Read all lines from the path. from_module is ignored
-
class
base.input.
CSVReader
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Yields every line in a CSV file.
- Configuration:
encoding (String): The encoding to use. Defaults to “utf-8”
delimiter (String): The delimiter to use. Defaults to ;
quotechar (String): The quotechar. Defaults to “
restkey (String): The restkey of the DictReader. Defaults to “extra”.
restval (String): The restval of the DictReader. Defaults to the empty string.
content_type: The content_type to set, if fill_common_fields is set
fieldnames: A space separated list of header names. If None, use the first line. Warning: if provided, the first line will be considered data unless ignore_lines is set to >0
ignore_lines (int): Ignore this numner of initial lines. If fieldnames is provided, the first line is also ignored.
progress.disable (Boolean): If True, disable the progress bar.
progress.cmd (String): The shell command to run to estimate the number of lines in the file.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ Read CSV file in the path. from_module is ignored
-
class
base.input.
DummyReader
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A dummy reader that creates as many empty dictionaries as requested in _number_.
Use for debugging.
- Configuration:
number: Yields this many of empty dictionaries
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ path and from_module are ignored
-
class
base.input.
ForAllLinesInFile
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Pass to from_module each line in a file as the path.
- Configuration:
encoding (String): The encoding to use. Defaults to utf-8
progress.disable (Boolean): If True, disable the progress bar.
progress.cmd (String): The shell command to run to estimate the number of lines in the file.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ Read all lines from the path and pass them to from_module
-
class
base.input.
GeneratorReader
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Manages from_module as a generator, not a module, and yields it contents.
You can use this module to inject an array into another module down in the chain.
Example
Save a list in a CSV file:
data = [ dict(greeting='Hello', language='English'), dict(greeting='Hola', language='Spanish') ] base.output.CSVSink( config, from_module=GeneratorReader(config, from_module=data), local_config=dict(outfile='outfile.csv') ).run()
-
from_module
¶ Any generator-like object such a list. Yields its contents.
-
run
(path=None)¶ Path is ignored.
-
-
class
base.input.
JSONReader
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.input.AllLinesInFile
Load every line in a file as a JSON dictionary and yields it.
-
run
(path)¶ Read JSON file in the path. from_module is ignored
-
-
class
base.input.
SQLiteReader
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Returns the cursor of a query on a sqlite database.
Rows in the database are returned as dictionaries.
- Configuration:
query: The SQL query to run.
- Current job section:
query: If the query in the module section is empty, read the SQL query from the job section.
read_only: If True, open the database in read_only mode
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ Read database from the path. from_module is ignored
base.job module¶
Jobs and modules.
-
class
base.job.
BaseModule
(config, section=None, local_config=None, from_module=None)¶ Bases:
object
The base for all modules. Do not use this module directly, always extend it.
- Configuration:
stop_on_error: If True, stop the execution on an error.
logger_name: The name of the logger to use.
- Parameters
config (base.config.Config) – Global configuration for the application.
section (str) – the name of the configuration section for this module in the global configuration object.. If None, use the classname.
local_config (dict) – local configuration for this module. This configuration overrides the values in the section in the global configuration.
from_module (base.job.BaseModule) – If in a chain, the next module in the chain, or None.
-
check_params
(path, check_from_module=False, check_path=False, check_path_exists=False)¶ Check the module is configured correctly. Extend this function to run your own tests.
- Parameters
path (str) – The path passed to the run() method.
check_from_module (boolean) – If True, check a from_module is defined.
check_path (boolean) – If True, check the path is not None.
check_path_exists (boolean) – If True, check the path exists.
- Raises
RVTError if the tests are not passed. –
-
logger
()¶ Get the logger for this parser.
Warning
Do not store the logger as an internal variable: the user may want to change the logger at any time.
-
myconfig
(option, default=None)¶ Get the value of a configuration for this module.
- Parameters
option (str) – the name of the option
default – the dafault value of the option
-
myflag
(option, default=False)¶ A convenience method for self.config.getboolean(self.section, option, False)
-
options
()¶ Return a dictionary with the available options to this job
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
set_default_config
(option, default=None)¶ Get the value of a configuration for this module.
- Parameters
option (str) – the name of the option
default (str) – the dafault value of the option. It MUST be a string.
-
shutdown
()¶ This function will be called at the end of the execution of a job.
The shutdown() function of the from_module is called recursively.
-
class
base.job.
CascadeWrapper
(*args, **kwargs)¶ Bases:
base.job.BaseModule
-
run
(*args, **kwargs)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
-
exception
base.job.
RVTCritical
¶ Bases:
Exception
A special class for Exceptions inside the RVT. The rvt2 cannot continue.
-
exception
base.job.
RVTError
¶ Bases:
Exception
A special class for Exceptions inside the RVT. The module or job cannot continue.
-
base.job.
get_path_array
(job_name, myparams, extra_config, default_path, config)¶ get path array from 1. myparams (job_with params); 2. extra_config; 3. parameter 4. config
-
base.job.
load_module
(config, confsection, from_module=None, extra_config=None)¶ Loads a module from a section name.
- Parameters
config (
base.config.Config
) – global configuration object to pass to the module.confsection (str) – The section name, and optional local configuration. The section name is searched in the configuration. If the section is present and it has a “module” attribute, load the class “module”. If the section is not present or it doesn’t have a “module” attribute, try to load the section name as a class. Format: SECTIONAME{‘OPTION’: ‘VALUE’, ‘OPTION2’: ‘VALUE2’}.
from_module (base.job.BaseModule) – pass this value as the from_module configuration. Default: None
extra_config (dict) – extra local configuration for the module. Default: None.
-
base.job.
parse_modules_chain
(job_name, myparams, config)¶ Parse a list of modules or jobs from a conf_name, taking into account local configuration.
- Parameters
job_name (str) – The name of the job. The modules of jobs will be loaded first from “jobs” and, if not found, from “modules”. The default parameters will be loaded from “default_params”. The chain string will be managed as a string template, where the “default_params” is applied. If the job does not have “jobs” or “modules” configuration, just return the name of the job. The system will assume this name id a class name.
myparams (dict) – The local parameters, as returned by parse_modules_name()
config (base.config.Config) – The configuration object for the application.
- Returns
A list of modules to load.
-
base.job.
parse_modules_name
(input_name, default='True')¶ Parse a module or job name searching for local configurations.
- Parameters
input_name (str) – The name of a module or job with an optional configuration. The optional configuration is appended next to the job name, as pair name=value, or only name. See examples below.
default (str) – The default value of a param, when only the name is given.
- Returns
A set. The first member if the name of the module or job. The second is the local configuration, if any.
>>> parse_modules_name('funcname') ('funcname', OrderedDict())
>>> parse_modules_name('funcname ') ('funcname', OrderedDict())
>>> parse_modules_name('funcname greetings="good morning" name="Jim" morning') ('funcname', OrderedDict([('greetings', 'good morning'), ('name', 'Jim'), ('morning', 'True')]))
-
base.job.
run_job
(config, job_name_with_params, path=None, extra_config=None, from_module=None)¶ Runs a job from the configuration. This jobs has ‘jobs’, ‘modules’ or ‘cascade’
- Parameters
config (base.config.Config) – The configuration object
job_name_with_params (str) – The name of the job to run. It must be a section in the configuration. This string will be parsed using parse_modules_params() and it may include additional parameters.
path (
list
ofstr
) – Run the job on this paths.extra_config (dict) – extra local configuration for all the modules in the job. Default: None
from_module (base.job.BaseModule) – use this as the from_module of the last module (only in single jobs)
- Returns
If the job is sinble (it has ‘modules’ or ‘cascade’), a generator with the result of the execution. If the job is composite (it has ‘jobs’), return an empty list since the result of each job is probably not related to each other. You MUST read each item from the returned generator.
-
base.job.
run_single_job
(config, job_name_with_params, default_path=None, extra_config=None, from_module=None)¶ Runs a job from the configuration. This job has only ‘modules’, it does not include ‘jobs’.
- Parameters
config (base.config.Config) – The configuration object
job_name_with_params (str) – The name of the job to run. It must be a section in the configuration. This string will be parsed using parse_modules_params() and it may include additional parameters.
default_path (
list
ofstr
) – Run the job on this paths. The order to read paths for a job is: 1. job_with_params (single path); 2. extra_config; 3. this paramter 4. configextra_config (dict) – extra local configuration for all the modules in the job. Default: None
from_module (base.job.BaseModule) – use this as the from_module of the last module in the chain.
- Returns
A generator that yields each of the results of the execution.
base.mutations module¶
Modules to mutate data yielded by other modules: converte using specific convertes, remove fields, set fields to default values…
-
class
base.mutations.
AddFields
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Get data from from_module, add some new fields loaded from configuration and yield again.
- Module description:
path: not used, passed to from_module.
from_module: Data is updated.
yields: The updated data.
- Configuration:
section: Section from configuration where new values are to be retrieved
fields: A dictionary of fields to be set. fields will be managed as a string template, passing the options from the configuration section as parameter.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.mutations.
Collapse
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Collapse different documents sent by from_module with a common field into just one document.
Warning: the collapse may take many time and memory
- Configuration section:
field: collapse documents using this field name as the common field.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.mutations.
CommonFields
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Adds common fields for a document: path, filename, dirname, extension, content_type and _id if they don’t exist yet.
- Module description:
path: not used, passed to from_module.
from_module: mandatory. Copy the information sent by from_module and add fields if they don’t exist yet.
yields: the modified data.
- Configuration:
calculate_id: if True, calls base.utils.generate_id to generate an identifier in the _id field.
disabled: if True, do not add anything and just yield the result. Useful in configurable module chains
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.mutations.
DateFields
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Converts some fields into ISO date strings.
Fields might be:
An integer, or a string representing an integer: it is a UNIX timestamp.
A string: the module will use the datetutil package to parse it
If the field cannot be converted and stop_on_error is not set, the field is popped out from the data.
- Module description:
path: not used, passed to from_module.
from_module: mandatory. Get data and udpate fields.
yields: The modified data.
- Configuration:
fields: A space separated list of fields to check to convert
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ The path will be passed to the mandatory from_module
-
class
base.mutations.
ForEach
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Runs a job for each data yielded by from_module. The data is passed as params of the job.
- Module description:
path: not used, passed to from_module.
from_module: mandatory. The data is passed to
run_job
as its extra_config parameter.yields: None
- Configuration:
run_job: The name of he job to run
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.mutations.
GetFields
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Get data from from_module, yield fields specified.
- Module description:
path: not used, passed to from_module.
from_module: Data dict.
yields: The updated dict data.
- Configuration:
section: Section from configuration where new values are to be retrieved
fields: A list of fields to be yielded.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.mutations.
RemoveFields
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Drops some fields from data.
- Module description:
path: not used, passed to from_module.
from_module: mandatory. Get data and remove fields.
yields: The modified data.
- Configuration:
fields: A space separated list of fields to drop
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.mutations.
SetFields
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
Get data from from_module, set or update some if its fields and yield again.
- Module description:
path: not used, passed to from_module.
from_module: mandatory. Data is updated.
yields: The updated data.
- Configuration:
presets: A dictionary of fields to be set, unless already set by data yielded by from_module.
fields: A dictionary of fields to be set. fields will be managed as a string template, passing the data yielded by from_module as parameter.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
base.output module¶
Print the results from other modules to the console or a file.
-
class
base.output.
BaseSink
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
An abstract module that prints the results from other modules to a file or standard output.
Do not use this module directly, but one of its extensions.
The from_module of a BaseSink object can be a base.job.BaseModule or an array. This way, you can use sinks like this, to use common configuration.
Example
Save a list into a CSV file:
m = base.job.load_module( base.config.Config(), 'base.output.CSVSink', extra_config=dict(outfile='outfile.csv') from_module=[ dict(greeting='Hello', language='English'), dict(greeting='Hola', language='Spanish') ] ) list(m.run())
- Configuration:
outfile (str): If provided, saved to this file (absolute path) instead of standard output. CONSOLE is a special name: prints to standard output.
file_exists (str): If outfile exists, APPEND (this is the default behaviour), OVERWRITE or throw an ERROR.
- Current job section:
outfile (str):
outfile
can be defined in the job section if the outfile in the section is empty
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.output.
CSVSink
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.output.BaseSink
A module that prints the results from other modules to a file or the standard output as a CSV.
- Configuration::
outfile (str): If provided, saved to this file (absolute path) instead of standard output. CONSOLE is a special name to force printing to the standard output
file_exists (str): If outfile exists, APPEND (this is the default behaviour), OVERWRITE or throw an ERROR.
delimiter (String): The delimiter parameter of the csv.DictWriter. “TAB” means tabulator.
quotechar (String): The quotechar of the csv.DictWriter. Defaults to “
extrasaction (String): The extrasaction parameter of the csv.DictWriter. Defaults to “raise”.
restval (String): The restval parameter of the csv.DictWriter. Defaults to the empty string.
write_header (boolean): If True (default), writes the header of the CSV file.
quoting (int): The quoting parameter of the csv.DictWriter.
fieldnames: If present, use these fieldsnames instead of the fields in the dictionary. You can use this option to order the fields
field_size_limit: maximum field size allowed by the parser. Default “sys.maxsize”. Lower the value to skip writing large inputs.
- Current job section:
outfile (str): outfile can be defined in the job section if the outfile in the section is empty
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.output.
JSONSink
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.output.BaseSink
A module that prints the results from other modules to a file or standard output as a JSON object.
- Configuration:
outfile (str): If provided, saved to this file (absolute path) instead of standard output. CONSOLE is a special name: prints to standard output.
file_exists (str): If outfile exists, APPEND (this is the default behaviour), OVERWRITE or throw an ERROR.
indent (str): Indentation value for the output. Default=None
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
class
base.output.
MirrorPath
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A basic module that yields the path.
-
run
(path)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
base.templates module¶
Modules to manage mako templates.
-
class
base.templates.
TemplateSink
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.output.BaseSink
A _base.output.BaseSink_ that saves into a file or standard output, using a mako template.
- Configuration:
template_dirs: A space separated list of directories to load templates from. Default: current path, rvt2 path
input_encoding: The encoding of the templates. Default: utf-8
template_file: A file with the template. Relative to ‘template_dirs’
template: The template as a string. This option is ignored if a template_file is provided.
skip_on_empty_data: If from_module doesn’t return anything and this is True, do not output anything
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path=None)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
base.threads module¶
-
class
base.threads.
Fork
(config, section=None, local_config=None, from_module=None)¶ Bases:
base.job.BaseModule
A module to send the data received from from_module up in the chain and to a job in a different thread.
- Configuration:
secondary_job: The name of the job to run in the secondary thread. This job cannot be composite (only ‘modules’ allowed) and it will receive the data in the last module of the chain.
-
read_config
()¶ Read options from the configuration section.
This method should set default values for all available configuration options. The other module function will safely assume these options have correct values.
-
run
(path)¶ Run the job on a path
- Parameters
path (str) – the path to check.
- Yields
If any, an iterable of elements with the output.
-
shutdown
()¶ This function will be called at the end of the execution of a job.
The shutdown() function of the from_module is called recursively.
-
base.threads.
run_job
(*args, daemon=False, **kwargs)¶ Runs a job from the configuration in a different thread.
- Returns
The new thread
-
base.threads.
worker
(*args, **kwargs)¶ The worker that actually runs a job in a thread.
This worker only consumes the generator returned by the job. It does nothing else
base.utils module¶
Utility functions to the rest of the system.
-
base.utils.
check_directory
(path, create=False, delete_exists=False, error_exists=False, error_missing=False)¶ Check if a directory exists.
- Parameters
error_exists (Boolean) – If True and the directory exits, raise a RVTError
error_missing (Boolean) – If True and the file does not exist, raise a RVTError
create (Boolean) – If True and the directory does not exist, create it
delete_exists (Boolean) – If True, delete the directory and create a new one.
- Returns
True if the directory exists at the end of this function.
-
base.utils.
check_file
(path, error_missing=False, error_exists=False, delete_exists=False, create_parent=False)¶ Check if a file exists, and optionally removes it.
- Parameters
error_exists (Boolean) – If True and the file exists, raise a RVTError
error_missing (Boolean) – If True and the file does not exist, raise a RVTError
delete_exists (Boolean) – If True, delete the file if exists
create_parent (Boolean) – If True, create the parent directory
- Raises
RVTError if the path is not a file, or the file does not exist and error_exists is set to True –
- Returns
True if the file exists at the end of this function.
-
base.utils.
check_folder
(path)¶ Check is a path is a folder and create if not exists.
Equivalent to
check_directory(path, create=True)
-
base.utils.
generate_id
(data=None)¶ Generate a unique ID for a piece of data. If data is None, returns a random indentifier.
The identifier is created using:
uuid.uuid5(uuid.NAMESPACE_URL, 'file:///{}/{}?{}'.format(dirname, filename, embedded_path))
If the data already provides and identifier in an field
_id
, pop this field from data and return it.
-
base.utils.
relative_path
(path, start)¶ Transform a path to be relative to a start path.
Todo
We don’t want to go outside the starting path. Check that.
- Returns
path relative to start path.
>>> relative_path('/morgue/112234-casename/01/23', '/morgue/112234-casename') '01/23' >>> relative_path('/another/112234-casename/01/23', '/morgue/112234-casename') '../../another/112234-casename/01/23' >>> relative_path(None, '/morgue/11223344-casename') is None True
-
base.utils.
save_csv
(data, config=None, **kwargs)¶ Save data in a CSV file. This is a convenient function to run a
base.output.CSVSink
module from inside another module.- Parameters
data – The data to be saved. It can be a generator (such as list or tuple) or a base.job.BaseModule. In the last case, the module is run and saved.
config (base.config.Config) – The global configuration object, or None to use default configuration.
kwargs (dict) – The extra configuration for the base.output.CSVSink module. You’d want to set, at least, outfile.
-
base.utils.
save_json
(data, config=None, **kwargs)¶ Save data in a JSON file. This is a convenient function to run a
base.output.JSONSink
module from inside another module.- Parameters
data – The data to be saved. It can be a generator (such as list or tuple) or a base.job.BaseModule. In the last case, the module is run and saved.
config (base.config.Config) – The global configuration object, or None to use default configuration.
kwargs (dict) – The extra configuration for the base.output.JSONSink module. You’d want to set, at least, outfile.