Extractors

The Extractor stage processes files generated by experiment jobs and creates a Pandas data frame. Each file needs to be assigned to exactly one Extractor by setting the file_regex field. The provided extractors provide reasonable defaults that can be adjusted for specific use cases.

Yaml Files

pydantic model doespy.etl.steps.extractors.YamlExtractor[source]

The YamlExtractor reads result files as YAML.

The YAML file can contain either a single object (result) or a list of objects (results).

Example ETL Pipeline Design
$ETL$:
    extractors:
        YamlExtractor: {}         # with default file_regex
        YamlExtractor:            # with custom file_regex
            file_regex: [out.yml]
field file_regex: Union[str, List[str]] = ['.*\\.yaml$', '.*\\.yml$']

Json Files

pydantic model doespy.etl.steps.extractors.JsonExtractor[source]

The JsonExtractor reads result files as JSON. The JSON file can contain either a single object (result) or a list of objects (results).

Example ETL Pipeline Design
$ETL$:
    extractors:
        JsonExtractor: {}         # with default file_regex
        JsonExtractor:            # with custom file_regex
            file_regex: [out.json]
field file_regex: Union[str, List[str]] = ['.*\\.json$']

Csv Files

pydantic model doespy.etl.steps.extractors.CsvExtractor[source]

The CsvExtractor reads result files as CSV. The CSV file contains a result per line and by default starts with a header row, see has_header and fieldnames for CSV files without header.

Example ETL Pipeline Design
 $ETL$:
     extractors:
         CsvExtractor: {}         # with default params
         CsvExtractor:            # with custom params
             file_regex: [out.csv]
             delimiter: ;
             has_header: False
             fieldnames: [col1, col2, col3]
field file_regex: Union[str, List[str]] = ['.*\\.csv$']

The regex list to match result files.

field delimiter: str = ','

The separator between columns.

field has_header: bool = True

Indicates whether the first CSV row is a header or not.

field fieldnames: List[str] = None

The names of the CSV columns if has_header is set to False

Raising Attention to Errors

pydantic model doespy.etl.steps.extractors.ErrorExtractor[source]

The ErrorExtractor provides a mechanism to detect potential errors in an experiment job. For experiments with a large number of jobs, it is easy to overlook an error because there are many output folders and files e.g., the stderr.log of each job.

The ErrorExtractor raises a warning if matching files are not empty.

Example ETL Pipeline Design
 $ETL$:
     extractors:
         ErrorExtractor: {}         # checking stderr.log
         ErrorExtractor:            # checking custom files
             file_regex: [stderr.log, error.log]
field file_regex: Union[str, List[str]] = ['^stderr.log$']

The regex list to match result files.

Ignoring Result Files

pydantic model doespy.etl.steps.extractors.IgnoreExtractor[source]

The IgnoreExtractor provides a mechanism to detect potential errors in an experiment job. For experiments with a large number of jobs, it is easy to overlook an error indicted by the presence of an unexpected file.

As a result, the ETL requires that every file in the results folder of the job must be matched by exactly one Extractor.

The IgnoreExtractor can be used to ignore certain files on purpose, e.g., stdout.log.

Example ETL Pipeline Design
 $ETL$:
     extractors:
         IgnoreExtractor: {}        # ignore stdout.log
         IgnoreExtractor:           # custom ignore list
             file_regex: [stdout.log, other.txt]
field file_regex: Union[str, List[str]] = ['^stdout.log$']

The regex list to match result files.