Extractors¶
The Extractor stage processes files generated by experiment jobs and creates a Pandas data frame. Each file needs to be assigned to exactly one Extractor by setting the file_regex field. The provided extractors provide reasonable defaults that can be adjusted for specific use cases.
Yaml Files¶
- pydantic model doespy.etl.steps.extractors.YamlExtractor[source]¶
The YamlExtractor reads result files as YAML.
The YAML file can contain either a single object (result) or a list of objects (results).
$ETL$: extractors: YamlExtractor: {} # with default file_regex YamlExtractor: # with custom file_regex file_regex: [out.yml]
- field file_regex: Union[str, List[str]] = ['.*\\.yaml$', '.*\\.yml$']¶
Json Files¶
- pydantic model doespy.etl.steps.extractors.JsonExtractor[source]¶
The JsonExtractor reads result files as JSON. The JSON file can contain either a single object (result) or a list of objects (results).
$ETL$: extractors: JsonExtractor: {} # with default file_regex JsonExtractor: # with custom file_regex file_regex: [out.json]
- field file_regex: Union[str, List[str]] = ['.*\\.json$']¶
Csv Files¶
- pydantic model doespy.etl.steps.extractors.CsvExtractor[source]¶
The CsvExtractor reads result files as CSV. The CSV file contains a result per line and by default starts with a header row, see
has_header
andfieldnames
for CSV files without header.$ETL$: extractors: CsvExtractor: {} # with default params CsvExtractor: # with custom params file_regex: [out.csv] delimiter: ; has_header: False fieldnames: [col1, col2, col3]
- field file_regex: Union[str, List[str]] = ['.*\\.csv$']¶
The regex list to match result files.
- field delimiter: str = ','¶
The separator between columns.
- field has_header: bool = True¶
Indicates whether the first CSV row is a header or not.
- field fieldnames: List[str] = None¶
The names of the CSV columns if
has_header
is set to False
Raising Attention to Errors¶
- pydantic model doespy.etl.steps.extractors.ErrorExtractor[source]¶
The ErrorExtractor provides a mechanism to detect potential errors in an experiment job. For experiments with a large number of jobs, it is easy to overlook an error because there are many output folders and files e.g., the stderr.log of each job.
The ErrorExtractor raises a warning if matching files are not empty.
$ETL$: extractors: ErrorExtractor: {} # checking stderr.log ErrorExtractor: # checking custom files file_regex: [stderr.log, error.log]
- field file_regex: Union[str, List[str]] = ['^stderr.log$']¶
The regex list to match result files.
Ignoring Result Files¶
- pydantic model doespy.etl.steps.extractors.IgnoreExtractor[source]¶
The IgnoreExtractor provides a mechanism to detect potential errors in an experiment job. For experiments with a large number of jobs, it is easy to overlook an error indicted by the presence of an unexpected file.
As a result, the ETL requires that every file in the results folder of the job must be matched by exactly one Extractor.
The IgnoreExtractor can be used to ignore certain files on purpose, e.g., stdout.log.
$ETL$: extractors: IgnoreExtractor: {} # ignore stdout.log IgnoreExtractor: # custom ignore list file_regex: [stdout.log, other.txt]
- field file_regex: Union[str, List[str]] = ['^stdout.log$']¶
The regex list to match result files.