ETL Processing¶
Warning
This is a work in progress. The ETL pipeline is currently under development and may change in the future. The documentation is not up to date.
Instructions for defining ETL pipelines, and descriptions of base components
Dev Transformers¶
Factor-dependent transformers¶
We are often interested in output differences split out by the different
factor values for each run. Instead of manually specifying what columns
we are interested in, some transformers support the use of the
$FACTORS$
tag in the ETL design files, which will automatically be
expanded to the factor columns of the experiment.
The ETL pipeline provides the per-experiment factor information to the
transformers in the dataframe as additional information. On the
transformer side, the Transformer
base class provides the
_expand_factors
helper to automatically expand $FACTORS$
into
the right values. An example is provided below.
transformers:
- name: GroupByAggTransformer
data_columns:
- accuracy
- total_time
groupby_columns:
- exp_name
- audit
- $FACTORS$ # will be expanded to factor_columns of the experiment
The $FACTORS$
tag must be explicitly provided as a column value to
be expanded to the experiment factors. For an example, see
GroupByAggTransformer
.
Dev Super ETL¶
Doe suite supports the definition of suite-transcending (super-suite) ETL pipelines that combines experiments from multiple suites, which we refer to as super ETL.
Pipeline configs are defined in doe-suite-config/super_etl
and can
be run similar to regular etl, using # TODO [nku] needs to use makefile
in commands
poetry run python src/super_etl.py --config pipeline
Custom output location¶
The default option is to place results in
doe-suite-results/super_etl
. This may be overridden using the
output_path
option to specify a base directory for outputs. In the
following example, a pipeline named pipeline
outputs a file
plot.pdf
defined in config.yml
.
poetry run python src/super_etl.py --config config --output_path {paper_dir}
# (Over)writes: paper_dir/config/plot.pdf
In the base directory, subdirectories per-pipeline and per-config file
can be created using --output_dir_config_name_disabled
and
output_dir_pipeline
.
poetry run python src/super_etl.py --config config --output_path {paper_dir} --output_dir_config_name_disabled
# (Over)writes: paper_dir/plot.pdf
poetry run python src/super_etl.py --config config --output_path {paper_dir} --output_dir_pipeline
# (Over)writes: paper_dir/config/pipeline/plot.pdf
poetry run python src/super_etl.py --config_name config --output_path {paper_dir} --output_dir_config_name_disabled --output_dir_pipeline
# (Over)writes: paper_dir/pipeline/plot.df
The default is to create a directory for each config file, but not for
each pipeline as generally the output files have the pipeline name. This
translates to output_dir_config_name_disabled=False
and
output_dir_pipeline=True
.
Config changes¶
There are two changes compared to a regular ETL pipeline. First is is
the experiments
key. experiment
now contains a dict of suites
with a list of the experiments of those suites to include. Note that the
other keys of the pipeline transformers
, loaders
and
extractors
stay the same and can therefore often be easily
copy-pasted.
$ETL$:
pipeline_name:
experiments:
suite_1: [ exp_1 ]
suite_2: [ exp_2, exp_3 ]
...transformers, loaders, extractors etc.
The second change is that the runs must be specified to load data from
in the form of the suite_id
. The suite_id
can be specified
per-suite and per-experiment. Specifying suite ids per-suite:
$SUITE_ID$:
suite_1: 1648453067
suite_2: 1651052734
Experiment-specific suite ids
$SUITE_ID$:
suite_1: 1648453067
suite_2:
exp_1: 1651052734
exp_2: 1651052743
Use the $DEFAULT$
key to specify a default:
$SUITE_ID$:
suite_1: 1648453067
suite_2:
$DEFAULT$: 1651052734
exp_2: 1651052743
Full example:
$SUITE_ID$:
suite_1: 1648453067
suite_2: 1651052734
$ETL$:
pipeline_name:
experiments:
suite_1: [ exp_1 ]
suite_2: [ exp_2, exp_3 ]
extractors:
JsonExtractor: {} # with default file_regex
ErrorExtractor: {} # if a non-empty file exists matching the default regex -> then we throw an error using the ErrorExtractor
IgnoreExtractor: {} # since we want that each file is processed by an extractor, we provide the IgnoreExtractor which can be used to ignore certain files. (e.g., stdout)
transformers:
- name: RepAggTransformer # aggregate over all repetitions of a run and calc `mean`, `std`, etc.
data_columns: [latency] # the names of the columns in the dataframe that contain the measurements
loaders:
CsvSummaryLoader: # write the transformed detl_info["suite_dir"]ataframe across the whole experiment as a csv file
output_dir: "pipeline1" # write results into an output dir
DemoLatencyPlotLoader: # create a plot based on project-specific plot loader
output_dir: "pipeline1" # write results into an output dir
Jupyter Notebook support¶
The below code snippet can be used to use an ETL pipeline in a notebook
for quick debugging and analysis. The ETL’s loaders are skipped and
instead the DataFrame is returned (that would have been processed by the
loaders). Simply call
super_etl.run_multi_suite("pipeline.yml", return_df=True)
to return
the dataframe.
Full example
%env DOES_PROJECT_DIR= # place correct dir here
import sys
import os
sys.path.insert(0, os.path.abspath('doe-suite/doespy'))
display(sys.path)
import doespy.etl.etl_base as etl_base
df = super_etl.run_multi_suite("pipeline.yml", "etl_output", return_df=True)
# inspect df