ETL Processing

Warning

This is a work in progress. The ETL pipeline is currently under development and may change in the future. The documentation is not up to date.

Instructions for defining ETL pipelines, and descriptions of base components

Dev Transformers

Factor-dependent transformers

We are often interested in output differences split out by the different factor values for each run. Instead of manually specifying what columns we are interested in, some transformers support the use of the $FACTORS$ tag in the ETL design files, which will automatically be expanded to the factor columns of the experiment.

The ETL pipeline provides the per-experiment factor information to the transformers in the dataframe as additional information. On the transformer side, the Transformer base class provides the _expand_factors helper to automatically expand $FACTORS$ into the right values. An example is provided below.

transformers:
- name: GroupByAggTransformer
  data_columns:
  - accuracy
  - total_time
  groupby_columns:
  - exp_name
  - audit
  - $FACTORS$ # will be expanded to factor_columns of the experiment

The $FACTORS$ tag must be explicitly provided as a column value to be expanded to the experiment factors. For an example, see GroupByAggTransformer.

Dev Super ETL

Doe suite supports the definition of suite-transcending (super-suite) ETL pipelines that combines experiments from multiple suites, which we refer to as super ETL.

Pipeline configs are defined in doe-suite-config/super_etl and can be run similar to regular etl, using # TODO [nku] needs to use makefile in commands

poetry run python src/super_etl.py --config pipeline

Custom output location

The default option is to place results in doe-suite-results/super_etl. This may be overridden using the output_path option to specify a base directory for outputs. In the following example, a pipeline named pipeline outputs a file plot.pdf defined in config.yml.

poetry run python src/super_etl.py --config config --output_path {paper_dir}
# (Over)writes: paper_dir/config/plot.pdf

In the base directory, subdirectories per-pipeline and per-config file can be created using --output_dir_config_name_disabled and output_dir_pipeline.

poetry run python src/super_etl.py --config config --output_path {paper_dir} --output_dir_config_name_disabled
# (Over)writes: paper_dir/plot.pdf

poetry run python src/super_etl.py --config config --output_path {paper_dir} --output_dir_pipeline
# (Over)writes: paper_dir/config/pipeline/plot.pdf

poetry run python src/super_etl.py --config_name config --output_path {paper_dir} --output_dir_config_name_disabled --output_dir_pipeline
# (Over)writes: paper_dir/pipeline/plot.df

The default is to create a directory for each config file, but not for each pipeline as generally the output files have the pipeline name. This translates to output_dir_config_name_disabled=False and output_dir_pipeline=True.

Config changes

There are two changes compared to a regular ETL pipeline. First is is the experiments key. experiment now contains a dict of suites with a list of the experiments of those suites to include. Note that the other keys of the pipeline transformers, loaders and extractors stay the same and can therefore often be easily copy-pasted.

$ETL$:
  pipeline_name:
    experiments:
      suite_1: [ exp_1 ]
      suite_2: [ exp_2, exp_3 ]
    ...transformers, loaders, extractors etc.

The second change is that the runs must be specified to load data from in the form of the suite_id. The suite_id can be specified per-suite and per-experiment. Specifying suite ids per-suite:

$SUITE_ID$:
  suite_1: 1648453067
  suite_2: 1651052734

Experiment-specific suite ids

$SUITE_ID$:
  suite_1: 1648453067
  suite_2:
    exp_1: 1651052734
    exp_2: 1651052743

Use the $DEFAULT$ key to specify a default:

$SUITE_ID$:
  suite_1: 1648453067
  suite_2:
    $DEFAULT$: 1651052734
    exp_2: 1651052743

Full example:

$SUITE_ID$:
  suite_1: 1648453067
  suite_2: 1651052734

$ETL$:
  pipeline_name:
    experiments:
      suite_1: [ exp_1 ]
      suite_2: [ exp_2, exp_3 ]
    extractors:
      JsonExtractor: {} # with default file_regex
      ErrorExtractor: {} # if a non-empty file exists matching the default regex -> then we throw an error using the ErrorExtractor
      IgnoreExtractor: {} # since we want that each file is processed by an extractor, we provide the IgnoreExtractor which can be used to ignore certain files. (e.g., stdout)
    transformers:
      - name: RepAggTransformer # aggregate over all repetitions of a run and calc `mean`, `std`, etc.
        data_columns: [latency] # the names of the columns in the dataframe that contain the measurements
    loaders:
      CsvSummaryLoader: # write the transformed detl_info["suite_dir"]ataframe across the whole experiment as a csv file
        output_dir: "pipeline1" # write results into an output dir
      DemoLatencyPlotLoader: # create a plot based on project-specific plot loader
        output_dir: "pipeline1" # write results into an output dir

Jupyter Notebook support

The below code snippet can be used to use an ETL pipeline in a notebook for quick debugging and analysis. The ETL’s loaders are skipped and instead the DataFrame is returned (that would have been processed by the loaders). Simply call super_etl.run_multi_suite("pipeline.yml", return_df=True) to return the dataframe.

Full example

 %env DOES_PROJECT_DIR= # place correct dir here
 import sys
 import os
 sys.path.insert(0, os.path.abspath('doe-suite/doespy'))
 display(sys.path)
 import doespy.etl.etl_base as etl_base

df = super_etl.run_multi_suite("pipeline.yml", "etl_output", return_df=True)

# inspect df