ETL Processing
==============

.. warning::

    This is a work in progress. The ETL pipeline is currently under
    development and may change in the future.
    The documentation is not up to date.

Instructions for defining ETL pipelines, and descriptions of base
components

Dev Transformers
----------------

Factor-dependent transformers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We are often interested in output differences split out by the different
factor values for each run. Instead of manually specifying what columns
we are interested in, some transformers support the use of the
``$FACTORS$`` tag in the ETL design files, which will automatically be
expanded to the factor columns of the experiment.

The ETL pipeline provides the per-experiment factor information to the
transformers in the dataframe as additional information. On the
transformer side, the ``Transformer`` base class provides the
``_expand_factors`` helper to automatically expand ``$FACTORS$`` into
the right values. An example is provided below.

.. code:: yaml

   transformers:
   - name: GroupByAggTransformer
     data_columns:
     - accuracy
     - total_time
     groupby_columns:
     - exp_name
     - audit
     - $FACTORS$ # will be expanded to factor_columns of the experiment

The ``$FACTORS$`` tag must be explicitly provided as a column value to
be expanded to the experiment factors. For an example, see
``GroupByAggTransformer``.

Dev Super ETL
=============

Doe suite supports the definition of suite-transcending (super-suite)
ETL pipelines that combines experiments from multiple suites, which we
refer to as super ETL.

Pipeline configs are defined in ``doe-suite-config/super_etl`` and can
be run similar to regular etl, using # TODO [nku] needs to use makefile
in commands

.. code:: bash

   poetry run python src/super_etl.py --config pipeline


Custom output location
----------------------

The default option is to place results in
``doe-suite-results/super_etl``. This may be overridden using the
``output_path`` option to specify a base directory for outputs. In the
following example, a pipeline named ``pipeline`` outputs a file
``plot.pdf`` defined in ``config.yml``.

.. code:: bash

   poetry run python src/super_etl.py --config config --output_path {paper_dir}
   # (Over)writes: paper_dir/config/plot.pdf

In the base directory, subdirectories per-pipeline and per-config file
can be created using ``--output_dir_config_name_disabled`` and
``output_dir_pipeline``.

.. code:: bash

   poetry run python src/super_etl.py --config config --output_path {paper_dir} --output_dir_config_name_disabled
   # (Over)writes: paper_dir/plot.pdf

   poetry run python src/super_etl.py --config config --output_path {paper_dir} --output_dir_pipeline
   # (Over)writes: paper_dir/config/pipeline/plot.pdf

   poetry run python src/super_etl.py --config_name config --output_path {paper_dir} --output_dir_config_name_disabled --output_dir_pipeline
   # (Over)writes: paper_dir/pipeline/plot.df

The default is to create a directory for each config file, but not for
each pipeline as generally the output files have the pipeline name. This
translates to ``output_dir_config_name_disabled=False`` and
``output_dir_pipeline=True``.

Config changes
--------------

There are two changes compared to a regular ETL pipeline. First is is
the ``experiments`` key. ``experiment`` now contains a dict of suites
with a list of the experiments of those suites to include. Note that the
other keys of the pipeline ``transformers``, ``loaders`` and
``extractors`` stay the same and can therefore often be easily
copy-pasted.

.. code:: yaml

   $ETL$:
     pipeline_name:
       experiments:
         suite_1: [ exp_1 ]
         suite_2: [ exp_2, exp_3 ]
       ...transformers, loaders, extractors etc.

The second change is that the runs must be specified to load data from
in the form of the ``suite_id``. The ``suite_id`` can be specified
per-suite and per-experiment. Specifying suite ids per-suite:

.. code:: yaml

   $SUITE_ID$:
     suite_1: 1648453067
     suite_2: 1651052734

Experiment-specific suite ids

.. code:: yaml

   $SUITE_ID$:
     suite_1: 1648453067
     suite_2:
       exp_1: 1651052734
       exp_2: 1651052743

Use the ``$DEFAULT$`` key to specify a default:

.. code:: yaml

   $SUITE_ID$:
     suite_1: 1648453067
     suite_2:
       $DEFAULT$: 1651052734
       exp_2: 1651052743

Full example:

.. code:: yaml

   $SUITE_ID$:
     suite_1: 1648453067
     suite_2: 1651052734

   $ETL$:
     pipeline_name:
       experiments:
         suite_1: [ exp_1 ]
         suite_2: [ exp_2, exp_3 ]
       extractors:
         JsonExtractor: {} # with default file_regex
         ErrorExtractor: {} # if a non-empty file exists matching the default regex -> then we throw an error using the ErrorExtractor
         IgnoreExtractor: {} # since we want that each file is processed by an extractor, we provide the IgnoreExtractor which can be used to ignore certain files. (e.g., stdout)
       transformers:
         - name: RepAggTransformer # aggregate over all repetitions of a run and calc `mean`, `std`, etc.
           data_columns: [latency] # the names of the columns in the dataframe that contain the measurements
       loaders:
         CsvSummaryLoader: # write the transformed detl_info["suite_dir"]ataframe across the whole experiment as a csv file
           output_dir: "pipeline1" # write results into an output dir
         DemoLatencyPlotLoader: # create a plot based on project-specific plot loader
           output_dir: "pipeline1" # write results into an output dir

Jupyter Notebook support
~~~~~~~~~~~~~~~~~~~~~~~~

The below code snippet can be used to use an ETL pipeline in a notebook
for quick debugging and analysis. The ETL’s loaders are skipped and
instead the DataFrame is returned (that would have been processed by the
loaders). Simply call
``super_etl.run_multi_suite("pipeline.yml", return_df=True)`` to return
the dataframe.

Full example

.. code:: python

    %env DOES_PROJECT_DIR= # place correct dir here
    import sys
    import os
    sys.path.insert(0, os.path.abspath('doe-suite/doespy'))
    display(sys.path)
    import doespy.etl.etl_base as etl_base

   df = super_etl.run_multi_suite("pipeline.yml", "etl_output", return_df=True)

   # inspect df