Transformers

The Transformer stage manipulates the raw Pandas results data frame created by the Extractor stage. There are two different syntax available:

  • The stage can directly invoke functions defined on the data frame, see Pandas.DataFrameFunction.

  • The stage can invoke custom Transformer Classes, e.g., doespy.transformers.ConditionalTransformer.

Pandas DF Transformers

class Pandas.DataFrameFunction

Can directly call all functions defined on pandas data frames: https://pandas.pydata.org/docs/reference/frame.html The syntax is different from regular transformers, use df.* and replace * with the function name. The dictionary under df.* can be used to pass named arguments of the selected function.

Parameters:

**args – Pass argument ot the function selected with df.*

Example ETL Pipeline Design
 $ETL$:
     transformers:
         # remove all cols except
         - df.filter: {items: ["exp_name", "x", "y"]}
         # add column to df
         - df.eval: {expr: "color = 'black'"}

Conditional Replacement

pydantic model doespy.etl.steps.transformers.ConditionalTransformer[source]

The ConditionalTransformer replaces the value in the dest column with a value from the value dict, if the value in the col column is equal to the key.

Example ETL Pipeline Design
 $ETL$:
     transformers:
       - name: ConditionalTransformer:
         col: Country
         dest: Code
         value:
             Switzerland: CH
             Germany: DE

Example

Country

Code

Germany

Switzerland

France

➡️

Country

Code

Germany

DE

Switzerland

CH

France

field col: str [Required]

Name of condition column in data frame.

field dest: str [Required]

Name of destination column in data frame.

field value: Dict[Any, Any] [Required]

Dictionary of replacement rules: The dict key is the entry in the condition col and the value is the replacement used in the dest column.

Group By Aggregates

pydantic model doespy.etl.steps.transformers.GroupByAggTransformer[source]

The GroupByAggTransformer performs a group by followed by a set of aggregate functions applied to the data_columns.

Example ETL Pipeline Design
 $ETL$:
     transformers:
       - name: GroupByAggTransformer:
         groupby_columns: [Run, $FACTORS$]
         data_columns: [Lat]
         agg_functions: [mean]

Example

Run

Rep

$CMD$

Lat

0

0

xyz

0.1

0

1

xyz

0.3

1

0

xyz

0.5

1

1

xyz

0.5

➡️

Run

Lat_mean

0

0.2

1

0.5

field data_columns: List[str] [Required]

The columns that contain the data to aggregate, see agg_function.

field groupby_columns: List[str] [Required]

The columns to perform the group by. The list can contain the magic entry $FACTORS$ that expands to all factors of the experiment. e.g., [exp_name, host_type, host_idx, $FACTORS$] would perform a group by of each run.

field agg_functions: List[str] = ['mean', 'min', 'max', 'std', 'count']

List of aggregate function to apply on data_columns

field custom_tail_length: int = 5

“custom_tail” is a custom aggregation function that calculates the mean over the last custom_tail_length entries of a column.