Data and Seed Parallelism¶

ExpOps can parallelize pipeline execution by duplicating parts of the process graph across data partitions or random seeds. Both features are configured per process in configs/project_config.yaml under model.parameters.pipeline.processes.

Data Parallelism¶

Use data parallelism when a process returns a dataframe that should be split and processed in parallel by downstream processes.

Configuration keys:

data_parallelism.size: Number of partitions (int) or explicit row counts (list of ints).
data_parallelism.data_name: Key in the process return dict that contains the dataframe.

Behavior:

The dataframe is split into partitions.
Downstream processes are duplicated per partition until a data aggregation boundary.
Partitioned processes are suffixed with _1, _2, ... in the expanded graph.

Example:

processes:
  - name: "feature_engineering"
    code_function: "define_feature_engineering_process"
    data_parallelism:
      size: 3 # or [50,20,20] indicating three partitions with a 50, 20, 20 row count split
      data_name: df

Process return requirements:

The process must return a dict containing data_name.
The value can be a pandas DataFrame, a list of records, or a dict (it will be converted).

Seed Parallelism¶

Use seed parallelism to run downstream branches multiple times with deterministic seeds.

Configuration keys:

seed_parallelism.seeds: List of integer seeds (can also be a single int or list in YAML).

Behavior:

Downstream processes are duplicated once per seed until a seed aggregation boundary.
Seeded processes are suffixed with _seed{value} in the expanded graph.
The seed value is injected into the process hyperparameters as random_seed.

Note: ExpOps seeds common RNGs (Python, NumPy, PyTorch, TensorFlow) at task level, but it does not automatically set random_state/seed arguments for every library or model. If your library requires an explicit seed parameter, pass hyperparameters["random_seed"] in your process code.

Example:

processes:
  - name: "train_model"
    code_function: "define_training_process"
    seed_parallelism:
      seeds: [21, 22, 23]

Aggregation Boundaries¶

To stop graph duplication and combine results, mark a process as an aggregation boundary:

processes:
  - name: "feature_engineering_generic"
    data_parallelism:
      size: 2
      data_name: df

  - name: "preprocess_linear_nn"
    seed_parallelism:
      seeds: [40, 41]

  - name: "ensemble_aggregation"
    data_parallelism:
      aggregation: true
    seed_parallelism:
      aggregation: true

Aggregation processes receive a merged data payload containing the partition/seed results keyed by the base process name.