Skip to content

Data and Seed Parallelism

ExpOps can parallelize pipeline execution by duplicating parts of the process graph across data partitions or random seeds. Both features are configured per process in configs/project_config.yaml under model.parameters.pipeline.processes.

Data Parallelism

Use data parallelism when a process returns a dataframe that should be split and processed in parallel by downstream processes.

Configuration keys:

  • data_parallelism.size: Number of partitions (int) or explicit row counts (list of ints).
  • data_parallelism.data_name: Key in the process return dict that contains the dataframe.

Behavior:

  • The dataframe is split into partitions.
  • Downstream processes are duplicated per partition until a data aggregation boundary.
  • Partitioned processes are suffixed with _1, _2, ... in the expanded graph.

Example:

processes:
  - name: "feature_engineering"
    code_function: "define_feature_engineering_process"
    data_parallelism:
      size: 3 # or [50,20,20] indicating three partitions with a 50, 20, 20 row count split
      data_name: df

Process return requirements:

  • The process must return a dict containing data_name.
  • The value can be a pandas DataFrame, a list of records, or a dict (it will be converted).

Seed Parallelism

Use seed parallelism to run downstream branches multiple times with deterministic seeds.

Configuration keys:

  • seed_parallelism.seeds: List of integer seeds (can also be a single int or list in YAML).

Behavior:

  • Downstream processes are duplicated once per seed until a seed aggregation boundary.
  • Seeded processes are suffixed with _seed{value} in the expanded graph.
  • The seed value is injected into the process hyperparameters as random_seed.

Note: ExpOps seeds common RNGs (Python, NumPy, PyTorch, TensorFlow) at task level, but it does not automatically set random_state/seed arguments for every library or model. If your library requires an explicit seed parameter, pass hyperparameters["random_seed"] in your process code.

Example:

processes:
  - name: "train_model"
    code_function: "define_training_process"
    seed_parallelism:
      seeds: [21, 22, 23]

Aggregation Boundaries

To stop graph duplication and combine results, mark a process as an aggregation boundary:

processes:
  - name: "feature_engineering_generic"
    data_parallelism:
      size: 2
      data_name: df

  - name: "preprocess_linear_nn"
    seed_parallelism:
      seeds: [40, 41]

  - name: "ensemble_aggregation"
    data_parallelism:
      aggregation: true
    seed_parallelism:
      aggregation: true

Aggregation processes receive a merged data payload containing the partition/seed results keyed by the base process name.