.. _pipeline_details:

================
Pipeline details
================

This section walks through each stage of the ``ffrprep`` pipeline
— BIDS validation, preprocessing, and analysis — covering the
nodes that make up each Nipype workflow and the file layout each
stage writes to disk.

Pipeline Overview
=================

The ``ffrprep`` pipeline consists of three main stages:

1. **BIDS Validation** - Ensures dataset compliance with BIDS standards
2. **Preprocessing** - Filters, re-references, and epochs the EEG data  
3. **Analysis** - Computes evoked responses, time-frequency representations, and FFR metrics

Each stage is implemented as a modular Nipype workflow. The CLI parallelizes
across (task, run) iterations within a subject via a ``ProcessPoolExecutor``
sized by ``--n_procs``; for cross-subject scaling on clusters, run one CLI
invocation per subject (e.g. via slurm job arrays). See the
:ref:`Parallelization <parallelization>` section under *Usage* for details.


Stage 1: BIDS Validation
========================

The first stage validates that your input dataset follows the Brain Imaging Data Structure (BIDS) specification. This ensures reproducibility and compatibility with other neuroimaging tools.

**Purpose:**
Verify dataset structure, file naming conventions, and required metadata files before processing begins.

**Implementation:**
The validation uses the ``bids-validator`` tool with a custom configuration that ignores warnings not relevant to EEG/FFR data.

**Key Functions:**
- `validate_input_dir() <https://sitek.github.io/ffrprep/generated/ffrprep.utils.validate_input_dir.html#ffrprep.utils.validate_input_dir>`_ - Main validation function
- Custom validator configuration for EEG-specific requirements

**Validation Steps:**

1. **Directory Structure Check**
   
   - Verifies presence of required BIDS directories (``sub-*/``, ``derivatives/``)
   - Checks for ``dataset_description.json`` and other required metadata files
   - Validates subject/session/task naming conventions

2. **EEG-Specific Validation**
   
   - Confirms presence of EEG data files (``.edf``, ``.bdf``, ``.vhdr``, ``.fif``, ``.set``)
   - Validates channel description files (``*_channels.tsv``)
   - Checks event files (``*_events.tsv``) for proper formatting
   - Verifies EEG-specific metadata in JSON sidecars

3. **Participant Selection**
   
   - Validates requested participant labels exist in dataset
   - Checks for required EEG data for specified participants
   - Reports any missing or incomplete data

**Error Handling:**
If validation fails, ``ffrprep`` provides detailed error messages indicating specific BIDS compliance issues and suggestions for resolution.

**Skip Option:**
Validation can be bypassed using ``--skip_bids_validation`` (not recommended for production analyses).

Stage 2: Preprocessing
======================

The preprocessing stage converts raw EEG data into clean, epoched data suitable for FFR analysis. This stage implements standard electrophysiological preprocessing steps optimized for frequency-following responses.

**Purpose:**
Transform raw continuous EEG into clean, filtered, and epoched data while preserving FFR-relevant neural signals.

**Implementation:**
Implemented as a Nipype workflow (`create_preprocessing_workflow() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.create_preprocessing_workflow.html#ffrprep.preproc.create_preprocessing_workflow>`_) with the following nodes:

Preprocessing Workflow Nodes
----------------------------

**1. Data Loading Node**

*Function:* `load_data() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.load_data.html#ffrprep.preproc.load_data>`_

*Purpose:* Robustly load EEG data from BIDS datasets using pybids and MNE-BIDS.

*Sub-steps:*
   - Create ``BIDSLayout`` object for dataset querying
   - Query for EEG files matching participant/session/task/run criteria
   - Try multiple file extensions (``.edf``, ``.bdf``, ``.vhdr``, ``.fif``, ``.set``)
   - Load data using ``mne_bids.read_raw_bids()``
   - Extract and validate channel information
   - Load associated event data and metadata

*Outputs:* Raw EEG data object, BIDS path information, original filename

**2. Re-referencing Node**

*Function:* `reference_data() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.reference_data.html#ffrprep.preproc.reference_data>`_

*Purpose:* Apply appropriate reference scheme to reduce common-mode noise and artifacts.

*Sub-steps:*
   - Parse reference channel specification (average, single channel, or channel list)
   - Validate reference channels exist in data
   - Apply re-referencing using MNE's ``set_eeg_reference()``
   - Update channel information and provenance

*Reference Options:*
   - **Average reference** (``--ref_channels average``): Uses all EEG channels
   - **Single channel** (``--ref_channels Cz``): References to one electrode
   - **Multiple channels** (``--ref_channels "Cz,Fz"``): Average of specified channels

*Outputs:* Re-referenced EEG data

**3. Filtering Node**

*Function:* `filter_data() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.filter_data.html#ffrprep.preproc.filter_data>`_

*Purpose:* Apply temporal filtering to remove noise while preserving FFR signals.

*Sub-steps:*
   - Apply high-pass filter to remove slow drifts and DC offsets
   - Apply low-pass filter to remove high-frequency noise
   - Use zero-phase FIR filters to avoid temporal distortions
   - Log filter parameters and transition bands

*Default Parameters:*
   - **High-pass:** 1.0 Hz (removes slow drifts, preserves FFR frequencies)
   - **Low-pass:** 40.0 Hz (removes EMG and high-frequency noise)
   - **Filter design:** Zero-phase FIR with automatic transition bandwidth

*Outputs:* Filtered EEG data

**4. Epoching Node**

*Function:* `epoch_data() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.epoch_data.html#ffrprep.preproc.epoch_data>`_

*Purpose:* Segment the continuously-filtered EEG into time-locked
epochs around stimulus events, applying baseline correction and
amplitude-based rejection.

*Sub-steps:*
   - Load events from the BIDS ``*_events.tsv`` (or use annotations
     embedded in the raw recording when no sidecar is present)
   - Build ``mne.Epochs`` with the requested ``tmin`` / ``tmax``,
     baseline window, picks, and ``reject`` thresholds
   - Apply ``epochs.drop_bad()`` to materialize amplitude-based
     rejection; keep the resulting Epochs object as the workflow
     output

*Default Parameters (FFR-typical, all overridable from the CLI):*
   - **Epoch window:** ``--tmin -0.2`` to ``--tmax 0.6`` seconds
     around stimulus onset
   - **Baseline:** ``--baseline -0.2 0`` seconds (pre-stimulus)
   - **Rejection:** ``--reject-eeg 75e-6`` (75 µV peak-to-peak); pass
     ``--no-auto-reject`` to disable

*Outputs:* Epoched EEG data and the post-rejection drop log.

**5. Save Preprocessing Outputs Node**

*Function:* `save_preprocessing_outputs() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.save_preprocessing_outputs.html#ffrprep.preproc.save_preprocessing_outputs>`_
(invoked through the internal ``save_preprocessing_node`` wrapper,
which fans the Epochs out by trial type before calling it).

*Purpose:* Persist the epoched data plus a self-describing BIDS
sidecar — one file per trial type by default, or a single
combined file under ``--no-split-by-trial-type``.

*Sub-steps:*
   - When ``--split-by-trial-type`` is on (the default), partition
     the input Epochs by ``event_id`` and write one
     ``_desc-preproc{Cond}_epo.fif`` per trial type with a matching
     sidecar carrying a ``Condition`` field. Under
     ``--no-split-by-trial-type``, write a single bare
     ``_desc-preproc_epo.fif`` instead.
   - Reconstruct a fresh ``EpochsArray`` from the data + events +
     ``event_id`` so trial-type metadata survives the save / load
     round-trip (and the upstream Epochs object's internal state
     doesn't leak through ``.save()``).
   - Write each sibling ``.json`` sidecar with ``EpochCount`` /
     ``EpochCountTotal`` / ``EpochCountRejected`` /
     ``RejectionThresholds`` / ``Filtering`` / ``SamplingFrequency``
     / ``EpochTmin`` / ``EpochTmax`` / ``Channels`` plus run /
     session / ``ConcatenatedRuns`` / ``Condition`` provenance.
   - Initialize the per-derivatives ``dataset_description.json`` if
     missing.

*Reporting* runs **after** the workflow drains, in the CLI rather
than as a workflow node — see :ref:`reporting <reporting>` below.

*Output Structure (default split-by-trial-type):* ::

    derivatives/ffrprep-preprocessing/
    ├── sub-XX/
    │   └── eeg/
    │       ├── sub-XX_task-YY_run-ZZ_desc-preprocPositive_epo.fif
    │       ├── sub-XX_task-YY_run-ZZ_desc-preprocPositive_epo.json
    │       ├── sub-XX_task-YY_run-ZZ_desc-preprocNegative_epo.fif
    │       ├── sub-XX_task-YY_run-ZZ_desc-preprocNegative_epo.json
    │       ├── sub-XX_preprocessing_report.html
    │       └── sub-XX_preprocessing.log

Under ``--no-split-by-trial-type`` the per-trial-type files are
replaced by a single ``_desc-preproc_epo.fif`` + its sidecar.

The sidecar JSON carries provenance, ``EpochCount`` /
``EpochCountTotal`` / ``EpochCountRejected``, ``RejectionThresholds``,
``Filtering`` (high-pass and low-pass cut-offs), sampling frequency,
run / session identifiers, and ``Condition`` (for per-trial-type
files only).

*Outputs:* File paths, processing metadata.

Stage 3: Analysis
=================

The analysis stage averages each preprocessed epochs group into a
structured set of evoked responses (per-trial-type + combined +
optional difference) and saves them to BIDS-derivatives.

**Purpose:**
Produce per-(task, run) evoked responses suitable for downstream
statistical analysis or visualization, persisted in MNE-readable
format with self-describing BIDS sidecars.

**Implementation:**
Implemented as a Nipype workflow (`create_analysis_workflow() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.create_analysis_workflow.html#ffrprep.preproc.create_analysis_workflow>`_)
with two nodes. The CLI worker collects all per-trial-type
``_desc-preproc{Cond}_epo.fif`` files for one (task, run) group
via ``_collect_analysis_groups``, stitches them back together with
``mne.concatenate_epochs`` (which preserves ``event_id``), and
passes the resulting ``Epochs`` directly into the workflow's
``inputnode``. The per-(task, run) granularity means each group
runs the workflow once regardless of how many trial types it
holds.

Analysis Workflow Nodes
-----------------------

**1. Build Analysis Payload Node**

*Function:* `build_analysis_payload() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.build_analysis_payload.html#ffrprep.preproc.build_analysis_payload>`_

*Purpose:* From a single Epochs object, produce a structured
``{by_type, combined, diff}`` payload covering every evoked the
analysis stage emits.

*Sub-steps:*
   - Average all events into a single combined Evoked via
     ``make_combined_evoked`` (always emitted).
   - When ``split_by_trial_type`` is True (the default), partition
     by ``event_id`` via ``make_evoked(by_event_type=True)`` into a
     per-trial-type dict.
   - When at least two trial types are present and either
     (a) there are exactly two types (auto-paired) or
     (b) the user passed ``--difference-pairs A:B [C:D …]``,
     compute one difference Evoked per pair via
     ``make_difference_evokeds`` (which wraps
     ``mne.combine_evoked([A, B], weights=[1, -1])``).

*Outputs:* dict with keys ``"by_type"`` (dict trial_type → Evoked),
``"combined"`` (Evoked), and optionally ``"diff"`` (dict (A, B) →
Evoked).

**2. Save Analysis Node**

*Function:* `save_analysis_outputs() <https://sitek.github.io/ffrprep/generated/ffrprep.preproc.save_analysis_outputs.html#ffrprep.preproc.save_analysis_outputs>`_

*Purpose:* Persist every Evoked in the payload plus a
self-describing sidecar per file.

*Sub-steps:*
   - For each entry in ``by_type``: write
     ``_desc-evoked{Cond}.fif`` + matching sidecar with
     ``Condition: <cond>``.
   - For ``combined``: write the bare ``_desc-evoked.fif`` + sidecar
     (no ``Condition`` field).
   - For each entry in ``diff``: write
     ``_desc-evokedDiff{A}Vs{B}.fif`` + sidecar carrying
     ``DifferenceOf: [A, B]``.
   - Every sidecar also carries ``AverageCount`` / ``Baseline`` /
     ``SamplingFrequency`` / ``Tmin`` / ``Tmax`` / ``Channels`` /
     ``TaskName`` / ``AnalysisType`` plus run / session /
     ``ConcatenatedRuns`` provenance.
   - Initialize the per-derivatives ``dataset_description.json`` if
     missing.

*Output Structure (default split-by-trial-type, 2-trial-type
dataset):* ::

    derivatives/ffrprep-analysis/
    ├── sub-XX/
    │   ├── sub-XX_task-YY_run-ZZ_desc-evokedPositive.fif
    │   ├── sub-XX_task-YY_run-ZZ_desc-evokedPositive.json
    │   ├── sub-XX_task-YY_run-ZZ_desc-evokedNegative.fif
    │   ├── sub-XX_task-YY_run-ZZ_desc-evokedNegative.json
    │   ├── sub-XX_task-YY_run-ZZ_desc-evoked.fif
    │   ├── sub-XX_task-YY_run-ZZ_desc-evoked.json
    │   ├── sub-XX_task-YY_run-ZZ_desc-evokedDiffPositiveVsNegative.fif
    │   ├── sub-XX_task-YY_run-ZZ_desc-evokedDiffPositiveVsNegative.json
    │   ├── sub-XX_analysis_report.html
    │   └── sub-XX_analysis.log

Under ``--no-split-by-trial-type``, only the combined
``_desc-evoked.fif`` + sidecar are emitted per (task, run).

*Outputs:* Saved file paths.

.. _reporting:

Reporting (post-workflow)
-------------------------

The single-file HTML reports are built **in the CLI** after the
workflow drains, not as a workflow node. The CLI's
``_build_preproc_report`` and ``_build_analysis_report`` glob the
saved ``_desc-preproc*_epo.fif`` and ``_desc-evoked*.fif`` files
via ``_collect_analysis_groups`` / ``_collect_evoked_groups``,
group them by (task, run), and build sections via the
:py:mod:`ffrprep.reports` builders (``build_raw_section`` /
``build_epoch_section`` / ``build_evoked_section`` /
``build_phase_consistency_section`` / ``make_group``); the
subject-level HTML is rendered via ``build_subject_report`` /
``build_analysis_report``.

**Per (task, run) group layout in the analysis report:**

- one **Evoked section per per-trial-type file** (e.g. Positive,
  Negative). Each carries:

  - waveform / PSD / TFR / autocorrelation / pitch-track figures
    (via :py:func:`ffrprep.reports.evoked_qa`);
  - scalar metrics: ``RMS SNR (100-200 ms)``, ``Mean power 90-110 Hz,
    100-200 ms``;
  - when the BIDS ``stim_file`` column is populated in
    ``events.tsv``, a ``Stim correlation (peak r)`` +
    ``Stim correlation (lag, ms)`` row plus a stim ↔ response
    cross-correlation lag plot (computed by
    ``_stim_correlation_data`` in ``ffrprep_cli.py``).

- one **combined Evoked section** (the across-events average) with
  the same plots + scalars, plus a **second** pair of rows / plot
  for the **envelope** correlation (combined ≈ ENV proxy in FFR,
  so ``|hilbert(stim)|`` is the natural reference).

- one **difference Evoked section** per ``(A, B)`` pair (auto for
  the 2-trial-type case; opt-in via ``--difference-pairs``) with
  the standard plots + a single raw-waveform stim correlation row
  + plot (diff ≈ TFS proxy).

- when exactly two per-condition preproc files exist for the
  group, one **Phase Consistency section** combining
  :py:func:`ffrprep.analysis.compute_phase_consistency` with
  :py:func:`ffrprep.analysis.plot_phase_consistency` (or, when the
  caller passes ``mask=True``, the masked variant). Single-subject
  reports default to unmasked; group-level builders pass
  ``mask=True``. Uses seaborn's ``flare_r`` colormap; subplot
  titles surface the trial-type names plus auto-derived
  ``A + B`` / ``A − B`` for the sum and difference panels.

**Per (task, run) group layout in the preprocessing report:**

- one **Raw section** loaded from the original BIDS recording
  (with run concatenation when the preproc output sourced multiple
  runs).
- one **Epoched section per per-trial-type file** — each carries
  the standard epoch metadata plus a ``Mean trial-to-trial r`` row
  (via :py:func:`ffrprep.analysis.response_consistency`) when there
  are at least 10 trials.

Figures and scalar metrics are **computed at report time** from
the loaded ``Epochs`` / ``Evoked`` objects — they are not
separately persisted to disk. To recompute them yourself, see the
*Working with Outputs in Python* section of the
:ref:`walkthrough`.

The single-file ``*_report.html`` embeds all figures as inline
base64 PNGs — there is no sibling ``figures/`` directory.

*File Formats:*
   - **MNE format (.fif):** Epoched / Evoked data, loadable in
     MNE-Python.
   - **WAV (under** ``<dataset>/stimuli/`` **):** stimulus audio
     referenced by ``events.tsv``'s ``stim_file`` column. Fetched
     via ``ffrprep-download example --with-stimuli``.
   - **JSON:** BIDS sidecar (human-readable, machine-parseable).
   - **HTML:** Self-contained single-file per-subject report.

Pipeline Integration and Quality Control
========================================

**Workflow Management:**

- Each stage implemented as a Nipype workflow for per-iteration
  dependency tracking
- Outer-loop parallelism: the CLI dispatches per-(task, run)
  iterations to a ``ProcessPoolExecutor`` sized by ``--n_procs``
- Fail-fast on any iteration error; the exception propagates to the
  CLI entry point
- nipype caches per-iteration intermediates under ``work/`` so
  re-runs that already have a saved ``_desc-preproc_epo.fif`` skip
  the workflow re-execution

**Quality Control Checkpoints:**

- BIDS validation before processing
- Data quality assessment after loading
- Preprocessing quality metrics and reports
- Analysis validation and statistical checks

**Output Organization:**

- BIDS-compatible directory structure
- Per-file BIDS sidecars with run / session / condition provenance
- Standardized file formats for interoperability
- Version-controlled processing parameters

**Customization:**

- All preprocessing and analysis parameters are exposed as CLI
  flags (see :ref:`usage`).
- The Nipype workflows can be imported and reused programmatically
  from ``ffrprep.preproc`` (``create_preprocessing_workflow`` /
  ``create_analysis_workflow``).
- Section builders in ``ffrprep.reports`` accept ``extra_summary``
  and ``extra_figures`` kwargs so downstream code can fold
  caller-computed scalars or figures into a section's table or
  figure gallery.