Layers¶

Primitives¶

`RotorLayer` ¶

Bases: CliffordModule

Learnable versor layer with universal grade parameterization.

For grade=2 (default): learns R = exp(-B/2) and applies the isometry x' = RxR~. For grade=k: learns a grade-k element V and applies the versor product x' = hat(V) x V^{-1}, where hat denotes grade involution.

Preserves origin. For grade=2, also preserves lengths and angles (isometry).

The exp strategy (closed-form vs decomposition) is controlled by algebra.exp_policy -- see :class:core.runtime.decomposition.ExpPolicy.

Attributes:

Name	Type	Description
`channels`	`int`	Number of versors.
`grade`	`int`	Grade of the learnable parameter. Default 2 (bivector → rotor).
`grade_weights`	`Parameter`	Learnable grade-k coefficients [channels, num_grade_elements].

Source code in layers/primitives/rotor.py

class RotorLayer(CliffordModule):
    """Learnable versor layer with universal grade parameterization.

    For grade=2 (default): learns R = exp(-B/2) and applies the isometry x' = RxR~.
    For grade=k: learns a grade-k element V and applies the versor product
    x' = hat(V) x V^{-1}, where hat denotes grade involution.

    Preserves origin. For grade=2, also preserves lengths and angles (isometry).

    The exp strategy (closed-form vs decomposition) is controlled by
    ``algebra.exp_policy`` -- see :class:`core.runtime.decomposition.ExpPolicy`.

    Attributes:
        channels (int): Number of versors.
        grade (int): Grade of the learnable parameter. Default 2 (bivector → rotor).
        grade_weights (nn.Parameter): Learnable grade-k coefficients [channels, num_grade_elements].
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        grade: int = 2,
        *,
        input_grades=None,
        output_grades=None,
        input_layout: GradeLayout = None,
        output_layout: GradeLayout = None,
        compact_output: bool = True,
    ):
        """Initialize the versor layer.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            channels (int): Number of features.
            grade (int): Grade of the learnable parameter.
                grade=2 (default): bivectors → rotors via exp(-B/2), Spin group.
                grade=1: vectors → reflections via hat(n) x n^{-1}, Pin group.
                grade=k: general grade-k versor product.
        """
        super().__init__(algebra)
        self.channels = require_positive_int(channels, "channels")
        self.grade = int(grade)
        self.input_storage = resolve_layer_storage(algebra, layout=input_layout, grades=input_grades)
        self.output_storage = (
            resolve_layer_storage(algebra, layout=output_layout, grades=output_grades)
            if output_layout is not None or output_grades is not None
            else self.input_storage
        )
        self.input_layout = self.input_storage.layout
        self.output_layout = self.output_storage.layout
        self.compact_output = bool(compact_output)

        self.register_buffer("grade_indices", grade_indices(algebra, self.grade))
        self.num_grade_elements = self.grade_indices.numel()
        self.parameter_layout = algebra.layout((self.grade,))

        self.grade_weights = nn.Parameter(torch.Tensor(self.channels, self.num_grade_elements))
        if self.grade == 2:
            tag_manifold(self.grade_weights, MANIFOLD_SPIN)

        # Versor cache for eval mode
        self._cached_V_left = None
        self._cached_V_right = None

        self.reset_parameters()

    # --- Backward-compat aliases (grade == 2 usage) ---

    @property
    def bivector_indices(self):
        return self.grade_indices

    @property
    def num_bivectors(self):
        return self.num_grade_elements

    @property
    def bivector_weights(self):
        return self.grade_weights

    # ---------------------------------------------------

    def reset_parameters(self):
        """Initialize with near-identity transform (small weights)."""
        nn.init.normal_(self.grade_weights, std=0.01)

    def _build_grade_element(self, device, dtype):
        """Scatter grade_weights into full multivector dimension [channels, dim]."""
        weights = self.grade_weights.to(device=device, dtype=dtype)
        return self.parameter_layout.dense(weights)

    def _compute_versors(self, device, dtype):
        """Compute left and right factors for per_channel_sandwich.

        For grade=2: left = R = exp(-B/2), right = R~ (reverse).
        For grade=k: left = hat(V) (grade involution), right = V^{-1} (blade inverse).
          V is L2-normalized per channel before inversion so that blade_inverse
          remains exact (norm_sq is purely scalar for unit-norm grade-k elements).

        Returns:
            Tuple[Tensor, Tensor]: (V_left [C, dim], V_right [C, dim])
        """
        weights = self.grade_weights.to(device=device, dtype=dtype)
        return dense_versor_factors(
            self.algebra,
            weights,
            grade=self.grade,
            parameter_layout=self.parameter_layout,
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply versor product x' = hat(V) x V^{-1} (= RxR~ for grade=2).

        Caches versors during eval mode for faster inference.

        Args:
            x (torch.Tensor): Input [Batch, Channels, Dim].

        Returns:
            torch.Tensor: Transformed input [Batch, Channels, Dim].
        """
        cache = (
            (self._cached_V_left, self._cached_V_right)
            if not self.training and self._cached_V_left is not None and self._cached_V_right is not None
            else None
        )
        out, next_cache = self.algebra.versor_action(
            x,
            self.grade_weights,
            grade=self.grade,
            input_layout=self.input_layout,
            output_layout=self.output_layout,
            parameter_layout=self.parameter_layout,
            compact_output=self.compact_output,
            channels=self.channels,
            name="RotorLayer input",
            dense_cache=cache,
            cache_dense=not self.training,
            return_cache=True,
        )
        if not self.training and next_cache is not None:
            self._cached_V_left, self._cached_V_right = next_cache
        return out

    def train(self, mode: bool = True):
        """Invalidate versor cache when switching to train mode."""
        if mode:
            self._cached_V_left = None
            self._cached_V_right = None
        return super().train(mode)

    def prune_bivectors(self, threshold: float = 1e-4) -> int:
        """Zero out grade weights below threshold.

        Args:
            threshold (float): Cutoff magnitude.

        Returns:
            int: Number of pruned parameters.
        """
        with torch.no_grad():
            mask = torch.abs(self.grade_weights) >= threshold
            num_pruned = (~mask).sum().item()
            self.grade_weights.data.mul_(mask.to(dtype=self.grade_weights.dtype))
        return num_pruned

    def sparsity_loss(self) -> torch.Tensor:
        """Compute L1 sparsity regularization on grade weights."""
        return torch.norm(self.grade_weights, p=1)

`init(algebra, channels, grade=2, *, input_grades=None, output_grades=None, input_layout=None, output_layout=None, compact_output=True)` ¶

Initialize the versor layer.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	The algebra instance.	required
`channels`	`int`	Number of features.	required
`grade`	`int`	Grade of the learnable parameter. grade=2 (default): bivectors → rotors via exp(-B/2), Spin group. grade=1: vectors → reflections via hat(n) x n^{-1}, Pin group. grade=k: general grade-k versor product.	`2`

Source code in layers/primitives/rotor.py

def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    grade: int = 2,
    *,
    input_grades=None,
    output_grades=None,
    input_layout: GradeLayout = None,
    output_layout: GradeLayout = None,
    compact_output: bool = True,
):
    """Initialize the versor layer.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Number of features.
        grade (int): Grade of the learnable parameter.
            grade=2 (default): bivectors → rotors via exp(-B/2), Spin group.
            grade=1: vectors → reflections via hat(n) x n^{-1}, Pin group.
            grade=k: general grade-k versor product.
    """
    super().__init__(algebra)
    self.channels = require_positive_int(channels, "channels")
    self.grade = int(grade)
    self.input_storage = resolve_layer_storage(algebra, layout=input_layout, grades=input_grades)
    self.output_storage = (
        resolve_layer_storage(algebra, layout=output_layout, grades=output_grades)
        if output_layout is not None or output_grades is not None
        else self.input_storage
    )
    self.input_layout = self.input_storage.layout
    self.output_layout = self.output_storage.layout
    self.compact_output = bool(compact_output)

    self.register_buffer("grade_indices", grade_indices(algebra, self.grade))
    self.num_grade_elements = self.grade_indices.numel()
    self.parameter_layout = algebra.layout((self.grade,))

    self.grade_weights = nn.Parameter(torch.Tensor(self.channels, self.num_grade_elements))
    if self.grade == 2:
        tag_manifold(self.grade_weights, MANIFOLD_SPIN)

    # Versor cache for eval mode
    self._cached_V_left = None
    self._cached_V_right = None

    self.reset_parameters()

`reset_parameters()` ¶

Initialize with near-identity transform (small weights).

Source code in layers/primitives/rotor.py

def reset_parameters(self):
    """Initialize with near-identity transform (small weights)."""
    nn.init.normal_(self.grade_weights, std=0.01)

`forward(x)` ¶

Apply versor product x' = hat(V) x V^{-1} (= RxR~ for grade=2).

Caches versors during eval mode for faster inference.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input [Batch, Channels, Dim].	required

Returns:

Type	Description
`Tensor`	torch.Tensor: Transformed input [Batch, Channels, Dim].

Source code in layers/primitives/rotor.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Apply versor product x' = hat(V) x V^{-1} (= RxR~ for grade=2).

    Caches versors during eval mode for faster inference.

    Args:
        x (torch.Tensor): Input [Batch, Channels, Dim].

    Returns:
        torch.Tensor: Transformed input [Batch, Channels, Dim].
    """
    cache = (
        (self._cached_V_left, self._cached_V_right)
        if not self.training and self._cached_V_left is not None and self._cached_V_right is not None
        else None
    )
    out, next_cache = self.algebra.versor_action(
        x,
        self.grade_weights,
        grade=self.grade,
        input_layout=self.input_layout,
        output_layout=self.output_layout,
        parameter_layout=self.parameter_layout,
        compact_output=self.compact_output,
        channels=self.channels,
        name="RotorLayer input",
        dense_cache=cache,
        cache_dense=not self.training,
        return_cache=True,
    )
    if not self.training and next_cache is not None:
        self._cached_V_left, self._cached_V_right = next_cache
    return out

`train(mode=True)` ¶

Invalidate versor cache when switching to train mode.

Source code in layers/primitives/rotor.py

def train(self, mode: bool = True):
    """Invalidate versor cache when switching to train mode."""
    if mode:
        self._cached_V_left = None
        self._cached_V_right = None
    return super().train(mode)

`prune_bivectors(threshold=0.0001)` ¶

Zero out grade weights below threshold.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	Cutoff magnitude.	`0.0001`

Returns:

Name	Type	Description
`int`	`int`	Number of pruned parameters.

Source code in layers/primitives/rotor.py

def prune_bivectors(self, threshold: float = 1e-4) -> int:
    """Zero out grade weights below threshold.

    Args:
        threshold (float): Cutoff magnitude.

    Returns:
        int: Number of pruned parameters.
    """
    with torch.no_grad():
        mask = torch.abs(self.grade_weights) >= threshold
        num_pruned = (~mask).sum().item()
        self.grade_weights.data.mul_(mask.to(dtype=self.grade_weights.dtype))
    return num_pruned

`sparsity_loss()` ¶

Compute L1 sparsity regularization on grade weights.

Source code in layers/primitives/rotor.py

def sparsity_loss(self) -> torch.Tensor:
    """Compute L1 sparsity regularization on grade weights."""
    return torch.norm(self.grade_weights, p=1)

`MultiRotorLayer` ¶

Bases: CliffordModule

Multi-versor layer with weighted superposition: x' = sum_k w_k hat(V_k) x V_k^{-1}.

For grade=2 (default): each V_k = exp(-B_k/2) is a rotor, reducing to x' = sum_k w_k R_k x R~_k. For grade=k: each V_k is a grade-k versor applied via the general versor product.

The exp strategy is controlled by algebra.exp_policy.

Attributes:

Name	Type	Description
`channels`	`int`	Input features.
`num_rotors`	`int`	Number of overlapping versors.
`grade`	`int`	Grade of the learnable parameters. Default 2 (rotors).
`rotor_grade_weights`	`Parameter`	Grade-k coefficients [num_rotors, num_grade_elements].
`weights`	`Parameter`	Mixing weights [channels, num_rotors].

Source code in layers/primitives/multi_rotor.py

class MultiRotorLayer(CliffordModule):
    """Multi-versor layer with weighted superposition: x' = sum_k w_k hat(V_k) x V_k^{-1}.

    For grade=2 (default): each V_k = exp(-B_k/2) is a rotor, reducing to
    x' = sum_k w_k R_k x R~_k.
    For grade=k: each V_k is a grade-k versor applied via the general versor product.

    The exp strategy is controlled by ``algebra.exp_policy``.

    Attributes:
        channels (int): Input features.
        num_rotors (int): Number of overlapping versors.
        grade (int): Grade of the learnable parameters. Default 2 (rotors).
        rotor_grade_weights (nn.Parameter): Grade-k coefficients [num_rotors, num_grade_elements].
        weights (nn.Parameter): Mixing weights [channels, num_rotors].
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        num_rotors: int = 8,
        grade: int = 2,
        *,
        input_grades=None,
        output_grades=None,
        input_layout: GradeLayout = None,
        output_layout: GradeLayout = None,
        compact_output: bool = True,
    ):
        """Initialize Multi-Versor Layer.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            channels (int): Input features.
            num_rotors (int): Number of parallel versor heads.
            grade (int): Grade of the learnable parameter.
                grade=2 (default): bivectors → rotors via exp(-B/2), Spin group.
                grade=k: general grade-k versor product.
        """
        super().__init__(algebra)
        self.channels = require_positive_int(channels, "channels")
        self.num_rotors = require_positive_int(num_rotors, "num_rotors")
        self.grade = int(grade)
        self.input_storage = resolve_layer_storage(algebra, layout=input_layout, grades=input_grades)
        self.output_storage = (
            resolve_layer_storage(algebra, layout=output_layout, grades=output_grades)
            if output_layout is not None or output_grades is not None
            else self.input_storage
        )
        self.input_layout = self.input_storage.layout
        self.output_layout = self.output_storage.layout
        self.compact_output = bool(compact_output)

        self.register_buffer("grade_indices", grade_indices(algebra, self.grade))
        self.num_grade_elements = self.grade_indices.numel()
        self.parameter_layout = algebra.layout((self.grade,))

        self.rotor_grade_weights = nn.Parameter(torch.Tensor(self.num_rotors, self.num_grade_elements))
        if self.grade == 2:
            tag_manifold(self.rotor_grade_weights, MANIFOLD_SPIN)

        # Mixing weights (Euclidean — intentionally untagged)
        self.weights = nn.Parameter(torch.Tensor(self.channels, self.num_rotors))

        # Versor cache for eval mode
        self._cached_V_left = None
        self._cached_V_right = None

        self.reset_parameters()

    # --- Backward-compat aliases (grade == 2 usage) ---

    @property
    def bivector_indices(self):
        return self.grade_indices

    @property
    def num_bivectors(self):
        return self.num_grade_elements

    @property
    def rotor_bivectors(self):
        return self.rotor_grade_weights

    # ---------------------------------------------------

    def reset_parameters(self):
        """Initialize with small transforms and uniform mixing weights."""
        nn.init.normal_(self.rotor_grade_weights, std=0.01)
        nn.init.xavier_uniform_(self.weights)

    def _compute_versors(self, device, dtype):
        """Compute left and right factors for all K versors.

        For grade=2: left = R_k = exp(-B_k/2), right = R~_k.
        For grade=k: left = hat(V_k), right = V_k^{-1}.

        Returns:
            Tuple[Tensor, Tensor]: (V_left [K, dim], V_right [K, dim])
        """
        weights = self.rotor_grade_weights.to(device=device, dtype=dtype)
        return dense_versor_factors(
            self.algebra,
            weights,
            grade=self.grade,
            parameter_layout=self.parameter_layout,
        )

    def forward(self, x: torch.Tensor, return_invariants: bool = False) -> torch.Tensor:
        """Apply weighted multi-versor superposition.

        Caches versors during eval mode for faster inference.

        Args:
            x (torch.Tensor): Input [Batch, Channels, Dim].
            return_invariants (bool): If True, returns per-grade norms instead of output.

        Returns:
            torch.Tensor: Transformed output [Batch, Channels, Dim].
        """
        cache = (
            (self._cached_V_left, self._cached_V_right)
            if not self.training and self._cached_V_left is not None and self._cached_V_right is not None
            else None
        )
        out, next_cache = self.algebra.multi_versor_action(
            x,
            self.rotor_grade_weights,
            self.weights,
            grade=self.grade,
            input_layout=self.input_layout,
            output_layout=self.output_layout,
            parameter_layout=self.parameter_layout,
            compact_output=self.compact_output,
            channels=self.channels,
            name="MultiRotorLayer input",
            dense_cache=cache,
            cache_dense=not self.training,
            return_cache=True,
        )
        if not self.training and next_cache is not None:
            self._cached_V_left, self._cached_V_right = next_cache

        if return_invariants:
            return self.algebra.grade_norms(out, layout=self.output_layout)

        return out

    def train(self, mode: bool = True):
        """Invalidate versor cache when switching to train mode."""
        if mode:
            self._cached_V_left = None
            self._cached_V_right = None
        return super().train(mode)

    def sparsity_loss(self) -> torch.Tensor:
        """Compute L1 sparsity loss for versor weights and mixing weights."""
        return torch.norm(self.rotor_grade_weights, p=1) + torch.norm(self.weights, p=1)

`init(algebra, channels, num_rotors=8, grade=2, *, input_grades=None, output_grades=None, input_layout=None, output_layout=None, compact_output=True)` ¶

Initialize Multi-Versor Layer.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	The algebra instance.	required
`channels`	`int`	Input features.	required
`num_rotors`	`int`	Number of parallel versor heads.	`8`
`grade`	`int`	Grade of the learnable parameter. grade=2 (default): bivectors → rotors via exp(-B/2), Spin group. grade=k: general grade-k versor product.	`2`

Source code in layers/primitives/multi_rotor.py

def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    num_rotors: int = 8,
    grade: int = 2,
    *,
    input_grades=None,
    output_grades=None,
    input_layout: GradeLayout = None,
    output_layout: GradeLayout = None,
    compact_output: bool = True,
):
    """Initialize Multi-Versor Layer.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Input features.
        num_rotors (int): Number of parallel versor heads.
        grade (int): Grade of the learnable parameter.
            grade=2 (default): bivectors → rotors via exp(-B/2), Spin group.
            grade=k: general grade-k versor product.
    """
    super().__init__(algebra)
    self.channels = require_positive_int(channels, "channels")
    self.num_rotors = require_positive_int(num_rotors, "num_rotors")
    self.grade = int(grade)
    self.input_storage = resolve_layer_storage(algebra, layout=input_layout, grades=input_grades)
    self.output_storage = (
        resolve_layer_storage(algebra, layout=output_layout, grades=output_grades)
        if output_layout is not None or output_grades is not None
        else self.input_storage
    )
    self.input_layout = self.input_storage.layout
    self.output_layout = self.output_storage.layout
    self.compact_output = bool(compact_output)

    self.register_buffer("grade_indices", grade_indices(algebra, self.grade))
    self.num_grade_elements = self.grade_indices.numel()
    self.parameter_layout = algebra.layout((self.grade,))

    self.rotor_grade_weights = nn.Parameter(torch.Tensor(self.num_rotors, self.num_grade_elements))
    if self.grade == 2:
        tag_manifold(self.rotor_grade_weights, MANIFOLD_SPIN)

    # Mixing weights (Euclidean — intentionally untagged)
    self.weights = nn.Parameter(torch.Tensor(self.channels, self.num_rotors))

    # Versor cache for eval mode
    self._cached_V_left = None
    self._cached_V_right = None

    self.reset_parameters()

`reset_parameters()` ¶

Initialize with small transforms and uniform mixing weights.

Source code in layers/primitives/multi_rotor.py

def reset_parameters(self):
    """Initialize with small transforms and uniform mixing weights."""
    nn.init.normal_(self.rotor_grade_weights, std=0.01)
    nn.init.xavier_uniform_(self.weights)

`forward(x, return_invariants=False)` ¶

Apply weighted multi-versor superposition.

Caches versors during eval mode for faster inference.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input [Batch, Channels, Dim].	required
`return_invariants`	`bool`	If True, returns per-grade norms instead of output.	`False`

Returns:

Type	Description
`Tensor`	torch.Tensor: Transformed output [Batch, Channels, Dim].

Source code in layers/primitives/multi_rotor.py

def forward(self, x: torch.Tensor, return_invariants: bool = False) -> torch.Tensor:
    """Apply weighted multi-versor superposition.

    Caches versors during eval mode for faster inference.

    Args:
        x (torch.Tensor): Input [Batch, Channels, Dim].
        return_invariants (bool): If True, returns per-grade norms instead of output.

    Returns:
        torch.Tensor: Transformed output [Batch, Channels, Dim].
    """
    cache = (
        (self._cached_V_left, self._cached_V_right)
        if not self.training and self._cached_V_left is not None and self._cached_V_right is not None
        else None
    )
    out, next_cache = self.algebra.multi_versor_action(
        x,
        self.rotor_grade_weights,
        self.weights,
        grade=self.grade,
        input_layout=self.input_layout,
        output_layout=self.output_layout,
        parameter_layout=self.parameter_layout,
        compact_output=self.compact_output,
        channels=self.channels,
        name="MultiRotorLayer input",
        dense_cache=cache,
        cache_dense=not self.training,
        return_cache=True,
    )
    if not self.training and next_cache is not None:
        self._cached_V_left, self._cached_V_right = next_cache

    if return_invariants:
        return self.algebra.grade_norms(out, layout=self.output_layout)

    return out

`train(mode=True)` ¶

Invalidate versor cache when switching to train mode.

Source code in layers/primitives/multi_rotor.py

def train(self, mode: bool = True):
    """Invalidate versor cache when switching to train mode."""
    if mode:
        self._cached_V_left = None
        self._cached_V_right = None
    return super().train(mode)

`sparsity_loss()` ¶

Compute L1 sparsity loss for versor weights and mixing weights.

Source code in layers/primitives/multi_rotor.py

def sparsity_loss(self) -> torch.Tensor:
    """Compute L1 sparsity loss for versor weights and mixing weights."""
    return torch.norm(self.rotor_grade_weights, p=1) + torch.norm(self.weights, p=1)

`CliffordLinear` ¶

Bases: CliffordModule

Fully connected layer with optional rotor-based backend.

Can use either: - Traditional scalar weight matrix (default, backward compatible) - Rotor-based transformation (new, parameter efficient via RotorGadget)

The traditional backend uses O(in_channels x out_channels) parameters, while the rotor backend uses O(num_rotor_pairs x n(n-1)/2) parameters where n is the number of basis vectors.

Attributes:

Name	Type	Description
`in_channels`	`int`	Input features.
`out_channels`	`int`	Output features.
`backend`	`str`	'traditional' or 'rotor'
`weight`	`Parameter \| None`	Weights [Out, In] (traditional backend only).
`bias`	`Parameter \| None`	Bias multivector [Out, Dim] (traditional backend only).
`gadget`	`Module \| None`	Rotor transformation (rotor backend only).

Source code in layers/primitives/linear.py

class CliffordLinear(CliffordModule):
    """Fully connected layer with optional rotor-based backend.

    Can use either:
    - Traditional scalar weight matrix (default, backward compatible)
    - Rotor-based transformation (new, parameter efficient via RotorGadget)

    The traditional backend uses O(in_channels x out_channels) parameters,
    while the rotor backend uses O(num_rotor_pairs x n(n-1)/2) parameters
    where n is the number of basis vectors.

    Attributes:
        in_channels (int): Input features.
        out_channels (int): Output features.
        backend (str): 'traditional' or 'rotor'
        weight (torch.nn.Parameter | None): Weights [Out, In] (traditional backend only).
        bias (torch.nn.Parameter | None): Bias multivector [Out, Dim] (traditional backend only).
        gadget (nn.Module | None): Rotor transformation (rotor backend only).
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        in_channels: int,
        out_channels: int,
        backend: Literal["traditional", "rotor"] = "traditional",
        num_rotor_pairs: int = 4,
        aggregation: Literal["mean", "sum", "learned"] = "mean",
        shuffle: Literal["none", "fixed", "random"] = "none",
        grades=None,
        layout: GradeLayout = None,
    ):
        """Initialize Clifford Linear.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            in_channels (int): Input size.
            out_channels (int): Output size.
            backend (str): 'traditional' for standard linear layer,
                          'rotor' for rotor-based transformation
            num_rotor_pairs (int): Number of rotor pairs (rotor backend only)
            aggregation (str): Aggregation method (rotor backend only)
            shuffle (str): Input channel shuffle strategy (rotor backend only):
                - 'none': No shuffle (default)
                - 'fixed': Fixed random permutation
                - 'random': Random permutation each forward pass
        """
        super().__init__(algebra)
        self.in_channels = require_positive_int(in_channels, "in_channels")
        self.out_channels = require_positive_int(out_channels, "out_channels")
        self.backend = require_choice(backend, "backend", ("traditional", "rotor"))
        self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
        self.layout = self.storage.layout
        self.lane_dim = self.storage.lane_dim

        if self.backend == "traditional":
            self.weight = nn.Parameter(torch.Tensor(self.out_channels, self.in_channels))
            self.bias = nn.Parameter(torch.Tensor(self.out_channels, self.lane_dim))
            self.reset_parameters()
            self.gadget = None

        elif self.backend == "rotor":
            if self.layout is not None:
                raise ValueError(
                    "CliffordLinear rotor backend is dense-only; use traditional backend for compact lanes."
                )
            from .rotor_gadget import RotorGadget

            self.gadget = RotorGadget(
                algebra=algebra,
                in_channels=self.in_channels,
                out_channels=self.out_channels,
                num_rotor_pairs=num_rotor_pairs,
                aggregation=aggregation,
                shuffle=shuffle,
                bias=True,  # Include bias in rotor gadget
            )
            self.weight = None
            self.bias = None

    def reset_parameters(self):
        """Initialize weights with Xavier uniform and zero bias."""
        if self.backend == "traditional":
            nn.init.xavier_uniform_(self.weight)
            nn.init.zeros_(self.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply channel-mixing linear transformation.

        Args:
            x (torch.Tensor): Input [Batch, In, Dim].

        Returns:
            torch.Tensor: Output [Batch, Out, Dim].
        """
        self.storage.validate_input(
            x,
            channels=self.in_channels,
            name="CliffordLinear input",
            allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
        )

        if self.backend == "traditional":
            out = torch.einsum("oi,...id->...od", self.weight, x)
            bias_shape = (1,) * (x.ndim - 2) + (self.out_channels, self.lane_dim)
            out = out + self.bias.view(bias_shape)
            return out
        return self.gadget(x)

    def extra_repr(self) -> str:
        """String representation for debugging.

        Returns:
            str: Layer parameters description
        """
        parts = [f"in_channels={self.in_channels}", f"out_channels={self.out_channels}", f"backend={self.backend}"]
        if self.layout is not None:
            parts.append(f"grades={self.layout.grades}")
        return ", ".join(parts)

`init(algebra, in_channels, out_channels, backend='traditional', num_rotor_pairs=4, aggregation='mean', shuffle='none', grades=None, layout=None)` ¶

Initialize Clifford Linear.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	The algebra instance.	required
`in_channels`	`int`	Input size.	required
`out_channels`	`int`	Output size.	required
`backend`	`str`	'traditional' for standard linear layer, 'rotor' for rotor-based transformation	`'traditional'`
`num_rotor_pairs`	`int`	Number of rotor pairs (rotor backend only)	`4`
`aggregation`	`str`	Aggregation method (rotor backend only)	`'mean'`
`shuffle`	`str`	Input channel shuffle strategy (rotor backend only): - 'none': No shuffle (default) - 'fixed': Fixed random permutation - 'random': Random permutation each forward pass	`'none'`

Source code in layers/primitives/linear.py

def __init__(
    self,
    algebra: CliffordAlgebra,
    in_channels: int,
    out_channels: int,
    backend: Literal["traditional", "rotor"] = "traditional",
    num_rotor_pairs: int = 4,
    aggregation: Literal["mean", "sum", "learned"] = "mean",
    shuffle: Literal["none", "fixed", "random"] = "none",
    grades=None,
    layout: GradeLayout = None,
):
    """Initialize Clifford Linear.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        in_channels (int): Input size.
        out_channels (int): Output size.
        backend (str): 'traditional' for standard linear layer,
                      'rotor' for rotor-based transformation
        num_rotor_pairs (int): Number of rotor pairs (rotor backend only)
        aggregation (str): Aggregation method (rotor backend only)
        shuffle (str): Input channel shuffle strategy (rotor backend only):
            - 'none': No shuffle (default)
            - 'fixed': Fixed random permutation
            - 'random': Random permutation each forward pass
    """
    super().__init__(algebra)
    self.in_channels = require_positive_int(in_channels, "in_channels")
    self.out_channels = require_positive_int(out_channels, "out_channels")
    self.backend = require_choice(backend, "backend", ("traditional", "rotor"))
    self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
    self.layout = self.storage.layout
    self.lane_dim = self.storage.lane_dim

    if self.backend == "traditional":
        self.weight = nn.Parameter(torch.Tensor(self.out_channels, self.in_channels))
        self.bias = nn.Parameter(torch.Tensor(self.out_channels, self.lane_dim))
        self.reset_parameters()
        self.gadget = None

    elif self.backend == "rotor":
        if self.layout is not None:
            raise ValueError(
                "CliffordLinear rotor backend is dense-only; use traditional backend for compact lanes."
            )
        from .rotor_gadget import RotorGadget

        self.gadget = RotorGadget(
            algebra=algebra,
            in_channels=self.in_channels,
            out_channels=self.out_channels,
            num_rotor_pairs=num_rotor_pairs,
            aggregation=aggregation,
            shuffle=shuffle,
            bias=True,  # Include bias in rotor gadget
        )
        self.weight = None
        self.bias = None

`reset_parameters()` ¶

Initialize weights with Xavier uniform and zero bias.

Source code in layers/primitives/linear.py

def reset_parameters(self):
    """Initialize weights with Xavier uniform and zero bias."""
    if self.backend == "traditional":
        nn.init.xavier_uniform_(self.weight)
        nn.init.zeros_(self.bias)

`forward(x)` ¶

Apply channel-mixing linear transformation.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input [Batch, In, Dim].	required

Returns:

Type	Description
`Tensor`	torch.Tensor: Output [Batch, Out, Dim].

Source code in layers/primitives/linear.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Apply channel-mixing linear transformation.

    Args:
        x (torch.Tensor): Input [Batch, In, Dim].

    Returns:
        torch.Tensor: Output [Batch, Out, Dim].
    """
    self.storage.validate_input(
        x,
        channels=self.in_channels,
        name="CliffordLinear input",
        allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
    )

    if self.backend == "traditional":
        out = torch.einsum("oi,...id->...od", self.weight, x)
        bias_shape = (1,) * (x.ndim - 2) + (self.out_channels, self.lane_dim)
        out = out + self.bias.view(bias_shape)
        return out
    return self.gadget(x)

`extra_repr()` ¶

String representation for debugging.

Returns:

Name	Type	Description
`str`	`str`	Layer parameters description

Source code in layers/primitives/linear.py

def extra_repr(self) -> str:
    """String representation for debugging.

    Returns:
        str: Layer parameters description
    """
    parts = [f"in_channels={self.in_channels}", f"out_channels={self.out_channels}", f"backend={self.backend}"]
    if self.layout is not None:
        parts.append(f"grades={self.layout.grades}")
    return ", ".join(parts)

`RotorGadget` ¶

Bases: CliffordModule

Rotor-based linear transformation (Generalized Rotor Gadget).

Replaces standard linear layers with parameter-efficient rotor-sandwich transformations. Instead of using O(in_channels x out_channels) parameters, this uses O(num_rotor_pairs x n(n-1)/2) parameters where n is the number of basis vectors in the Clifford algebra.

Architecture

Partition input channels into blocks
For each rotor pair (i, j):
Apply rotor sandwich: r_ij . x_i . s_ij.H
Pool/aggregate results to output channels

The transformation is: psi(x) = r.x.s.H where r, s are rotors (bivector exponentials).

Attributes:

Name	Type	Description
`algebra`	`AlgebraLike`	CliffordAlgebra instance
`in_channels`		Number of input channels
`out_channels`		Number of output channels
`num_rotor_pairs`		Number of rotor pairs to use
`aggregation`		Aggregation method ('mean', 'sum', or 'learned')

Source code in layers/primitives/rotor_gadget.py

class RotorGadget(CliffordModule):
    """Rotor-based linear transformation (Generalized Rotor Gadget).

    Replaces standard linear layers with parameter-efficient rotor-sandwich
    transformations. Instead of using O(in_channels x out_channels) parameters,
    this uses O(num_rotor_pairs x n(n-1)/2) parameters where n is the number
    of basis vectors in the Clifford algebra.

    Architecture:
        1. Partition input channels into blocks
        2. For each rotor pair (i, j):
           - Apply rotor sandwich: r_ij . x_i . s_ij.H
        3. Pool/aggregate results to output channels

    The transformation is: psi(x) = r.x.s.H where r, s are rotors (bivector exponentials).

    Attributes:
        algebra: CliffordAlgebra instance
        in_channels: Number of input channels
        out_channels: Number of output channels
        num_rotor_pairs: Number of rotor pairs to use
        aggregation: Aggregation method ('mean', 'sum', or 'learned')
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        in_channels: int,
        out_channels: int,
        num_rotor_pairs: int = 4,
        aggregation: Literal["mean", "sum", "learned"] = "mean",
        shuffle: Literal["none", "fixed", "random"] = "none",
        bias: bool = False,
    ):
        """Initialize rotor gadget layer.

        Args:
            algebra: CliffordAlgebra instance
            in_channels: Number of input channels
            out_channels: Number of output channels
            num_rotor_pairs: Number of rotor pairs (higher = more expressive)
            aggregation: How to pool rotor outputs ('mean', 'sum', 'learned')
            shuffle: Input channel shuffle strategy:
                - 'none': No shuffle, sequential block assignment (default)
                - 'fixed': Random permutation at initialization (fixed during training)
                - 'random': Random permutation each forward pass (regularization)
            bias: Whether to include bias term (applied after transformation)
        """
        super().__init__(algebra)
        if not hasattr(algebra, "per_channel_sandwich"):
            raise ValueError("RotorGadget is dense-only and requires CliffordAlgebra.")

        self.in_channels = require_positive_int(in_channels, "in_channels")
        self.out_channels = require_positive_int(out_channels, "out_channels")
        self.num_rotor_pairs = require_positive_int(num_rotor_pairs, "num_rotor_pairs")
        self.aggregation = require_choice(aggregation, "aggregation", ("mean", "sum", "learned"))
        self.shuffle = require_choice(shuffle, "shuffle", ("none", "fixed", "random"))

        if algebra.num_grades <= 2:
            raise ValueError(f"Algebra has no bivectors. RotorGadget requires at least one bivector for rotation.")
        self.register_buffer("bivector_indices", grade_indices(algebra, 2, name="bivector grade"))
        self.num_bivectors = self.bivector_indices.numel()

        # Rotor parameters: bivector coefficients for exponential map
        # Left rotors: [num_rotor_pairs, num_bivectors]
        self.bivector_left = nn.Parameter(torch.randn(self.num_rotor_pairs, self.num_bivectors) * 0.1)
        tag_manifold(self.bivector_left, MANIFOLD_SPIN)
        # Right rotors: [num_rotor_pairs, num_bivectors]
        self.bivector_right = nn.Parameter(torch.randn(self.num_rotor_pairs, self.num_bivectors) * 0.1)
        tag_manifold(self.bivector_right, MANIFOLD_SPIN)

        # Channel routing: block diagonal partitioning (paper style)
        # Each rotor pair processes a subset of input channels
        self._setup_channel_routing()

        # Aggregation weights (if learned)
        if self.aggregation == "learned":
            self.agg_weights = nn.Parameter(torch.ones(self.num_rotor_pairs, self.out_channels) / self.num_rotor_pairs)
        else:
            self.register_buffer("agg_weights", None)

        # Optional bias
        if bias:
            self.bias = nn.Parameter(torch.zeros(self.out_channels, algebra.dim))
        else:
            self.register_buffer("bias", None)

        # Rotor cache for eval mode
        self._cached_rotors = None

    def _setup_channel_routing(self):
        """Set up block diagonal channel routing with optional shuffle.

        Partitions input and output channels into blocks, where each rotor
        pair operates on a specific block. Optionally shuffles input channels
        before routing for regularization.
        """
        in_assignment = torch.div(
            torch.arange(self.in_channels) * self.num_rotor_pairs,
            self.in_channels,
            rounding_mode="floor",
        ).clamp_max(self.num_rotor_pairs - 1)
        out_assignment = torch.div(
            torch.arange(self.out_channels) * self.num_rotor_pairs,
            self.out_channels,
            rounding_mode="floor",
        ).clamp_max(self.num_rotor_pairs - 1)

        in_indices = []
        out_indices = []
        for i in range(self.num_rotor_pairs):
            in_members = (in_assignment == i).nonzero(as_tuple=False).flatten()
            out_members = (out_assignment == i).nonzero(as_tuple=False).flatten()
            if in_members.numel() == 0:
                in_indices.append((self.in_channels, self.in_channels))
            else:
                in_indices.append((int(in_members[0]), int(in_members[-1]) + 1))
            if out_members.numel() == 0:
                out_indices.append((self.out_channels, self.out_channels))
            else:
                out_indices.append((int(out_members[0]), int(out_members[-1]) + 1))

        self.in_indices = in_indices
        self.out_indices = out_indices

        ch2pair = in_assignment.to(dtype=torch.long)
        self.register_buffer("_ch2pair", ch2pair)
        self.register_buffer("_channel_mix_mean", channel_mix(self.in_channels, self.out_channels, normalize=True))
        self.register_buffer("_channel_mix_sum", channel_mix(self.in_channels, self.out_channels, normalize=False))
        self.register_buffer("_pair_mean", pair_mean(ch2pair, self.num_rotor_pairs))

        # Set up channel shuffle permutation
        if self.shuffle == "fixed":
            # Create fixed random permutation at initialization
            perm = torch.randperm(self.in_channels)
            self.register_buffer("channel_permutation", perm)
        elif self.shuffle == "random":
            # Random shuffle each forward pass - no fixed permutation
            self.register_buffer("channel_permutation", None)
        else:  # 'none'
            # No shuffle - identity permutation
            self.register_buffer("channel_permutation", None)

    def _bivector_to_multivector(self, bivector_coeffs: torch.Tensor) -> torch.Tensor:
        """Convert bivector coefficients to full multivector via vectorized scatter.

        Args:
            bivector_coeffs: Tensor of shape [..., num_bivectors]

        Returns:
            Multivector tensor of shape [..., algebra.dim]
        """
        return dense_from_indices(bivector_coeffs, self.bivector_indices, self.algebra.dim)

    def _compute_rotors(self, device=None, dtype=None):
        """Compute rotor multivectors from bivector parameters.

        Returns:
            Tuple of (left_rotors, right_rotors_reversed) where each is
            a tensor of shape [num_rotor_pairs, algebra.dim]
        """
        left = self.bivector_left
        right = self.bivector_right
        if device is not None or dtype is not None:
            left = left.to(device=device, dtype=dtype)
            right = right.to(device=device, dtype=dtype)

        # Convert bivector parameters to multivectors
        B_left = self._bivector_to_multivector(left)  # [pairs, dim]
        B_right = self._bivector_to_multivector(right)  # [pairs, dim]

        # Compute rotors via exponential map: R = exp(-0.5 * B)
        R_left = self.algebra.exp(-0.5 * B_left)  # [pairs, dim]
        R_right = self.algebra.exp(-0.5 * B_right)  # [pairs, dim]

        # Compute reverse of right rotors for sandwich product
        R_right_rev = self.algebra.reverse(R_right)  # [pairs, dim]

        return R_left, R_right_rev

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply rotor-based transformation.

        Uses batched geometric products - all rotor pairs are applied in
        parallel via a single pair of GP calls.

        Args:
            x: Input tensor of shape [Batch, In_Channels, Dim]

        Returns:
            Output tensor of shape [Batch, Out_Channels, Dim]
        """
        check_multivector(x, self.algebra, "RotorGadget input")
        check_channels(x, self.in_channels, "RotorGadget input")

        # Apply input channel shuffle if enabled
        if self.shuffle == "fixed":
            x = x.index_select(-2, self.channel_permutation)
        elif self.shuffle == "random":
            perm = torch.randperm(self.in_channels, device=x.device)
            x = x.index_select(-2, perm)

        # Compute rotors (cached in eval mode)
        if not self.training and cache_matches(self._cached_rotors, x):
            R_left, R_right_rev = self._cached_rotors
        else:
            R_left, R_right_rev = self._compute_rotors(x.device, x.dtype)
            if not self.training:
                self._cached_rotors = (R_left, R_right_rev)

        ch2pair = self._ch2pair.to(device=R_left.device)
        R_left_by_channel = R_left[ch2pair]
        R_right_by_channel = R_right_rev[ch2pair]
        concat_out = self.algebra.per_channel_sandwich(R_left_by_channel, x, R_right_by_channel)

        # Map to output channels
        out = self._aggregate_to_output_channels(concat_out)

        if self.bias is not None:
            bias_shape = (1,) * (out.ndim - 2) + (self.out_channels, self.algebra.dim)
            out = out + self.bias.view(bias_shape)

        return out

    def _aggregate_to_output_channels(self, x: torch.Tensor) -> torch.Tensor:
        """Aggregate rotor pair outputs to match output channel count.

        Args:
            x: Concatenated outputs from rotor pairs [B, total_channels, dim]

        Returns:
            Aggregated output [B, out_channels, dim]
        """
        if self.aggregation == "learned":
            pair_values = torch.einsum("ki,...id->...kd", self._pair_mean.to(device=x.device, dtype=x.dtype), x)
            return torch.einsum("ko,...kd->...od", self.agg_weights.to(device=x.device, dtype=x.dtype), pair_values)

        mix = self._channel_mix_sum if self.aggregation == "sum" else self._channel_mix_mean
        return torch.einsum("oi,...id->...od", mix.to(device=x.device, dtype=x.dtype), x)

    def train(self, mode: bool = True):
        """Override to invalidate rotor cache when switching to train mode."""
        if mode:
            self._cached_rotors = None
        return super().train(mode)

    def extra_repr(self) -> str:
        """String representation for debugging."""
        return (
            f"in_channels={self.in_channels}, "
            f"out_channels={self.out_channels}, "
            f"num_rotor_pairs={self.num_rotor_pairs}, "
            f"aggregation={self.aggregation}, "
            f"shuffle={self.shuffle}, "
            f"bias={self.bias is not None}"
        )

`init(algebra, in_channels, out_channels, num_rotor_pairs=4, aggregation='mean', shuffle='none', bias=False)` ¶

Initialize rotor gadget layer.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	CliffordAlgebra instance	required
`in_channels`	`int`	Number of input channels	required
`out_channels`	`int`	Number of output channels	required
`num_rotor_pairs`	`int`	Number of rotor pairs (higher = more expressive)	`4`
`aggregation`	`Literal['mean', 'sum', 'learned']`	How to pool rotor outputs ('mean', 'sum', 'learned')	`'mean'`
`shuffle`	`Literal['none', 'fixed', 'random']`	Input channel shuffle strategy: - 'none': No shuffle, sequential block assignment (default) - 'fixed': Random permutation at initialization (fixed during training) - 'random': Random permutation each forward pass (regularization)	`'none'`
`bias`	`bool`	Whether to include bias term (applied after transformation)	`False`

Source code in layers/primitives/rotor_gadget.py

def __init__(
    self,
    algebra: CliffordAlgebra,
    in_channels: int,
    out_channels: int,
    num_rotor_pairs: int = 4,
    aggregation: Literal["mean", "sum", "learned"] = "mean",
    shuffle: Literal["none", "fixed", "random"] = "none",
    bias: bool = False,
):
    """Initialize rotor gadget layer.

    Args:
        algebra: CliffordAlgebra instance
        in_channels: Number of input channels
        out_channels: Number of output channels
        num_rotor_pairs: Number of rotor pairs (higher = more expressive)
        aggregation: How to pool rotor outputs ('mean', 'sum', 'learned')
        shuffle: Input channel shuffle strategy:
            - 'none': No shuffle, sequential block assignment (default)
            - 'fixed': Random permutation at initialization (fixed during training)
            - 'random': Random permutation each forward pass (regularization)
        bias: Whether to include bias term (applied after transformation)
    """
    super().__init__(algebra)
    if not hasattr(algebra, "per_channel_sandwich"):
        raise ValueError("RotorGadget is dense-only and requires CliffordAlgebra.")

    self.in_channels = require_positive_int(in_channels, "in_channels")
    self.out_channels = require_positive_int(out_channels, "out_channels")
    self.num_rotor_pairs = require_positive_int(num_rotor_pairs, "num_rotor_pairs")
    self.aggregation = require_choice(aggregation, "aggregation", ("mean", "sum", "learned"))
    self.shuffle = require_choice(shuffle, "shuffle", ("none", "fixed", "random"))

    if algebra.num_grades <= 2:
        raise ValueError(f"Algebra has no bivectors. RotorGadget requires at least one bivector for rotation.")
    self.register_buffer("bivector_indices", grade_indices(algebra, 2, name="bivector grade"))
    self.num_bivectors = self.bivector_indices.numel()

    # Rotor parameters: bivector coefficients for exponential map
    # Left rotors: [num_rotor_pairs, num_bivectors]
    self.bivector_left = nn.Parameter(torch.randn(self.num_rotor_pairs, self.num_bivectors) * 0.1)
    tag_manifold(self.bivector_left, MANIFOLD_SPIN)
    # Right rotors: [num_rotor_pairs, num_bivectors]
    self.bivector_right = nn.Parameter(torch.randn(self.num_rotor_pairs, self.num_bivectors) * 0.1)
    tag_manifold(self.bivector_right, MANIFOLD_SPIN)

    # Channel routing: block diagonal partitioning (paper style)
    # Each rotor pair processes a subset of input channels
    self._setup_channel_routing()

    # Aggregation weights (if learned)
    if self.aggregation == "learned":
        self.agg_weights = nn.Parameter(torch.ones(self.num_rotor_pairs, self.out_channels) / self.num_rotor_pairs)
    else:
        self.register_buffer("agg_weights", None)

    # Optional bias
    if bias:
        self.bias = nn.Parameter(torch.zeros(self.out_channels, algebra.dim))
    else:
        self.register_buffer("bias", None)

    # Rotor cache for eval mode
    self._cached_rotors = None

`forward(x)` ¶

Apply rotor-based transformation.

Uses batched geometric products - all rotor pairs are applied in parallel via a single pair of GP calls.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input tensor of shape [Batch, In_Channels, Dim]	required

Returns:

Type	Description
`Tensor`	Output tensor of shape [Batch, Out_Channels, Dim]

Source code in layers/primitives/rotor_gadget.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Apply rotor-based transformation.

    Uses batched geometric products - all rotor pairs are applied in
    parallel via a single pair of GP calls.

    Args:
        x: Input tensor of shape [Batch, In_Channels, Dim]

    Returns:
        Output tensor of shape [Batch, Out_Channels, Dim]
    """
    check_multivector(x, self.algebra, "RotorGadget input")
    check_channels(x, self.in_channels, "RotorGadget input")

    # Apply input channel shuffle if enabled
    if self.shuffle == "fixed":
        x = x.index_select(-2, self.channel_permutation)
    elif self.shuffle == "random":
        perm = torch.randperm(self.in_channels, device=x.device)
        x = x.index_select(-2, perm)

    # Compute rotors (cached in eval mode)
    if not self.training and cache_matches(self._cached_rotors, x):
        R_left, R_right_rev = self._cached_rotors
    else:
        R_left, R_right_rev = self._compute_rotors(x.device, x.dtype)
        if not self.training:
            self._cached_rotors = (R_left, R_right_rev)

    ch2pair = self._ch2pair.to(device=R_left.device)
    R_left_by_channel = R_left[ch2pair]
    R_right_by_channel = R_right_rev[ch2pair]
    concat_out = self.algebra.per_channel_sandwich(R_left_by_channel, x, R_right_by_channel)

    # Map to output channels
    out = self._aggregate_to_output_channels(concat_out)

    if self.bias is not None:
        bias_shape = (1,) * (out.ndim - 2) + (self.out_channels, self.algebra.dim)
        out = out + self.bias.view(bias_shape)

    return out

`train(mode=True)` ¶

Override to invalidate rotor cache when switching to train mode.

Source code in layers/primitives/rotor_gadget.py

def train(self, mode: bool = True):
    """Override to invalidate rotor cache when switching to train mode."""
    if mode:
        self._cached_rotors = None
    return super().train(mode)

`extra_repr()` ¶

String representation for debugging.

Source code in layers/primitives/rotor_gadget.py

def extra_repr(self) -> str:
    """String representation for debugging."""
    return (
        f"in_channels={self.in_channels}, "
        f"out_channels={self.out_channels}, "
        f"num_rotor_pairs={self.num_rotor_pairs}, "
        f"aggregation={self.aggregation}, "
        f"shuffle={self.shuffle}, "
        f"bias={self.bias is not None}"
    )

`CliffordLayerNorm` ¶

Bases: CliffordModule

Geometric LayerNorm that preserves direction and recovers scale.

Normalizes the multivector to unit norm (preserving geometric direction), then injects the original log-magnitude into the scalar (grade-0) part via a learnable gate.

Attributes:

Name	Type	Description
`weight`	`Parameter`	Per-channel direction scale [C].
`bias`	`Parameter`	Per-channel scalar bias [C].
`norm_scale`	`Parameter`	Per-channel gate for log-magnitude injection into grade-0. Initialized to zero so the layer starts identical to the old (scale-discarding) behaviour.

Source code in layers/primitives/normalization.py

class CliffordLayerNorm(CliffordModule):
    """Geometric LayerNorm that preserves direction and recovers scale.

    Normalizes the multivector to unit norm (preserving geometric direction),
    then injects the original log-magnitude into the scalar (grade-0) part
    via a learnable gate.

    Attributes:
        weight (nn.Parameter): Per-channel direction scale [C].
        bias (nn.Parameter): Per-channel scalar bias [C].
        norm_scale (nn.Parameter): Per-channel gate for log-magnitude
            injection into grade-0.  Initialized to zero so the layer
            starts identical to the old (scale-discarding) behaviour.
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        eps: float = 1e-6,
        recover: bool = True,
        *,
        grades=None,
        layout: GradeLayout = None,
    ):
        """Sets up normalization.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            channels (int): Features.
            eps (float): Stability term.
            recover (bool): Whether to inject original scale into the scalar part.
        """
        super().__init__(algebra)
        self.channels = require_positive_int(channels, "channels")
        if eps <= 0:
            raise ValueError(f"eps must be positive, got {eps}")
        self.eps = eps
        self.recover = recover
        self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
        self.layout = self.storage.layout
        self.lane_dim = self.storage.lane_dim

        self.weight = nn.Parameter(torch.ones(self.channels))
        self.bias = nn.Parameter(torch.zeros(self.channels))
        self.register_buffer("scalar_mask", self.storage.scalar_mask())
        if recover:
            self.norm_scale = nn.Parameter(torch.zeros(self.channels))
        else:
            self.register_buffer("norm_scale", None)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Normalizes energy, preserves direction, optionally recovers scale in grade-0.

        Args:
            x (torch.Tensor): Input [Batch, Channels, Dim].

        Returns:
            torch.Tensor: Normalized input.
        """
        self.storage.validate_input(
            x,
            channels=self.channels,
            name="CliffordLayerNorm input",
            allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
        )
        channel_shape = (1,) * (x.ndim - 2) + (self.channels, 1)

        norm = x.norm(dim=-1, keepdim=True).clamp_min(self.eps)
        x_normalized = x / norm
        out = x_normalized * self.weight.view(channel_shape)

        g0 = self.scalar_mask
        if g0.device != x.device or g0.dtype != x.dtype:
            g0 = g0.to(device=x.device, dtype=x.dtype)
        out = out + self.bias.view(channel_shape) * g0

        if self.recover:
            log_norm = torch.log1p(norm)
            out = out + self.norm_scale.view(channel_shape) * log_norm * g0

        return out

`init(algebra, channels, eps=1e-06, recover=True, *, grades=None, layout=None)` ¶

Sets up normalization.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	The algebra instance.	required
`channels`	`int`	Features.	required
`eps`	`float`	Stability term.	`1e-06`
`recover`	`bool`	Whether to inject original scale into the scalar part.	`True`

Source code in layers/primitives/normalization.py

def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    eps: float = 1e-6,
    recover: bool = True,
    *,
    grades=None,
    layout: GradeLayout = None,
):
    """Sets up normalization.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Features.
        eps (float): Stability term.
        recover (bool): Whether to inject original scale into the scalar part.
    """
    super().__init__(algebra)
    self.channels = require_positive_int(channels, "channels")
    if eps <= 0:
        raise ValueError(f"eps must be positive, got {eps}")
    self.eps = eps
    self.recover = recover
    self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
    self.layout = self.storage.layout
    self.lane_dim = self.storage.lane_dim

    self.weight = nn.Parameter(torch.ones(self.channels))
    self.bias = nn.Parameter(torch.zeros(self.channels))
    self.register_buffer("scalar_mask", self.storage.scalar_mask())
    if recover:
        self.norm_scale = nn.Parameter(torch.zeros(self.channels))
    else:
        self.register_buffer("norm_scale", None)

`forward(x)` ¶

Normalizes energy, preserves direction, optionally recovers scale in grade-0.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input [Batch, Channels, Dim].	required

Returns:

Type	Description
`Tensor`	torch.Tensor: Normalized input.

Source code in layers/primitives/normalization.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Normalizes energy, preserves direction, optionally recovers scale in grade-0.

    Args:
        x (torch.Tensor): Input [Batch, Channels, Dim].

    Returns:
        torch.Tensor: Normalized input.
    """
    self.storage.validate_input(
        x,
        channels=self.channels,
        name="CliffordLayerNorm input",
        allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
    )
    channel_shape = (1,) * (x.ndim - 2) + (self.channels, 1)

    norm = x.norm(dim=-1, keepdim=True).clamp_min(self.eps)
    x_normalized = x / norm
    out = x_normalized * self.weight.view(channel_shape)

    g0 = self.scalar_mask
    if g0.device != x.device or g0.dtype != x.dtype:
        g0 = g0.to(device=x.device, dtype=x.dtype)
    out = out + self.bias.view(channel_shape) * g0

    if self.recover:
        log_norm = torch.log1p(norm)
        out = out + self.norm_scale.view(channel_shape) * log_norm * g0

    return out

`BladeSelector` ¶

Bases: CliffordModule

Blade Selector. Filters insignificant components.

Learns to weigh geometric grades, suppressing less relevant ones.

Attributes:

Name	Type	Description
`weights`	`Parameter`	Gate logits [Channels, Dim].

Source code in layers/primitives/projection.py

class BladeSelector(CliffordModule):
    """Blade Selector. Filters insignificant components.

    Learns to weigh geometric grades, suppressing less relevant ones.

    Attributes:
        weights (nn.Parameter): Gate logits [Channels, Dim].
    """

    def __init__(self, algebra: CliffordAlgebra, channels: int, *, grades=None, layout: GradeLayout = None):
        """Sets up the selector.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            channels (int): Input features.
        """
        super().__init__(algebra)
        self.channels = require_positive_int(channels, "channels")
        self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
        self.layout = self.storage.layout
        self.lane_dim = self.storage.lane_dim

        self.weights = nn.Parameter(torch.Tensor(self.channels, self.lane_dim))

        self.reset_parameters()

    def reset_parameters(self):
        """Initialize logits so the selector starts as pass-through."""
        nn.init.zeros_(self.weights)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Gates the grades.

        The gate is ``2 * sigmoid(weights)`` so zero logits preserve the input.

        Args:
            x (torch.Tensor): Input [Batch, Channels, Dim].

        Returns:
            torch.Tensor: Filtered input.
        """
        self.storage.validate_input(
            x,
            channels=self.channels,
            name="BladeSelector input",
            allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
        )
        gate_shape = (1,) * (x.ndim - 2) + (self.channels, self.lane_dim)
        gate = 2.0 * torch.sigmoid(self.weights).view(gate_shape)
        return x * gate

`init(algebra, channels, *, grades=None, layout=None)` ¶

Sets up the selector.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	The algebra instance.	required
`channels`	`int`	Input features.	required

Source code in layers/primitives/projection.py

def __init__(self, algebra: CliffordAlgebra, channels: int, *, grades=None, layout: GradeLayout = None):
    """Sets up the selector.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Input features.
    """
    super().__init__(algebra)
    self.channels = require_positive_int(channels, "channels")
    self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
    self.layout = self.storage.layout
    self.lane_dim = self.storage.lane_dim

    self.weights = nn.Parameter(torch.Tensor(self.channels, self.lane_dim))

    self.reset_parameters()

`reset_parameters()` ¶

Initialize logits so the selector starts as pass-through.

Source code in layers/primitives/projection.py

def reset_parameters(self):
    """Initialize logits so the selector starts as pass-through."""
    nn.init.zeros_(self.weights)

`forward(x)` ¶

Gates the grades.

The gate is 2 * sigmoid(weights) so zero logits preserve the input.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input [Batch, Channels, Dim].	required

Returns:

Type	Description
`Tensor`	torch.Tensor: Filtered input.

Source code in layers/primitives/projection.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Gates the grades.

    The gate is ``2 * sigmoid(weights)`` so zero logits preserve the input.

    Args:
        x (torch.Tensor): Input [Batch, Channels, Dim].

    Returns:
        torch.Tensor: Filtered input.
    """
    self.storage.validate_input(
        x,
        channels=self.channels,
        name="BladeSelector input",
        allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
    )
    gate_shape = (1,) * (x.ndim - 2) + (self.channels, self.lane_dim)
    gate = 2.0 * torch.sigmoid(self.weights).view(gate_shape)
    return x * gate

Blocks¶

`GeometricProductAttention` ¶

Bases: CliffordModule

Multi-head attention using geometric product scoring.

Standard attention: score(Q, K) = / sqrt(d) (scalar only)

GA attention

product = Q_c * reverse(K_c) (geometric product per head-channel) score = (0 + lambda * ||_2||_F) / sqrt(H_c * dim)

The grade-0 (scalar) part measures alignment (like dot product). The grade-2 (bivector) part measures relative orientation - novel.

Memory: naive [B, H, L, L, H_c, D] is too large. We chunk over L_q in blocks of BLOCK_SIZE to bound peak VRAM.

Attributes:

Name	Type	Description
`num_heads`	`int`	Number of attention heads.
`head_channels`	`int`	Channels per head.
`causal`	`bool`	If True, apply autoregressive causal mask.
`bivector_weight`	`float`	lambda_ - weight of bivector score component.

Source code in layers/blocks/attention.py

class GeometricProductAttention(CliffordModule):
    """Multi-head attention using geometric product scoring.

    Standard attention: score(Q, K) = <Q, K> / sqrt(d)  (scalar only)

    GA attention:
        product = Q_c * reverse(K_c)    (geometric product per head-channel)
        score   = (<product>_0 + lambda_ * ||<product>_2||_F) / sqrt(H_c * dim)

    The grade-0 (scalar) part measures alignment (like dot product).
    The grade-2 (bivector) part measures relative orientation - novel.

    Memory: naive [B, H, L, L, H_c, D] is too large. We chunk over L_q
    in blocks of BLOCK_SIZE to bound peak VRAM.

    Attributes:
        num_heads (int): Number of attention heads.
        head_channels (int): Channels per head.
        causal (bool): If True, apply autoregressive causal mask.
        bivector_weight (float): lambda_ - weight of bivector score component.
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        num_heads: int,
        causal: bool = True,
        bivector_weight: float = 0.5,
        dropout: float = 0.0,
        score_blade_chunk_size: int = _G2_BLADE_CHUNK_SIZE,
        score_precompute_limit: int = _SCORE_PRECOMPUTE_LIMIT,
    ):
        """Sets up geometric product attention.

        Args:
            algebra: Clifford algebra instance.
            channels: Total number of multivector channels.
            num_heads: Number of attention heads.
            causal: Apply causal mask for autoregressive generation.
            bivector_weight: lambda_ weight on bivector score component.
            dropout: Dropout rate on attention weights.
            score_blade_chunk_size: Grade-2 output blades processed per dense
                chunk when exact dense scoring is used.
            score_precompute_limit: Maximum temporary ``K_g2`` elements allowed
                before exact dense scoring switches to chunked grade-2 blades.
        """
        super().__init__(algebra)
        assert channels % num_heads == 0, f"channels ({channels}) must be divisible by num_heads ({num_heads})"

        self.channels = channels
        self.num_heads = num_heads
        self.head_channels = channels // num_heads
        self.causal = causal
        self.bivector_weight = bivector_weight
        self.score_blade_chunk_size = max(1, int(score_blade_chunk_size))
        self.score_precompute_limit = max(0, int(score_precompute_limit))

        # Q, K, V projections operate on [B*L, channels, dim]
        self.q_proj = CliffordLinear(algebra, channels, channels)
        self.k_proj = CliffordLinear(algebra, channels, channels)
        self.v_proj = CliffordLinear(algebra, channels, channels)
        self.out_proj = CliffordLinear(algebra, channels, channels)

        self.attn_dropout = nn.Dropout(dropout) if dropout > 0.0 else None

        # Precompute bilinear score routes (replaces pairwise geometric product)
        self._precompute_score_tables()

    def _precompute_score_tables(self):
        """Precompute exact dense attention score routes."""
        alg = self.algebra
        D = alg.dim

        if not hasattr(alg, "gp_signs") or not hasattr(alg, "rev_signs"):
            raise ValueError("GeometricProductAttention currently requires dense CliffordAlgebra inputs.")

        # Grade-0 metric: metric_rev[a] = gp_signs[a, 0] * rev_signs[a]
        # gp_signs[a, 0] is the sign when A[a] * B[a] contributes to output blade 0
        metric_rev = alg.gp_signs[:, 0].float() * alg.rev_signs.float()
        self.register_buffer("_metric_rev", metric_rev)  # [D]

        g2_blades = [i for i in range(D) if bin(i).count("1") == 2]
        self.n_g2 = len(g2_blades)
        self.register_buffer("_g2_blades", torch.tensor(g2_blades, dtype=torch.long, device=alg.device))
        self.register_buffer("_basis_indices", torch.arange(D, dtype=torch.long, device=alg.device))

    def _compute_score(
        self,
        q_head: torch.Tensor,
        k_head: torch.Tensor,
    ) -> torch.Tensor:
        """Compute GA attention scores for one query block."""
        return self._compute_score_dense(q_head, k_head)

    def _compute_score_dense(self, q_head: torch.Tensor, k_head: torch.Tensor) -> torch.Tensor:
        """Exact dense score with automatic full/prechunked grade-2 routing."""
        B, H, Lq, Hc, D = q_head.shape
        Lk = k_head.shape[2]
        n_g2 = self.n_g2

        # == Grade-0 score ====================================================
        # <Q * rev(K)>_0 = Sum_c Sum_d  Q[c,d] * K[c,d] * metric_rev[d]
        # Implemented as a batched matrix multiply: [B,H,Lq,Hc*D] @ [B,H,Hc*D,Lk]
        q_weighted = q_head * self._metric_rev  # [B, H, Lq, Hc, D]
        q_flat = q_weighted.reshape(B, H, Lq, Hc * D)  # [B, H, Lq, Hc*D]
        k_flat = k_head.reshape(B, H, Lk, Hc * D)  # [B, H, Lk, Hc*D]
        score_g0 = torch.matmul(q_flat, k_flat.transpose(-2, -1))  # [B, H, Lq, Lk]

        # == Grade-2 score ====================================================
        # ||<Q * rev(K)>_2||_F = sqrt(Sum_c Sum_r (Sum_d Q[c,d]*k_g2[j,c,r,d])^2)
        if n_g2 > 0:
            q_2d = q_head.permute(0, 1, 3, 2, 4).reshape(B * H * Hc, Lq, D)

            full_k_g2_elements = B * H * Lk * Hc * n_g2 * D
            if full_k_g2_elements <= self.score_precompute_limit:
                score_g2_sq = self._dense_score_g2_precomputed(q_2d, k_head, B, H, Hc, Lq, Lk, D, n_g2)
            else:
                k_2d = k_head.permute(0, 1, 3, 2, 4).reshape(B * H * Hc, Lk, D)
                score_g2_sq = self._dense_score_g2_chunked(q_2d, k_2d, B, H, Hc, Lq, Lk, D, n_g2)
            score_g2 = score_g2_sq.sqrt()
        else:
            score_g2 = torch.zeros_like(score_g0)

        # Combined score
        scale = math.sqrt(self.head_channels * self.algebra.dim)
        return (score_g0 + self.bivector_weight * score_g2) / scale

    def _dense_score_g2_precomputed(self, q_2d, k_head, B, H, Hc, Lq, Lk, D, n_g2):
        """Dense grade-2 score using one full shifted-key materialization."""
        r_vals = self._g2_blades
        b_idx = self._basis_indices.unsqueeze(0) ^ r_vals.unsqueeze(1)
        rev_b = self.algebra.rev_signs[b_idx].to(dtype=k_head.dtype)
        gp_ar = self.algebra.gp_signs[:, r_vals].T.to(dtype=k_head.dtype)
        g2_sign = rev_b * gp_ar

        k_g2 = k_head[..., b_idx] * g2_sign
        k_g2_2d = k_g2.permute(0, 1, 3, 2, 4, 5).reshape(B * H * Hc, Lk * n_g2, D)
        comp = torch.bmm(q_2d, k_g2_2d.transpose(-2, -1))
        comp_sq = comp.reshape(B * H * Hc, Lq, Lk, n_g2).pow(2).sum(-1)
        return comp_sq.reshape(B, H, Hc, Lq, Lk).sum(2)

    def _dense_score_g2_chunked(self, q_2d, k_2d, B, H, Hc, Lq, Lk, D, n_g2):
        """Dense grade-2 score using bounded output-blade chunks."""
        score_g2_sq = q_2d.new_zeros(B, H, Lq, Lk)
        for start in range(0, n_g2, self.score_blade_chunk_size):
            end = min(start + self.score_blade_chunk_size, n_g2)
            r_vals = self._g2_blades[start:end]
            b_idx = self._basis_indices.unsqueeze(0) ^ r_vals.unsqueeze(1)
            rev_b = self.algebra.rev_signs[b_idx].to(dtype=k_2d.dtype)
            gp_ar = self.algebra.gp_signs[:, r_vals].T.to(dtype=k_2d.dtype)
            g2_sign = rev_b * gp_ar

            k_shifted = torch.index_select(k_2d, -1, b_idx.reshape(-1))
            k_shifted = k_shifted * g2_sign.reshape(-1)
            k_g2_2d = k_shifted.reshape(B * H * Hc, Lk * (end - start), D)
            comp = torch.bmm(q_2d, k_g2_2d.transpose(-2, -1))
            comp_sq = comp.reshape(B * H * Hc, Lq, Lk, end - start).pow(2).sum(-1)
            score_g2_sq = score_g2_sq + comp_sq.reshape(B, H, Hc, Lq, Lk).sum(2)
        return score_g2_sq

    def forward(self, x: torch.Tensor, key_padding_mask: torch.Tensor = None) -> torch.Tensor:
        """Computes geometric product attention.

        Args:
            x: Input multivectors [B, L, C, D].
            key_padding_mask: Optional [B, L] bool mask where True = padded (ignored).

        Returns:
            Output multivectors [B, L, C, D].
        """
        B, L, C, D = x.shape

        # Project Q, K, V (CliffordLinear expects [B, C, D])
        x_flat = x.reshape(B * L, C, D)
        Q = self.q_proj(x_flat).reshape(B, L, C, D)
        K = self.k_proj(x_flat).reshape(B, L, C, D)
        V = self.v_proj(x_flat).reshape(B, L, C, D)

        H = self.num_heads
        Hc = self.head_channels

        # Reshape to [B, H, L, Hc, D]
        Q = Q.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)  # [B, H, L, Hc, D]
        K = K.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)
        V = V.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)

        # Build causal mask once [L, L]
        if self.causal:
            causal_mask = torch.triu(
                torch.ones(L, L, device=x.device, dtype=torch.bool), diagonal=1
            )  # True = masked (future)
        else:
            causal_mask = None

        # Chunked attention over query positions to bound memory
        output_chunks = []
        for q_start in range(0, L, _BLOCK_SIZE):
            q_end = min(q_start + _BLOCK_SIZE, L)

            Q_block = Q[:, :, q_start:q_end]  # [B, H, Lq, Hc, D]

            # Compute scores: [B, H, Lq, L]
            scores = self._compute_score(Q_block, K)

            # Apply causal mask
            if causal_mask is not None:
                mask_block = causal_mask[q_start:q_end, :]  # [Lq, L]
                scores = scores.masked_fill(mask_block.unsqueeze(0).unsqueeze(0), float("-inf"))

            # Apply key padding mask: True = padded -> -inf
            if key_padding_mask is not None:
                # key_padding_mask: [B, L] -> [B, 1, 1, L]
                scores = scores.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), float("-inf"))

            # Softmax + dropout
            attn_weights = F.softmax(scores, dim=-1)  # [B, H, Lq, L]
            if self.attn_dropout is not None:
                attn_weights = self.attn_dropout(attn_weights)

            # Aggregate values: sum_k attn[b,h,i,k] * V[b,h,k,Hc,D]
            # attn_weights: [B, H, Lq, L]
            # V:            [B, H, L,  Hc, D]
            # out:          [B, H, Lq, Hc, D]
            out_block = torch.einsum("bhij,bhjcd->bhicd", attn_weights, V)
            output_chunks.append(out_block)

        # Reassemble: [B, H, L, Hc, D]
        output = torch.cat(output_chunks, dim=2)

        # Merge heads back: [B, L, C, D]
        output = output.permute(0, 2, 1, 3, 4).reshape(B, L, C, D)

        # Output projection
        output = self.out_proj(output.reshape(B * L, C, D)).reshape(B, L, C, D)

        return output

`init(algebra, channels, num_heads, causal=True, bivector_weight=0.5, dropout=0.0, score_blade_chunk_size=_G2_BLADE_CHUNK_SIZE, score_precompute_limit=_SCORE_PRECOMPUTE_LIMIT)` ¶

Sets up geometric product attention.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	Clifford algebra instance.	required
`channels`	`int`	Total number of multivector channels.	required
`num_heads`	`int`	Number of attention heads.	required
`causal`	`bool`	Apply causal mask for autoregressive generation.	`True`
`bivector_weight`	`float`	lambda_ weight on bivector score component.	`0.5`
`dropout`	`float`	Dropout rate on attention weights.	`0.0`
`score_blade_chunk_size`	`int`	Grade-2 output blades processed per dense chunk when exact dense scoring is used.	`_G2_BLADE_CHUNK_SIZE`
`score_precompute_limit`	`int`	Maximum temporary `K_g2` elements allowed before exact dense scoring switches to chunked grade-2 blades.	`_SCORE_PRECOMPUTE_LIMIT`

Source code in layers/blocks/attention.py

def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    num_heads: int,
    causal: bool = True,
    bivector_weight: float = 0.5,
    dropout: float = 0.0,
    score_blade_chunk_size: int = _G2_BLADE_CHUNK_SIZE,
    score_precompute_limit: int = _SCORE_PRECOMPUTE_LIMIT,
):
    """Sets up geometric product attention.

    Args:
        algebra: Clifford algebra instance.
        channels: Total number of multivector channels.
        num_heads: Number of attention heads.
        causal: Apply causal mask for autoregressive generation.
        bivector_weight: lambda_ weight on bivector score component.
        dropout: Dropout rate on attention weights.
        score_blade_chunk_size: Grade-2 output blades processed per dense
            chunk when exact dense scoring is used.
        score_precompute_limit: Maximum temporary ``K_g2`` elements allowed
            before exact dense scoring switches to chunked grade-2 blades.
    """
    super().__init__(algebra)
    assert channels % num_heads == 0, f"channels ({channels}) must be divisible by num_heads ({num_heads})"

    self.channels = channels
    self.num_heads = num_heads
    self.head_channels = channels // num_heads
    self.causal = causal
    self.bivector_weight = bivector_weight
    self.score_blade_chunk_size = max(1, int(score_blade_chunk_size))
    self.score_precompute_limit = max(0, int(score_precompute_limit))

    # Q, K, V projections operate on [B*L, channels, dim]
    self.q_proj = CliffordLinear(algebra, channels, channels)
    self.k_proj = CliffordLinear(algebra, channels, channels)
    self.v_proj = CliffordLinear(algebra, channels, channels)
    self.out_proj = CliffordLinear(algebra, channels, channels)

    self.attn_dropout = nn.Dropout(dropout) if dropout > 0.0 else None

    # Precompute bilinear score routes (replaces pairwise geometric product)
    self._precompute_score_tables()

`forward(x, key_padding_mask=None)` ¶

Computes geometric product attention.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input multivectors [B, L, C, D].	required
`key_padding_mask`	`Tensor`	Optional [B, L] bool mask where True = padded (ignored).	`None`

Returns:

Type	Description
`Tensor`	Output multivectors [B, L, C, D].

Source code in layers/blocks/attention.py

def forward(self, x: torch.Tensor, key_padding_mask: torch.Tensor = None) -> torch.Tensor:
    """Computes geometric product attention.

    Args:
        x: Input multivectors [B, L, C, D].
        key_padding_mask: Optional [B, L] bool mask where True = padded (ignored).

    Returns:
        Output multivectors [B, L, C, D].
    """
    B, L, C, D = x.shape

    # Project Q, K, V (CliffordLinear expects [B, C, D])
    x_flat = x.reshape(B * L, C, D)
    Q = self.q_proj(x_flat).reshape(B, L, C, D)
    K = self.k_proj(x_flat).reshape(B, L, C, D)
    V = self.v_proj(x_flat).reshape(B, L, C, D)

    H = self.num_heads
    Hc = self.head_channels

    # Reshape to [B, H, L, Hc, D]
    Q = Q.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)  # [B, H, L, Hc, D]
    K = K.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)
    V = V.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)

    # Build causal mask once [L, L]
    if self.causal:
        causal_mask = torch.triu(
            torch.ones(L, L, device=x.device, dtype=torch.bool), diagonal=1
        )  # True = masked (future)
    else:
        causal_mask = None

    # Chunked attention over query positions to bound memory
    output_chunks = []
    for q_start in range(0, L, _BLOCK_SIZE):
        q_end = min(q_start + _BLOCK_SIZE, L)

        Q_block = Q[:, :, q_start:q_end]  # [B, H, Lq, Hc, D]

        # Compute scores: [B, H, Lq, L]
        scores = self._compute_score(Q_block, K)

        # Apply causal mask
        if causal_mask is not None:
            mask_block = causal_mask[q_start:q_end, :]  # [Lq, L]
            scores = scores.masked_fill(mask_block.unsqueeze(0).unsqueeze(0), float("-inf"))

        # Apply key padding mask: True = padded -> -inf
        if key_padding_mask is not None:
            # key_padding_mask: [B, L] -> [B, 1, 1, L]
            scores = scores.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), float("-inf"))

        # Softmax + dropout
        attn_weights = F.softmax(scores, dim=-1)  # [B, H, Lq, L]
        if self.attn_dropout is not None:
            attn_weights = self.attn_dropout(attn_weights)

        # Aggregate values: sum_k attn[b,h,i,k] * V[b,h,k,Hc,D]
        # attn_weights: [B, H, Lq, L]
        # V:            [B, H, L,  Hc, D]
        # out:          [B, H, Lq, Hc, D]
        out_block = torch.einsum("bhij,bhjcd->bhicd", attn_weights, V)
        output_chunks.append(out_block)

    # Reassemble: [B, H, L, Hc, D]
    output = torch.cat(output_chunks, dim=2)

    # Merge heads back: [B, L, C, D]
    output = output.permute(0, 2, 1, 3, 4).reshape(B, L, C, D)

    # Output projection
    output = self.out_proj(output.reshape(B * L, C, D)).reshape(B, L, C, D)

    return output

`MultiRotorFFN` ¶

Bases: CliffordModule

Embedded Geometric Toolbox - Feed-Forward Network via rotor superposition.

Standard transformers use: Linear -> GELU -> Linear. This replaces that with:

CliffordLinear(expand) -> CliffordLayerNorm
    -> MultiRotorLayer(K rotors) -> GeometricGELU
    -> CliffordLinear(contract) -> BladeSelector

The expand step lifts x into a ffn_mult x channels toolbox subspace. MultiRotorLayer applies K parallel rotors, each exploring a different rotation plane - this IS the nonlinearity, not just a scalar gate. The contract step projects back to the original channel count.

Designed as a standalone module so it can be reused in other tasks (md17, pdbbind, etc.) beyond the language model.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	The algebra instance.	required
`channels`	`int`	Input/output channel count.	required
`ffn_mult`	`int`	Expansion factor (ffn_channels = channels * ffn_mult).	`4`
`num_rotors`	`int`	Number of parallel rotors K in the toolbox.	`8`
`use_rotor_backend`	`bool`	Use RotorGadget backend for CliffordLinear.	`False`

Input/Output shape: [B, C, D] where D = algebra.dim.

Source code in layers/blocks/multi_rotor_ffn.py

class MultiRotorFFN(CliffordModule):
    """Embedded Geometric Toolbox - Feed-Forward Network via rotor superposition.

    Standard transformers use: Linear -> GELU -> Linear.
    This replaces that with:

        CliffordLinear(expand) -> CliffordLayerNorm
            -> MultiRotorLayer(K rotors) -> GeometricGELU
            -> CliffordLinear(contract) -> BladeSelector

    The expand step lifts x into a ``ffn_mult x channels`` toolbox subspace.
    ``MultiRotorLayer`` applies K parallel rotors, each exploring a different
    rotation plane - this IS the nonlinearity, not just a scalar gate.
    The contract step projects back to the original channel count.

    Designed as a standalone module so it can be reused in other tasks
    (md17, pdbbind, etc.) beyond the language model.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Input/output channel count.
        ffn_mult (int): Expansion factor (ffn_channels = channels * ffn_mult).
        num_rotors (int): Number of parallel rotors K in the toolbox.
        use_rotor_backend (bool): Use RotorGadget backend for CliffordLinear.

    Input/Output shape: ``[B, C, D]`` where D = algebra.dim.
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        ffn_mult: int = 4,
        num_rotors: int = 8,
        use_rotor_backend: bool = False,
    ):
        super().__init__(algebra)
        self.channels = channels
        ffn_channels = channels * ffn_mult
        backend = "rotor" if use_rotor_backend else "traditional"

        self.expand = CliffordLinear(algebra, channels, ffn_channels, backend=backend)
        self.norm = CliffordLayerNorm(algebra, ffn_channels)
        self.toolbox = MultiRotorLayer(algebra, ffn_channels, num_rotors)
        self.act = GeometricGELU(algebra, channels=ffn_channels)
        self.contract = CliffordLinear(algebra, ffn_channels, channels, backend=backend)
        self.gate = BladeSelector(algebra, channels)

    def forward(self, x) -> torch.Tensor:
        """Applies the geometric toolbox FFN.

        Args:
            x (torch.Tensor): Input ``[B, C, D]``.

        Returns:
            torch.Tensor: Output ``[B, C, D]``.
        """
        h = self.expand(x)  # [B, ffn_channels, D]
        h = self.norm(h)  # [B, ffn_channels, D]
        h = self.toolbox(h)  # [B, ffn_channels, D]  - K-rotor superposition
        h = self.act(h)  # [B, ffn_channels, D]
        h = self.contract(h)  # [B, channels, D]
        h = self.gate(h)  # [B, channels, D]      - per-blade gating
        return h

`forward(x)` ¶

Applies the geometric toolbox FFN.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input `[B, C, D]`.	required

Returns:

Type	Description
`Tensor`	torch.Tensor: Output `[B, C, D]`.

Source code in layers/blocks/multi_rotor_ffn.py

def forward(self, x) -> torch.Tensor:
    """Applies the geometric toolbox FFN.

    Args:
        x (torch.Tensor): Input ``[B, C, D]``.

    Returns:
        torch.Tensor: Output ``[B, C, D]``.
    """
    h = self.expand(x)  # [B, ffn_channels, D]
    h = self.norm(h)  # [B, ffn_channels, D]
    h = self.toolbox(h)  # [B, ffn_channels, D]  - K-rotor superposition
    h = self.act(h)  # [B, ffn_channels, D]
    h = self.contract(h)  # [B, channels, D]
    h = self.gate(h)  # [B, channels, D]      - per-blade gating
    return h

`GeometricTransformerBlock` ¶

Bases: CliffordModule

Modular Geometric Transformer block.

Architecture: 1. Pre-norm 2. Geometric Attention (Standard or Entropy-Gated) 3. Residual connection 4. Pre-norm 5. Multi-Rotor FFN 6. Residual connection

Source code in layers/blocks/transformer.py

class GeometricTransformerBlock(CliffordModule):
    """Modular Geometric Transformer block.

    Architecture:
    1. Pre-norm
    2. Geometric Attention (Standard or Entropy-Gated)
    3. Residual connection
    4. Pre-norm
    5. Multi-Rotor FFN
    6. Residual connection
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        num_heads: int = 4,
        num_rotors: int = 8,
        dropout: float = 0.1,
        use_entropy_gating: bool = False,
        eta: float = 1.5,
        H_base: float = 0.5,
    ):
        """Initializes the Geometric Transformer Block.

        Args:
            algebra: Clifford algebra instance.
            channels: Total multivector channels.
            num_heads: Number of attention heads.
            num_rotors: Number of rotors in the FFN.
            dropout: Dropout rate.
            use_entropy_gating: If True, uses EntropyGatedAttention.
            eta: Gating multiplier for entropy attention.
            H_base: Base entropy threshold.
        """
        super().__init__(algebra)
        self.use_entropy_gating = use_entropy_gating
        self.norm1 = CliffordLayerNorm(algebra, channels)

        if use_entropy_gating:
            self.attn = EntropyGatedAttention(algebra, channels, num_heads, eta=eta, H_base=H_base)
        else:
            self.attn = GeometricProductAttention(algebra, channels, num_heads, causal=False, dropout=dropout)

        self.norm2 = CliffordLayerNorm(algebra, channels)

        # Check MultiRotorFFN class name in multi_rotor_ffn.py
        from .multi_rotor_ffn import MultiRotorFFN

        self.ffn = MultiRotorFFN(algebra, channels, num_rotors=num_rotors)

    def forward(
        self, x: torch.Tensor, key_padding_mask: torch.Tensor = None, return_state: bool = False
    ) -> torch.Tensor:
        """Forward pass through the transformer block.

        Args:
            x: Input multivectors [B, L, C, D].
            key_padding_mask: Optional [B, L] bool mask where True = padded.
            return_state: If True, returns intermediate entropy/gating states.

        Returns:
            Processed multivectors [B, L, C, D] (and optionally intermediate states).
        """
        B, L, C, D = x.shape

        # 1. Attention path
        res = x
        x_n = self.norm1(x.reshape(B * L, C, D)).reshape(B, L, C, D)

        if self.use_entropy_gating and return_state:
            attn_out, H, lambda_dyn = self.attn(x_n, key_padding_mask=key_padding_mask, return_gating=True)
        else:
            attn_out = self.attn(x_n, key_padding_mask=key_padding_mask)
            H, lambda_dyn = None, None

        x = res + attn_out

        # 2. FFN path
        res = x
        x_n = self.norm2(x.reshape(B * L, C, D)).reshape(B, L, C, D)
        f_out = self.ffn(x_n.reshape(B * L, C, D)).reshape(B, L, C, D)
        x = res + f_out

        if return_state:
            return x, H, lambda_dyn
        return x

`init(algebra, channels, num_heads=4, num_rotors=8, dropout=0.1, use_entropy_gating=False, eta=1.5, H_base=0.5)` ¶

Initializes the Geometric Transformer Block.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	Clifford algebra instance.	required
`channels`	`int`	Total multivector channels.	required
`num_heads`	`int`	Number of attention heads.	`4`
`num_rotors`	`int`	Number of rotors in the FFN.	`8`
`dropout`	`float`	Dropout rate.	`0.1`
`use_entropy_gating`	`bool`	If True, uses EntropyGatedAttention.	`False`
`eta`	`float`	Gating multiplier for entropy attention.	`1.5`
`H_base`	`float`	Base entropy threshold.	`0.5`

Source code in layers/blocks/transformer.py

def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    num_heads: int = 4,
    num_rotors: int = 8,
    dropout: float = 0.1,
    use_entropy_gating: bool = False,
    eta: float = 1.5,
    H_base: float = 0.5,
):
    """Initializes the Geometric Transformer Block.

    Args:
        algebra: Clifford algebra instance.
        channels: Total multivector channels.
        num_heads: Number of attention heads.
        num_rotors: Number of rotors in the FFN.
        dropout: Dropout rate.
        use_entropy_gating: If True, uses EntropyGatedAttention.
        eta: Gating multiplier for entropy attention.
        H_base: Base entropy threshold.
    """
    super().__init__(algebra)
    self.use_entropy_gating = use_entropy_gating
    self.norm1 = CliffordLayerNorm(algebra, channels)

    if use_entropy_gating:
        self.attn = EntropyGatedAttention(algebra, channels, num_heads, eta=eta, H_base=H_base)
    else:
        self.attn = GeometricProductAttention(algebra, channels, num_heads, causal=False, dropout=dropout)

    self.norm2 = CliffordLayerNorm(algebra, channels)

    # Check MultiRotorFFN class name in multi_rotor_ffn.py
    from .multi_rotor_ffn import MultiRotorFFN

    self.ffn = MultiRotorFFN(algebra, channels, num_rotors=num_rotors)

`forward(x, key_padding_mask=None, return_state=False)` ¶

Forward pass through the transformer block.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input multivectors [B, L, C, D].	required
`key_padding_mask`	`Tensor`	Optional [B, L] bool mask where True = padded.	`None`
`return_state`	`bool`	If True, returns intermediate entropy/gating states.	`False`

Returns:

Type	Description
`Tensor`	Processed multivectors [B, L, C, D] (and optionally intermediate states).

Source code in layers/blocks/transformer.py

def forward(
    self, x: torch.Tensor, key_padding_mask: torch.Tensor = None, return_state: bool = False
) -> torch.Tensor:
    """Forward pass through the transformer block.

    Args:
        x: Input multivectors [B, L, C, D].
        key_padding_mask: Optional [B, L] bool mask where True = padded.
        return_state: If True, returns intermediate entropy/gating states.

    Returns:
        Processed multivectors [B, L, C, D] (and optionally intermediate states).
    """
    B, L, C, D = x.shape

    # 1. Attention path
    res = x
    x_n = self.norm1(x.reshape(B * L, C, D)).reshape(B, L, C, D)

    if self.use_entropy_gating and return_state:
        attn_out, H, lambda_dyn = self.attn(x_n, key_padding_mask=key_padding_mask, return_gating=True)
    else:
        attn_out = self.attn(x_n, key_padding_mask=key_padding_mask)
        H, lambda_dyn = None, None

    x = res + attn_out

    # 2. FFN path
    res = x
    x_n = self.norm2(x.reshape(B * L, C, D)).reshape(B, L, C, D)
    f_out = self.ffn(x_n.reshape(B * L, C, D)).reshape(B, L, C, D)
    x = res + f_out

    if return_state:
        return x, H, lambda_dyn
    return x

Adapters¶

`MultivectorEmbedding` ¶

Bases: CliffordModule

Token embedding as multivectors.

Each token maps to a [channels, dim] multivector. Initializes content in grade-1 (vector) subspace only - semantic content starts as directed quantities before rotors act on them.

Attributes:

Name	Type	Description
`vocab_size`	`int`	Number of tokens.
`channels`	`int`	Number of multivector channels.
`embedding`	`Embedding`	Underlying embedding table.

Source code in layers/adapters/embedding.py

class MultivectorEmbedding(CliffordModule):
    """Token embedding as multivectors.

    Each token maps to a [channels, dim] multivector. Initializes
    content in grade-1 (vector) subspace only - semantic content
    starts as directed quantities before rotors act on them.

    Attributes:
        vocab_size (int): Number of tokens.
        channels (int): Number of multivector channels.
        embedding (nn.Embedding): Underlying embedding table.
    """

    def __init__(self, algebra: CliffordAlgebra, vocab_size: int, channels: int):
        """Sets up the multivector embedding.

        Args:
            algebra: Clifford algebra instance.
            vocab_size: Vocabulary size.
            channels: Number of multivector channels per token.
        """
        super().__init__(algebra)
        self.vocab_size = vocab_size
        self.channels = channels

        # Single flat embedding: vocab_size -> channels * dim
        self.embedding = nn.Embedding(vocab_size, channels * algebra.dim)
        self._init_grade1()

    def _init_grade1(self):
        """Initializes only grade-1 components; zeros out all others."""
        with torch.no_grad():
            dim = self.algebra.dim
            channels = self.channels

            # Build grade-1 mask (indices with exactly 1 bit set)
            grade1_flat = []
            for i in range(dim):
                if bin(i).count("1") == 1:
                    grade1_flat.append(i)

            # Zero everything
            self.embedding.weight.zero_()

            # Fill grade-1 slots with small normal values
            for ch in range(channels):
                for idx in grade1_flat:
                    flat_idx = ch * dim + idx
                    self.embedding.weight[:, flat_idx].normal_(std=0.02)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        """Maps token ids to multivector embeddings.

        Args:
            token_ids: Token indices [B, L].

        Returns:
            Multivector embeddings [B, L, channels, dim].
        """
        B, L = token_ids.shape
        flat = self.embedding(token_ids)  # [B, L, channels * dim]
        return flat.reshape(B, L, self.channels, self.algebra.dim)

`init(algebra, vocab_size, channels)` ¶

Sets up the multivector embedding.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	Clifford algebra instance.	required
`vocab_size`	`int`	Vocabulary size.	required
`channels`	`int`	Number of multivector channels per token.	required

Source code in layers/adapters/embedding.py

def __init__(self, algebra: CliffordAlgebra, vocab_size: int, channels: int):
    """Sets up the multivector embedding.

    Args:
        algebra: Clifford algebra instance.
        vocab_size: Vocabulary size.
        channels: Number of multivector channels per token.
    """
    super().__init__(algebra)
    self.vocab_size = vocab_size
    self.channels = channels

    # Single flat embedding: vocab_size -> channels * dim
    self.embedding = nn.Embedding(vocab_size, channels * algebra.dim)
    self._init_grade1()

`forward(token_ids)` ¶

Maps token ids to multivector embeddings.

Parameters:

Name	Type	Description	Default
`token_ids`	`Tensor`	Token indices [B, L].	required

Returns:

Type	Description
`Tensor`	Multivector embeddings [B, L, channels, dim].

Source code in layers/adapters/embedding.py

def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
    """Maps token ids to multivector embeddings.

    Args:
        token_ids: Token indices [B, L].

    Returns:
        Multivector embeddings [B, L, channels, dim].
    """
    B, L = token_ids.shape
    flat = self.embedding(token_ids)  # [B, L, channels * dim]
    return flat.reshape(B, L, self.channels, self.algebra.dim)

`MotherEmbedding` ¶

Bases: CliffordModule

Embeds local feature groups into a canonical Mother Algebra with Procrustes Alignment.

Uses fixed rotors (R_fixed) to rotate individual channel vectors into a shared reference frame, effectively aligning disparate geometric manifolds.

Source code in layers/adapters/mother.py

class MotherEmbedding(CliffordModule):
    """Embeds local feature groups into a canonical Mother Algebra with Procrustes Alignment.

    Uses fixed rotors (R_fixed) to rotate individual channel vectors into a shared
    reference frame, effectively aligning disparate geometric manifolds.
    """

    def __init__(self, algebra: CliffordAlgebra, input_dim: int, channels: int, U: float = 0.0, V: torch.Tensor = None):
        """Initializes the Mother Embedding.

        Args:
            algebra: Clifford algebra instance.
            input_dim: Dimension of the input features.
            channels: Number of multivector channels.
            U: Geometric uncertainty index for manifold suppression.
            V: Fixed rotor proxy for Procrustes alignment (input_dim x input_dim).
        """
        super().__init__(algebra)
        self.channels = channels

        # Procrustes Alignment Matrix (Fixed Rotor Proxy)
        if V is None:
            V = torch.eye(input_dim)
        self.register_buffer("R_fixed", V)

        # Up-cast to Mother Algebra multivector channels
        self.linear = nn.Linear(input_dim, channels * algebra.dim)
        self.norm = CliffordLayerNorm(algebra, channels)

        # Pre-condition LayerNorm scale with Uncertainty Index
        with torch.no_grad():
            if hasattr(self.norm, "weight"):
                # Suppress highly uncertain (twisted) manifolds initially
                scale = 1.0 / (1.0 + U)
                self.norm.weight.data.fill_(scale)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Projects input into the aligned mother manifold.

        Args:
            x: Input features [B, input_dim].

        Returns:
            Aligned multivectors [B, channels, dim].
        """
        # 1. Apply Geometric Procrustes Alignment
        if self.R_fixed is not None:
            x = x @ self.R_fixed.T

        # 2. Mother Projection
        c = self.linear(x).view(-1, self.channels, self.algebra.dim)
        return self.norm(c)

`init(algebra, input_dim, channels, U=0.0, V=None)` ¶

Initializes the Mother Embedding.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	Clifford algebra instance.	required
`input_dim`	`int`	Dimension of the input features.	required
`channels`	`int`	Number of multivector channels.	required
`U`	`float`	Geometric uncertainty index for manifold suppression.	`0.0`
`V`	`Tensor`	Fixed rotor proxy for Procrustes alignment (input_dim x input_dim).	`None`

Source code in layers/adapters/mother.py

def __init__(self, algebra: CliffordAlgebra, input_dim: int, channels: int, U: float = 0.0, V: torch.Tensor = None):
    """Initializes the Mother Embedding.

    Args:
        algebra: Clifford algebra instance.
        input_dim: Dimension of the input features.
        channels: Number of multivector channels.
        U: Geometric uncertainty index for manifold suppression.
        V: Fixed rotor proxy for Procrustes alignment (input_dim x input_dim).
    """
    super().__init__(algebra)
    self.channels = channels

    # Procrustes Alignment Matrix (Fixed Rotor Proxy)
    if V is None:
        V = torch.eye(input_dim)
    self.register_buffer("R_fixed", V)

    # Up-cast to Mother Algebra multivector channels
    self.linear = nn.Linear(input_dim, channels * algebra.dim)
    self.norm = CliffordLayerNorm(algebra, channels)

    # Pre-condition LayerNorm scale with Uncertainty Index
    with torch.no_grad():
        if hasattr(self.norm, "weight"):
            # Suppress highly uncertain (twisted) manifolds initially
            scale = 1.0 / (1.0 + U)
            self.norm.weight.data.fill_(scale)

`forward(x)` ¶

Projects input into the aligned mother manifold.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input features [B, input_dim].	required

Returns:

Type	Description
`Tensor`	Aligned multivectors [B, channels, dim].

Source code in layers/adapters/mother.py

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Projects input into the aligned mother manifold.

    Args:
        x: Input features [B, input_dim].

    Returns:
        Aligned multivectors [B, channels, dim].
    """
    # 1. Apply Geometric Procrustes Alignment
    if self.R_fixed is not None:
        x = x @ self.R_fixed.T

    # 2. Mother Projection
    c = self.linear(x).view(-1, self.channels, self.algebra.dim)
    return self.norm(c)

`EntropyGatedAttention` ¶

Bases: CliffordModule

Dynamic geometric attention governed by bivector information entropy.

Segments with high bivector entropy (disordered phase states) are "stiffened" or suppressed, allowing only coherent, synchronized states to propagate.

Source code in layers/adapters/mother.py

class EntropyGatedAttention(CliffordModule):
    """Dynamic geometric attention governed by bivector information entropy.

    Segments with high bivector entropy (disordered phase states) are "stiffened"
    or suppressed, allowing only coherent, synchronized states to propagate.
    """

    def __init__(self, algebra: CliffordAlgebra, channels: int, num_heads: int, eta: float = 1.0, H_base: float = 0.5):
        """Initializes Entropy-Gated Attention.

        Args:
            algebra: Clifford algebra instance.
            channels: Total multivector channels.
            num_heads: Number of attention heads.
            eta: Gating multiplier.
            H_base: Base entropy threshold.
        """
        super().__init__(algebra)
        self.channels = channels
        self.eta = eta
        self.H_base = H_base
        self.base_attention = GeometricProductAttention(algebra, channels, num_heads, causal=False)

        # Cache bivector indices and float mask for compile-friendly gating
        mask = self.algebra.grade_masks[2]
        self.register_buffer("g2_idx", mask.nonzero(as_tuple=True)[0])
        self.register_buffer("_g2_float_mask", mask.float())

    def forward(
        self, x: torch.Tensor, key_padding_mask: torch.Tensor = None, return_gating: bool = False
    ) -> torch.Tensor:
        """Applies entropy-gated geometric attention.

        Args:
            x: Input multivectors [B, L, C, D].
            key_padding_mask: Optional [B, L] bool mask where True = padded.
            return_gating: If True, returns entropy and gating values.

        Returns:
            Attended multivectors [B, L, C, D].
        """
        # 1. Calculate Information Entropy of Bivector Energy
        # Summing across multivector components (g2_idx) and across channels (dim 2)
        # x: [B, L, C, D]
        g2_energy = (x[..., self.g2_idx] ** 2).sum(dim=(-1, -2))  # [B, L]

        # Mask padded positions before entropy calc
        if key_padding_mask is not None:
            g2_energy = g2_energy.masked_fill(key_padding_mask, 0.0)

        # Normalize to probability distribution over sequence
        p = g2_energy / (g2_energy.sum(dim=1, keepdim=True) + 1e-8)

        # Shannon Entropy H per batch [B]
        H = -(p * torch.log(p + 1e-8)).sum(dim=1)

        # 2. Base-Adjusted Gating Function
        lambda_dyn = self.eta * torch.sigmoid(H - self.H_base)  # [B]

        # 3. Apply dynamic geometric stiffness
        # Scale the rotational components (bivectors)
        lambda_view = lambda_dyn.view(-1, 1, 1, 1)

        g2_mask = self._g2_float_mask.to(dtype=x.dtype)
        scale = 1.0 + (lambda_view - 1.0) * g2_mask  # [B, 1, 1, D]
        x_gated = x * scale

        out = self.base_attention(x_gated, key_padding_mask=key_padding_mask)

        if return_gating:
            return out, H, lambda_dyn
        return out

`init(algebra, channels, num_heads, eta=1.0, H_base=0.5)` ¶

Initializes Entropy-Gated Attention.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	Clifford algebra instance.	required
`channels`	`int`	Total multivector channels.	required
`num_heads`	`int`	Number of attention heads.	required
`eta`	`float`	Gating multiplier.	`1.0`
`H_base`	`float`	Base entropy threshold.	`0.5`

Source code in layers/adapters/mother.py

def __init__(self, algebra: CliffordAlgebra, channels: int, num_heads: int, eta: float = 1.0, H_base: float = 0.5):
    """Initializes Entropy-Gated Attention.

    Args:
        algebra: Clifford algebra instance.
        channels: Total multivector channels.
        num_heads: Number of attention heads.
        eta: Gating multiplier.
        H_base: Base entropy threshold.
    """
    super().__init__(algebra)
    self.channels = channels
    self.eta = eta
    self.H_base = H_base
    self.base_attention = GeometricProductAttention(algebra, channels, num_heads, causal=False)

    # Cache bivector indices and float mask for compile-friendly gating
    mask = self.algebra.grade_masks[2]
    self.register_buffer("g2_idx", mask.nonzero(as_tuple=True)[0])
    self.register_buffer("_g2_float_mask", mask.float())

`forward(x, key_padding_mask=None, return_gating=False)` ¶

Applies entropy-gated geometric attention.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Input multivectors [B, L, C, D].	required
`key_padding_mask`	`Tensor`	Optional [B, L] bool mask where True = padded.	`None`
`return_gating`	`bool`	If True, returns entropy and gating values.	`False`

Returns:

Type	Description
`Tensor`	Attended multivectors [B, L, C, D].

Source code in layers/adapters/mother.py

def forward(
    self, x: torch.Tensor, key_padding_mask: torch.Tensor = None, return_gating: bool = False
) -> torch.Tensor:
    """Applies entropy-gated geometric attention.

    Args:
        x: Input multivectors [B, L, C, D].
        key_padding_mask: Optional [B, L] bool mask where True = padded.
        return_gating: If True, returns entropy and gating values.

    Returns:
        Attended multivectors [B, L, C, D].
    """
    # 1. Calculate Information Entropy of Bivector Energy
    # Summing across multivector components (g2_idx) and across channels (dim 2)
    # x: [B, L, C, D]
    g2_energy = (x[..., self.g2_idx] ** 2).sum(dim=(-1, -2))  # [B, L]

    # Mask padded positions before entropy calc
    if key_padding_mask is not None:
        g2_energy = g2_energy.masked_fill(key_padding_mask, 0.0)

    # Normalize to probability distribution over sequence
    p = g2_energy / (g2_energy.sum(dim=1, keepdim=True) + 1e-8)

    # Shannon Entropy H per batch [B]
    H = -(p * torch.log(p + 1e-8)).sum(dim=1)

    # 2. Base-Adjusted Gating Function
    lambda_dyn = self.eta * torch.sigmoid(H - self.H_base)  # [B]

    # 3. Apply dynamic geometric stiffness
    # Scale the rotational components (bivectors)
    lambda_view = lambda_dyn.view(-1, 1, 1, 1)

    g2_mask = self._g2_float_mask.to(dtype=x.dtype)
    scale = 1.0 + (lambda_view - 1.0) * g2_mask  # [B, 1, 1, D]
    x_gated = x * scale

    out = self.base_attention(x_gated, key_padding_mask=key_padding_mask)

    if return_gating:
        return out, H, lambda_dyn
    return out

Optional dependency

CliffordGraphConv requires torch-geometric. Install with uv sync --extra md17.

`CliffordGraphConv` ¶

Bases: CliffordModule

Geometric Graph Conv. Performs message passing using multivector features.

Aggregates features based on graph topology. H' = Aggregate(H) * W + Bias.

Attributes:

Name	Type	Description
`linear`	`CliffordLinear`	The transformation.

Source code in layers/adapters/gnn.py

class CliffordGraphConv(CliffordModule):
    """Geometric Graph Conv. Performs message passing using multivector features.

    Aggregates features based on graph topology.
    H' = Aggregate(H) * W + Bias.

    Attributes:
        linear (CliffordLinear): The transformation.
    """

    def __init__(self, algebra: CliffordAlgebra, in_channels: int, out_channels: int):
        """Sets up the GNN layer.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            in_channels (int): Input features.
            out_channels (int): Output features.
        """
        super().__init__(algebra)
        self.linear = CliffordLinear(algebra, in_channels, out_channels)

    def forward(self, x: torch.Tensor, adj: torch.Tensor) -> torch.Tensor:
        """Aggregates and transforms node features using geometric operations.

        Args:
            x (torch.Tensor): Node features.
            adj (torch.Tensor): Adjacency matrix.

        Returns:
            torch.Tensor: Updated features.
        """
        # 1. Aggregate
        N, C, D = x.shape
        x_flat = x.view(N, -1)

        # Sparse aggregation
        x_agg_flat = torch.mm(adj, x_flat)
        x_agg = x_agg_flat.view(N, C, D)

        # 2. Transform
        out = self.linear(x_agg)

        return out

`init(algebra, in_channels, out_channels)` ¶

Sets up the GNN layer.

Parameters:

Name	Type	Description	Default
`algebra`	`CliffordAlgebra`	The algebra instance.	required
`in_channels`	`int`	Input features.	required
`out_channels`	`int`	Output features.	required

Source code in layers/adapters/gnn.py

def __init__(self, algebra: CliffordAlgebra, in_channels: int, out_channels: int):
    """Sets up the GNN layer.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        in_channels (int): Input features.
        out_channels (int): Output features.
    """
    super().__init__(algebra)
    self.linear = CliffordLinear(algebra, in_channels, out_channels)

`forward(x, adj)` ¶

Aggregates and transforms node features using geometric operations.

Parameters:

Name	Type	Description	Default
`x`	`Tensor`	Node features.	required
`adj`	`Tensor`	Adjacency matrix.	required

Returns:

Type	Description
`Tensor`	torch.Tensor: Updated features.

Source code in layers/adapters/gnn.py

def forward(self, x: torch.Tensor, adj: torch.Tensor) -> torch.Tensor:
    """Aggregates and transforms node features using geometric operations.

    Args:
        x (torch.Tensor): Node features.
        adj (torch.Tensor): Adjacency matrix.

    Returns:
        torch.Tensor: Updated features.
    """
    # 1. Aggregate
    N, C, D = x.shape
    x_flat = x.view(N, -1)

    # Sparse aggregation
    x_agg_flat = torch.mm(adj, x_flat)
    x_agg = x_agg_flat.view(N, C, D)

    # 2. Transform
    out = self.linear(x_agg)

    return out

Layers¶

Primitives¶

RotorLayer ¶

__init__(algebra, channels, grade=2, *, input_grades=None, output_grades=None, input_layout=None, output_layout=None, compact_output=True) ¶

reset_parameters() ¶

forward(x) ¶

train(mode=True) ¶

prune_bivectors(threshold=0.0001) ¶

sparsity_loss() ¶

MultiRotorLayer ¶

__init__(algebra, channels, num_rotors=8, grade=2, *, input_grades=None, output_grades=None, input_layout=None, output_layout=None, compact_output=True) ¶

reset_parameters() ¶

forward(x, return_invariants=False) ¶

train(mode=True) ¶

sparsity_loss() ¶

CliffordLinear ¶

__init__(algebra, in_channels, out_channels, backend='traditional', num_rotor_pairs=4, aggregation='mean', shuffle='none', grades=None, layout=None) ¶

reset_parameters() ¶

forward(x) ¶

extra_repr() ¶

RotorGadget ¶

__init__(algebra, in_channels, out_channels, num_rotor_pairs=4, aggregation='mean', shuffle='none', bias=False) ¶

forward(x) ¶

train(mode=True) ¶

extra_repr() ¶

CliffordLayerNorm ¶

__init__(algebra, channels, eps=1e-06, recover=True, *, grades=None, layout=None) ¶

forward(x) ¶

BladeSelector ¶

__init__(algebra, channels, *, grades=None, layout=None) ¶

reset_parameters() ¶

forward(x) ¶

Blocks¶

GeometricProductAttention ¶

__init__(algebra, channels, num_heads, causal=True, bivector_weight=0.5, dropout=0.0, score_blade_chunk_size=_G2_BLADE_CHUNK_SIZE, score_precompute_limit=_SCORE_PRECOMPUTE_LIMIT) ¶

forward(x, key_padding_mask=None) ¶

MultiRotorFFN ¶

forward(x) ¶

GeometricTransformerBlock ¶

__init__(algebra, channels, num_heads=4, num_rotors=8, dropout=0.1, use_entropy_gating=False, eta=1.5, H_base=0.5) ¶

forward(x, key_padding_mask=None, return_state=False) ¶

Adapters¶

MultivectorEmbedding ¶

__init__(algebra, vocab_size, channels) ¶

forward(token_ids) ¶

MotherEmbedding ¶

__init__(algebra, input_dim, channels, U=0.0, V=None) ¶

forward(x) ¶

EntropyGatedAttention ¶

__init__(algebra, channels, num_heads, eta=1.0, H_base=0.5) ¶

forward(x, key_padding_mask=None, return_gating=False) ¶

CliffordGraphConv ¶

__init__(algebra, in_channels, out_channels) ¶

forward(x, adj) ¶

`RotorLayer` ¶

`init(algebra, channels, grade=2, *, input_grades=None, output_grades=None, input_layout=None, output_layout=None, compact_output=True)` ¶

`reset_parameters()` ¶

`forward(x)` ¶

`train(mode=True)` ¶

`prune_bivectors(threshold=0.0001)` ¶

`sparsity_loss()` ¶

`MultiRotorLayer` ¶

`init(algebra, channels, num_rotors=8, grade=2, *, input_grades=None, output_grades=None, input_layout=None, output_layout=None, compact_output=True)` ¶

`reset_parameters()` ¶

`forward(x, return_invariants=False)` ¶

`train(mode=True)` ¶

`sparsity_loss()` ¶

`CliffordLinear` ¶

`init(algebra, in_channels, out_channels, backend='traditional', num_rotor_pairs=4, aggregation='mean', shuffle='none', grades=None, layout=None)` ¶

`reset_parameters()` ¶

`forward(x)` ¶

`extra_repr()` ¶

`RotorGadget` ¶

`init(algebra, in_channels, out_channels, num_rotor_pairs=4, aggregation='mean', shuffle='none', bias=False)` ¶

`forward(x)` ¶

`train(mode=True)` ¶

`extra_repr()` ¶

`CliffordLayerNorm` ¶

`init(algebra, channels, eps=1e-06, recover=True, *, grades=None, layout=None)` ¶

`forward(x)` ¶

`BladeSelector` ¶

`init(algebra, channels, *, grades=None, layout=None)` ¶

`reset_parameters()` ¶

`forward(x)` ¶

`GeometricProductAttention` ¶

`init(algebra, channels, num_heads, causal=True, bivector_weight=0.5, dropout=0.0, score_blade_chunk_size=_G2_BLADE_CHUNK_SIZE, score_precompute_limit=_SCORE_PRECOMPUTE_LIMIT)` ¶

`forward(x, key_padding_mask=None)` ¶

`MultiRotorFFN` ¶

`forward(x)` ¶

`GeometricTransformerBlock` ¶

`init(algebra, channels, num_heads=4, num_rotors=8, dropout=0.1, use_entropy_gating=False, eta=1.5, H_base=0.5)` ¶

`forward(x, key_padding_mask=None, return_state=False)` ¶

`MultivectorEmbedding` ¶

`init(algebra, vocab_size, channels)` ¶

`forward(token_ids)` ¶

`MotherEmbedding` ¶

`init(algebra, input_dim, channels, U=0.0, V=None)` ¶

`forward(x)` ¶

`EntropyGatedAttention` ¶

`init(algebra, channels, num_heads, eta=1.0, H_base=0.5)` ¶

`forward(x, key_padding_mask=None, return_gating=False)` ¶

`CliffordGraphConv` ¶

`init(algebra, in_channels, out_channels)` ¶

`forward(x, adj)` ¶