Skip to content

Layers

Primitives

RotorLayer

Bases: CliffordModule

Learnable versor layer with universal grade parameterization.

For grade=2 (default): learns R = exp(-B/2) and applies the isometry x' = RxR~. For grade=k: learns a grade-k element V and applies the versor product x' = hat(V) x V^{-1}, where hat denotes grade involution.

Preserves origin. For grade=2, also preserves lengths and angles (isometry).

The exp strategy (closed-form vs decomposition) is controlled by algebra.exp_policy -- see :class:core.runtime.decomposition.ExpPolicy.

Attributes:

Name Type Description
channels int

Number of versors.

grade int

Grade of the learnable parameter. Default 2 (bivector → rotor).

grade_weights Parameter

Learnable grade-k coefficients [channels, num_grade_elements].

Source code in layers/primitives/rotor.py
class RotorLayer(CliffordModule):
    """Learnable versor layer with universal grade parameterization.

    For grade=2 (default): learns R = exp(-B/2) and applies the isometry x' = RxR~.
    For grade=k: learns a grade-k element V and applies the versor product
    x' = hat(V) x V^{-1}, where hat denotes grade involution.

    Preserves origin. For grade=2, also preserves lengths and angles (isometry).

    The exp strategy (closed-form vs decomposition) is controlled by
    ``algebra.exp_policy`` -- see :class:`core.runtime.decomposition.ExpPolicy`.

    Attributes:
        channels (int): Number of versors.
        grade (int): Grade of the learnable parameter. Default 2 (bivector → rotor).
        grade_weights (nn.Parameter): Learnable grade-k coefficients [channels, num_grade_elements].
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        grade: int = 2,
        *,
        input_grades=None,
        output_grades=None,
        input_layout: GradeLayout = None,
        output_layout: GradeLayout = None,
        compact_output: bool = True,
    ):
        """Initialize the versor layer.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            channels (int): Number of features.
            grade (int): Grade of the learnable parameter.
                grade=2 (default): bivectors → rotors via exp(-B/2), Spin group.
                grade=1: vectors → reflections via hat(n) x n^{-1}, Pin group.
                grade=k: general grade-k versor product.
        """
        super().__init__(algebra)
        self.channels = require_positive_int(channels, "channels")
        self.grade = int(grade)
        self.input_storage = resolve_layer_storage(algebra, layout=input_layout, grades=input_grades)
        self.output_storage = (
            resolve_layer_storage(algebra, layout=output_layout, grades=output_grades)
            if output_layout is not None or output_grades is not None
            else self.input_storage
        )
        self.input_layout = self.input_storage.layout
        self.output_layout = self.output_storage.layout
        self.compact_output = bool(compact_output)

        self.register_buffer("grade_indices", grade_indices(algebra, self.grade))
        self.num_grade_elements = self.grade_indices.numel()
        self.parameter_layout = algebra.layout((self.grade,))

        self.grade_weights = nn.Parameter(torch.Tensor(self.channels, self.num_grade_elements))
        if self.grade == 2:
            tag_manifold(self.grade_weights, MANIFOLD_SPIN)

        # Versor cache for eval mode
        self._cached_V_left = None
        self._cached_V_right = None

        self.reset_parameters()

    # --- Backward-compat aliases (grade == 2 usage) ---

    @property
    def bivector_indices(self):
        return self.grade_indices

    @property
    def num_bivectors(self):
        return self.num_grade_elements

    @property
    def bivector_weights(self):
        return self.grade_weights

    # ---------------------------------------------------

    def reset_parameters(self):
        """Initialize with near-identity transform (small weights)."""
        nn.init.normal_(self.grade_weights, std=0.01)

    def _build_grade_element(self, device, dtype):
        """Scatter grade_weights into full multivector dimension [channels, dim]."""
        weights = self.grade_weights.to(device=device, dtype=dtype)
        return self.parameter_layout.dense(weights)

    def _compute_versors(self, device, dtype):
        """Compute left and right factors for per_channel_sandwich.

        For grade=2: left = R = exp(-B/2), right = R~ (reverse).
        For grade=k: left = hat(V) (grade involution), right = V^{-1} (blade inverse).
          V is L2-normalized per channel before inversion so that blade_inverse
          remains exact (norm_sq is purely scalar for unit-norm grade-k elements).

        Returns:
            Tuple[Tensor, Tensor]: (V_left [C, dim], V_right [C, dim])
        """
        weights = self.grade_weights.to(device=device, dtype=dtype)
        return dense_versor_factors(
            self.algebra,
            weights,
            grade=self.grade,
            parameter_layout=self.parameter_layout,
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply versor product x' = hat(V) x V^{-1} (= RxR~ for grade=2).

        Caches versors during eval mode for faster inference.

        Args:
            x (torch.Tensor): Input [Batch, Channels, Dim].

        Returns:
            torch.Tensor: Transformed input [Batch, Channels, Dim].
        """
        cache = (
            (self._cached_V_left, self._cached_V_right)
            if not self.training and self._cached_V_left is not None and self._cached_V_right is not None
            else None
        )
        out, next_cache = self.algebra.versor_action(
            x,
            self.grade_weights,
            grade=self.grade,
            input_layout=self.input_layout,
            output_layout=self.output_layout,
            parameter_layout=self.parameter_layout,
            compact_output=self.compact_output,
            channels=self.channels,
            name="RotorLayer input",
            dense_cache=cache,
            cache_dense=not self.training,
            return_cache=True,
        )
        if not self.training and next_cache is not None:
            self._cached_V_left, self._cached_V_right = next_cache
        return out

    def train(self, mode: bool = True):
        """Invalidate versor cache when switching to train mode."""
        if mode:
            self._cached_V_left = None
            self._cached_V_right = None
        return super().train(mode)

    def prune_bivectors(self, threshold: float = 1e-4) -> int:
        """Zero out grade weights below threshold.

        Args:
            threshold (float): Cutoff magnitude.

        Returns:
            int: Number of pruned parameters.
        """
        with torch.no_grad():
            mask = torch.abs(self.grade_weights) >= threshold
            num_pruned = (~mask).sum().item()
            self.grade_weights.data.mul_(mask.to(dtype=self.grade_weights.dtype))
        return num_pruned

    def sparsity_loss(self) -> torch.Tensor:
        """Compute L1 sparsity regularization on grade weights."""
        return torch.norm(self.grade_weights, p=1)

__init__(algebra, channels, grade=2, *, input_grades=None, output_grades=None, input_layout=None, output_layout=None, compact_output=True)

Initialize the versor layer.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

The algebra instance.

required
channels int

Number of features.

required
grade int

Grade of the learnable parameter. grade=2 (default): bivectors → rotors via exp(-B/2), Spin group. grade=1: vectors → reflections via hat(n) x n^{-1}, Pin group. grade=k: general grade-k versor product.

2
Source code in layers/primitives/rotor.py
def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    grade: int = 2,
    *,
    input_grades=None,
    output_grades=None,
    input_layout: GradeLayout = None,
    output_layout: GradeLayout = None,
    compact_output: bool = True,
):
    """Initialize the versor layer.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Number of features.
        grade (int): Grade of the learnable parameter.
            grade=2 (default): bivectors → rotors via exp(-B/2), Spin group.
            grade=1: vectors → reflections via hat(n) x n^{-1}, Pin group.
            grade=k: general grade-k versor product.
    """
    super().__init__(algebra)
    self.channels = require_positive_int(channels, "channels")
    self.grade = int(grade)
    self.input_storage = resolve_layer_storage(algebra, layout=input_layout, grades=input_grades)
    self.output_storage = (
        resolve_layer_storage(algebra, layout=output_layout, grades=output_grades)
        if output_layout is not None or output_grades is not None
        else self.input_storage
    )
    self.input_layout = self.input_storage.layout
    self.output_layout = self.output_storage.layout
    self.compact_output = bool(compact_output)

    self.register_buffer("grade_indices", grade_indices(algebra, self.grade))
    self.num_grade_elements = self.grade_indices.numel()
    self.parameter_layout = algebra.layout((self.grade,))

    self.grade_weights = nn.Parameter(torch.Tensor(self.channels, self.num_grade_elements))
    if self.grade == 2:
        tag_manifold(self.grade_weights, MANIFOLD_SPIN)

    # Versor cache for eval mode
    self._cached_V_left = None
    self._cached_V_right = None

    self.reset_parameters()

reset_parameters()

Initialize with near-identity transform (small weights).

Source code in layers/primitives/rotor.py
def reset_parameters(self):
    """Initialize with near-identity transform (small weights)."""
    nn.init.normal_(self.grade_weights, std=0.01)

forward(x)

Apply versor product x' = hat(V) x V^{-1} (= RxR~ for grade=2).

Caches versors during eval mode for faster inference.

Parameters:

Name Type Description Default
x Tensor

Input [Batch, Channels, Dim].

required

Returns:

Type Description
Tensor

torch.Tensor: Transformed input [Batch, Channels, Dim].

Source code in layers/primitives/rotor.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Apply versor product x' = hat(V) x V^{-1} (= RxR~ for grade=2).

    Caches versors during eval mode for faster inference.

    Args:
        x (torch.Tensor): Input [Batch, Channels, Dim].

    Returns:
        torch.Tensor: Transformed input [Batch, Channels, Dim].
    """
    cache = (
        (self._cached_V_left, self._cached_V_right)
        if not self.training and self._cached_V_left is not None and self._cached_V_right is not None
        else None
    )
    out, next_cache = self.algebra.versor_action(
        x,
        self.grade_weights,
        grade=self.grade,
        input_layout=self.input_layout,
        output_layout=self.output_layout,
        parameter_layout=self.parameter_layout,
        compact_output=self.compact_output,
        channels=self.channels,
        name="RotorLayer input",
        dense_cache=cache,
        cache_dense=not self.training,
        return_cache=True,
    )
    if not self.training and next_cache is not None:
        self._cached_V_left, self._cached_V_right = next_cache
    return out

train(mode=True)

Invalidate versor cache when switching to train mode.

Source code in layers/primitives/rotor.py
def train(self, mode: bool = True):
    """Invalidate versor cache when switching to train mode."""
    if mode:
        self._cached_V_left = None
        self._cached_V_right = None
    return super().train(mode)

prune_bivectors(threshold=0.0001)

Zero out grade weights below threshold.

Parameters:

Name Type Description Default
threshold float

Cutoff magnitude.

0.0001

Returns:

Name Type Description
int int

Number of pruned parameters.

Source code in layers/primitives/rotor.py
def prune_bivectors(self, threshold: float = 1e-4) -> int:
    """Zero out grade weights below threshold.

    Args:
        threshold (float): Cutoff magnitude.

    Returns:
        int: Number of pruned parameters.
    """
    with torch.no_grad():
        mask = torch.abs(self.grade_weights) >= threshold
        num_pruned = (~mask).sum().item()
        self.grade_weights.data.mul_(mask.to(dtype=self.grade_weights.dtype))
    return num_pruned

sparsity_loss()

Compute L1 sparsity regularization on grade weights.

Source code in layers/primitives/rotor.py
def sparsity_loss(self) -> torch.Tensor:
    """Compute L1 sparsity regularization on grade weights."""
    return torch.norm(self.grade_weights, p=1)

MultiRotorLayer

Bases: CliffordModule

Multi-versor layer with weighted superposition: x' = sum_k w_k hat(V_k) x V_k^{-1}.

For grade=2 (default): each V_k = exp(-B_k/2) is a rotor, reducing to x' = sum_k w_k R_k x R~_k. For grade=k: each V_k is a grade-k versor applied via the general versor product.

The exp strategy is controlled by algebra.exp_policy.

Attributes:

Name Type Description
channels int

Input features.

num_rotors int

Number of overlapping versors.

grade int

Grade of the learnable parameters. Default 2 (rotors).

rotor_grade_weights Parameter

Grade-k coefficients [num_rotors, num_grade_elements].

weights Parameter

Mixing weights [channels, num_rotors].

Source code in layers/primitives/multi_rotor.py
class MultiRotorLayer(CliffordModule):
    """Multi-versor layer with weighted superposition: x' = sum_k w_k hat(V_k) x V_k^{-1}.

    For grade=2 (default): each V_k = exp(-B_k/2) is a rotor, reducing to
    x' = sum_k w_k R_k x R~_k.
    For grade=k: each V_k is a grade-k versor applied via the general versor product.

    The exp strategy is controlled by ``algebra.exp_policy``.

    Attributes:
        channels (int): Input features.
        num_rotors (int): Number of overlapping versors.
        grade (int): Grade of the learnable parameters. Default 2 (rotors).
        rotor_grade_weights (nn.Parameter): Grade-k coefficients [num_rotors, num_grade_elements].
        weights (nn.Parameter): Mixing weights [channels, num_rotors].
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        num_rotors: int = 8,
        grade: int = 2,
        *,
        input_grades=None,
        output_grades=None,
        input_layout: GradeLayout = None,
        output_layout: GradeLayout = None,
        compact_output: bool = True,
    ):
        """Initialize Multi-Versor Layer.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            channels (int): Input features.
            num_rotors (int): Number of parallel versor heads.
            grade (int): Grade of the learnable parameter.
                grade=2 (default): bivectors → rotors via exp(-B/2), Spin group.
                grade=k: general grade-k versor product.
        """
        super().__init__(algebra)
        self.channels = require_positive_int(channels, "channels")
        self.num_rotors = require_positive_int(num_rotors, "num_rotors")
        self.grade = int(grade)
        self.input_storage = resolve_layer_storage(algebra, layout=input_layout, grades=input_grades)
        self.output_storage = (
            resolve_layer_storage(algebra, layout=output_layout, grades=output_grades)
            if output_layout is not None or output_grades is not None
            else self.input_storage
        )
        self.input_layout = self.input_storage.layout
        self.output_layout = self.output_storage.layout
        self.compact_output = bool(compact_output)

        self.register_buffer("grade_indices", grade_indices(algebra, self.grade))
        self.num_grade_elements = self.grade_indices.numel()
        self.parameter_layout = algebra.layout((self.grade,))

        self.rotor_grade_weights = nn.Parameter(torch.Tensor(self.num_rotors, self.num_grade_elements))
        if self.grade == 2:
            tag_manifold(self.rotor_grade_weights, MANIFOLD_SPIN)

        # Mixing weights (Euclidean — intentionally untagged)
        self.weights = nn.Parameter(torch.Tensor(self.channels, self.num_rotors))

        # Versor cache for eval mode
        self._cached_V_left = None
        self._cached_V_right = None

        self.reset_parameters()

    # --- Backward-compat aliases (grade == 2 usage) ---

    @property
    def bivector_indices(self):
        return self.grade_indices

    @property
    def num_bivectors(self):
        return self.num_grade_elements

    @property
    def rotor_bivectors(self):
        return self.rotor_grade_weights

    # ---------------------------------------------------

    def reset_parameters(self):
        """Initialize with small transforms and uniform mixing weights."""
        nn.init.normal_(self.rotor_grade_weights, std=0.01)
        nn.init.xavier_uniform_(self.weights)

    def _compute_versors(self, device, dtype):
        """Compute left and right factors for all K versors.

        For grade=2: left = R_k = exp(-B_k/2), right = R~_k.
        For grade=k: left = hat(V_k), right = V_k^{-1}.

        Returns:
            Tuple[Tensor, Tensor]: (V_left [K, dim], V_right [K, dim])
        """
        weights = self.rotor_grade_weights.to(device=device, dtype=dtype)
        return dense_versor_factors(
            self.algebra,
            weights,
            grade=self.grade,
            parameter_layout=self.parameter_layout,
        )

    def forward(self, x: torch.Tensor, return_invariants: bool = False) -> torch.Tensor:
        """Apply weighted multi-versor superposition.

        Caches versors during eval mode for faster inference.

        Args:
            x (torch.Tensor): Input [Batch, Channels, Dim].
            return_invariants (bool): If True, returns per-grade norms instead of output.

        Returns:
            torch.Tensor: Transformed output [Batch, Channels, Dim].
        """
        cache = (
            (self._cached_V_left, self._cached_V_right)
            if not self.training and self._cached_V_left is not None and self._cached_V_right is not None
            else None
        )
        out, next_cache = self.algebra.multi_versor_action(
            x,
            self.rotor_grade_weights,
            self.weights,
            grade=self.grade,
            input_layout=self.input_layout,
            output_layout=self.output_layout,
            parameter_layout=self.parameter_layout,
            compact_output=self.compact_output,
            channels=self.channels,
            name="MultiRotorLayer input",
            dense_cache=cache,
            cache_dense=not self.training,
            return_cache=True,
        )
        if not self.training and next_cache is not None:
            self._cached_V_left, self._cached_V_right = next_cache

        if return_invariants:
            return self.algebra.grade_norms(out, layout=self.output_layout)

        return out

    def train(self, mode: bool = True):
        """Invalidate versor cache when switching to train mode."""
        if mode:
            self._cached_V_left = None
            self._cached_V_right = None
        return super().train(mode)

    def sparsity_loss(self) -> torch.Tensor:
        """Compute L1 sparsity loss for versor weights and mixing weights."""
        return torch.norm(self.rotor_grade_weights, p=1) + torch.norm(self.weights, p=1)

__init__(algebra, channels, num_rotors=8, grade=2, *, input_grades=None, output_grades=None, input_layout=None, output_layout=None, compact_output=True)

Initialize Multi-Versor Layer.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

The algebra instance.

required
channels int

Input features.

required
num_rotors int

Number of parallel versor heads.

8
grade int

Grade of the learnable parameter. grade=2 (default): bivectors → rotors via exp(-B/2), Spin group. grade=k: general grade-k versor product.

2
Source code in layers/primitives/multi_rotor.py
def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    num_rotors: int = 8,
    grade: int = 2,
    *,
    input_grades=None,
    output_grades=None,
    input_layout: GradeLayout = None,
    output_layout: GradeLayout = None,
    compact_output: bool = True,
):
    """Initialize Multi-Versor Layer.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Input features.
        num_rotors (int): Number of parallel versor heads.
        grade (int): Grade of the learnable parameter.
            grade=2 (default): bivectors → rotors via exp(-B/2), Spin group.
            grade=k: general grade-k versor product.
    """
    super().__init__(algebra)
    self.channels = require_positive_int(channels, "channels")
    self.num_rotors = require_positive_int(num_rotors, "num_rotors")
    self.grade = int(grade)
    self.input_storage = resolve_layer_storage(algebra, layout=input_layout, grades=input_grades)
    self.output_storage = (
        resolve_layer_storage(algebra, layout=output_layout, grades=output_grades)
        if output_layout is not None or output_grades is not None
        else self.input_storage
    )
    self.input_layout = self.input_storage.layout
    self.output_layout = self.output_storage.layout
    self.compact_output = bool(compact_output)

    self.register_buffer("grade_indices", grade_indices(algebra, self.grade))
    self.num_grade_elements = self.grade_indices.numel()
    self.parameter_layout = algebra.layout((self.grade,))

    self.rotor_grade_weights = nn.Parameter(torch.Tensor(self.num_rotors, self.num_grade_elements))
    if self.grade == 2:
        tag_manifold(self.rotor_grade_weights, MANIFOLD_SPIN)

    # Mixing weights (Euclidean — intentionally untagged)
    self.weights = nn.Parameter(torch.Tensor(self.channels, self.num_rotors))

    # Versor cache for eval mode
    self._cached_V_left = None
    self._cached_V_right = None

    self.reset_parameters()

reset_parameters()

Initialize with small transforms and uniform mixing weights.

Source code in layers/primitives/multi_rotor.py
def reset_parameters(self):
    """Initialize with small transforms and uniform mixing weights."""
    nn.init.normal_(self.rotor_grade_weights, std=0.01)
    nn.init.xavier_uniform_(self.weights)

forward(x, return_invariants=False)

Apply weighted multi-versor superposition.

Caches versors during eval mode for faster inference.

Parameters:

Name Type Description Default
x Tensor

Input [Batch, Channels, Dim].

required
return_invariants bool

If True, returns per-grade norms instead of output.

False

Returns:

Type Description
Tensor

torch.Tensor: Transformed output [Batch, Channels, Dim].

Source code in layers/primitives/multi_rotor.py
def forward(self, x: torch.Tensor, return_invariants: bool = False) -> torch.Tensor:
    """Apply weighted multi-versor superposition.

    Caches versors during eval mode for faster inference.

    Args:
        x (torch.Tensor): Input [Batch, Channels, Dim].
        return_invariants (bool): If True, returns per-grade norms instead of output.

    Returns:
        torch.Tensor: Transformed output [Batch, Channels, Dim].
    """
    cache = (
        (self._cached_V_left, self._cached_V_right)
        if not self.training and self._cached_V_left is not None and self._cached_V_right is not None
        else None
    )
    out, next_cache = self.algebra.multi_versor_action(
        x,
        self.rotor_grade_weights,
        self.weights,
        grade=self.grade,
        input_layout=self.input_layout,
        output_layout=self.output_layout,
        parameter_layout=self.parameter_layout,
        compact_output=self.compact_output,
        channels=self.channels,
        name="MultiRotorLayer input",
        dense_cache=cache,
        cache_dense=not self.training,
        return_cache=True,
    )
    if not self.training and next_cache is not None:
        self._cached_V_left, self._cached_V_right = next_cache

    if return_invariants:
        return self.algebra.grade_norms(out, layout=self.output_layout)

    return out

train(mode=True)

Invalidate versor cache when switching to train mode.

Source code in layers/primitives/multi_rotor.py
def train(self, mode: bool = True):
    """Invalidate versor cache when switching to train mode."""
    if mode:
        self._cached_V_left = None
        self._cached_V_right = None
    return super().train(mode)

sparsity_loss()

Compute L1 sparsity loss for versor weights and mixing weights.

Source code in layers/primitives/multi_rotor.py
def sparsity_loss(self) -> torch.Tensor:
    """Compute L1 sparsity loss for versor weights and mixing weights."""
    return torch.norm(self.rotor_grade_weights, p=1) + torch.norm(self.weights, p=1)

CliffordLinear

Bases: CliffordModule

Fully connected layer with optional rotor-based backend.

Can use either: - Traditional scalar weight matrix (default, backward compatible) - Rotor-based transformation (new, parameter efficient via RotorGadget)

The traditional backend uses O(in_channels x out_channels) parameters, while the rotor backend uses O(num_rotor_pairs x n(n-1)/2) parameters where n is the number of basis vectors.

Attributes:

Name Type Description
in_channels int

Input features.

out_channels int

Output features.

backend str

'traditional' or 'rotor'

weight Parameter | None

Weights [Out, In] (traditional backend only).

bias Parameter | None

Bias multivector [Out, Dim] (traditional backend only).

gadget Module | None

Rotor transformation (rotor backend only).

Source code in layers/primitives/linear.py
class CliffordLinear(CliffordModule):
    """Fully connected layer with optional rotor-based backend.

    Can use either:
    - Traditional scalar weight matrix (default, backward compatible)
    - Rotor-based transformation (new, parameter efficient via RotorGadget)

    The traditional backend uses O(in_channels x out_channels) parameters,
    while the rotor backend uses O(num_rotor_pairs x n(n-1)/2) parameters
    where n is the number of basis vectors.

    Attributes:
        in_channels (int): Input features.
        out_channels (int): Output features.
        backend (str): 'traditional' or 'rotor'
        weight (torch.nn.Parameter | None): Weights [Out, In] (traditional backend only).
        bias (torch.nn.Parameter | None): Bias multivector [Out, Dim] (traditional backend only).
        gadget (nn.Module | None): Rotor transformation (rotor backend only).
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        in_channels: int,
        out_channels: int,
        backend: Literal["traditional", "rotor"] = "traditional",
        num_rotor_pairs: int = 4,
        aggregation: Literal["mean", "sum", "learned"] = "mean",
        shuffle: Literal["none", "fixed", "random"] = "none",
        grades=None,
        layout: GradeLayout = None,
    ):
        """Initialize Clifford Linear.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            in_channels (int): Input size.
            out_channels (int): Output size.
            backend (str): 'traditional' for standard linear layer,
                          'rotor' for rotor-based transformation
            num_rotor_pairs (int): Number of rotor pairs (rotor backend only)
            aggregation (str): Aggregation method (rotor backend only)
            shuffle (str): Input channel shuffle strategy (rotor backend only):
                - 'none': No shuffle (default)
                - 'fixed': Fixed random permutation
                - 'random': Random permutation each forward pass
        """
        super().__init__(algebra)
        self.in_channels = require_positive_int(in_channels, "in_channels")
        self.out_channels = require_positive_int(out_channels, "out_channels")
        self.backend = require_choice(backend, "backend", ("traditional", "rotor"))
        self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
        self.layout = self.storage.layout
        self.lane_dim = self.storage.lane_dim

        if self.backend == "traditional":
            self.weight = nn.Parameter(torch.Tensor(self.out_channels, self.in_channels))
            self.bias = nn.Parameter(torch.Tensor(self.out_channels, self.lane_dim))
            self.reset_parameters()
            self.gadget = None

        elif self.backend == "rotor":
            if self.layout is not None:
                raise ValueError(
                    "CliffordLinear rotor backend is dense-only; use traditional backend for compact lanes."
                )
            from .rotor_gadget import RotorGadget

            self.gadget = RotorGadget(
                algebra=algebra,
                in_channels=self.in_channels,
                out_channels=self.out_channels,
                num_rotor_pairs=num_rotor_pairs,
                aggregation=aggregation,
                shuffle=shuffle,
                bias=True,  # Include bias in rotor gadget
            )
            self.weight = None
            self.bias = None

    def reset_parameters(self):
        """Initialize weights with Xavier uniform and zero bias."""
        if self.backend == "traditional":
            nn.init.xavier_uniform_(self.weight)
            nn.init.zeros_(self.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply channel-mixing linear transformation.

        Args:
            x (torch.Tensor): Input [Batch, In, Dim].

        Returns:
            torch.Tensor: Output [Batch, Out, Dim].
        """
        self.storage.validate_input(
            x,
            channels=self.in_channels,
            name="CliffordLinear input",
            allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
        )

        if self.backend == "traditional":
            out = torch.einsum("oi,...id->...od", self.weight, x)
            bias_shape = (1,) * (x.ndim - 2) + (self.out_channels, self.lane_dim)
            out = out + self.bias.view(bias_shape)
            return out
        return self.gadget(x)

    def extra_repr(self) -> str:
        """String representation for debugging.

        Returns:
            str: Layer parameters description
        """
        parts = [f"in_channels={self.in_channels}", f"out_channels={self.out_channels}", f"backend={self.backend}"]
        if self.layout is not None:
            parts.append(f"grades={self.layout.grades}")
        return ", ".join(parts)

__init__(algebra, in_channels, out_channels, backend='traditional', num_rotor_pairs=4, aggregation='mean', shuffle='none', grades=None, layout=None)

Initialize Clifford Linear.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

The algebra instance.

required
in_channels int

Input size.

required
out_channels int

Output size.

required
backend str

'traditional' for standard linear layer, 'rotor' for rotor-based transformation

'traditional'
num_rotor_pairs int

Number of rotor pairs (rotor backend only)

4
aggregation str

Aggregation method (rotor backend only)

'mean'
shuffle str

Input channel shuffle strategy (rotor backend only): - 'none': No shuffle (default) - 'fixed': Fixed random permutation - 'random': Random permutation each forward pass

'none'
Source code in layers/primitives/linear.py
def __init__(
    self,
    algebra: CliffordAlgebra,
    in_channels: int,
    out_channels: int,
    backend: Literal["traditional", "rotor"] = "traditional",
    num_rotor_pairs: int = 4,
    aggregation: Literal["mean", "sum", "learned"] = "mean",
    shuffle: Literal["none", "fixed", "random"] = "none",
    grades=None,
    layout: GradeLayout = None,
):
    """Initialize Clifford Linear.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        in_channels (int): Input size.
        out_channels (int): Output size.
        backend (str): 'traditional' for standard linear layer,
                      'rotor' for rotor-based transformation
        num_rotor_pairs (int): Number of rotor pairs (rotor backend only)
        aggregation (str): Aggregation method (rotor backend only)
        shuffle (str): Input channel shuffle strategy (rotor backend only):
            - 'none': No shuffle (default)
            - 'fixed': Fixed random permutation
            - 'random': Random permutation each forward pass
    """
    super().__init__(algebra)
    self.in_channels = require_positive_int(in_channels, "in_channels")
    self.out_channels = require_positive_int(out_channels, "out_channels")
    self.backend = require_choice(backend, "backend", ("traditional", "rotor"))
    self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
    self.layout = self.storage.layout
    self.lane_dim = self.storage.lane_dim

    if self.backend == "traditional":
        self.weight = nn.Parameter(torch.Tensor(self.out_channels, self.in_channels))
        self.bias = nn.Parameter(torch.Tensor(self.out_channels, self.lane_dim))
        self.reset_parameters()
        self.gadget = None

    elif self.backend == "rotor":
        if self.layout is not None:
            raise ValueError(
                "CliffordLinear rotor backend is dense-only; use traditional backend for compact lanes."
            )
        from .rotor_gadget import RotorGadget

        self.gadget = RotorGadget(
            algebra=algebra,
            in_channels=self.in_channels,
            out_channels=self.out_channels,
            num_rotor_pairs=num_rotor_pairs,
            aggregation=aggregation,
            shuffle=shuffle,
            bias=True,  # Include bias in rotor gadget
        )
        self.weight = None
        self.bias = None

reset_parameters()

Initialize weights with Xavier uniform and zero bias.

Source code in layers/primitives/linear.py
def reset_parameters(self):
    """Initialize weights with Xavier uniform and zero bias."""
    if self.backend == "traditional":
        nn.init.xavier_uniform_(self.weight)
        nn.init.zeros_(self.bias)

forward(x)

Apply channel-mixing linear transformation.

Parameters:

Name Type Description Default
x Tensor

Input [Batch, In, Dim].

required

Returns:

Type Description
Tensor

torch.Tensor: Output [Batch, Out, Dim].

Source code in layers/primitives/linear.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Apply channel-mixing linear transformation.

    Args:
        x (torch.Tensor): Input [Batch, In, Dim].

    Returns:
        torch.Tensor: Output [Batch, Out, Dim].
    """
    self.storage.validate_input(
        x,
        channels=self.in_channels,
        name="CliffordLinear input",
        allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
    )

    if self.backend == "traditional":
        out = torch.einsum("oi,...id->...od", self.weight, x)
        bias_shape = (1,) * (x.ndim - 2) + (self.out_channels, self.lane_dim)
        out = out + self.bias.view(bias_shape)
        return out
    return self.gadget(x)

extra_repr()

String representation for debugging.

Returns:

Name Type Description
str str

Layer parameters description

Source code in layers/primitives/linear.py
def extra_repr(self) -> str:
    """String representation for debugging.

    Returns:
        str: Layer parameters description
    """
    parts = [f"in_channels={self.in_channels}", f"out_channels={self.out_channels}", f"backend={self.backend}"]
    if self.layout is not None:
        parts.append(f"grades={self.layout.grades}")
    return ", ".join(parts)

RotorGadget

Bases: CliffordModule

Rotor-based linear transformation (Generalized Rotor Gadget).

Replaces standard linear layers with parameter-efficient rotor-sandwich transformations. Instead of using O(in_channels x out_channels) parameters, this uses O(num_rotor_pairs x n(n-1)/2) parameters where n is the number of basis vectors in the Clifford algebra.

Architecture
  1. Partition input channels into blocks
  2. For each rotor pair (i, j):
  3. Apply rotor sandwich: r_ij . x_i . s_ij.H
  4. Pool/aggregate results to output channels

The transformation is: psi(x) = r.x.s.H where r, s are rotors (bivector exponentials).

Attributes:

Name Type Description
algebra AlgebraLike

CliffordAlgebra instance

in_channels

Number of input channels

out_channels

Number of output channels

num_rotor_pairs

Number of rotor pairs to use

aggregation

Aggregation method ('mean', 'sum', or 'learned')

Source code in layers/primitives/rotor_gadget.py
class RotorGadget(CliffordModule):
    """Rotor-based linear transformation (Generalized Rotor Gadget).

    Replaces standard linear layers with parameter-efficient rotor-sandwich
    transformations. Instead of using O(in_channels x out_channels) parameters,
    this uses O(num_rotor_pairs x n(n-1)/2) parameters where n is the number
    of basis vectors in the Clifford algebra.

    Architecture:
        1. Partition input channels into blocks
        2. For each rotor pair (i, j):
           - Apply rotor sandwich: r_ij . x_i . s_ij.H
        3. Pool/aggregate results to output channels

    The transformation is: psi(x) = r.x.s.H where r, s are rotors (bivector exponentials).

    Attributes:
        algebra: CliffordAlgebra instance
        in_channels: Number of input channels
        out_channels: Number of output channels
        num_rotor_pairs: Number of rotor pairs to use
        aggregation: Aggregation method ('mean', 'sum', or 'learned')
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        in_channels: int,
        out_channels: int,
        num_rotor_pairs: int = 4,
        aggregation: Literal["mean", "sum", "learned"] = "mean",
        shuffle: Literal["none", "fixed", "random"] = "none",
        bias: bool = False,
    ):
        """Initialize rotor gadget layer.

        Args:
            algebra: CliffordAlgebra instance
            in_channels: Number of input channels
            out_channels: Number of output channels
            num_rotor_pairs: Number of rotor pairs (higher = more expressive)
            aggregation: How to pool rotor outputs ('mean', 'sum', 'learned')
            shuffle: Input channel shuffle strategy:
                - 'none': No shuffle, sequential block assignment (default)
                - 'fixed': Random permutation at initialization (fixed during training)
                - 'random': Random permutation each forward pass (regularization)
            bias: Whether to include bias term (applied after transformation)
        """
        super().__init__(algebra)
        if not hasattr(algebra, "per_channel_sandwich"):
            raise ValueError("RotorGadget is dense-only and requires CliffordAlgebra.")

        self.in_channels = require_positive_int(in_channels, "in_channels")
        self.out_channels = require_positive_int(out_channels, "out_channels")
        self.num_rotor_pairs = require_positive_int(num_rotor_pairs, "num_rotor_pairs")
        self.aggregation = require_choice(aggregation, "aggregation", ("mean", "sum", "learned"))
        self.shuffle = require_choice(shuffle, "shuffle", ("none", "fixed", "random"))

        if algebra.num_grades <= 2:
            raise ValueError(f"Algebra has no bivectors. RotorGadget requires at least one bivector for rotation.")
        self.register_buffer("bivector_indices", grade_indices(algebra, 2, name="bivector grade"))
        self.num_bivectors = self.bivector_indices.numel()

        # Rotor parameters: bivector coefficients for exponential map
        # Left rotors: [num_rotor_pairs, num_bivectors]
        self.bivector_left = nn.Parameter(torch.randn(self.num_rotor_pairs, self.num_bivectors) * 0.1)
        tag_manifold(self.bivector_left, MANIFOLD_SPIN)
        # Right rotors: [num_rotor_pairs, num_bivectors]
        self.bivector_right = nn.Parameter(torch.randn(self.num_rotor_pairs, self.num_bivectors) * 0.1)
        tag_manifold(self.bivector_right, MANIFOLD_SPIN)

        # Channel routing: block diagonal partitioning (paper style)
        # Each rotor pair processes a subset of input channels
        self._setup_channel_routing()

        # Aggregation weights (if learned)
        if self.aggregation == "learned":
            self.agg_weights = nn.Parameter(torch.ones(self.num_rotor_pairs, self.out_channels) / self.num_rotor_pairs)
        else:
            self.register_buffer("agg_weights", None)

        # Optional bias
        if bias:
            self.bias = nn.Parameter(torch.zeros(self.out_channels, algebra.dim))
        else:
            self.register_buffer("bias", None)

        # Rotor cache for eval mode
        self._cached_rotors = None

    def _setup_channel_routing(self):
        """Set up block diagonal channel routing with optional shuffle.

        Partitions input and output channels into blocks, where each rotor
        pair operates on a specific block. Optionally shuffles input channels
        before routing for regularization.
        """
        in_assignment = torch.div(
            torch.arange(self.in_channels) * self.num_rotor_pairs,
            self.in_channels,
            rounding_mode="floor",
        ).clamp_max(self.num_rotor_pairs - 1)
        out_assignment = torch.div(
            torch.arange(self.out_channels) * self.num_rotor_pairs,
            self.out_channels,
            rounding_mode="floor",
        ).clamp_max(self.num_rotor_pairs - 1)

        in_indices = []
        out_indices = []
        for i in range(self.num_rotor_pairs):
            in_members = (in_assignment == i).nonzero(as_tuple=False).flatten()
            out_members = (out_assignment == i).nonzero(as_tuple=False).flatten()
            if in_members.numel() == 0:
                in_indices.append((self.in_channels, self.in_channels))
            else:
                in_indices.append((int(in_members[0]), int(in_members[-1]) + 1))
            if out_members.numel() == 0:
                out_indices.append((self.out_channels, self.out_channels))
            else:
                out_indices.append((int(out_members[0]), int(out_members[-1]) + 1))

        self.in_indices = in_indices
        self.out_indices = out_indices

        ch2pair = in_assignment.to(dtype=torch.long)
        self.register_buffer("_ch2pair", ch2pair)
        self.register_buffer("_channel_mix_mean", channel_mix(self.in_channels, self.out_channels, normalize=True))
        self.register_buffer("_channel_mix_sum", channel_mix(self.in_channels, self.out_channels, normalize=False))
        self.register_buffer("_pair_mean", pair_mean(ch2pair, self.num_rotor_pairs))

        # Set up channel shuffle permutation
        if self.shuffle == "fixed":
            # Create fixed random permutation at initialization
            perm = torch.randperm(self.in_channels)
            self.register_buffer("channel_permutation", perm)
        elif self.shuffle == "random":
            # Random shuffle each forward pass - no fixed permutation
            self.register_buffer("channel_permutation", None)
        else:  # 'none'
            # No shuffle - identity permutation
            self.register_buffer("channel_permutation", None)

    def _bivector_to_multivector(self, bivector_coeffs: torch.Tensor) -> torch.Tensor:
        """Convert bivector coefficients to full multivector via vectorized scatter.

        Args:
            bivector_coeffs: Tensor of shape [..., num_bivectors]

        Returns:
            Multivector tensor of shape [..., algebra.dim]
        """
        return dense_from_indices(bivector_coeffs, self.bivector_indices, self.algebra.dim)

    def _compute_rotors(self, device=None, dtype=None):
        """Compute rotor multivectors from bivector parameters.

        Returns:
            Tuple of (left_rotors, right_rotors_reversed) where each is
            a tensor of shape [num_rotor_pairs, algebra.dim]
        """
        left = self.bivector_left
        right = self.bivector_right
        if device is not None or dtype is not None:
            left = left.to(device=device, dtype=dtype)
            right = right.to(device=device, dtype=dtype)

        # Convert bivector parameters to multivectors
        B_left = self._bivector_to_multivector(left)  # [pairs, dim]
        B_right = self._bivector_to_multivector(right)  # [pairs, dim]

        # Compute rotors via exponential map: R = exp(-0.5 * B)
        R_left = self.algebra.exp(-0.5 * B_left)  # [pairs, dim]
        R_right = self.algebra.exp(-0.5 * B_right)  # [pairs, dim]

        # Compute reverse of right rotors for sandwich product
        R_right_rev = self.algebra.reverse(R_right)  # [pairs, dim]

        return R_left, R_right_rev

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Apply rotor-based transformation.

        Uses batched geometric products - all rotor pairs are applied in
        parallel via a single pair of GP calls.

        Args:
            x: Input tensor of shape [Batch, In_Channels, Dim]

        Returns:
            Output tensor of shape [Batch, Out_Channels, Dim]
        """
        check_multivector(x, self.algebra, "RotorGadget input")
        check_channels(x, self.in_channels, "RotorGadget input")

        # Apply input channel shuffle if enabled
        if self.shuffle == "fixed":
            x = x.index_select(-2, self.channel_permutation)
        elif self.shuffle == "random":
            perm = torch.randperm(self.in_channels, device=x.device)
            x = x.index_select(-2, perm)

        # Compute rotors (cached in eval mode)
        if not self.training and cache_matches(self._cached_rotors, x):
            R_left, R_right_rev = self._cached_rotors
        else:
            R_left, R_right_rev = self._compute_rotors(x.device, x.dtype)
            if not self.training:
                self._cached_rotors = (R_left, R_right_rev)

        ch2pair = self._ch2pair.to(device=R_left.device)
        R_left_by_channel = R_left[ch2pair]
        R_right_by_channel = R_right_rev[ch2pair]
        concat_out = self.algebra.per_channel_sandwich(R_left_by_channel, x, R_right_by_channel)

        # Map to output channels
        out = self._aggregate_to_output_channels(concat_out)

        if self.bias is not None:
            bias_shape = (1,) * (out.ndim - 2) + (self.out_channels, self.algebra.dim)
            out = out + self.bias.view(bias_shape)

        return out

    def _aggregate_to_output_channels(self, x: torch.Tensor) -> torch.Tensor:
        """Aggregate rotor pair outputs to match output channel count.

        Args:
            x: Concatenated outputs from rotor pairs [B, total_channels, dim]

        Returns:
            Aggregated output [B, out_channels, dim]
        """
        if self.aggregation == "learned":
            pair_values = torch.einsum("ki,...id->...kd", self._pair_mean.to(device=x.device, dtype=x.dtype), x)
            return torch.einsum("ko,...kd->...od", self.agg_weights.to(device=x.device, dtype=x.dtype), pair_values)

        mix = self._channel_mix_sum if self.aggregation == "sum" else self._channel_mix_mean
        return torch.einsum("oi,...id->...od", mix.to(device=x.device, dtype=x.dtype), x)

    def train(self, mode: bool = True):
        """Override to invalidate rotor cache when switching to train mode."""
        if mode:
            self._cached_rotors = None
        return super().train(mode)

    def extra_repr(self) -> str:
        """String representation for debugging."""
        return (
            f"in_channels={self.in_channels}, "
            f"out_channels={self.out_channels}, "
            f"num_rotor_pairs={self.num_rotor_pairs}, "
            f"aggregation={self.aggregation}, "
            f"shuffle={self.shuffle}, "
            f"bias={self.bias is not None}"
        )

__init__(algebra, in_channels, out_channels, num_rotor_pairs=4, aggregation='mean', shuffle='none', bias=False)

Initialize rotor gadget layer.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

CliffordAlgebra instance

required
in_channels int

Number of input channels

required
out_channels int

Number of output channels

required
num_rotor_pairs int

Number of rotor pairs (higher = more expressive)

4
aggregation Literal['mean', 'sum', 'learned']

How to pool rotor outputs ('mean', 'sum', 'learned')

'mean'
shuffle Literal['none', 'fixed', 'random']

Input channel shuffle strategy: - 'none': No shuffle, sequential block assignment (default) - 'fixed': Random permutation at initialization (fixed during training) - 'random': Random permutation each forward pass (regularization)

'none'
bias bool

Whether to include bias term (applied after transformation)

False
Source code in layers/primitives/rotor_gadget.py
def __init__(
    self,
    algebra: CliffordAlgebra,
    in_channels: int,
    out_channels: int,
    num_rotor_pairs: int = 4,
    aggregation: Literal["mean", "sum", "learned"] = "mean",
    shuffle: Literal["none", "fixed", "random"] = "none",
    bias: bool = False,
):
    """Initialize rotor gadget layer.

    Args:
        algebra: CliffordAlgebra instance
        in_channels: Number of input channels
        out_channels: Number of output channels
        num_rotor_pairs: Number of rotor pairs (higher = more expressive)
        aggregation: How to pool rotor outputs ('mean', 'sum', 'learned')
        shuffle: Input channel shuffle strategy:
            - 'none': No shuffle, sequential block assignment (default)
            - 'fixed': Random permutation at initialization (fixed during training)
            - 'random': Random permutation each forward pass (regularization)
        bias: Whether to include bias term (applied after transformation)
    """
    super().__init__(algebra)
    if not hasattr(algebra, "per_channel_sandwich"):
        raise ValueError("RotorGadget is dense-only and requires CliffordAlgebra.")

    self.in_channels = require_positive_int(in_channels, "in_channels")
    self.out_channels = require_positive_int(out_channels, "out_channels")
    self.num_rotor_pairs = require_positive_int(num_rotor_pairs, "num_rotor_pairs")
    self.aggregation = require_choice(aggregation, "aggregation", ("mean", "sum", "learned"))
    self.shuffle = require_choice(shuffle, "shuffle", ("none", "fixed", "random"))

    if algebra.num_grades <= 2:
        raise ValueError(f"Algebra has no bivectors. RotorGadget requires at least one bivector for rotation.")
    self.register_buffer("bivector_indices", grade_indices(algebra, 2, name="bivector grade"))
    self.num_bivectors = self.bivector_indices.numel()

    # Rotor parameters: bivector coefficients for exponential map
    # Left rotors: [num_rotor_pairs, num_bivectors]
    self.bivector_left = nn.Parameter(torch.randn(self.num_rotor_pairs, self.num_bivectors) * 0.1)
    tag_manifold(self.bivector_left, MANIFOLD_SPIN)
    # Right rotors: [num_rotor_pairs, num_bivectors]
    self.bivector_right = nn.Parameter(torch.randn(self.num_rotor_pairs, self.num_bivectors) * 0.1)
    tag_manifold(self.bivector_right, MANIFOLD_SPIN)

    # Channel routing: block diagonal partitioning (paper style)
    # Each rotor pair processes a subset of input channels
    self._setup_channel_routing()

    # Aggregation weights (if learned)
    if self.aggregation == "learned":
        self.agg_weights = nn.Parameter(torch.ones(self.num_rotor_pairs, self.out_channels) / self.num_rotor_pairs)
    else:
        self.register_buffer("agg_weights", None)

    # Optional bias
    if bias:
        self.bias = nn.Parameter(torch.zeros(self.out_channels, algebra.dim))
    else:
        self.register_buffer("bias", None)

    # Rotor cache for eval mode
    self._cached_rotors = None

forward(x)

Apply rotor-based transformation.

Uses batched geometric products - all rotor pairs are applied in parallel via a single pair of GP calls.

Parameters:

Name Type Description Default
x Tensor

Input tensor of shape [Batch, In_Channels, Dim]

required

Returns:

Type Description
Tensor

Output tensor of shape [Batch, Out_Channels, Dim]

Source code in layers/primitives/rotor_gadget.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Apply rotor-based transformation.

    Uses batched geometric products - all rotor pairs are applied in
    parallel via a single pair of GP calls.

    Args:
        x: Input tensor of shape [Batch, In_Channels, Dim]

    Returns:
        Output tensor of shape [Batch, Out_Channels, Dim]
    """
    check_multivector(x, self.algebra, "RotorGadget input")
    check_channels(x, self.in_channels, "RotorGadget input")

    # Apply input channel shuffle if enabled
    if self.shuffle == "fixed":
        x = x.index_select(-2, self.channel_permutation)
    elif self.shuffle == "random":
        perm = torch.randperm(self.in_channels, device=x.device)
        x = x.index_select(-2, perm)

    # Compute rotors (cached in eval mode)
    if not self.training and cache_matches(self._cached_rotors, x):
        R_left, R_right_rev = self._cached_rotors
    else:
        R_left, R_right_rev = self._compute_rotors(x.device, x.dtype)
        if not self.training:
            self._cached_rotors = (R_left, R_right_rev)

    ch2pair = self._ch2pair.to(device=R_left.device)
    R_left_by_channel = R_left[ch2pair]
    R_right_by_channel = R_right_rev[ch2pair]
    concat_out = self.algebra.per_channel_sandwich(R_left_by_channel, x, R_right_by_channel)

    # Map to output channels
    out = self._aggregate_to_output_channels(concat_out)

    if self.bias is not None:
        bias_shape = (1,) * (out.ndim - 2) + (self.out_channels, self.algebra.dim)
        out = out + self.bias.view(bias_shape)

    return out

train(mode=True)

Override to invalidate rotor cache when switching to train mode.

Source code in layers/primitives/rotor_gadget.py
def train(self, mode: bool = True):
    """Override to invalidate rotor cache when switching to train mode."""
    if mode:
        self._cached_rotors = None
    return super().train(mode)

extra_repr()

String representation for debugging.

Source code in layers/primitives/rotor_gadget.py
def extra_repr(self) -> str:
    """String representation for debugging."""
    return (
        f"in_channels={self.in_channels}, "
        f"out_channels={self.out_channels}, "
        f"num_rotor_pairs={self.num_rotor_pairs}, "
        f"aggregation={self.aggregation}, "
        f"shuffle={self.shuffle}, "
        f"bias={self.bias is not None}"
    )

CliffordLayerNorm

Bases: CliffordModule

Geometric LayerNorm that preserves direction and recovers scale.

Normalizes the multivector to unit norm (preserving geometric direction), then injects the original log-magnitude into the scalar (grade-0) part via a learnable gate.

Attributes:

Name Type Description
weight Parameter

Per-channel direction scale [C].

bias Parameter

Per-channel scalar bias [C].

norm_scale Parameter

Per-channel gate for log-magnitude injection into grade-0. Initialized to zero so the layer starts identical to the old (scale-discarding) behaviour.

Source code in layers/primitives/normalization.py
class CliffordLayerNorm(CliffordModule):
    """Geometric LayerNorm that preserves direction and recovers scale.

    Normalizes the multivector to unit norm (preserving geometric direction),
    then injects the original log-magnitude into the scalar (grade-0) part
    via a learnable gate.

    Attributes:
        weight (nn.Parameter): Per-channel direction scale [C].
        bias (nn.Parameter): Per-channel scalar bias [C].
        norm_scale (nn.Parameter): Per-channel gate for log-magnitude
            injection into grade-0.  Initialized to zero so the layer
            starts identical to the old (scale-discarding) behaviour.
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        eps: float = 1e-6,
        recover: bool = True,
        *,
        grades=None,
        layout: GradeLayout = None,
    ):
        """Sets up normalization.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            channels (int): Features.
            eps (float): Stability term.
            recover (bool): Whether to inject original scale into the scalar part.
        """
        super().__init__(algebra)
        self.channels = require_positive_int(channels, "channels")
        if eps <= 0:
            raise ValueError(f"eps must be positive, got {eps}")
        self.eps = eps
        self.recover = recover
        self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
        self.layout = self.storage.layout
        self.lane_dim = self.storage.lane_dim

        self.weight = nn.Parameter(torch.ones(self.channels))
        self.bias = nn.Parameter(torch.zeros(self.channels))
        self.register_buffer("scalar_mask", self.storage.scalar_mask())
        if recover:
            self.norm_scale = nn.Parameter(torch.zeros(self.channels))
        else:
            self.register_buffer("norm_scale", None)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Normalizes energy, preserves direction, optionally recovers scale in grade-0.

        Args:
            x (torch.Tensor): Input [Batch, Channels, Dim].

        Returns:
            torch.Tensor: Normalized input.
        """
        self.storage.validate_input(
            x,
            channels=self.channels,
            name="CliffordLayerNorm input",
            allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
        )
        channel_shape = (1,) * (x.ndim - 2) + (self.channels, 1)

        norm = x.norm(dim=-1, keepdim=True).clamp_min(self.eps)
        x_normalized = x / norm
        out = x_normalized * self.weight.view(channel_shape)

        g0 = self.scalar_mask
        if g0.device != x.device or g0.dtype != x.dtype:
            g0 = g0.to(device=x.device, dtype=x.dtype)
        out = out + self.bias.view(channel_shape) * g0

        if self.recover:
            log_norm = torch.log1p(norm)
            out = out + self.norm_scale.view(channel_shape) * log_norm * g0

        return out

__init__(algebra, channels, eps=1e-06, recover=True, *, grades=None, layout=None)

Sets up normalization.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

The algebra instance.

required
channels int

Features.

required
eps float

Stability term.

1e-06
recover bool

Whether to inject original scale into the scalar part.

True
Source code in layers/primitives/normalization.py
def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    eps: float = 1e-6,
    recover: bool = True,
    *,
    grades=None,
    layout: GradeLayout = None,
):
    """Sets up normalization.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Features.
        eps (float): Stability term.
        recover (bool): Whether to inject original scale into the scalar part.
    """
    super().__init__(algebra)
    self.channels = require_positive_int(channels, "channels")
    if eps <= 0:
        raise ValueError(f"eps must be positive, got {eps}")
    self.eps = eps
    self.recover = recover
    self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
    self.layout = self.storage.layout
    self.lane_dim = self.storage.lane_dim

    self.weight = nn.Parameter(torch.ones(self.channels))
    self.bias = nn.Parameter(torch.zeros(self.channels))
    self.register_buffer("scalar_mask", self.storage.scalar_mask())
    if recover:
        self.norm_scale = nn.Parameter(torch.zeros(self.channels))
    else:
        self.register_buffer("norm_scale", None)

forward(x)

Normalizes energy, preserves direction, optionally recovers scale in grade-0.

Parameters:

Name Type Description Default
x Tensor

Input [Batch, Channels, Dim].

required

Returns:

Type Description
Tensor

torch.Tensor: Normalized input.

Source code in layers/primitives/normalization.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Normalizes energy, preserves direction, optionally recovers scale in grade-0.

    Args:
        x (torch.Tensor): Input [Batch, Channels, Dim].

    Returns:
        torch.Tensor: Normalized input.
    """
    self.storage.validate_input(
        x,
        channels=self.channels,
        name="CliffordLayerNorm input",
        allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
    )
    channel_shape = (1,) * (x.ndim - 2) + (self.channels, 1)

    norm = x.norm(dim=-1, keepdim=True).clamp_min(self.eps)
    x_normalized = x / norm
    out = x_normalized * self.weight.view(channel_shape)

    g0 = self.scalar_mask
    if g0.device != x.device or g0.dtype != x.dtype:
        g0 = g0.to(device=x.device, dtype=x.dtype)
    out = out + self.bias.view(channel_shape) * g0

    if self.recover:
        log_norm = torch.log1p(norm)
        out = out + self.norm_scale.view(channel_shape) * log_norm * g0

    return out

BladeSelector

Bases: CliffordModule

Blade Selector. Filters insignificant components.

Learns to weigh geometric grades, suppressing less relevant ones.

Attributes:

Name Type Description
weights Parameter

Gate logits [Channels, Dim].

Source code in layers/primitives/projection.py
class BladeSelector(CliffordModule):
    """Blade Selector. Filters insignificant components.

    Learns to weigh geometric grades, suppressing less relevant ones.

    Attributes:
        weights (nn.Parameter): Gate logits [Channels, Dim].
    """

    def __init__(self, algebra: CliffordAlgebra, channels: int, *, grades=None, layout: GradeLayout = None):
        """Sets up the selector.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            channels (int): Input features.
        """
        super().__init__(algebra)
        self.channels = require_positive_int(channels, "channels")
        self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
        self.layout = self.storage.layout
        self.lane_dim = self.storage.lane_dim

        self.weights = nn.Parameter(torch.Tensor(self.channels, self.lane_dim))

        self.reset_parameters()

    def reset_parameters(self):
        """Initialize logits so the selector starts as pass-through."""
        nn.init.zeros_(self.weights)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Gates the grades.

        The gate is ``2 * sigmoid(weights)`` so zero logits preserve the input.

        Args:
            x (torch.Tensor): Input [Batch, Channels, Dim].

        Returns:
            torch.Tensor: Filtered input.
        """
        self.storage.validate_input(
            x,
            channels=self.channels,
            name="BladeSelector input",
            allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
        )
        gate_shape = (1,) * (x.ndim - 2) + (self.channels, self.lane_dim)
        gate = 2.0 * torch.sigmoid(self.weights).view(gate_shape)
        return x * gate

__init__(algebra, channels, *, grades=None, layout=None)

Sets up the selector.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

The algebra instance.

required
channels int

Input features.

required
Source code in layers/primitives/projection.py
def __init__(self, algebra: CliffordAlgebra, channels: int, *, grades=None, layout: GradeLayout = None):
    """Sets up the selector.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Input features.
    """
    super().__init__(algebra)
    self.channels = require_positive_int(channels, "channels")
    self.storage = resolve_layer_storage(algebra, layout=layout, grades=grades)
    self.layout = self.storage.layout
    self.lane_dim = self.storage.lane_dim

    self.weights = nn.Parameter(torch.Tensor(self.channels, self.lane_dim))

    self.reset_parameters()

reset_parameters()

Initialize logits so the selector starts as pass-through.

Source code in layers/primitives/projection.py
def reset_parameters(self):
    """Initialize logits so the selector starts as pass-through."""
    nn.init.zeros_(self.weights)

forward(x)

Gates the grades.

The gate is 2 * sigmoid(weights) so zero logits preserve the input.

Parameters:

Name Type Description Default
x Tensor

Input [Batch, Channels, Dim].

required

Returns:

Type Description
Tensor

torch.Tensor: Filtered input.

Source code in layers/primitives/projection.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Gates the grades.

    The gate is ``2 * sigmoid(weights)`` so zero logits preserve the input.

    Args:
        x (torch.Tensor): Input [Batch, Channels, Dim].

    Returns:
        torch.Tensor: Filtered input.
    """
    self.storage.validate_input(
        x,
        channels=self.channels,
        name="BladeSelector input",
        allow_dense=self.layout is None or self.layout.dim == self.algebra.dim,
    )
    gate_shape = (1,) * (x.ndim - 2) + (self.channels, self.lane_dim)
    gate = 2.0 * torch.sigmoid(self.weights).view(gate_shape)
    return x * gate

Blocks

GeometricProductAttention

Bases: CliffordModule

Multi-head attention using geometric product scoring.

Standard attention: score(Q, K) = / sqrt(d) (scalar only)

GA attention

product = Q_c * reverse(K_c) (geometric product per head-channel) score = (0 + lambda * ||_2||_F) / sqrt(H_c * dim)

The grade-0 (scalar) part measures alignment (like dot product). The grade-2 (bivector) part measures relative orientation - novel.

Memory: naive [B, H, L, L, H_c, D] is too large. We chunk over L_q in blocks of BLOCK_SIZE to bound peak VRAM.

Attributes:

Name Type Description
num_heads int

Number of attention heads.

head_channels int

Channels per head.

causal bool

If True, apply autoregressive causal mask.

bivector_weight float

lambda_ - weight of bivector score component.

Source code in layers/blocks/attention.py
class GeometricProductAttention(CliffordModule):
    """Multi-head attention using geometric product scoring.

    Standard attention: score(Q, K) = <Q, K> / sqrt(d)  (scalar only)

    GA attention:
        product = Q_c * reverse(K_c)    (geometric product per head-channel)
        score   = (<product>_0 + lambda_ * ||<product>_2||_F) / sqrt(H_c * dim)

    The grade-0 (scalar) part measures alignment (like dot product).
    The grade-2 (bivector) part measures relative orientation - novel.

    Memory: naive [B, H, L, L, H_c, D] is too large. We chunk over L_q
    in blocks of BLOCK_SIZE to bound peak VRAM.

    Attributes:
        num_heads (int): Number of attention heads.
        head_channels (int): Channels per head.
        causal (bool): If True, apply autoregressive causal mask.
        bivector_weight (float): lambda_ - weight of bivector score component.
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        num_heads: int,
        causal: bool = True,
        bivector_weight: float = 0.5,
        dropout: float = 0.0,
        score_blade_chunk_size: int = _G2_BLADE_CHUNK_SIZE,
        score_precompute_limit: int = _SCORE_PRECOMPUTE_LIMIT,
    ):
        """Sets up geometric product attention.

        Args:
            algebra: Clifford algebra instance.
            channels: Total number of multivector channels.
            num_heads: Number of attention heads.
            causal: Apply causal mask for autoregressive generation.
            bivector_weight: lambda_ weight on bivector score component.
            dropout: Dropout rate on attention weights.
            score_blade_chunk_size: Grade-2 output blades processed per dense
                chunk when exact dense scoring is used.
            score_precompute_limit: Maximum temporary ``K_g2`` elements allowed
                before exact dense scoring switches to chunked grade-2 blades.
        """
        super().__init__(algebra)
        assert channels % num_heads == 0, f"channels ({channels}) must be divisible by num_heads ({num_heads})"

        self.channels = channels
        self.num_heads = num_heads
        self.head_channels = channels // num_heads
        self.causal = causal
        self.bivector_weight = bivector_weight
        self.score_blade_chunk_size = max(1, int(score_blade_chunk_size))
        self.score_precompute_limit = max(0, int(score_precompute_limit))

        # Q, K, V projections operate on [B*L, channels, dim]
        self.q_proj = CliffordLinear(algebra, channels, channels)
        self.k_proj = CliffordLinear(algebra, channels, channels)
        self.v_proj = CliffordLinear(algebra, channels, channels)
        self.out_proj = CliffordLinear(algebra, channels, channels)

        self.attn_dropout = nn.Dropout(dropout) if dropout > 0.0 else None

        # Precompute bilinear score routes (replaces pairwise geometric product)
        self._precompute_score_tables()

    def _precompute_score_tables(self):
        """Precompute exact dense attention score routes."""
        alg = self.algebra
        D = alg.dim

        if not hasattr(alg, "gp_signs") or not hasattr(alg, "rev_signs"):
            raise ValueError("GeometricProductAttention currently requires dense CliffordAlgebra inputs.")

        # Grade-0 metric: metric_rev[a] = gp_signs[a, 0] * rev_signs[a]
        # gp_signs[a, 0] is the sign when A[a] * B[a] contributes to output blade 0
        metric_rev = alg.gp_signs[:, 0].float() * alg.rev_signs.float()
        self.register_buffer("_metric_rev", metric_rev)  # [D]

        g2_blades = [i for i in range(D) if bin(i).count("1") == 2]
        self.n_g2 = len(g2_blades)
        self.register_buffer("_g2_blades", torch.tensor(g2_blades, dtype=torch.long, device=alg.device))
        self.register_buffer("_basis_indices", torch.arange(D, dtype=torch.long, device=alg.device))

    def _compute_score(
        self,
        q_head: torch.Tensor,
        k_head: torch.Tensor,
    ) -> torch.Tensor:
        """Compute GA attention scores for one query block."""
        return self._compute_score_dense(q_head, k_head)

    def _compute_score_dense(self, q_head: torch.Tensor, k_head: torch.Tensor) -> torch.Tensor:
        """Exact dense score with automatic full/prechunked grade-2 routing."""
        B, H, Lq, Hc, D = q_head.shape
        Lk = k_head.shape[2]
        n_g2 = self.n_g2

        # == Grade-0 score ====================================================
        # <Q * rev(K)>_0 = Sum_c Sum_d  Q[c,d] * K[c,d] * metric_rev[d]
        # Implemented as a batched matrix multiply: [B,H,Lq,Hc*D] @ [B,H,Hc*D,Lk]
        q_weighted = q_head * self._metric_rev  # [B, H, Lq, Hc, D]
        q_flat = q_weighted.reshape(B, H, Lq, Hc * D)  # [B, H, Lq, Hc*D]
        k_flat = k_head.reshape(B, H, Lk, Hc * D)  # [B, H, Lk, Hc*D]
        score_g0 = torch.matmul(q_flat, k_flat.transpose(-2, -1))  # [B, H, Lq, Lk]

        # == Grade-2 score ====================================================
        # ||<Q * rev(K)>_2||_F = sqrt(Sum_c Sum_r (Sum_d Q[c,d]*k_g2[j,c,r,d])^2)
        if n_g2 > 0:
            q_2d = q_head.permute(0, 1, 3, 2, 4).reshape(B * H * Hc, Lq, D)

            full_k_g2_elements = B * H * Lk * Hc * n_g2 * D
            if full_k_g2_elements <= self.score_precompute_limit:
                score_g2_sq = self._dense_score_g2_precomputed(q_2d, k_head, B, H, Hc, Lq, Lk, D, n_g2)
            else:
                k_2d = k_head.permute(0, 1, 3, 2, 4).reshape(B * H * Hc, Lk, D)
                score_g2_sq = self._dense_score_g2_chunked(q_2d, k_2d, B, H, Hc, Lq, Lk, D, n_g2)
            score_g2 = score_g2_sq.sqrt()
        else:
            score_g2 = torch.zeros_like(score_g0)

        # Combined score
        scale = math.sqrt(self.head_channels * self.algebra.dim)
        return (score_g0 + self.bivector_weight * score_g2) / scale

    def _dense_score_g2_precomputed(self, q_2d, k_head, B, H, Hc, Lq, Lk, D, n_g2):
        """Dense grade-2 score using one full shifted-key materialization."""
        r_vals = self._g2_blades
        b_idx = self._basis_indices.unsqueeze(0) ^ r_vals.unsqueeze(1)
        rev_b = self.algebra.rev_signs[b_idx].to(dtype=k_head.dtype)
        gp_ar = self.algebra.gp_signs[:, r_vals].T.to(dtype=k_head.dtype)
        g2_sign = rev_b * gp_ar

        k_g2 = k_head[..., b_idx] * g2_sign
        k_g2_2d = k_g2.permute(0, 1, 3, 2, 4, 5).reshape(B * H * Hc, Lk * n_g2, D)
        comp = torch.bmm(q_2d, k_g2_2d.transpose(-2, -1))
        comp_sq = comp.reshape(B * H * Hc, Lq, Lk, n_g2).pow(2).sum(-1)
        return comp_sq.reshape(B, H, Hc, Lq, Lk).sum(2)

    def _dense_score_g2_chunked(self, q_2d, k_2d, B, H, Hc, Lq, Lk, D, n_g2):
        """Dense grade-2 score using bounded output-blade chunks."""
        score_g2_sq = q_2d.new_zeros(B, H, Lq, Lk)
        for start in range(0, n_g2, self.score_blade_chunk_size):
            end = min(start + self.score_blade_chunk_size, n_g2)
            r_vals = self._g2_blades[start:end]
            b_idx = self._basis_indices.unsqueeze(0) ^ r_vals.unsqueeze(1)
            rev_b = self.algebra.rev_signs[b_idx].to(dtype=k_2d.dtype)
            gp_ar = self.algebra.gp_signs[:, r_vals].T.to(dtype=k_2d.dtype)
            g2_sign = rev_b * gp_ar

            k_shifted = torch.index_select(k_2d, -1, b_idx.reshape(-1))
            k_shifted = k_shifted * g2_sign.reshape(-1)
            k_g2_2d = k_shifted.reshape(B * H * Hc, Lk * (end - start), D)
            comp = torch.bmm(q_2d, k_g2_2d.transpose(-2, -1))
            comp_sq = comp.reshape(B * H * Hc, Lq, Lk, end - start).pow(2).sum(-1)
            score_g2_sq = score_g2_sq + comp_sq.reshape(B, H, Hc, Lq, Lk).sum(2)
        return score_g2_sq

    def forward(self, x: torch.Tensor, key_padding_mask: torch.Tensor = None) -> torch.Tensor:
        """Computes geometric product attention.

        Args:
            x: Input multivectors [B, L, C, D].
            key_padding_mask: Optional [B, L] bool mask where True = padded (ignored).

        Returns:
            Output multivectors [B, L, C, D].
        """
        B, L, C, D = x.shape

        # Project Q, K, V (CliffordLinear expects [B, C, D])
        x_flat = x.reshape(B * L, C, D)
        Q = self.q_proj(x_flat).reshape(B, L, C, D)
        K = self.k_proj(x_flat).reshape(B, L, C, D)
        V = self.v_proj(x_flat).reshape(B, L, C, D)

        H = self.num_heads
        Hc = self.head_channels

        # Reshape to [B, H, L, Hc, D]
        Q = Q.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)  # [B, H, L, Hc, D]
        K = K.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)
        V = V.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)

        # Build causal mask once [L, L]
        if self.causal:
            causal_mask = torch.triu(
                torch.ones(L, L, device=x.device, dtype=torch.bool), diagonal=1
            )  # True = masked (future)
        else:
            causal_mask = None

        # Chunked attention over query positions to bound memory
        output_chunks = []
        for q_start in range(0, L, _BLOCK_SIZE):
            q_end = min(q_start + _BLOCK_SIZE, L)

            Q_block = Q[:, :, q_start:q_end]  # [B, H, Lq, Hc, D]

            # Compute scores: [B, H, Lq, L]
            scores = self._compute_score(Q_block, K)

            # Apply causal mask
            if causal_mask is not None:
                mask_block = causal_mask[q_start:q_end, :]  # [Lq, L]
                scores = scores.masked_fill(mask_block.unsqueeze(0).unsqueeze(0), float("-inf"))

            # Apply key padding mask: True = padded -> -inf
            if key_padding_mask is not None:
                # key_padding_mask: [B, L] -> [B, 1, 1, L]
                scores = scores.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), float("-inf"))

            # Softmax + dropout
            attn_weights = F.softmax(scores, dim=-1)  # [B, H, Lq, L]
            if self.attn_dropout is not None:
                attn_weights = self.attn_dropout(attn_weights)

            # Aggregate values: sum_k attn[b,h,i,k] * V[b,h,k,Hc,D]
            # attn_weights: [B, H, Lq, L]
            # V:            [B, H, L,  Hc, D]
            # out:          [B, H, Lq, Hc, D]
            out_block = torch.einsum("bhij,bhjcd->bhicd", attn_weights, V)
            output_chunks.append(out_block)

        # Reassemble: [B, H, L, Hc, D]
        output = torch.cat(output_chunks, dim=2)

        # Merge heads back: [B, L, C, D]
        output = output.permute(0, 2, 1, 3, 4).reshape(B, L, C, D)

        # Output projection
        output = self.out_proj(output.reshape(B * L, C, D)).reshape(B, L, C, D)

        return output

__init__(algebra, channels, num_heads, causal=True, bivector_weight=0.5, dropout=0.0, score_blade_chunk_size=_G2_BLADE_CHUNK_SIZE, score_precompute_limit=_SCORE_PRECOMPUTE_LIMIT)

Sets up geometric product attention.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

Clifford algebra instance.

required
channels int

Total number of multivector channels.

required
num_heads int

Number of attention heads.

required
causal bool

Apply causal mask for autoregressive generation.

True
bivector_weight float

lambda_ weight on bivector score component.

0.5
dropout float

Dropout rate on attention weights.

0.0
score_blade_chunk_size int

Grade-2 output blades processed per dense chunk when exact dense scoring is used.

_G2_BLADE_CHUNK_SIZE
score_precompute_limit int

Maximum temporary K_g2 elements allowed before exact dense scoring switches to chunked grade-2 blades.

_SCORE_PRECOMPUTE_LIMIT
Source code in layers/blocks/attention.py
def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    num_heads: int,
    causal: bool = True,
    bivector_weight: float = 0.5,
    dropout: float = 0.0,
    score_blade_chunk_size: int = _G2_BLADE_CHUNK_SIZE,
    score_precompute_limit: int = _SCORE_PRECOMPUTE_LIMIT,
):
    """Sets up geometric product attention.

    Args:
        algebra: Clifford algebra instance.
        channels: Total number of multivector channels.
        num_heads: Number of attention heads.
        causal: Apply causal mask for autoregressive generation.
        bivector_weight: lambda_ weight on bivector score component.
        dropout: Dropout rate on attention weights.
        score_blade_chunk_size: Grade-2 output blades processed per dense
            chunk when exact dense scoring is used.
        score_precompute_limit: Maximum temporary ``K_g2`` elements allowed
            before exact dense scoring switches to chunked grade-2 blades.
    """
    super().__init__(algebra)
    assert channels % num_heads == 0, f"channels ({channels}) must be divisible by num_heads ({num_heads})"

    self.channels = channels
    self.num_heads = num_heads
    self.head_channels = channels // num_heads
    self.causal = causal
    self.bivector_weight = bivector_weight
    self.score_blade_chunk_size = max(1, int(score_blade_chunk_size))
    self.score_precompute_limit = max(0, int(score_precompute_limit))

    # Q, K, V projections operate on [B*L, channels, dim]
    self.q_proj = CliffordLinear(algebra, channels, channels)
    self.k_proj = CliffordLinear(algebra, channels, channels)
    self.v_proj = CliffordLinear(algebra, channels, channels)
    self.out_proj = CliffordLinear(algebra, channels, channels)

    self.attn_dropout = nn.Dropout(dropout) if dropout > 0.0 else None

    # Precompute bilinear score routes (replaces pairwise geometric product)
    self._precompute_score_tables()

forward(x, key_padding_mask=None)

Computes geometric product attention.

Parameters:

Name Type Description Default
x Tensor

Input multivectors [B, L, C, D].

required
key_padding_mask Tensor

Optional [B, L] bool mask where True = padded (ignored).

None

Returns:

Type Description
Tensor

Output multivectors [B, L, C, D].

Source code in layers/blocks/attention.py
def forward(self, x: torch.Tensor, key_padding_mask: torch.Tensor = None) -> torch.Tensor:
    """Computes geometric product attention.

    Args:
        x: Input multivectors [B, L, C, D].
        key_padding_mask: Optional [B, L] bool mask where True = padded (ignored).

    Returns:
        Output multivectors [B, L, C, D].
    """
    B, L, C, D = x.shape

    # Project Q, K, V (CliffordLinear expects [B, C, D])
    x_flat = x.reshape(B * L, C, D)
    Q = self.q_proj(x_flat).reshape(B, L, C, D)
    K = self.k_proj(x_flat).reshape(B, L, C, D)
    V = self.v_proj(x_flat).reshape(B, L, C, D)

    H = self.num_heads
    Hc = self.head_channels

    # Reshape to [B, H, L, Hc, D]
    Q = Q.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)  # [B, H, L, Hc, D]
    K = K.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)
    V = V.reshape(B, L, H, Hc, D).permute(0, 2, 1, 3, 4)

    # Build causal mask once [L, L]
    if self.causal:
        causal_mask = torch.triu(
            torch.ones(L, L, device=x.device, dtype=torch.bool), diagonal=1
        )  # True = masked (future)
    else:
        causal_mask = None

    # Chunked attention over query positions to bound memory
    output_chunks = []
    for q_start in range(0, L, _BLOCK_SIZE):
        q_end = min(q_start + _BLOCK_SIZE, L)

        Q_block = Q[:, :, q_start:q_end]  # [B, H, Lq, Hc, D]

        # Compute scores: [B, H, Lq, L]
        scores = self._compute_score(Q_block, K)

        # Apply causal mask
        if causal_mask is not None:
            mask_block = causal_mask[q_start:q_end, :]  # [Lq, L]
            scores = scores.masked_fill(mask_block.unsqueeze(0).unsqueeze(0), float("-inf"))

        # Apply key padding mask: True = padded -> -inf
        if key_padding_mask is not None:
            # key_padding_mask: [B, L] -> [B, 1, 1, L]
            scores = scores.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), float("-inf"))

        # Softmax + dropout
        attn_weights = F.softmax(scores, dim=-1)  # [B, H, Lq, L]
        if self.attn_dropout is not None:
            attn_weights = self.attn_dropout(attn_weights)

        # Aggregate values: sum_k attn[b,h,i,k] * V[b,h,k,Hc,D]
        # attn_weights: [B, H, Lq, L]
        # V:            [B, H, L,  Hc, D]
        # out:          [B, H, Lq, Hc, D]
        out_block = torch.einsum("bhij,bhjcd->bhicd", attn_weights, V)
        output_chunks.append(out_block)

    # Reassemble: [B, H, L, Hc, D]
    output = torch.cat(output_chunks, dim=2)

    # Merge heads back: [B, L, C, D]
    output = output.permute(0, 2, 1, 3, 4).reshape(B, L, C, D)

    # Output projection
    output = self.out_proj(output.reshape(B * L, C, D)).reshape(B, L, C, D)

    return output

MultiRotorFFN

Bases: CliffordModule

Embedded Geometric Toolbox - Feed-Forward Network via rotor superposition.

Standard transformers use: Linear -> GELU -> Linear. This replaces that with:

CliffordLinear(expand) -> CliffordLayerNorm
    -> MultiRotorLayer(K rotors) -> GeometricGELU
    -> CliffordLinear(contract) -> BladeSelector

The expand step lifts x into a ffn_mult x channels toolbox subspace. MultiRotorLayer applies K parallel rotors, each exploring a different rotation plane - this IS the nonlinearity, not just a scalar gate. The contract step projects back to the original channel count.

Designed as a standalone module so it can be reused in other tasks (md17, pdbbind, etc.) beyond the language model.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

The algebra instance.

required
channels int

Input/output channel count.

required
ffn_mult int

Expansion factor (ffn_channels = channels * ffn_mult).

4
num_rotors int

Number of parallel rotors K in the toolbox.

8
use_rotor_backend bool

Use RotorGadget backend for CliffordLinear.

False

Input/Output shape: [B, C, D] where D = algebra.dim.

Source code in layers/blocks/multi_rotor_ffn.py
class MultiRotorFFN(CliffordModule):
    """Embedded Geometric Toolbox - Feed-Forward Network via rotor superposition.

    Standard transformers use: Linear -> GELU -> Linear.
    This replaces that with:

        CliffordLinear(expand) -> CliffordLayerNorm
            -> MultiRotorLayer(K rotors) -> GeometricGELU
            -> CliffordLinear(contract) -> BladeSelector

    The expand step lifts x into a ``ffn_mult x channels`` toolbox subspace.
    ``MultiRotorLayer`` applies K parallel rotors, each exploring a different
    rotation plane - this IS the nonlinearity, not just a scalar gate.
    The contract step projects back to the original channel count.

    Designed as a standalone module so it can be reused in other tasks
    (md17, pdbbind, etc.) beyond the language model.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        channels (int): Input/output channel count.
        ffn_mult (int): Expansion factor (ffn_channels = channels * ffn_mult).
        num_rotors (int): Number of parallel rotors K in the toolbox.
        use_rotor_backend (bool): Use RotorGadget backend for CliffordLinear.

    Input/Output shape: ``[B, C, D]`` where D = algebra.dim.
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        ffn_mult: int = 4,
        num_rotors: int = 8,
        use_rotor_backend: bool = False,
    ):
        super().__init__(algebra)
        self.channels = channels
        ffn_channels = channels * ffn_mult
        backend = "rotor" if use_rotor_backend else "traditional"

        self.expand = CliffordLinear(algebra, channels, ffn_channels, backend=backend)
        self.norm = CliffordLayerNorm(algebra, ffn_channels)
        self.toolbox = MultiRotorLayer(algebra, ffn_channels, num_rotors)
        self.act = GeometricGELU(algebra, channels=ffn_channels)
        self.contract = CliffordLinear(algebra, ffn_channels, channels, backend=backend)
        self.gate = BladeSelector(algebra, channels)

    def forward(self, x) -> torch.Tensor:
        """Applies the geometric toolbox FFN.

        Args:
            x (torch.Tensor): Input ``[B, C, D]``.

        Returns:
            torch.Tensor: Output ``[B, C, D]``.
        """
        h = self.expand(x)  # [B, ffn_channels, D]
        h = self.norm(h)  # [B, ffn_channels, D]
        h = self.toolbox(h)  # [B, ffn_channels, D]  - K-rotor superposition
        h = self.act(h)  # [B, ffn_channels, D]
        h = self.contract(h)  # [B, channels, D]
        h = self.gate(h)  # [B, channels, D]      - per-blade gating
        return h

forward(x)

Applies the geometric toolbox FFN.

Parameters:

Name Type Description Default
x Tensor

Input [B, C, D].

required

Returns:

Type Description
Tensor

torch.Tensor: Output [B, C, D].

Source code in layers/blocks/multi_rotor_ffn.py
def forward(self, x) -> torch.Tensor:
    """Applies the geometric toolbox FFN.

    Args:
        x (torch.Tensor): Input ``[B, C, D]``.

    Returns:
        torch.Tensor: Output ``[B, C, D]``.
    """
    h = self.expand(x)  # [B, ffn_channels, D]
    h = self.norm(h)  # [B, ffn_channels, D]
    h = self.toolbox(h)  # [B, ffn_channels, D]  - K-rotor superposition
    h = self.act(h)  # [B, ffn_channels, D]
    h = self.contract(h)  # [B, channels, D]
    h = self.gate(h)  # [B, channels, D]      - per-blade gating
    return h

GeometricTransformerBlock

Bases: CliffordModule

Modular Geometric Transformer block.

Architecture: 1. Pre-norm 2. Geometric Attention (Standard or Entropy-Gated) 3. Residual connection 4. Pre-norm 5. Multi-Rotor FFN 6. Residual connection

Source code in layers/blocks/transformer.py
class GeometricTransformerBlock(CliffordModule):
    """Modular Geometric Transformer block.

    Architecture:
    1. Pre-norm
    2. Geometric Attention (Standard or Entropy-Gated)
    3. Residual connection
    4. Pre-norm
    5. Multi-Rotor FFN
    6. Residual connection
    """

    def __init__(
        self,
        algebra: CliffordAlgebra,
        channels: int,
        num_heads: int = 4,
        num_rotors: int = 8,
        dropout: float = 0.1,
        use_entropy_gating: bool = False,
        eta: float = 1.5,
        H_base: float = 0.5,
    ):
        """Initializes the Geometric Transformer Block.

        Args:
            algebra: Clifford algebra instance.
            channels: Total multivector channels.
            num_heads: Number of attention heads.
            num_rotors: Number of rotors in the FFN.
            dropout: Dropout rate.
            use_entropy_gating: If True, uses EntropyGatedAttention.
            eta: Gating multiplier for entropy attention.
            H_base: Base entropy threshold.
        """
        super().__init__(algebra)
        self.use_entropy_gating = use_entropy_gating
        self.norm1 = CliffordLayerNorm(algebra, channels)

        if use_entropy_gating:
            self.attn = EntropyGatedAttention(algebra, channels, num_heads, eta=eta, H_base=H_base)
        else:
            self.attn = GeometricProductAttention(algebra, channels, num_heads, causal=False, dropout=dropout)

        self.norm2 = CliffordLayerNorm(algebra, channels)

        # Check MultiRotorFFN class name in multi_rotor_ffn.py
        from .multi_rotor_ffn import MultiRotorFFN

        self.ffn = MultiRotorFFN(algebra, channels, num_rotors=num_rotors)

    def forward(
        self, x: torch.Tensor, key_padding_mask: torch.Tensor = None, return_state: bool = False
    ) -> torch.Tensor:
        """Forward pass through the transformer block.

        Args:
            x: Input multivectors [B, L, C, D].
            key_padding_mask: Optional [B, L] bool mask where True = padded.
            return_state: If True, returns intermediate entropy/gating states.

        Returns:
            Processed multivectors [B, L, C, D] (and optionally intermediate states).
        """
        B, L, C, D = x.shape

        # 1. Attention path
        res = x
        x_n = self.norm1(x.reshape(B * L, C, D)).reshape(B, L, C, D)

        if self.use_entropy_gating and return_state:
            attn_out, H, lambda_dyn = self.attn(x_n, key_padding_mask=key_padding_mask, return_gating=True)
        else:
            attn_out = self.attn(x_n, key_padding_mask=key_padding_mask)
            H, lambda_dyn = None, None

        x = res + attn_out

        # 2. FFN path
        res = x
        x_n = self.norm2(x.reshape(B * L, C, D)).reshape(B, L, C, D)
        f_out = self.ffn(x_n.reshape(B * L, C, D)).reshape(B, L, C, D)
        x = res + f_out

        if return_state:
            return x, H, lambda_dyn
        return x

__init__(algebra, channels, num_heads=4, num_rotors=8, dropout=0.1, use_entropy_gating=False, eta=1.5, H_base=0.5)

Initializes the Geometric Transformer Block.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

Clifford algebra instance.

required
channels int

Total multivector channels.

required
num_heads int

Number of attention heads.

4
num_rotors int

Number of rotors in the FFN.

8
dropout float

Dropout rate.

0.1
use_entropy_gating bool

If True, uses EntropyGatedAttention.

False
eta float

Gating multiplier for entropy attention.

1.5
H_base float

Base entropy threshold.

0.5
Source code in layers/blocks/transformer.py
def __init__(
    self,
    algebra: CliffordAlgebra,
    channels: int,
    num_heads: int = 4,
    num_rotors: int = 8,
    dropout: float = 0.1,
    use_entropy_gating: bool = False,
    eta: float = 1.5,
    H_base: float = 0.5,
):
    """Initializes the Geometric Transformer Block.

    Args:
        algebra: Clifford algebra instance.
        channels: Total multivector channels.
        num_heads: Number of attention heads.
        num_rotors: Number of rotors in the FFN.
        dropout: Dropout rate.
        use_entropy_gating: If True, uses EntropyGatedAttention.
        eta: Gating multiplier for entropy attention.
        H_base: Base entropy threshold.
    """
    super().__init__(algebra)
    self.use_entropy_gating = use_entropy_gating
    self.norm1 = CliffordLayerNorm(algebra, channels)

    if use_entropy_gating:
        self.attn = EntropyGatedAttention(algebra, channels, num_heads, eta=eta, H_base=H_base)
    else:
        self.attn = GeometricProductAttention(algebra, channels, num_heads, causal=False, dropout=dropout)

    self.norm2 = CliffordLayerNorm(algebra, channels)

    # Check MultiRotorFFN class name in multi_rotor_ffn.py
    from .multi_rotor_ffn import MultiRotorFFN

    self.ffn = MultiRotorFFN(algebra, channels, num_rotors=num_rotors)

forward(x, key_padding_mask=None, return_state=False)

Forward pass through the transformer block.

Parameters:

Name Type Description Default
x Tensor

Input multivectors [B, L, C, D].

required
key_padding_mask Tensor

Optional [B, L] bool mask where True = padded.

None
return_state bool

If True, returns intermediate entropy/gating states.

False

Returns:

Type Description
Tensor

Processed multivectors [B, L, C, D] (and optionally intermediate states).

Source code in layers/blocks/transformer.py
def forward(
    self, x: torch.Tensor, key_padding_mask: torch.Tensor = None, return_state: bool = False
) -> torch.Tensor:
    """Forward pass through the transformer block.

    Args:
        x: Input multivectors [B, L, C, D].
        key_padding_mask: Optional [B, L] bool mask where True = padded.
        return_state: If True, returns intermediate entropy/gating states.

    Returns:
        Processed multivectors [B, L, C, D] (and optionally intermediate states).
    """
    B, L, C, D = x.shape

    # 1. Attention path
    res = x
    x_n = self.norm1(x.reshape(B * L, C, D)).reshape(B, L, C, D)

    if self.use_entropy_gating and return_state:
        attn_out, H, lambda_dyn = self.attn(x_n, key_padding_mask=key_padding_mask, return_gating=True)
    else:
        attn_out = self.attn(x_n, key_padding_mask=key_padding_mask)
        H, lambda_dyn = None, None

    x = res + attn_out

    # 2. FFN path
    res = x
    x_n = self.norm2(x.reshape(B * L, C, D)).reshape(B, L, C, D)
    f_out = self.ffn(x_n.reshape(B * L, C, D)).reshape(B, L, C, D)
    x = res + f_out

    if return_state:
        return x, H, lambda_dyn
    return x

Adapters

MultivectorEmbedding

Bases: CliffordModule

Token embedding as multivectors.

Each token maps to a [channels, dim] multivector. Initializes content in grade-1 (vector) subspace only - semantic content starts as directed quantities before rotors act on them.

Attributes:

Name Type Description
vocab_size int

Number of tokens.

channels int

Number of multivector channels.

embedding Embedding

Underlying embedding table.

Source code in layers/adapters/embedding.py
class MultivectorEmbedding(CliffordModule):
    """Token embedding as multivectors.

    Each token maps to a [channels, dim] multivector. Initializes
    content in grade-1 (vector) subspace only - semantic content
    starts as directed quantities before rotors act on them.

    Attributes:
        vocab_size (int): Number of tokens.
        channels (int): Number of multivector channels.
        embedding (nn.Embedding): Underlying embedding table.
    """

    def __init__(self, algebra: CliffordAlgebra, vocab_size: int, channels: int):
        """Sets up the multivector embedding.

        Args:
            algebra: Clifford algebra instance.
            vocab_size: Vocabulary size.
            channels: Number of multivector channels per token.
        """
        super().__init__(algebra)
        self.vocab_size = vocab_size
        self.channels = channels

        # Single flat embedding: vocab_size -> channels * dim
        self.embedding = nn.Embedding(vocab_size, channels * algebra.dim)
        self._init_grade1()

    def _init_grade1(self):
        """Initializes only grade-1 components; zeros out all others."""
        with torch.no_grad():
            dim = self.algebra.dim
            channels = self.channels

            # Build grade-1 mask (indices with exactly 1 bit set)
            grade1_flat = []
            for i in range(dim):
                if bin(i).count("1") == 1:
                    grade1_flat.append(i)

            # Zero everything
            self.embedding.weight.zero_()

            # Fill grade-1 slots with small normal values
            for ch in range(channels):
                for idx in grade1_flat:
                    flat_idx = ch * dim + idx
                    self.embedding.weight[:, flat_idx].normal_(std=0.02)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        """Maps token ids to multivector embeddings.

        Args:
            token_ids: Token indices [B, L].

        Returns:
            Multivector embeddings [B, L, channels, dim].
        """
        B, L = token_ids.shape
        flat = self.embedding(token_ids)  # [B, L, channels * dim]
        return flat.reshape(B, L, self.channels, self.algebra.dim)

__init__(algebra, vocab_size, channels)

Sets up the multivector embedding.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

Clifford algebra instance.

required
vocab_size int

Vocabulary size.

required
channels int

Number of multivector channels per token.

required
Source code in layers/adapters/embedding.py
def __init__(self, algebra: CliffordAlgebra, vocab_size: int, channels: int):
    """Sets up the multivector embedding.

    Args:
        algebra: Clifford algebra instance.
        vocab_size: Vocabulary size.
        channels: Number of multivector channels per token.
    """
    super().__init__(algebra)
    self.vocab_size = vocab_size
    self.channels = channels

    # Single flat embedding: vocab_size -> channels * dim
    self.embedding = nn.Embedding(vocab_size, channels * algebra.dim)
    self._init_grade1()

forward(token_ids)

Maps token ids to multivector embeddings.

Parameters:

Name Type Description Default
token_ids Tensor

Token indices [B, L].

required

Returns:

Type Description
Tensor

Multivector embeddings [B, L, channels, dim].

Source code in layers/adapters/embedding.py
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
    """Maps token ids to multivector embeddings.

    Args:
        token_ids: Token indices [B, L].

    Returns:
        Multivector embeddings [B, L, channels, dim].
    """
    B, L = token_ids.shape
    flat = self.embedding(token_ids)  # [B, L, channels * dim]
    return flat.reshape(B, L, self.channels, self.algebra.dim)

MotherEmbedding

Bases: CliffordModule

Embeds local feature groups into a canonical Mother Algebra with Procrustes Alignment.

Uses fixed rotors (R_fixed) to rotate individual channel vectors into a shared reference frame, effectively aligning disparate geometric manifolds.

Source code in layers/adapters/mother.py
class MotherEmbedding(CliffordModule):
    """Embeds local feature groups into a canonical Mother Algebra with Procrustes Alignment.

    Uses fixed rotors (R_fixed) to rotate individual channel vectors into a shared
    reference frame, effectively aligning disparate geometric manifolds.
    """

    def __init__(self, algebra: CliffordAlgebra, input_dim: int, channels: int, U: float = 0.0, V: torch.Tensor = None):
        """Initializes the Mother Embedding.

        Args:
            algebra: Clifford algebra instance.
            input_dim: Dimension of the input features.
            channels: Number of multivector channels.
            U: Geometric uncertainty index for manifold suppression.
            V: Fixed rotor proxy for Procrustes alignment (input_dim x input_dim).
        """
        super().__init__(algebra)
        self.channels = channels

        # Procrustes Alignment Matrix (Fixed Rotor Proxy)
        if V is None:
            V = torch.eye(input_dim)
        self.register_buffer("R_fixed", V)

        # Up-cast to Mother Algebra multivector channels
        self.linear = nn.Linear(input_dim, channels * algebra.dim)
        self.norm = CliffordLayerNorm(algebra, channels)

        # Pre-condition LayerNorm scale with Uncertainty Index
        with torch.no_grad():
            if hasattr(self.norm, "weight"):
                # Suppress highly uncertain (twisted) manifolds initially
                scale = 1.0 / (1.0 + U)
                self.norm.weight.data.fill_(scale)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Projects input into the aligned mother manifold.

        Args:
            x: Input features [B, input_dim].

        Returns:
            Aligned multivectors [B, channels, dim].
        """
        # 1. Apply Geometric Procrustes Alignment
        if self.R_fixed is not None:
            x = x @ self.R_fixed.T

        # 2. Mother Projection
        c = self.linear(x).view(-1, self.channels, self.algebra.dim)
        return self.norm(c)

__init__(algebra, input_dim, channels, U=0.0, V=None)

Initializes the Mother Embedding.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

Clifford algebra instance.

required
input_dim int

Dimension of the input features.

required
channels int

Number of multivector channels.

required
U float

Geometric uncertainty index for manifold suppression.

0.0
V Tensor

Fixed rotor proxy for Procrustes alignment (input_dim x input_dim).

None
Source code in layers/adapters/mother.py
def __init__(self, algebra: CliffordAlgebra, input_dim: int, channels: int, U: float = 0.0, V: torch.Tensor = None):
    """Initializes the Mother Embedding.

    Args:
        algebra: Clifford algebra instance.
        input_dim: Dimension of the input features.
        channels: Number of multivector channels.
        U: Geometric uncertainty index for manifold suppression.
        V: Fixed rotor proxy for Procrustes alignment (input_dim x input_dim).
    """
    super().__init__(algebra)
    self.channels = channels

    # Procrustes Alignment Matrix (Fixed Rotor Proxy)
    if V is None:
        V = torch.eye(input_dim)
    self.register_buffer("R_fixed", V)

    # Up-cast to Mother Algebra multivector channels
    self.linear = nn.Linear(input_dim, channels * algebra.dim)
    self.norm = CliffordLayerNorm(algebra, channels)

    # Pre-condition LayerNorm scale with Uncertainty Index
    with torch.no_grad():
        if hasattr(self.norm, "weight"):
            # Suppress highly uncertain (twisted) manifolds initially
            scale = 1.0 / (1.0 + U)
            self.norm.weight.data.fill_(scale)

forward(x)

Projects input into the aligned mother manifold.

Parameters:

Name Type Description Default
x Tensor

Input features [B, input_dim].

required

Returns:

Type Description
Tensor

Aligned multivectors [B, channels, dim].

Source code in layers/adapters/mother.py
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """Projects input into the aligned mother manifold.

    Args:
        x: Input features [B, input_dim].

    Returns:
        Aligned multivectors [B, channels, dim].
    """
    # 1. Apply Geometric Procrustes Alignment
    if self.R_fixed is not None:
        x = x @ self.R_fixed.T

    # 2. Mother Projection
    c = self.linear(x).view(-1, self.channels, self.algebra.dim)
    return self.norm(c)

EntropyGatedAttention

Bases: CliffordModule

Dynamic geometric attention governed by bivector information entropy.

Segments with high bivector entropy (disordered phase states) are "stiffened" or suppressed, allowing only coherent, synchronized states to propagate.

Source code in layers/adapters/mother.py
class EntropyGatedAttention(CliffordModule):
    """Dynamic geometric attention governed by bivector information entropy.

    Segments with high bivector entropy (disordered phase states) are "stiffened"
    or suppressed, allowing only coherent, synchronized states to propagate.
    """

    def __init__(self, algebra: CliffordAlgebra, channels: int, num_heads: int, eta: float = 1.0, H_base: float = 0.5):
        """Initializes Entropy-Gated Attention.

        Args:
            algebra: Clifford algebra instance.
            channels: Total multivector channels.
            num_heads: Number of attention heads.
            eta: Gating multiplier.
            H_base: Base entropy threshold.
        """
        super().__init__(algebra)
        self.channels = channels
        self.eta = eta
        self.H_base = H_base
        self.base_attention = GeometricProductAttention(algebra, channels, num_heads, causal=False)

        # Cache bivector indices and float mask for compile-friendly gating
        mask = self.algebra.grade_masks[2]
        self.register_buffer("g2_idx", mask.nonzero(as_tuple=True)[0])
        self.register_buffer("_g2_float_mask", mask.float())

    def forward(
        self, x: torch.Tensor, key_padding_mask: torch.Tensor = None, return_gating: bool = False
    ) -> torch.Tensor:
        """Applies entropy-gated geometric attention.

        Args:
            x: Input multivectors [B, L, C, D].
            key_padding_mask: Optional [B, L] bool mask where True = padded.
            return_gating: If True, returns entropy and gating values.

        Returns:
            Attended multivectors [B, L, C, D].
        """
        # 1. Calculate Information Entropy of Bivector Energy
        # Summing across multivector components (g2_idx) and across channels (dim 2)
        # x: [B, L, C, D]
        g2_energy = (x[..., self.g2_idx] ** 2).sum(dim=(-1, -2))  # [B, L]

        # Mask padded positions before entropy calc
        if key_padding_mask is not None:
            g2_energy = g2_energy.masked_fill(key_padding_mask, 0.0)

        # Normalize to probability distribution over sequence
        p = g2_energy / (g2_energy.sum(dim=1, keepdim=True) + 1e-8)

        # Shannon Entropy H per batch [B]
        H = -(p * torch.log(p + 1e-8)).sum(dim=1)

        # 2. Base-Adjusted Gating Function
        lambda_dyn = self.eta * torch.sigmoid(H - self.H_base)  # [B]

        # 3. Apply dynamic geometric stiffness
        # Scale the rotational components (bivectors)
        lambda_view = lambda_dyn.view(-1, 1, 1, 1)

        g2_mask = self._g2_float_mask.to(dtype=x.dtype)
        scale = 1.0 + (lambda_view - 1.0) * g2_mask  # [B, 1, 1, D]
        x_gated = x * scale

        out = self.base_attention(x_gated, key_padding_mask=key_padding_mask)

        if return_gating:
            return out, H, lambda_dyn
        return out

__init__(algebra, channels, num_heads, eta=1.0, H_base=0.5)

Initializes Entropy-Gated Attention.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

Clifford algebra instance.

required
channels int

Total multivector channels.

required
num_heads int

Number of attention heads.

required
eta float

Gating multiplier.

1.0
H_base float

Base entropy threshold.

0.5
Source code in layers/adapters/mother.py
def __init__(self, algebra: CliffordAlgebra, channels: int, num_heads: int, eta: float = 1.0, H_base: float = 0.5):
    """Initializes Entropy-Gated Attention.

    Args:
        algebra: Clifford algebra instance.
        channels: Total multivector channels.
        num_heads: Number of attention heads.
        eta: Gating multiplier.
        H_base: Base entropy threshold.
    """
    super().__init__(algebra)
    self.channels = channels
    self.eta = eta
    self.H_base = H_base
    self.base_attention = GeometricProductAttention(algebra, channels, num_heads, causal=False)

    # Cache bivector indices and float mask for compile-friendly gating
    mask = self.algebra.grade_masks[2]
    self.register_buffer("g2_idx", mask.nonzero(as_tuple=True)[0])
    self.register_buffer("_g2_float_mask", mask.float())

forward(x, key_padding_mask=None, return_gating=False)

Applies entropy-gated geometric attention.

Parameters:

Name Type Description Default
x Tensor

Input multivectors [B, L, C, D].

required
key_padding_mask Tensor

Optional [B, L] bool mask where True = padded.

None
return_gating bool

If True, returns entropy and gating values.

False

Returns:

Type Description
Tensor

Attended multivectors [B, L, C, D].

Source code in layers/adapters/mother.py
def forward(
    self, x: torch.Tensor, key_padding_mask: torch.Tensor = None, return_gating: bool = False
) -> torch.Tensor:
    """Applies entropy-gated geometric attention.

    Args:
        x: Input multivectors [B, L, C, D].
        key_padding_mask: Optional [B, L] bool mask where True = padded.
        return_gating: If True, returns entropy and gating values.

    Returns:
        Attended multivectors [B, L, C, D].
    """
    # 1. Calculate Information Entropy of Bivector Energy
    # Summing across multivector components (g2_idx) and across channels (dim 2)
    # x: [B, L, C, D]
    g2_energy = (x[..., self.g2_idx] ** 2).sum(dim=(-1, -2))  # [B, L]

    # Mask padded positions before entropy calc
    if key_padding_mask is not None:
        g2_energy = g2_energy.masked_fill(key_padding_mask, 0.0)

    # Normalize to probability distribution over sequence
    p = g2_energy / (g2_energy.sum(dim=1, keepdim=True) + 1e-8)

    # Shannon Entropy H per batch [B]
    H = -(p * torch.log(p + 1e-8)).sum(dim=1)

    # 2. Base-Adjusted Gating Function
    lambda_dyn = self.eta * torch.sigmoid(H - self.H_base)  # [B]

    # 3. Apply dynamic geometric stiffness
    # Scale the rotational components (bivectors)
    lambda_view = lambda_dyn.view(-1, 1, 1, 1)

    g2_mask = self._g2_float_mask.to(dtype=x.dtype)
    scale = 1.0 + (lambda_view - 1.0) * g2_mask  # [B, 1, 1, D]
    x_gated = x * scale

    out = self.base_attention(x_gated, key_padding_mask=key_padding_mask)

    if return_gating:
        return out, H, lambda_dyn
    return out

Optional dependency

CliffordGraphConv requires torch-geometric. Install with uv sync --extra md17.

CliffordGraphConv

Bases: CliffordModule

Geometric Graph Conv. Performs message passing using multivector features.

Aggregates features based on graph topology. H' = Aggregate(H) * W + Bias.

Attributes:

Name Type Description
linear CliffordLinear

The transformation.

Source code in layers/adapters/gnn.py
class CliffordGraphConv(CliffordModule):
    """Geometric Graph Conv. Performs message passing using multivector features.

    Aggregates features based on graph topology.
    H' = Aggregate(H) * W + Bias.

    Attributes:
        linear (CliffordLinear): The transformation.
    """

    def __init__(self, algebra: CliffordAlgebra, in_channels: int, out_channels: int):
        """Sets up the GNN layer.

        Args:
            algebra (CliffordAlgebra): The algebra instance.
            in_channels (int): Input features.
            out_channels (int): Output features.
        """
        super().__init__(algebra)
        self.linear = CliffordLinear(algebra, in_channels, out_channels)

    def forward(self, x: torch.Tensor, adj: torch.Tensor) -> torch.Tensor:
        """Aggregates and transforms node features using geometric operations.

        Args:
            x (torch.Tensor): Node features.
            adj (torch.Tensor): Adjacency matrix.

        Returns:
            torch.Tensor: Updated features.
        """
        # 1. Aggregate
        N, C, D = x.shape
        x_flat = x.view(N, -1)

        # Sparse aggregation
        x_agg_flat = torch.mm(adj, x_flat)
        x_agg = x_agg_flat.view(N, C, D)

        # 2. Transform
        out = self.linear(x_agg)

        return out

__init__(algebra, in_channels, out_channels)

Sets up the GNN layer.

Parameters:

Name Type Description Default
algebra CliffordAlgebra

The algebra instance.

required
in_channels int

Input features.

required
out_channels int

Output features.

required
Source code in layers/adapters/gnn.py
def __init__(self, algebra: CliffordAlgebra, in_channels: int, out_channels: int):
    """Sets up the GNN layer.

    Args:
        algebra (CliffordAlgebra): The algebra instance.
        in_channels (int): Input features.
        out_channels (int): Output features.
    """
    super().__init__(algebra)
    self.linear = CliffordLinear(algebra, in_channels, out_channels)

forward(x, adj)

Aggregates and transforms node features using geometric operations.

Parameters:

Name Type Description Default
x Tensor

Node features.

required
adj Tensor

Adjacency matrix.

required

Returns:

Type Description
Tensor

torch.Tensor: Updated features.

Source code in layers/adapters/gnn.py
def forward(self, x: torch.Tensor, adj: torch.Tensor) -> torch.Tensor:
    """Aggregates and transforms node features using geometric operations.

    Args:
        x (torch.Tensor): Node features.
        adj (torch.Tensor): Adjacency matrix.

    Returns:
        torch.Tensor: Updated features.
    """
    # 1. Aggregate
    N, C, D = x.shape
    x_flat = x.view(N, -1)

    # Sparse aggregation
    x_agg_flat = torch.mm(adj, x_flat)
    x_agg = x_agg_flat.view(N, C, D)

    # 2. Transform
    out = self.linear(x_agg)

    return out