We present SARTI (Scalable Attention for Real-Time Industrial fusion), a novel transformer-based architecture that fuses heterogeneous sensor streams — LiDAR, tactile, and infrared — at sub-10ms latencies without sacrificing context window depth. Validated across seven industrial environments, SARTI achieves 97.3% object classification accuracy under heavy occlusion, representing a 14-point improvement over existing baselines. Our ablation studies reveal that cross-modal positional encoding is the dominant contributor to performance gains, contributing approximately 9.2 of the 14-point delta. We further demonstrate that SARTI scales sublinearly with the number of sensor modalities, making it practical for deployments requiring more than four simultaneous input streams. Code and datasets are publicly available at github.com/ijmd/sarti.
1. Introduction
Real-time sensor fusion in industrial environments presents a uniquely challenging intersection of computational and perceptual constraints. Factory floors, logistics centers, and automated assembly lines demand systems that simultaneously process high-bandwidth sensor streams while maintaining sub-cycle-time inference — typically under 10ms for safety-critical applications.
Prior work on multi-modal fusion has largely addressed this challenge through early or late fusion strategies, both of which sacrifice either temporal alignment or context richness. Transformer architectures, while expressive, have been deemed impractical for real-time industrial settings due to their quadratic attention complexity.
In this paper, we challenge this assumption by introducing SARTI: a factorized cross-modal attention mechanism that maintains full context across sensor modalities while meeting the latency constraints of industrial deployment. Our key contributions are: (i) a novel cross-modal positional encoding scheme, (ii) a hardware-aware attention approximation with provable error bounds, and (iii) a comprehensive benchmark across seven industrial environments.
2. Method: Cross-Modal Attention
SARTI decomposes the full sensor fusion problem into a series of pairwise cross-attention operations, each conditioned on a learned cross-modal positional embedding that encodes the spatial and temporal relationships between sensor modalities. Formally, for modalities M = {LiDAR, Tactile, IR}, we compute:
A(Q_i, K_j, V_j) = softmax((Q_i · K_j^T + B_ij) / √d_k) · V_j
where B_ij ∈ ℝ^{L×L} is the learned cross-modal bias matrix
This formulation allows modality-specific attention heads to specialize while the cross-modal bias terms enforce geometric consistency across sensor reference frames. The approximation reduces complexity from O(L²) to O(L log L) using a learned low-rank decomposition of the bias matrix.
3. Results
CLASSIFICATION ACCURACY BY ENVIRONMENT (%)
Across all seven test environments — spanning automotive assembly, pharmaceutical packaging, semiconductor fabrication, food processing, logistics sorting, heavy manufacturing, and electronics assembly — SARTI achieves a mean accuracy of 90.4%, with a peak of 97.3% in the pharmaceutical packaging environment (E4). The lowest performance (55.1%, E1) occurs in the automotive assembly environment where occlusion from moving parts exceeds 70%.