Scalable Attention Mechanisms for Real-Time Sensor Fusion in Autonomous Industrial Robots

Fashola, Adeyemi O.; Wang, Lei; Krishnamurthy, Priya

doi:10.12345/ijmd.ml.2026.0412

ArticleIJMD ML · SERIES A

Scalable Attention Mechanisms for Real-Time Sensor Fusion in Autonomous Industrial Robots

Adeyemi O. Fashola, Lei Wang, Priya Krishnamurthy, Johann S. Müller

University of Lagos · MIT CSAIL · IIT Delhi · ETH Zürich

Series A · Vol 12, No 2 OPEN ACCESS DOI: 10.12345/ijmd.ml.2026.0412 Published June 1, 2026

#SensorFusion #Transformers #Robotics #RealTime #IndustrialAI

Download PDF

Abstract

We present SARTI (Scalable Attention for Real-Time Industrial fusion), a novel transformer-based architecture that fuses heterogeneous sensor streams — LiDAR, tactile, and infrared — at sub-10ms latencies without sacrificing context window depth. Validated across seven industrial environments, SARTI achieves 97.3% object classification accuracy under heavy occlusion, representing a 14-point improvement over existing baselines. Our ablation studies reveal that cross-modal positional encoding is the dominant contributor to performance gains, contributing approximately 9.2 of the 14-point delta. We further demonstrate that SARTI scales sublinearly with the number of sensor modalities, making it practical for deployments requiring more than four simultaneous input streams. Code and datasets are publicly available at github.com/ijmd/sarti.

1. Introduction

Real-time sensor fusion in industrial environments presents a uniquely challenging intersection of computational and perceptual constraints. Factory floors, logistics centers, and automated assembly lines demand systems that simultaneously process high-bandwidth sensor streams while maintaining sub-cycle-time inference — typically under 10ms for safety-critical applications.

Prior work on multi-modal fusion has largely addressed this challenge through early or late fusion strategies, both of which sacrifice either temporal alignment or context richness. Transformer architectures, while expressive, have been deemed impractical for real-time industrial settings due to their quadratic attention complexity.

In this paper, we challenge this assumption by introducing SARTI: a factorized cross-modal attention mechanism that maintains full context across sensor modalities while meeting the latency constraints of industrial deployment. Our key contributions are: (i) a novel cross-modal positional encoding scheme, (ii) a hardware-aware attention approximation with provable error bounds, and (iii) a comprehensive benchmark across seven industrial environments.

2. Method: Cross-Modal Attention

SARTI decomposes the full sensor fusion problem into a series of pairwise cross-attention operations, each conditioned on a learned cross-modal positional embedding that encodes the spatial and temporal relationships between sensor modalities. Formally, for modalities M = {LiDAR, Tactile, IR}, we compute:

        A(Q_i, K_j, V_j) = softmax((Q_i · K_j^T + B_ij) / √d_k) · V_j

        where B_ij ∈ ℝ^{L×L} is the learned cross-modal bias matrix

This formulation allows modality-specific attention heads to specialize while the cross-modal bias terms enforce geometric consistency across sensor reference frames. The approximation reduces complexity from O(L²) to O(L log L) using a learned low-rank decomposition of the bias matrix.

3. Results

CLASSIFICATION ACCURACY BY ENVIRONMENT (%)

Across all seven test environments — spanning automotive assembly, pharmaceutical packaging, semiconductor fabrication, food processing, logistics sorting, heavy manufacturing, and electronics assembly — SARTI achieves a mean accuracy of 90.4%, with a peak of 97.3% in the pharmaceutical packaging environment (E4). The lowest performance (55.1%, E1) occurs in the automotive assembly environment where occlusion from moving parts exceeds 70%.

How to Cite

Fashola, A.O., Wang, L., Krishnamurthy, P., & Müller, J.S. (2026). Scalable Attention Mechanisms for Real-Time Sensor Fusion in Autonomous Industrial Robots. IJMD Machine Learning, 12(2), 412–441. https://doi.org/10.12345/ijmd.ml.2026.0412