סמינר: Graduate Seminar
Holistic Multi-Modal Alignment for Open-Set 3D Recognition
| Open-set 3D recognition requires models to generalize to rare or unseen categories, a critical capability for robots and autonomous platforms operating in real-world, cluttered environments. Recent approaches typically address this by distilling language-vision knowledge into 3D encoders, aligning each point cloud with a single image or caption. However, this often anchors representations to partial views, failing to capture a holistic understanding of 3D objects. Current state-of-the-art methods typically achieve high performance by scaling capacity, utilizing large 2D ViT-based backbones that range from 200M to over 2B parameters. While effective, these massive models are often computationally heavy and lead to significant training and inference latency.
In this talk, we present HOLA, a framework designed to capture a more complete understanding of 3D shapes by aligning each point cloud with multiple images and textual descriptions simultaneously. We introduce a decoupled multi-positive contrastive loss that jointly aligns a 3D instance with matched multi-view signals while separating positive aggregation from negative competition. This formulation prevents “spotlight crowding”, a phenomenon where many positives share the same softmax with negatives, thereby preserving a hardness-aware focus on challenging negative samples that are crucial for learning discriminative features. Complementing this loss, we utilize a lightweight text adapter applied specifically to web captions in order to reduce the domain gap between noisy, large-scale unsupervised text and curated annotations. Our results demonstrate that HOLA achieves state-of-the-art open-vocabulary performance on long-tail benchmarks such as Objaverse-LVIS. Remarkably, these results are obtained using a compact 32M parameter backbone, which is substantially smaller than the largest state-of-the-art models, while maintaining high frame rates and low inference latency. This combination of accuracy and efficiency makes HOLA a strong candidate for real-world deployment under tight computational and memory constraints. M.Sc. student under the supervision of Prof. Ayellet Tal.
|

