CrossPlace: Cross-modal Place Recognition between Omnidirectional Cameras and LiDAR via a Unified Feature Space

1Institute for Engineering Research (I3E)
2Valencian Graduate School and Research Network for Artificial Intelligence (valgrAI)
MY ALT TEXT

General architecture of the CrossPlace method. The LiDAR point cloud is converted to an image through a spherical projection. Depending on the information projected to each image pixel, (a) the intensity image, (b) the range image and (c) the segmented range image by MinkUNet34C [1] are obtained. Likewise, fisheye images are transformed to an equirectangular image. This image is used to compute (d) the intensity image through grayscale conversion, (e) the depth image estimated by Depth Anything V2 Large [2] and (f) the semantic image obtained through SegFormer [3] are computed. Each type of image is embedded by an independent fine-tuned CosPlace model [4] with shared weights between sensor modalities. The CrossPlace final embedding is the result of the concatenation of the intensity, depth and semantic embeddings.

Abstract

This paper presents CrossPlace, an innovative method for cross-modal place recognition between heterogeneous sensor modalities, particularly between fisheye cameras and LiDAR. Place recognition is the fundamental capability of mobile robots to determine their most likely location within a database, based on sensory input queries. In cross-modal place recognition, the goal is to localize using a different sensor from the one originally used to construct the database. The core contribution of this paper is a unified feature space that integrates intensity, depth and semantic information. Both the database entries and the queries are obtained by embedding sensor readings through the same CrossPlace model, ensuring a consistent representation across modalities. Consequently, a database constructed from LiDAR can be queried with fisheye images, and vice versa, using a single shared architecture. Furthermore, a comprehensive data transformation and preprocessing pipeline is presented. Specifically, CrossPlace is constituted by three independently branches, each one for processing intensity, depth and semantic information. Each branch consists of a CosPlace model for image embedding with shared weights across sensor modalities. Late fusion through concatenation of the intensity, depth and semantic embbedings provides optimal global performance. We conduct an exhaustive evaluation on the KITTI-360 dataset, where CrossPlace surpasses state-of-the-art techniques across all metrics, establishing a new standard for cross-modal place recognition in urban and highway environments. The results demonstrate the effectiveness of our unified approach for place recognition across different sensor modalities while maintaining a robust performance under various operating environments.

CrossPlace Evaluation Results

Highway Environment - Sequence 03

2D-3D Modality

3D-2D Modality

CrossPlace Evaluation Results

Urban Environment - Sequence 00

2D-3D Modality

3D-2D Modality

CrossPlace Further Test Results

Highway Environment - Sequence 07

2D-3D Modality

3D-2D Modality

CrossPlace Further Test Results

Urban Environment - Sequence 18

2D-3D Modality

3D-2D Modality

Acknowledgements

The Ministry of Science, Innovation and Universities (Spain) has funded this work through FPU23/00587 (M. Alfaro) and FPU21/04969 (J.J. Cabrera). This work is part of the projects PID2023-149575OB-I00, funded by MICIU/AEI/10.13039/501100011033 and by FEDER UE, and CIPROM/2024/8, funded by Generalitat Valenciana.

Logos of funding institutions