Deploying: Network Modification



Although CRN is already capable of real-time operation, my goal was to optimize it further to reduce computation and memory requirements as much as possible. Additionally, I aimed to keep the network as lightweight as possible and GPU-friendly, considering potential deployment on embedded platforms such as NVIDIA Jetson Orin or Qualcomm RB5.



CRN architecture


The main modifications made to the network included:

  1. BEV Grid and Depth Bin Size
    Given that the testing environment had a range of up to 100m, I adjusted the perception range in the front direction while maintaining a 50m range in the side direction. Additionally, since the environment lacked small objects and had wider object spacing compared to urban scenarios, I increased BEV grid size and bin size in the depth distribution, reducing memory usage and computational load.
  2. Backbone Layers
    I minimized the number of backbone layers in both the perspective view and bird’s eye view feature extractors to reduce the overall model parameter size.
  3. Number of Keyframes
    The original paper used four keyframes to incorporate information from the past three seconds. To compensate for the ego vehicle’s motion between keyframes, BEV feature maps had to be warped back to time t.
    Given the huge displacement between keyframes caused by the high-speed driving (about 30m/s), there was minimal overlap between feature maps, potentially reducing their effectiveness or even causing adverse effects. Therefore, I chose to use only the current frame and the previous frame from one second ago.
  4. Transformer Blocks
    I replaced the transformer layer used for fusing camera and radar feature maps with a convolutional layer. While the transformer layer had a wider receptive field and demonstrated more robust performance, I couldn’t guarantee whether training would remain stable, especially when the data quantity was insufficient. Therefore, I decided to use a convolutional layer that I believed would work reliably, albeit with slightly lower performance.

BEV feature map warping, Figure from UniFormer


Domain Gap

I also took into consideration the domain gap between nuScenes and my test environment for training. Due to the limited amount of training data, minimizing domain-related issues would enhance training effectiveness.


We can see the domain gap in both image and radar points


When filtering radar points, I adjusted the radar RCS (Radar Cross Section), SNR (Signal-to-Noise Ratio), and false alarm rate thresholds to align them as closely as possible with the characteristics of the data in nuScenes. Refer to nuScenes radar data format, ARS430 provides a similar data to ARS408. However, I have yet to conduct an in-depth analysis of potential challenges arising from the high-speed differences between nuScenes and my environment.


Regarding camera image, I faced more significant domain shift challenges when estimating depth from the image (possibly due to the difference in camera intrinsics and mounting position), and I required more data to address this issue. Fortunately, training the camera image to predict depth was achievable without additional annotation costs since it could be supervised lidar points projected into the image.

I trained the network under two scenarios: one with 3D annotations, allowing training the entire network end-to-end, and another without 3D annotations, training only the depth network. Due to limited GPU resources and time constraints, I was unable to verify the performance improvement from additional depth network training through ablation tests.