Labeling: Auto Labeling
Offboard detector prototype
I was eager to reduce the time spent on the repetitive labeling process as much as possible. Consequently, I put in extra effort to develop an efficient offboard 3D LiDAR detector as an auto labeler. To achieve this, I needed to address several key considerations:
- Which pre-trained model to use
- How to preprocess the input
- How to post-process the output
Pre-trained Model
Given that I didn’t have a large amount of labeled data, using a pre-trained model was highly desirable. While pre-training a model on DDAD dataset would have been ideal considering the highway environments, I decided to use a pre-trained model from mmdetection3D, so I had to choose models between trained on KITTI, nuScenes, or Waymo.
Among those options, Waymo had the advantage of a longer detection range (up to 75m) and similar lidar characteristics to ours. However, the only problem was the coarse-grained class (Car, Pedestrian, Cyclist), since I wanted to use fine-grained classes (car, truck, bus, trailer, and construction vehicle) following nuScenes. To address this, I copied the detection head of ‘Car’ class and replicated it five times to correspond to the desired five classes. The detection head originally pre-trained to detect ‘Car’ class, was then fine-tuned to separately detect ‘Car’ sub-classes.
Preprocessing
I decided to label only the forward data since the MVP model would use only the forward camera and radar. Additionally, I aimed to annotate objects up to 100m to demonstrate the superior detection range of camera-radar compared to LiDAR. While annotating up to a longer distance would have been ideal, the performance limitations of our LiDAR made this challenging, so I settled for 100m.
As an ‘offboard’ detector, there were no limitations requiring the use of only the previous frame, also I didn’t need to prioritize computational efficiency or real-time processing. Thus, I used the previous 3 sweeps and the following (future) 3 sweeps of keyframe, totaling 7 frames. Maybe more sweeps or a wider frame interval might have been more effective, this configuration already worked well, so I didn’t invest more effort into this aspect.
When training the network, I paid special attention to rotation augmentation, as most vehicles on the highway travel in the same direction without a huge heading angle change. Therefore, I wanted to ensure that the model could handle different heading angles effectively through strong rotation augmentation. Also, I didn’t use vertical flip augmentation since the situation vehicles approaching from the opposite direction would not occur.
Post-processing
Finally, I put some effort into post-processing the detection results. I applied multiple rotation augmentation [-3, -2, …, +2, +3] * pi/8 and horizontal flip (left-right) augmentation to LiDAR points input, resulting in 14 predictions from a single input. After generating predictions, I applied inverse augmentation to each prediction, then combined them by taking the average of attribute maps (refer to mmdet3D CenterPoint TTA, I modified rotation augmentation based on their codebase). In the future, I plan to explore object-level test time augmentation methods similar to Weighted Box Fusion (WBF) for combining predictions considering the confidence of prediction.
Results
(Semi-) Auto labeled scenes, colors denote classes
The offboard 3D LiDAR detector trained with 400 frames performed quite well. I manually corrected the pseudo-labels generated by this model and repeated this process a few times, resulting in a total of 1600 frames. The significance of this achievement lies not only in the quantity of data but also in the ability to automate annotation with minimal effort.
The initial 400 frames took a few days to label, but by the final cycle, the required time had significantly decreased.
Future Directions
At this point, I had completed developing offboard 3D LiDAR detector for auto labeling.
However, the following directions are worth exploring:
- Transition from frame-by-frame detection to tracklets: Sequential data provide a lot more information than single frame data.
- Ensure that object sizes remain consistent within the same tracklet: Feeding consistent data to the model is essential.
- Smooth heading angle changes between adjacent frames through fitting: This might be important for following tracking or prediction tasks.