An End-to-End Transformer Model for 3D Object Detection

Last updated Sep 9, 2022 Edit Source

2020, link

# Intro

Many detection models work directly on point clouds
- turning unordered set of inputs (point cloud) into unordered set of outputs (bbx)
- i.e VoteNet with encoder ( PointNet++) and decoder architecture
  - effective but required years of careful development of hand-encoding inductive biases, radii, and designing special 3D operators and loss functions.
recently, set-to-set encoder-decoder models also emerged in 2D object detection as a competitive method. (i.e. DETR)
Central question: since transformers are permutation invariant (good for set-to-set problems), can we create a 3D object detector with it w/p hand-designed inductive biases?

use pointnet++ operations (downsampling + MLPs + maxPool) on point clouds directly
construct graphs (i.e: DGCNN, PointWeb)
continous point convolutions (pointConv, KPConv)

Frame detection as a set prediction problem.
- i.e: predict set of boxes w/o order
Parallel decoder takes in $N’$ point features and set of $B$ query embeddings ${\mathbf{q}^e_1,\ldots,\mathbf{q}^e_B}$ to produce $B$ feautres for bbx.
$\mathbf{q}^e$ represent locations in 3D space around which our final 3D bounding boxes are predicted.
positional embeddings is used