You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+67Lines changed: 67 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -227,6 +227,73 @@ This is an overview of CAMELTrack's online pipeline, which uses the tracking-by-
227
227
</p>
228
228
229
229
230
+
## 🔍 Ideas for Further Work
231
+
232
+
Our motivation was to glue together strong expert pre-trained models (detection, reid, motion, pose, etc.) using a learned module instead of SORT-like heuristics (e.g. ByteTrack, DeepSORT, BoT-SORT, ...).
233
+
This modular design contrasts with end-to-end (E2E) methods (MOTR, MOTIP, etc), which aim to learn everything jointly—including detection, re-identification, and motion—but often require large-scale training data, are computationally intensive, and struggle in real-world applications.
234
+
235
+
While CAMELTrack provides a strong foundation, there is room for improvement.
236
+
The authors will not pursue these directions further, so we encourage others to explore and build on this work.
237
+
Feel free to open an issue or contact the authors for any suggestion or question regarding these ideas.
238
+
239
+
### Suggested Research Directions
240
+
241
+
<details>
242
+
<summary>1. Self-Supervised Video Pre-Training</summary>
243
+
244
+
Self-supervised pre-training on large-scale video datasets is a promising path to improve temporal reasoning and generalization in MOT, particularly for end-to-end (E2E) methods that struggle without massive annotated data. Tasks like future frame prediction could naturally teach models about object motion and identity preservation—central to tracking—without requiring manual supervision.
245
+
246
+
</details>
247
+
248
+
<details>
249
+
<summary>2. Better Training Strategies</summary>
250
+
251
+
Our ablation studies show that data augmentation is crucial to reach state-of-the-art performance, but we only implemented basic strategies. There is clear room for improvement here.
252
+
253
+
</details>
254
+
255
+
<details>
256
+
<summary>3. Cross-Domain Tracking</summary>
257
+
258
+
Study how CAMELTrack behaves in cross-domain settings by training it on one domain (e.g. DanceTrack) and evaluating it on another (e.g. SportMOT), while keeping the CAMEL association module fixed. The idea is to replace only the off-the-shelf components (detector, ReID, etc.) with counterparts trained on the target domain. We believe that, unlike end-to-end methods—which learn all components jointly—CAMEL’s modular design may allow for easier adaptation to new domains, without retraining the learned association module.
259
+
260
+
</details>
261
+
262
+
<details>
263
+
<summary>4. Additional Cues</summary>
264
+
265
+
Extend CAMELTrack with domain-specific or general cues. Examples include jersey numbers for sports, license plates for vehicles, segmentation masks, monocular depth, or learned motion models. The architecture can naturally handle additional input modalities.
266
+
267
+
</details>
268
+
269
+
<details>
270
+
<summary>5. Alternative Designs</summary>
271
+
272
+
CAMELTrack aims to be simple and free of complex or handcrafted architectural design. Future work could however explore different architectures or custom training objectives.
273
+
274
+
</details>
275
+
276
+
<details>
277
+
<summary>6. Bridge the Gap with Detection-by-Tracking Methods</summary>
278
+
279
+
End-to-end methods like MOTR or SAM2 follow the detection-by-tracking paradigm, meaning they can use past information from their memory to help re-detect occluded targets in the current frame. CAMELTrack, like other tracking-by-detection methods, cannot currently do this as detection is performed independently at each frame. A possible extension would be to replace CAMEL’s YOLO module with a dedicated DETR-like detector, prompted with CAMEL’s track tokens from the previous frame to help re-detect previously tracked targets.
280
+
281
+
</details>
282
+
283
+
<details>
284
+
<summary>7. Latent Space Tracking with Detection Tokens</summary>
285
+
286
+
CAMELTrack currently relies on bounding box coordinates and image crops from YOLO. A promising direction would be to operate directly in the latent space of modern detectors like DETR, using their detection tokens as inputs to the association module. These tokens carry rich contextual information—including appearance, object relationships, and scene context—that are lost when reduced to spatial boxes alone. Leveraging this richer representation could help resolve ambiguities, such as overlapping targets, more effectively. This approach could complement rather than replace dedicated ReID models, which still provide stronger appearance cues due to their high resolution input image crop and their training on difficult ReID-specific datasets with hard triplets of samples.
287
+
288
+
</details>
289
+
290
+
<details>
291
+
<summary>8. Learned Tracklet Management</summary>
292
+
293
+
CAMELTrack currently focuses on frame-to-frame association but lacks an explicit mechanism for managing tracklet lifecycles. Future work could extend CAMEL to handle higher-level decisions such as when to pause a tracklet, when to resume it, or when to initialize a new one. Incorporating learned or rule-based tracklet management could improve robustness in scenarios involving occlusions, missed detections, false positives, or re-entries.
294
+
295
+
</details>
296
+
230
297
## 🖋 Citation
231
298
232
299
If you use this repository for your research or wish to refer to our contributions, please use the following BibTeX entries:
0 commit comments