Recycling Waste Object Detection(2nd)

ํ”„๋กœ์ ํŠธ ๊ฐœ์š”


๋Œ€ํšŒ ์†Œ๊ฐœ

image
์žฌํ™œ์šฉ ํ’ˆ๋ชฉ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ Object Detection ๋ชจ๋ธ ๊ฐœ๋ฐœ. 10์ข…๋ฅ˜์˜ ์“ฐ๋ ˆ๊ธฐ ํ’ˆ๋ชฉ Object Detection ๋ชจ๋ธ์˜ ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•˜๋Š” ๋Œ€ํšŒ์ž…๋‹ˆ๋‹ค.


๊ฐœ๋ฐœ ํ™˜๊ฒฝ


Leaderboard

image
๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•˜๊ณ  ์ˆ˜๋งŽ์€ ๊ฐ€์„ค์„ ์‹คํ—˜ํ•œ ๋์—, mAP_50 0.7482๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ 2์œ„์˜ ์„ฑ๊ณผ๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ตœ์‹  SOTA ๋ชจ๋ธ์ธ Co-DETR๋ถ€ํ„ฐ ATSS, Faster R-CNN, DyHead, YOLO ๋“ฑ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์‹œ๋„ํ–ˆ์œผ๋ฉฐ, ์ตœ์ข…์ ์œผ๋กœ ์•™์ƒ๋ธ”์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ์„ฑ๋Šฅ์ด ๋‚ฎ์•˜๋˜ YOLO ๋ชจ๋ธ์„ ์•™์ƒ๋ธ”ํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™” ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋กœ์จ, ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ์กฐํ•ฉํ•˜์—ฌ ์•™์ƒ๋ธ”ํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ์ ์„ ๊นจ๋‹ฌ์•˜์Šต๋‹ˆ๋‹ค.
๋˜ํ•œ, ํŒ€์›๋“ค๊ณผ์˜ ํ˜‘์—… ๊ณผ์ •์—์„œ Git Issue, Pull Request, Project Table์„ ์ ๊ทน์ ์œผ๋กœ ํ™œ์šฉํ•˜์—ฌ ํ”„๋กœ์ ํŠธ ์ง„ํ–‰ ์ƒํ™ฉ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๊ด€๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ˜‘์—… ๋„๊ตฌ์˜ ํ™œ์šฉ์€ ์›ํ™œํ•œ ์†Œํ†ต๊ณผ ํšจ์œจ์ ์ธ ์—…๋ฌด ๋ถ„๋ฐฐ์— ํฐ ๋„์›€์ด ๋˜์—ˆ๊ณ , ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋†’์€ ์„ฑ๊ณผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.


ํƒ€์ž„๋ผ์ธ

Pasted_image_20241113223533.png


ํ”„๋กœ์ ํŠธ ์ˆ˜ํ–‰ ์ ˆ์ฐจ ๋ฐ ๋ฐฉ๋ฒ•

์‚ฌ์ง„์—์„œ ์“ฐ๋ ˆ๊ธฐ๋ฅผ ํƒ์ง€ํ•˜๊ธฐ ์œ„ํ•ด MMDetection ์˜คํ”ˆ์†Œ์Šค ๊ฐ์ฒด ํƒ์ง€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์„ ์‹คํ—˜ํ•˜๊ณ  ์ตœ์ ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ์„ ์žฌํ™œ์šฉ ํ’ˆ๋ชฉ ๋ถ„๋ฅ˜ ๊ณผ์ œ์— ๋งž๊ฒŒ ํŠœ๋‹ํ•˜๊ณ  ์กฐํ•ฉํ•œ ๊ฒฐ๊ณผ, ๋„ค์ด๋ฒ„ ๋ถ€์ŠคํŠธ์บ ํ”„์—์„œ ๊ฐœ์ตœํ•œ โ€œ์žฌํ™œ์šฉ ํ’ˆ๋ชฉ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ Object Detection ๋Œ€ํšŒ"์—์„œ mAP50 0.7482๋ฅผ ๊ธฐ๋กํ•˜๋ฉฐ 2์œ„๋ฅผ ์ฐจ์ง€ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์“ฐ๋ ˆ๊ธฐ Detection ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์ œ๊ฐ€ ์‹œ๋„ํ•œ ์ฃผ์š” ์ ‘๊ทผ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋‹ค์–‘ํ•œ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ ์‹คํ—˜
    ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•˜๋ฉฐ, ๊ฐ ๋ชจ๋ธ์ด ์žฌํ™œ์šฉ ํ’ˆ๋ชฉ ๋ถ„๋ฅ˜์— ์ ํ•ฉํ•˜๋„๋ก ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

  2. ๋ชจ๋ธ ์กฐํ•ฉ ๋ฐ ์•™์ƒ๋ธ”
    ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์กฐํ•ฉํ•˜์—ฌ ์•™์ƒ๋ธ”์„ ์‹œ๋„ํ–ˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํฐ ๋„์›€์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  3. MMDetection์˜ ํ™œ์šฉ
    MMDetection ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๊ตฌ์กฐ์™€ ๊ธฐ๋Šฅ์„ ํšจ์œจ์ ์œผ๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์‹คํ—˜ ๊ณผ์ •์ด ํฌ๊ฒŒ ์ตœ์ ํ™”๋˜์—ˆ์Šต๋‹ˆ๋‹ค.


๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐ EDA

1-1 ๋ฐ์ดํ„ฐ ๋ถ„ํ• 

๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ์„ฑ๋Šฅ์„ ๊ฐ๊ด€์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹๊ณผ ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„ํ• ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์˜ ํด๋ž˜์Šค ๋ถ„ํฌ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Stratified Group K-Fold ๋ฐฉ์‹์„ ์ ์šฉํ•˜์—ฌ ๊ฐ fold์—์„œ ํด๋ž˜์Šค ๋ถ„ํฌ๊ฐ€ ๊ท ๋“ฑํ•ด์ง€๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.
Pasted_image_20241113224501.png
์ตœ์ข…์ ์œผ๋กœ Fold 1์— ๋Œ€ํ•ด ์‹คํ—˜์„ ์ง„ํ–‰ํ•ด ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ณ , ์ด ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋“  fold์— ๋ฐ˜์˜ํ•˜์—ฌ 5-Fold ์•™์ƒ๋ธ”๋กœ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์„ ๋„์ถœํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ Stratified Group K-Fold ๋ฐฉ์‹์€ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋†’์ด๊ณ , ๋” ์ผ๊ด€๋œ ํ‰๊ฐ€๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์ ‘๊ทผ์€ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ฐ˜์˜ํ•˜๋ฉฐ ์ตœ์ข… ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ๋ฐ ํฐ ๋„์›€์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

1-2 Anchor box

์ด ๋ถ„์„์„ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹์— ์ตœ์ ํ™”๋œ anchor box ratio๋ฅผ ์„ค์ •ํ•˜์—ฌ, ๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ๋ถˆํ•„์š”ํ•œ anchor์˜ ์‚ฌ์šฉ์„ ์ตœ์†Œํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด anchor๊ฐ€ ๊ฐ ๊ฐ์ฒด์˜ ํฌ๊ธฐ์™€ ํ˜•ํƒœ์— ๋ณด๋‹ค ์ž˜ ๋งž๋„๋ก ์กฐ์ •ํ•จ์œผ๋กœ์จ, ๋ชจ๋ธ์˜ ํ•™์Šต ํšจ์œจ์„ ๋†’์ผ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๊ฐ์ฒด ํƒ์ง€ ์ž‘์—…์—์„œ anchor ์„ค์ •์˜ ์ค‘์š”์„ฑ์„ ๋‹ค์‹œ๊ธˆ ์‹ค๊ฐํ•œ ๊ฒฝํ—˜์ด์—ˆ์œผ๋ฉฐ, ์•ž์œผ๋กœ๋„ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋งž๋Š” anchor ์ตœ์ ํ™”๋ฅผ ์ ๊ทน์ ์œผ๋กœ ๊ณ ๋ คํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.

๋ชจ๋ธ Predictions bounding box ๋ฐ PR ๊ณก์„  ์‹œ๊ฐํ™”

Pasted_image_20241113224744.png
๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ง๊ด€์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด Prediction bounding box์™€ Ground Truth bounding box, ๊ทธ๋ฆฌ๊ณ  PR (Precision-Recall) ๊ณก์„ ์„ ์‹œ๊ฐํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธํ•˜๊ณ , ๋ชจ๋ธ์˜ ์•ฝ์ ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์‹œ๊ฐํ™” ๊ฒฐ๊ณผ, ๋ชจ๋ธ์ด **์ž‘์€ ๊ฐ์ฒด(General trash)**์— ๋Œ€ํ•ด์„œ๋Š” localization ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๊ณ , **๊ฒน์ณ ์žˆ๋Š” ๊ฐ์ฒด(Plastic)**์— ๋Œ€ํ•ด์„œ๋Š” classification ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๊ฒฝํ–ฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ถ„์„์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์ทจ์•ฝ์ ์„ ๊ตฌ์ฒด์ ์œผ๋กœ ํŒŒ์•…ํ•˜๊ณ , ์ดํ›„ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ์œ„ํ•œ ์ „๋žต์„ ์ˆ˜๋ฆฝํ•  ์ˆ˜ ์žˆ๋Š” ์ค‘์š”ํ•œ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์•ž์œผ๋กœ๋„ ์‹œ๊ฐ์  ๋ถ„์„์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋‹ค๊ฐ๋„๋กœ ํ‰๊ฐ€ํ•˜๊ณ  ์ตœ์ ํ™”ํ•  ๊ณ„ํš์ž…๋‹ˆ๋‹ค.


MMdetection

baseline ๋ชจ๋ธ ํƒ์ƒ‰

๋ฒ ์ด์Šค๋ผ์ธ ์ดˆ๊ธฐ ์‹คํ—˜์—์„œ๋Š” MMDetection ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๊ฐ์ฒด ํƒ์ง€ ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค. Fold 1 ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ๊ฐ ๋ชจ๋ธ์˜ ์ดˆ๊ธฐ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜ ํ‘œ๋Š” Faster R-CNN, Cascade R-CNN, ATSS, UniverseNet, RetinaNet, VFNet ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ ๋น„๊ต ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ฐ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ถ„์„ํ•˜๊ณ , ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์„ ์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ดˆ ์ž๋ฃŒ๋กœ ์‚ผ์•˜์Šต๋‹ˆ๋‹ค.

Model Backbone Neck Optimizer lr Epoch Test mAP50
Faster R-CNN ResNet50 FPN SGD 0.02 12 0.3734
Cascade R-CNN SwinL FPN SGD 0.02 20 0.5161
ATSS SwinL FPN SGD 0.02 20 0.5015
UniverseNet SwinL FPN AdamW 0.0001 20 0.5545
RetinaNet SwinL FPN AdamW 0.0001 20 0.5438
VFNet SwinL FPN AdamW 0.0001 20 0.5623

Backbone ํƒ์ƒ‰

Cascade R-CNN ๋ชจ๋ธ์˜ Backbone์œผ๋กœ ResNet50, Swin Transformer Small, Swin Transformer Large๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ, Backbone์„ Swin Transformer ๋ชจ๋ธ๋กœ ์ „ํ™˜ํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์ด ์œ ์˜๋ฏธํ•˜๊ฒŒ ํ–ฅ์ƒ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ, Swin Transformer ๋ชจ๋ธ์€ ImageNet-22K ๋ฐ์ดํ„ฐ์…‹์—์„œ (384x384) ์ด๋ฏธ์ง€๋กœ ํ•™์Šต๋œ ์‚ฌ์ „ ํ•™์Šต ๊ฐ€์ค‘์น˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ fine-tuning ๊ณผ์ •์„ ๊ฑฐ์ณ ์ตœ์ ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” ๊ฐ Backbone ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ๋น„๊ต ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค:

Model Backbone Test mAP50
Cascade R-CNN ResNet50 0.3613
Cascade R-CNN SwinS 0.4628
Cascade R-CNN SwinL 0.5161

๊ฒฐ๊ณผ

Pasted_image_20241113225944.png
๊ฐ์ฒด ํƒ์ง€ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ๊ตฌ์กฐ์™€ Backbone, Neck ์กฐํ•ฉ์„ ์‹œ๋„ํ•˜๊ณ , ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๋ฐ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ๋ณ€๊ฒฝํ•˜๋ฉฐ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ๋ถ„์„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ์กฐํ•ฉ์ด ์„ฑ๋Šฅ์— ๋ฏธ์นœ ์˜ํ–ฅ์„ ์š”์•ฝํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด ์‹คํ—˜์„ ํ†ตํ•ด ์ตœ์ ์˜ ์กฐํ•ฉ์„ ์ฐพ์•„๊ฐ€๋Š” ๊ณผ์ •์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋งค์šฐ ์ค‘์š”ํ•จ์„ ์‹ค๊ฐํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋ฉฐ, ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋งž๋Š” Backbone๊ณผ ์ฆ๊ฐ• ๊ธฐ๋ฒ•์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํšจ๊ณผ์ ์ž„์„ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.


Yolo

**YOLO (You Only Look Once)**๋Š” ๊ฐ์ฒด ํƒ์ง€์—์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ชจ๋ธ๋กœ, ์ด๋ฏธ์ง€ ๋‚ด์˜ ๊ฐ์ฒด๋ฅผ ํ•œ ๋ฒˆ์˜ ์ „๋ฐฉํ–ฅ ํŒจ์Šค(forward pass)๋งŒ์œผ๋กœ ํƒ์ง€ํ•˜๋Š” 1-Stage ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์ „ํ†ต์ ์ธ ๊ฐ์ฒด ํƒ์ง€ ๋ฐฉ์‹๋“ค์€ ๊ฐ์ฒด๋ฅผ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์— ๊ฑธ์ณ ํƒ์ง€ํ•˜์ง€๋งŒ, YOLO๋Š” ์ด๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ํŠน์ง•์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ์†๋„์™€ ํšจ์œจ์„ฑ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผœ ์‹ค์‹œ๊ฐ„ ๊ฐ์ฒด ํƒ์ง€์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

baseline ๋ชจ๋ธ ํƒ์ƒ‰

Model Image Size FLOPs Validation mAP40
YOLOv5x6 1280 209.8 55.0
YOLOv11x 640 194.9 54.7

YOLO ๋ชจ๋ธ์„ ์„ ํƒํ•  ๋•Œ, YOLOv5์™€ YOLOv11 ์ค‘ ํ•˜๋‚˜๋ฅผ ๊ณ ๋ คํ–ˆ์Šต๋‹ˆ๋‹ค. YOLOv11x๋Š” ์ตœ์‹  ๋ชจ๋ธ๋กœ, ์†๋„๊ฐ€ ๋น ๋ฅด๊ณ  ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ FLOPs๋ฅผ ์š”๊ตฌํ•˜๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋ฒˆ ๋Œ€ํšŒ์˜ ๋ชฉํ‘œ๋Š” ์„ฑ๋Šฅ ์šฐ์„ ์ด์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์†๋„๋‚˜ ๊ฒฝ๋Ÿ‰ํ™”๋ณด๋‹ค๋Š” ์ •ํ™•๋„์— ๋” ์ค‘์ ์„ ๋‘๊ธฐ๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด์— ๋”ฐ๋ผ, YOLOv5x6 ๋ชจ๋ธ์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. YOLOv5x6๋Š” YOLOv11x๋ณด๋‹ค ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์•ฝ 2.5๋ฐฐ ๋งŽ์•„ ์†๋„๋Š” ๋А๋ฆฌ์ง€๋งŒ, ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, Image Size๋ฅผ 1024x1024๋กœ ์„ค์ •ํ–ˆ์„ ๋•Œ, YOLOv5x6๊ฐ€ ๋” ์ ํ•ฉํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ ์ด๋ฅผ ๋ชจ๋ธ๋กœ ์ฑ„ํƒํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ, ์ตœ์ข…์ ์œผ๋กœ ์„ฑ๋Šฅ์„ ์ตœ์šฐ์„ ์œผ๋กœ ๊ณ ๋ คํ•œ ์„ ํƒ์ด์—ˆ์œผ๋ฉฐ, ์†๋„๋ณด๋‹ค๋Š” ์ •ํ™•๋„ ํ–ฅ์ƒ์„ ๋ชฉํ‘œ๋กœ ํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

Model Image size lr momentum decay Step size anchor box test mAP
YOLOv5x6 1280 0.1 0.937 0.005 3 original 0.4770
YOLOv5x6 1280 0.01 0.937 0.0005 3 original 0.5015
YOLOv5x6 1280 0.01 0.937 0.0005 3 anchor box tunning 0.5303

Co-DINO

baseline ๋ชจ๋ธ ํƒ์ƒ‰

Model Backbone Pre-training Dataset Fine-Tuning Dataset Dataset split validation mAP50 test mAP50
Co-Deformable-DETR R50 COCO Train set 0.3341
Co-DINO Swin-T COCO Train-Validation split 0.4220
Co-DINO Swin-L COCO Train-Validation split 0.7170 0.7071
Co-DINO Swin-L COCO Train set 0.7190
Co-DINO Swin-L COCO 5-fold CV 0.7283

์‹คํ—˜ ๊ฒฐ๊ณผ, ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ์ค‘ Co-DINO๊ฐ€ ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. Co-DINO ๋ชจ๋ธ์€ Contrastive Denoising ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํšจ์œจ์ ์ธ ์•ต์ปค ๋ฐ•์Šค๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ ํ†ตํ•ด ๊ฐ์ฒด ๊ฒ€์ถœ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ, Backbone ๋ชจ๋ธ ๋ถ„์„์—์„œ Swin-L์€ ์ด์ „ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Œ€๋กœ Transformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ ์ค‘ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์‹คํ—˜์—์„œ๋Š” COCO ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ pre-training๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ํ™œ์šฉํ•˜์˜€์œผ๋ฉฐ, ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ถ„ํ•  ๋ฐฉ์‹์„ ์กฐ์ •ํ•˜๋ฉด์„œ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ๋น„๊ตํ–ˆ์Šต๋‹ˆ๋‹ค.

ํŠนํžˆ, K-fold cross validation์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„ํ• ํ•˜๊ณ  ํ•™์Šตํ•œ ๋’ค ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์„ ์ ์šฉํ•œ ๊ฒฐ๊ณผ, ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋จ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹์„ ์—ฌ๋Ÿฌ Fold๋กœ ๋‚˜๋ˆ„๊ณ  ์•™์ƒ๋ธ”์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ฑ„ํƒํ–ˆ์Šต๋‹ˆ๋‹ค. ์ตœ์ข…์ ์œผ๋กœ, Co-DINO ๋ชจ๋ธ์„ ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋กœ ์„ ์ •ํ•˜์—ฌ ์ตœ์ƒ์˜ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹

๊ฒฝ์ง„๋Œ€ํšŒ์˜ ํŠน์„ฑ์ƒ ์‹œ๊ฐ„์  ์ œ์•ฝ์ด ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, Co-DINO Swin-L ๋ชจ๋ธ์€ 12์—ํญ ํ•™์Šต์— ์•ฝ 36์‹œ๊ฐ„์ด ์†Œ์š”๋˜๋Š” ๋ฌธ์ œ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด์— ๋”ฐ๋ผ ๊ธฐ์กด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๊ธฐ์ค€์„ ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

ํ•™์Šต์— ์‚ฌ์šฉ๋œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Model Backbone Fine-Tuning Dataset Input image size validation mAP50 test mAP50
Co-DINO Swin-T Train-validation split (1024, 1024) 0.4160
Co-DINO Swin-T Train-validation split (1280, 1280) 0.4220
Co-DINO Swin-T Train-validation split (1536, 1536) 0.0720
Co-DINO Swin-L Train set (512, 512) 0.6686
Co-DINO Swin-L Train set (1280, 1280) 0.7790

๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๊ฐ€ ํด์ˆ˜๋ก ํ•ด์ƒ๋„๊ฐ€ ๋†’์•„์ ธ ๋” ๋งŽ์€ ๊ฐ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ•œ ๋ฐ” ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ Co-DINO ๋ชจ๋ธ์— ์ ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์—์„œ์˜ ์„ฑ๋Šฅ์„ ํ™•์ธํ•ด๋ณธ ๊ฒฐ๊ณผ, ์ด๋ฏธ์ง€ ํฌ๊ธฐ๊ฐ€ ์›๋ณธ๋ณด๋‹ค ์ž‘์„์ˆ˜๋ก ์ •๋ณด ์†์‹ค์ด ๋ฐœ์ƒํ•˜์—ฌ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ž…๋ ฅ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ 1280x1280 ์ด์ƒ์œผ๋กœ ๋Š˜๋ ธ์„ ๋•Œ๋Š” ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๊ฐ์†Œํ•˜๋Š” ํ˜„์ƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ทธ ์ด์œ ๋Š”, ํŠน์ • ํฌ๊ธฐ ์ด์ƒ์˜ ์ด๋ฏธ์ง€์—์„œ๋Š” backbone ๋ชจ๋ธ์˜ ์œˆ๋„์šฐ ํฌ๊ธฐ ๋ฐ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์˜ ๊ตฌ์กฐ ์ƒ, ์ด๋ฏธ์ง€์—์„œ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋Š” feature ์ •๋ณด๊ฐ€ ์ œ๋Œ€๋กœ ๋ฝ‘ํžˆ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ์˜ˆ์ƒ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํฐ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์—์„œ ๋ชจ๋ธ์ด ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” feature ๋งต์˜ ํฌ๊ธฐ๋‚˜ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์— ํ•œ๊ณ„๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ๋ณด์ด๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ์ ์ ˆํ•œ ์ด๋ฏธ์ง€ ํฌ๊ธฐ์˜ ์„ ํƒ์ด ๋ชจ๋ธ ์„ฑ๋Šฅ์— ์ค‘์š”ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ์ ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.


์ตœ์ข… ์„ฑ๋Šฅ

Model Backbone Pretraining Dataset Fine-Tuning Dataset Input image size validation mAP50 test mAP50
CoDeformableDETR R50 COCO Train set (400~800) (multi-size) - - 0.3341
Co-DINO Swin-T COCO Train-validation split (1024, 1024) 0.4160 -
Co-DINO Swin-T COCO Train-validation split (1280, 1280) 0.4220 -
Co-DINO Swin-T COCO Train-validation split (1536, 1536) 0.0720 -
Co-DINO Swin-T COCO Train-validation split (1280, 1280) 0.7170 0.7071
Co-DINO Swin-L COCO Train set (512, 512) - 0.6686
Co-DINO Swin-L COCO Train set (1280, 1280) - 0.7190
Co-DINO Swin-L COCO 5-fold CV (1280, 1280) - 0.7283

์•™์ƒ๋ธ”

Object Detection Task ์—์„  ๋Œ€ํ‘œ์ ์œผ๋กœ NMS, soft NMS, NMW, WBF 4๊ฐ€์ง€์˜ ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์ด์žˆ๋‹ค. ๊ฐ„๋‹จํ•œ ๊ธฐ๋ณธ ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ ๊ฐ ์•™์ƒ๋ธ”์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•œ ๋’ค, ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์•˜๋˜ ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์œผ๋กœ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์„ ๋งŒ๋“ค์—ˆ๋‹ค.

์•™์ƒ๋ธ” ๊ธฐ๋ฒ• ATSS Cascade RCNN UniversNet Co-DINO YOLOv5x6 test mAP50
WBF o o 0.7055
NMW o o 0.7118
NMW o o o 0.6948
WBF o o 0.7198
NMS o o 0.7217
NMW o o o 0.7327
NMW o o o o 0.7553

์ตœ์ข…์ ์œผ๋กœ ATSS, Cascade RCNN, Co-DINO, YOLOv5x6์„ NMW ๊ธฐ๋ฒ•์œผ๋กœ ์•™์ƒ๋ธ”ํ–ˆ์„ ๋•Œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ , ๊ทธ ๊ฒฐ๊ณผ ๋Œ€ํšŒ๋ฅผ 2๋“ฑ์œผ๋กœ ๋งˆ๋ฌด๋ฆฌํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํฅ๋ฏธ๋กœ์šด ์ ์€ YOLOv5x6 ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์€ ํŽธ์ด์—ˆ์ง€๋งŒ, ์•™์ƒ๋ธ”์„ ์ ์šฉํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. YOLOv5x6 ๋ชจ๋ธ์€ ์ •ํ™•ํžˆ ๋งž์ถœ ์ˆ˜ ์žˆ๋Š” ๊ฐ์ฒด์— ๋Œ€ํ•ด์„œ๋งŒ bounding box๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์„œ๋กœ ๋ณด์™„ํ•˜๋Š” ํšจ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚˜ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฐ๊ณผ๋Š” ์•™์ƒ๋ธ” ๊ธฐ๋ฒ•์ด ๊ฐ ๋ชจ๋ธ์˜ ๊ฐ•์ ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋” ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ์ž˜ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.


๊ฐœ์ธ ํšŒ๊ณ 

ํ”„๋กœ์ ํŠธ์— ์•ž์„  ํ•™์Šต ๋ชฉํ‘œ

ํ•™์Šต ๋ชฉํ‘œ ๋‹ฌ์„ฑ์„ ์œ„ํ•ด ๋ฌด์—‡์„ ์–ด๋–ป๊ฒŒ ํ–ˆ๋Š”๊ฐ€?

๋‚˜๋Š” ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋ธ์„ ๊ฐœ์„ ํ–ˆ๋Š”๊ฐ€?

๋‚ด๊ฐ€ ํ•œ ํ–‰๋™์˜ ๊ฒฐ๊ณผ๋กœ ์–ด๋–ค ์ง€์ ์„ ๋‹ฌ์„ฑํ•˜๊ณ , ์–ด๋–ค ๊นจ๋‹ฌ์Œ์„ ์–ป์—ˆ๋Š”๊ฐ€?

๋งˆ์ฃผํ•œ ํ•œ๊ณ„๋Š” ๋ฌด์—‡์ด๋ฉฐ, ์•„์‰ฌ์› ๋˜ ์ ์€ ๋ฌด์—‡์ธ๊ฐ€?

ํ•œ๊ณ„/๊ตํ›ˆ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ํ”„๋กœ์ ํŠธ์—์„œ ์‹œ๋„ํ•ด๋ณผ ๊ฒƒ์€ ๋ฌด์—‡์ธ๊ฐ€?