[2024-03-08 14:13:22] INFO - super_gradients.training.utils.sg_trainer_utils - TRAINING PARAMETERS:
- Mode: Single GPU
- Number of GPUs: 1 (1 available on the machine)
- Full dataset size: 2500 (len(train_set))
- Batch size per GPU: 64 (batch_size)
- Batch Accumulate: 1 (batch_accumulate)
- Total batch size: 64 (num_gpus * batch_size)
- Effective Batch size: 64 (num_gpus * batch_size * batch_accumulate)
- Iterations per epoch: 39 (len(train_loader))
- Gradient updates per epoch: 39 (len(train_loader) / batch_accumulate)
- Model: YoloNAS_M (51.13M parameters, 51.13M optimized)
- Learning Rates and Weight Decays:
- default: (51.13M parameters). LR: 0.0005 (51.13M parameters) WD: 0.0, (72.21K parameters), WD: 0.0001, (51.06M parameters)
[2024-03-08 14:13:22] INFO - super_gradients.training.sg_trainer.sg_trainer - Started training for 100 epochs (0/99)
[2024-03-08 14:15:41] INFO - super_gradients.common.sg_loggers.base_sg_logger - Checkpoint saved in runs/train3/RUN_20240308_141307_360962/ckpt_best.pth
[2024-03-08 14:15:41] INFO - super_gradients.training.sg_trainer.sg_trainer - Best checkpoint overriden: validation mAP@0.50:0.95: 0.004820541944354773
[2024-03-08 14:18:00] INFO - super_gradients.common.sg_loggers.base_sg_logger - Checkpoint saved in runs/train3/RUN_20240308_141307_360962/ckpt_best.pth
[2024-03-08 14:18:00] INFO - super_gradients.training.sg_trainer.sg_trainer - Best checkpoint overriden: validation mAP@0.50:0.95: 0.22781015932559967
[2024-03-08 14:20:19] INFO - super_gradients.common.sg_loggers.base_sg_logger - Checkpoint saved in runs/train3/RUN_20240308_141307_360962/ckpt_best.pth
[2024-03-08 14:20:19] INFO - super_gradients.training.sg_trainer.sg_trainer - Best checkpoint overriden: validation mAP@0.50:0.95: 0.4800339639186859
[2024-03-08 14:22:39] INFO - super_gradients.common.sg_loggers.base_sg_logger - Checkpoint saved in runs/train3/RUN_20240308_141307_360962/ckpt_best.pth
[2024-03-08 14:22:39] INFO - super_gradients.training.sg_trainer.sg_trainer - Best checkpoint overriden: validation mAP@0.50:0.95: 0.5306538343429565
[2024-03-08 14:24:59] INFO - super_gradients.common.sg_loggers.base_sg_logger - Checkpoint saved in runs/train3/RUN_20240308_141307_360962/ckpt_best.pth
[2024-03-08 14:24:59] INFO - super_gradients.training.sg_trainer.sg_trainer - Best checkpoint overriden: validation mAP@0.50:0.95: 0.5745893120765686
[2024-03-08 14:27:21] INFO - super_gradients.common.sg_loggers.base_sg_logger - Checkpoint saved in runs/train3/RUN_20240308_141307_360962/ckpt_best.pth
[2024-03-08 14:27:21] INFO - super_gradients.training.sg_trainer.sg_trainer - Best checkpoint overriden: validation mAP@0.50:0.95: 0.6019154191017151
[2024-03-08 14:29:39] INFO - super_gradients.common.sg_loggers.base_sg_logger - Checkpoint saved in runs/train3/RUN_20240308_141307_360962/ckpt_best.pth
[2024-03-08 14:29:39] INFO - super_gradients.training.sg_trainer.sg_trainer - Best checkpoint overriden: validation mAP@0.50:0.95: 0.6488515138626099
YOLO-NAS를 학습시킬 때, ckpt_root_dir에 입력한 경로에 기본적인 로깅 파일들이 생성된다.
그런데 구경을 하다보니 텐서보드 로그 파일도 있길래 관련된 내용을 찾아 docs를 찾아보게 되었다.
https://docs.deci.ai/super-gradients/latest/documentation/source/experiment_monitoring.html
https://docs.deci.ai/super-gradients/latest/documentation/source/logs.html#i-tensorboard-logging
https://docs.deci.ai/super-gradients/latest/docstring/common/sg_loggers.html
# 로그 경로
<ckpt_root_dir>/<experiment_name>/<run_dir>/events.out.tfevents.<unique_id>
# 텐서보드 실행
tensorboard --logdir <ckpt_root_dir>/<experiment_name>/<run_dir>
여전히 loss는 높고 precision과 recall이 비정상적인 수치로 나타난다.
학습 데이터가 잘못된건지 trainer 설정에서 실수가 있었던건지 뭐가 문제인걸까?
일단 현재 시점에서 의심되는 것은 trainer의 파라미터보단 데이터셋이므로 이 부분을 좀 더 보완해야할 것 같다.
'dev' 카테고리의 다른 글
YOLOv10: 새로운 실시간 종단간 객체 감지 모델 (0) | 2024.05.30 |
---|---|
[Jetson] Jetson Orin Nano SSD에 Jetpack 설치 (SDK Manager 사용) (3) | 2024.04.18 |
YOLO-NAS 커스텀 데이터 학습시키기 (0) | 2024.03.07 |
YOLOv9 예제 (0) | 2024.03.01 |
YOLO-NAS 예제 (0) | 2024.02.29 |