BSc Thesis · University of Debrecen · 2026

JEPA-RobustViT
Baseline Results Dashboard

Author
Asfand Yar
Supervisor
Dr. Bogacsovics Gergő
External
Sergio Correa
Updated
June 2026
Supervised ViT
✓ Complete
DINO
✓ Complete
MAE
✓ Complete
I-JEPA
✓ Complete
TTA
⏳ In Progress
01Experimental Setup
Backbone
ViT-B/16
Embed Dim
768
Patches
196 (14×14)
Blocks
12 transformer
Attn Heads
12
Backbone
Frozen
Head
Linear(768, C)
Optimizer
Adam, lr=1e-3
Scheduler
CosineAnnealingLR
Epochs
10
Batch Size
256
Seeds
0, 1, 2
I-JEPA Pretrain
100 epochs · Komondor
GPU
A100-SXM4-40GB
02Linear Probe Results by Method
Supervised ViT
DINO
MAE
I-JEPA (ours)
Mean Test Accuracy
80.90%
± 0.17% across 3 seeds
Mean Test ECE
0.0138
Well calibrated on source
Pretraining
ImageNet
Supervised labels
Trainable Params
6,921
Head only
Per-Seed Results — Supervised ViT-B/16
Seed Best Val Acc Test Accuracy Test ECE
0 84.66% 81.14% 0.0128
1 84.51% 80.78% 0.0110
2 84.30% 80.77% 0.0175
Mean ± Std 84.49 ± 0.15% 80.90 ± 0.17% 0.0138 ± 0.0033
Mean Test Accuracy
91.80%
± 0.33% across 3 seeds
Mean Test ECE
0.0111
Well calibrated on source
vs Supervised
+10.90%
Higher source accuracy
Pretraining
DINO SSL
Contrastive, no labels
Per-Seed Results — DINO ViT-B/16
Seed Best Val Acc Test Accuracy Test ECE
0 95.08% 91.31% 0.0151
1 94.75% 92.01% 0.0099
2 95.16% 92.08% 0.0084
Mean ± Std 95.00 ± 0.17% 91.80 ± 0.33% 0.0111 ± 0.0028
Mean Test Accuracy
83.40%
± 0.03% across 3 seeds
Mean Test ECE
0.0881
Higher than DINO/Supervised
BloodMNIST Shift
0.06%
Essentially zero — collapse
Pretraining
MAE SSL
Pixel reconstruction
Per-Seed Results — MAE ViT-B/16
Seed Best Val Acc Test Accuracy Test ECE
0 76.76% 83.44% 0.0876
1 76.77% 83.37% 0.0884
2 76.95% 83.38% 0.0882
Mean ± Std 76.83 ± 0.08% 83.40 ± 0.03% 0.0881 ± 0.0004
Mean Test Accuracy
88.25%
± 0.86% across 3 seeds
Mean Test ECE
0.0132
Best-calibrated SSL method
Pretraining
I-JEPA SSL
100 epochs from scratch
vs Supervised
+7.35%
Higher source accuracy

I-JEPA pretrained from scratch on PathMNIST. Unlike the DINO and MAE baselines which use ImageNet-pretrained weights, this I-JEPA backbone was trained from scratch for 100 epochs on the Komondor supercomputer, predicting abstract representations of masked regions in embedding space. It reaches 88.25% source accuracy with excellent calibration (ECE 0.0132), second only to DINO and avoiding MAE's calibration penalty. Under domain shift it remains the most stable SSL method, never collapsing catastrophically.

Per-Seed Results — I-JEPA ViT-B/16 (source: PathMNIST)
Seed Pretrain Loss Test Accuracy Test ECE
0 0.0681 87.26% 0.0172
1 0.0602 88.73% 0.0093
2 0.0570 88.76% 0.0130
Mean ± Std 0.0618 88.25 ± 0.86% 0.0132 ± 0.0040
03Full Baseline Comparison

Key finding: Four pretraining paradigms — supervised, contrastive SSL (DINO), reconstructive SSL (MAE), and predictive SSL (I-JEPA) — all fail to remain robust under cross-modal medical domain shift. Higher source accuracy does not translate to robustness: DINO leads on source (91.80%) yet retains only 7.0% under shift. MAE collapses to 0.06% on BloodMNIST. I-JEPA is the most stable SSL method and the best-calibrated, but it does not beat the baselines under shift. No pretraining objective alone solves the problem — motivating test-time adaptation as the core contribution.

Method Comparison — All Domains (Mean across 3 seeds)
Method Source → DermaMNIST → BloodMNIST → RetinaMNIST Src ECE
Supervised ViT 80.90 ± 0.17% 5.31 ± 0.15% 17.78 ± 0.24% 10.58 ± 0.29% 0.0138
DINO ViT-B/16 91.80 ± 0.33% 6.46 ± 0.37% 18.45 ± 0.16% 11.50 ± 0.00% 0.0111
MAE ViT-B/16 83.40 ± 0.03% 5.74 ± 0.00% 0.06 ± 0.00% 2.33 ± 0.14% 0.0881
I-JEPA (ours) 88.25 ± 0.86% 6.20 ± 1.20% 11.16 ± 4.76% 10.00 ± 3.27% 0.0132
I-JEPA + TTA (proposed) Test-time adaptation evaluation in progress
→ DermaMNIST (skin lesions)
Supervised 80.90%5.31% (6.6%)
DINO 91.80%6.46% (7.0%)
MAE 83.40%5.74% (6.9%)
I-JEPA 88.25%6.20% (7.0%)
→ BloodMNIST (blood cells)
Supervised 80.90%17.78% (22%)
DINO 91.80%18.45% (20%)
MAE 83.40%0.06% (0.1%)
I-JEPA 88.25%11.16% (12.6%)
→ RetinaMNIST (retinal fundus)
Supervised 80.90%10.58% (13%)
DINO 91.80%11.50% (13%)
MAE 83.40%2.33% (2.8%)
I-JEPA 88.25%10.00% (11.3%)
04Key Observations
All four paradigms fail under shift
Supervised, contrastive SSL, reconstructive SSL, and predictive SSL all lose robustness under cross-modal medical domain shift regardless of source accuracy. No pretraining paradigm solves this alone.
I-JEPA is the most stable SSL method
I-JEPA never collapses catastrophically. It avoids MAE's 0.06% BloodMNIST failure and matches supervised and DINO on Derma and Retina, while being the best-calibrated SSL method on source (ECE 0.0132).
Better source accuracy ≠ robustness
DINO leads on source by over 10 points yet retains only 7.0% under shift. I-JEPA reaches 88.25% source but still drops sharply. Source performance does not predict shift robustness for any method.
I-JEPA shift variance is high
BloodMNIST accuracy ranges 5.79% to 14.85% across seeds (± 4.76%). Predictive SSL features transfer inconsistently under severe modality shift, itself an informative result.
ECE collapses under shift for all methods
Source ECE is low (0.011–0.088) but all methods become severely overconfident under shift, with ECE above 0.5. Calibration failure is as dangerous as accuracy failure in clinical deployment.
Motivation for test-time adaptation
Since no pretraining objective produces robustness on its own, lightweight entropy-based LayerNorm adaptation at inference becomes the central contribution — recovering accuracy without target labels.
05Planned Comparisons
Method Type Status
Supervised ViT-B/16 Supervised pretraining ✓ Complete
DINO ViT-B/16 Contrastive SSL ✓ Complete
MAE ViT-B/16 Reconstructive SSL ✓ Complete
I-JEPA ViT-B/16 (ours) Predictive SSL ✓ Complete
Supervised + TTA Supervised + LayerNorm adaptation ⏳ In Progress
DINO + TTA Contrastive SSL + adaptation ⬜ Planned
MAE + TTA Reconstructive SSL + adaptation ⬜ Planned
I-JEPA + TTA (proposed) Predictive SSL + adaptation ⏳ In Progress