JEPA-RobustViT — Baseline Results

01Experimental Setup

Backbone

ViT-B/16

Embed Dim

768

Patches

196 (14×14)

Blocks

12 transformer

Attn Heads

12

Backbone

Frozen

Head

Linear(768, C)

Optimizer

Adam, lr=1e-3

Scheduler

CosineAnnealingLR

Epochs

10

Batch Size

256

Seeds

0, 1, 2

I-JEPA Pretrain

100 epochs · Komondor

GPU

A100-SXM4-40GB

02Linear Probe Results by Method

Supervised ViT

DINO

MAE

I-JEPA (ours)

Mean Test Accuracy

80.90%

± 0.17% across 3 seeds

Mean Test ECE

0.0138

Well calibrated on source

Pretraining

ImageNet

Supervised labels

Trainable Params

6,921

Head only

Per-Seed Results — Supervised ViT-B/16

Seed	Best Val Acc	Test Accuracy	Test ECE
0	84.66%	81.14%	0.0128
1	84.51%	80.78%	0.0110
2	84.30%	80.77%	0.0175
Mean ± Std	84.49 ± 0.15%	80.90 ± 0.17%	0.0138 ± 0.0033

Mean Test Accuracy

91.80%

± 0.33% across 3 seeds

Mean Test ECE

0.0111

Well calibrated on source

vs Supervised

+10.90%

Higher source accuracy

Pretraining

DINO SSL

Contrastive, no labels

Per-Seed Results — DINO ViT-B/16

Seed	Best Val Acc	Test Accuracy	Test ECE
0	95.08%	91.31%	0.0151
1	94.75%	92.01%	0.0099
2	95.16%	92.08%	0.0084
Mean ± Std	95.00 ± 0.17%	91.80 ± 0.33%	0.0111 ± 0.0028

Mean Test Accuracy

83.40%

± 0.03% across 3 seeds

Mean Test ECE

0.0881

Higher than DINO/Supervised

BloodMNIST Shift

0.06%

Essentially zero — collapse

Pretraining

MAE SSL

Pixel reconstruction

Per-Seed Results — MAE ViT-B/16

Seed	Best Val Acc	Test Accuracy	Test ECE
0	76.76%	83.44%	0.0876
1	76.77%	83.37%	0.0884
2	76.95%	83.38%	0.0882
Mean ± Std	76.83 ± 0.08%	83.40 ± 0.03%	0.0881 ± 0.0004

Mean Test Accuracy

88.25%

± 0.86% across 3 seeds

Mean Test ECE

0.0132

Best-calibrated SSL method

Pretraining

I-JEPA SSL

100 epochs from scratch

vs Supervised

+7.35%

Higher source accuracy

I-JEPA pretrained from scratch on PathMNIST. Unlike the DINO and MAE baselines which use ImageNet-pretrained weights, this I-JEPA backbone was trained from scratch for 100 epochs on the Komondor supercomputer, predicting abstract representations of masked regions in embedding space. It reaches 88.25% source accuracy with excellent calibration (ECE 0.0132), second only to DINO and avoiding MAE's calibration penalty. Under domain shift it remains the most stable SSL method, never collapsing catastrophically.

Per-Seed Results — I-JEPA ViT-B/16 (source: PathMNIST)

Seed	Pretrain Loss	Test Accuracy	Test ECE
0	0.0681	87.26%	0.0172
1	0.0602	88.73%	0.0093
2	0.0570	88.76%	0.0130
Mean ± Std	0.0618	88.25 ± 0.86%	0.0132 ± 0.0040

03Full Baseline Comparison

Key finding: Four pretraining paradigms — supervised, contrastive SSL (DINO), reconstructive SSL (MAE), and predictive SSL (I-JEPA) — all fail to remain robust under cross-modal medical domain shift. Higher source accuracy does not translate to robustness: DINO leads on source (91.80%) yet retains only 7.0% under shift. MAE collapses to 0.06% on BloodMNIST. I-JEPA is the most stable SSL method and the best-calibrated, but it does not beat the baselines under shift. No pretraining objective alone solves the problem — motivating test-time adaptation as the core contribution.

Method Comparison — All Domains (Mean across 3 seeds)

Method	Source	→ DermaMNIST	→ BloodMNIST	→ RetinaMNIST	Src ECE
Supervised ViT	80.90 ± 0.17%	5.31 ± 0.15%	17.78 ± 0.24%	10.58 ± 0.29%	0.0138
DINO ViT-B/16	91.80 ± 0.33%	6.46 ± 0.37%	18.45 ± 0.16%	11.50 ± 0.00%	0.0111
MAE ViT-B/16	83.40 ± 0.03%	5.74 ± 0.00%	0.06 ± 0.00%	2.33 ± 0.14%	0.0881
I-JEPA (ours)	88.25 ± 0.86%	6.20 ± 1.20%	11.16 ± 4.76%	10.00 ± 3.27%	0.0132
I-JEPA + TTA (proposed)	Test-time adaptation evaluation in progress				—

→ DermaMNIST (skin lesions)

Supervised 80.90%5.31% (6.6%)

DINO 91.80%6.46% (7.0%)

MAE 83.40%5.74% (6.9%)

I-JEPA 88.25%6.20% (7.0%)

→ BloodMNIST (blood cells)

Supervised 80.90%17.78% (22%)

DINO 91.80%18.45% (20%)

MAE 83.40%0.06% (0.1%)

I-JEPA 88.25%11.16% (12.6%)

→ RetinaMNIST (retinal fundus)

Supervised 80.90%10.58% (13%)

DINO 91.80%11.50% (13%)

MAE 83.40%2.33% (2.8%)

I-JEPA 88.25%10.00% (11.3%)

04Key Observations

All four paradigms fail under shift

Supervised, contrastive SSL, reconstructive SSL, and predictive SSL all lose robustness under cross-modal medical domain shift regardless of source accuracy. No pretraining paradigm solves this alone.

I-JEPA is the most stable SSL method

I-JEPA never collapses catastrophically. It avoids MAE's 0.06% BloodMNIST failure and matches supervised and DINO on Derma and Retina, while being the best-calibrated SSL method on source (ECE 0.0132).

Better source accuracy ≠ robustness

DINO leads on source by over 10 points yet retains only 7.0% under shift. I-JEPA reaches 88.25% source but still drops sharply. Source performance does not predict shift robustness for any method.

I-JEPA shift variance is high

BloodMNIST accuracy ranges 5.79% to 14.85% across seeds (± 4.76%). Predictive SSL features transfer inconsistently under severe modality shift, itself an informative result.

ECE collapses under shift for all methods

Source ECE is low (0.011–0.088) but all methods become severely overconfident under shift, with ECE above 0.5. Calibration failure is as dangerous as accuracy failure in clinical deployment.

Motivation for test-time adaptation

Since no pretraining objective produces robustness on its own, lightweight entropy-based LayerNorm adaptation at inference becomes the central contribution — recovering accuracy without target labels.

05Planned Comparisons

Method	Type	Status
Supervised ViT-B/16	Supervised pretraining	✓ Complete
DINO ViT-B/16	Contrastive SSL	✓ Complete
MAE ViT-B/16	Reconstructive SSL	✓ Complete
I-JEPA ViT-B/16 (ours)	Predictive SSL	✓ Complete
Supervised + TTA	Supervised + LayerNorm adaptation	⏳ In Progress
DINO + TTA	Contrastive SSL + adaptation	⬜ Planned
MAE + TTA	Reconstructive SSL + adaptation	⬜ Planned
I-JEPA + TTA (proposed)	Predictive SSL + adaptation	⏳ In Progress

JEPA-RobustViTBaseline Results Dashboard

JEPA-RobustViT
Baseline Results Dashboard