PCA of image features for 30 classes of the CUB Bird dataset. Distilling a large pretrained teacher (top, left) to train a small task-specific student model (top, right) results in a better clustering of the representations compared to simply finetuning the student on the task (bottom, right). Distillation can be improved by using a Mixup-inspired class-agnostic data augmentation based on Stable Diffusion (grey features in teacher plot).