At What Level of Categorization Do Neural Networks Capture Ventral Stream Representations?
Abstract
Artificial neural networks trained on a large-scale object classification tasks exhibit high representational similarity to the human brain. This similarity is typically attributed to training with hundreds or thousands of object categories. In this study, we investigate an alternative question: Can coarse-grained categorization alone achieve similar brain alignment? Using the same dataset (ImageNet/Tiny-ImageNet), we construct broad classification labels based on the principal components of extracted representations from the penultimate layer of a trained AlexNet. We experiment with varying levels of granularity (2, 4, 8, and 16 categories) and analyze how representational similarity analysis (RSA) scores evolve throughout training and across different regions of the ventral stream (early, mid, and higher visual areas) using fMRI responses to natural scenes. Surprisingly, we find that even broad, coarse-grained classification is sufficient to achieve RSA scores comparable to those obtained from networks trained on fine-grained object categories. Additionally, we perform cross-decomposition analysis and further investigate the shared latent dimensions between these networks and the brain. Our findings suggest that high-level ventral stream representations may be driven more by global structure than specific object categories, providing new insights into the nature of neural encoding in artificial and biological vision systems.