Abstract

Real-world social perception depends on continuously integrating information from both vision and language. Understanding how the human brain integrates vision and language is essential for developing AI that aligns with human multimodal processing. However, most prior neuroimaging studies have studied vision and language separately, leaving open critical questions about how these distinct social signals are integrated in the human brain. To address this gap, we investigate how rich social visual and verbal semantic signals are processed simultaneously by combining controlled and naturalistic fMRI paradigms with AI models. Focusing on the superior temporal sulcus (STS), previously shown to be sensitive to both visual social and language signals, we first localized visual social interaction perception and language regions in each participant (n=20+) using controlled social and language stimuli from prior work. We show for the first time that social interaction and language voxels in the STS are largely non-overlapping. We then investigate how these regions process a 45 minute naturalistic movie by combining vision (AlexNet) and language (sBERT) deep neural network embeddings with a voxel-wise encoding approach. We find that social interaction selective regions are best described by vision model embeddings of the frames of the movie and, to a lesser extent, language model embeddings of the spoken content. Surprisingly, language regions are equally well described by language and vision model embeddings, despite the lack of correlation between these features in the movie. Both the social and language regions were best explained by the later layers of the vision model, suggesting sensitivity to high-level visual information. These results suggest that social interaction and language-selective brain regions respond not only to spoken language content, but also to semantic information in the visual scene. In follow-up work investigating these multimodal regions, we compared multimodal deep neural network embeddings to the same naturalistic fMRI responses, revealing critical differences between vision-language alignment in modern AI systems and the human brain, particularly in how visual and linguistic information is integrated in social regions. Together, this work highlights how combining controlled and naturalistic approaches to visual cognition with AI models can help us start building an understanding of multimodal social processing in naturalistic contexts. However, it also emphasizes the limitations of current multimodal AI systems in predicting brain responses to naturalistic stimuli, calling for new approaches in modeling the simultaneous visual and linguistic experience of daily human life.