Acoustic Scene Classification (ASC) has been typically addressed by feeding raw audio features to deep neural networks. However, such an audio-based approach has consistently proved to result in poor model generalization across different recording devices. In fact, device-specific transfer functions and nonlinear dynamic range compression highly affect spectro-temporal features, resulting in a deviation from the learned data distribution known as domain shift. In this paper, we present an alternative ASC paradigm that involves ditching the classic end-to-end audio-based training in favor of gathering an intermediate event-based representation of the acoustic scenes using large-scale pretrained models. Performance evaluation on the TAU Urban Acoustic Scenes 2020 Mobile Development dataset shows that the proposed event-based approach is up to 160% more robust than corresponding audio-based methods in the face of mismatched recording devices.