Cybersecurity worries intensify as Big Data, the Internet of Things, and 5G technologies develop. Based on code reuse technologies, malware creators are producing new malware quickly, and new malware is continually endangering the effectiveness of existing detection methods. We propose a vision transformer-based approach for malware picture identification because, in contrast to CNN, Transformer’s self-attentive process is not constrained by local interactions and can simultaneously compute long-range mine relationships. We use ViT-B/16 weights pre-trained on the ImageNet21k dataset to improve model generalization capability and fine-tune them for the malware image classification task. This work demonstrates that (i) a pure attention mechanism applies to malware recognition, and (ii) the Transformer can be used instead of traditional CNN for malware image recognition. We train and assess our models using the MalImg dataset and the BIG2015 dataset in this paper. Our experimental evaluation found that the recognition accuracy of transfer learning-based ViT for MalImg samples and BIG2015 samples is 99.14% and 98.22%, respectively. This study shows that training ViT models using transfer learning can perform better than CNN in malware family classification.