For driver drowsiness detection in the real world, the existing methods have good performance in general. However, when the face is blocked, the light is dim, and the driver’s head posture changes, the performance will deteriorate significantly. In this paper, a two-stream Spatial-Temporal transformer network intended to perform diver drowsiness detection task is proposed to solve the above problems. The spatial-temporal graph is extracted from the video and then the results are obtained from 2s-STTN. The model is a two-stream transformer network model. In our model, the Spatial Self-Attention module is used to learn the embedding of different facial landmarks, and the Temporal Self-Attention module is used to learn the correlation between the frames of facial feature points. Different activated facial landmarks are separated and recognized by class activation mapping technology. Each flow recognizes different activated facial features, extracts spatial or temporal features, and integrates the information about facial features, so as to improve the performance of the system. 2s-STTN can not only mine the long-term dependence of driver behavior from video, but also mine the driver drowsiness information provided by the unobstructed facial signs when the face is blocked. By conducting experiments and comparing our model with other models, it is demonstrated that the proposed model has good performance in driver drowsiness detection.