IFAST: Weakly Supervised Interpretable
Face Anti-Spoofing from Single-Shot
Binocular NIR Images

arXiv Preprint (Under Review)
1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences, 3The Chinese University of Hong Kong

Abstract

Single-shot face anti-spoofing (FAS) is a key technique for securing face recognition systems, and it requires only static images as input. However, single-shot FAS remains a challenging and under-explored problem due to two main reasons: 1) on the data side, learning FAS from RGB images is largely context-dependent, and single-shot images without additional annotations contain limited semantic information. 2) on the model side, existing single-shot FAS models are infeasible to provide proper evidence for their decisions, and FAS methods based on depth estimation require expensive per-pixel annotations. To address these issues, a large binocular NIR image dataset (BNI-FAS) is constructed and published, which contains more than 300,000 real face and plane attack images, and an Interpretable FAS Transformer (IFAST) is proposed that requires only weak supervision to produce interpretable predictions. Our IFAST can produce pixel-wise disparity maps by the proposed disparity estimation Transformer with Dynamic Matching Attention (DMA) block. Besides, a well-designed confidence map generator is adopted to cooperate with the proposed dual-teacher distillation module to obtain the final discriminant results. The comprehensive experiments show that our IFAST can achieve state-of-the-art results on BNI-FAS, proving the effectiveness of the single-shot FAS based on binocular NIR images.

New Paradigm of Face Anti-Spoofing

Illustration of face anti-spoofing (FAS) from single-shot binocular near-intrared (NIR) Images. (a) Multi-shots FAS. (b) Single-shot depth-based FAS. (c) Our single-shot binocular disparity-based FAS. Note that the same colored points are the corresponding maximum attention weight points between the left and right images to emphasize our disparity-based method.

Interpretable FAS Transformer

Training pipeline of the proposed Interpretable FAS Transformer (IFAST). The input of IFAST is a binocular NIR face image (left and right image). First, an estimated disparity map is obtained by the disparity estimation Transformer. Then, the disparity map and the left image are fed to the confidence map generator. The confidence maps in terms of real face and plane attack are used to complete the classification of FAS. The proposed dual-teacher distillation module is used to support the weakly supervised training.

Benchmark Experiments

Comparison of FAS methods on different test settings. Real and Attack denote the setting of positive samples and negative samples in the test set, respectively. The best and second best are in bold and underlined, respectively. The sign "*" denotes that it is a full-supervised method.

Visualization Results

Visual comparison of the disparity maps predicted by different methods for estimating the depth of real faces. (a) The left image. (b) StereoNet. (c) PSMNet. (d) GwcNet. (e) STTR. (f) PASMNet. (g) Dual-Net. (h) BM. (i) SGBM. (j) IFAST. The comparison show that Our IFAST can produce higher visibility.

BibTeX


    @misc{huang2023ifast,
      title={IFAST: Weakly Supervised Interpretable Face Anti-spoofing from Single-shot Binocular NIR Images}, 
      author={Jiancheng Huang and Donghao Zhou and Shifeng Chen},
      year={2023},
      eprint={2309.17399},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
    }