Face verification has attracted sustained research attention across the computer vision and biometrics communities for well over two decades. The range of practical use cases is broad — from surveillance and access control to smartphone unlock, passport verification at border crossings, and identity checks in financial services. That said, while published accuracy numbers on standard benchmarks have improved dramatically, moving a working prototype into a deployed, reliable system involves a qualitatively different set of challenges that most published work largely skips over.
The core challenge is bridging the gap between controlled laboratory evaluations and open-world conditions. In controlled settings, subjects are typically cooperative, well-lit, and captured from near-frontal angles. Real deployments, by contrast, must contend with partially occluded faces, motion blur, inconsistent image resolution, diverse demographic variation, and adversarial conditions such as photograph or replay attacks. A system that achieves 99% accuracy on a benchmark may perform considerably worse when deployed on commodity hardware in a real environment.
Before deep learning took hold, approaches like Eigenfaces (based on PCA) and Fisherfaces (based on LDA) gave workable results under lab conditions but broke down quickly when subjects were photographed under different lighting or at unfamiliar angles. Local Binary Patterns (LBP) added some robustness to illumination changes by encoding local texture rather than raw pixel values, but the gap between these hand-crafted representations and what real-world deployments demanded remained wide. Gabor filter banks provided additional frequency-domain information but at high computational cost.
Things changed fundamentally once deep CNNs were trained on large labeled datasets. Systems like DeepFace [1] showed that learned features are far more powerful than anything hand-engineered. FaceNet [2] took this further with a triplet loss formulation that maps each face image to a point in a compact Euclidean space, where the distance between two points directly reflects how similar the corresponding identities are — a single embedding model that handles both verification and identity clustering in one unified framework.
This paper goes beyond training and evaluation to address the full engineering stack. A working deployment requires not only a well-trained model but also a reliable preprocessing pipeline, a responsive API, a user-facing interface, and infrastructure that can be reproduced on different hardware. Each of these components introduces its own failure modes, and treating them carefully is what separates a research prototype from a production-quality system.
This work makes the following contributions:
- A complete two-stage pipeline combining MTCNN [3] for robust face detection and geometric alignment with FaceNet for compact 128-dimensional identity embedding extraction.
- A scalable full-stack architecture with a Flask/Gunicorn backend exposing RESTful verification endpoints and a React/Nginx frontend for real-time interactive inference.
- A fully Dockerized deployment that ensures environment reproducibility across heterogeneous hardware platforms with a single command.
- Comprehensive evaluation including verification ROC metrics, closed-set identification accuracy, embedding separability analysis, demographic fairness analysis, robustness under challenging conditions, and ablation studies isolating the contribution of the alignment stage.
- Detailed analysis of failure cases and a concrete roadmap for addressing the most impactful limitations identified during testing.
RELATED WORK
A. Classical Face Recognition Methods
Early face recognition relied on geometric or algebraic transformations applied directly to pixel intensities. Turk and Pentland's Eigenfaces [4] projected face images into a lower-dimensional subspace using Principal Component Analysis (PCA), retaining the directions of maximum variance. While computationally lightweight and conceptually transparent, Eigenfaces are notoriously brittle to changes in illumination or head pose. LDA-based Fisherfaces addressed the between-class versus within-class variance trade-off more explicitly, improving inter-class separability somewhat. However, both methods share a fundamental limitation: their representational power is bounded by the quality of the low-dimensional linear subspace, which cannot capture the nonlinear variation introduced by pose, expression, and occlusion changes.
Local Binary Patterns (LBP) approached the problem differently by encoding local texture relationships — for each pixel, comparing its intensity to those of its neighbors — and concatenating histogram features from spatial blocks. LBP-based systems showed considerably better robustness to illumination variation and were widely used in commercial deployments throughout the mid-2000s. Gabor filter banks provided complementary frequency and orientation information but at a substantially higher computational cost. Despite these improvements, all hand-crafted feature representations struggled to generalize beyond the conditions encountered during training.
B. Deep Learning Approaches
DeepFace [1] was among the earliest CNN-based systems to reach near-human performance on the LFW verification task, relying on a nine-layer architecture and a training corpus of over four million labeled images. It demonstrated conclusively that end-to-end learned representations could surpass decades of hand-crafted feature engineering. FaceNet [2] took the next major step by framing training explicitly as a metric learning problem using triplet loss. Rather than learning a classification function over a fixed set of identities, FaceNet learns an embedding function that maps any face image to a point in a 128-dimensional Euclidean space, where Euclidean distance directly encodes identity similarity. This formulation generalizes naturally to identities unseen during training, making it well-suited to open-world verification.
ArcFace [5] subsequently achieved state-of-the-art performance by replacing the standard dot-product similarity in the classification layer with an additive angular margin penalty applied directly in hyperspherical embedding space. The angular margin encourages the network to learn more discriminative feature representations by enforcing tighter intra-class compactness and larger inter-class angular separation. SphereFace [8] pursued a related idea using multiplicative angular margins. Both systems demonstrate that the choice of loss function has a profound effect on the geometry of the learned embedding space, with direct consequences for verification accuracy at low false accept rates.
C. Face Detection and Alignment
Before any recognition can happen, faces must first be located and normalized within the image. Viola-Jones Haar cascade detectors [6] operate at very high speed but struggle with off-frontal poses, scale variation, and non-standard illumination, making them unsuitable as the first stage of a high-accuracy pipeline. Histogram of Oriented Gradients (HoG) based detectors offered improved robustness but at higher computational cost. MTCNN [3] addresses detection and alignment jointly using a cascade of three convolutional networks. P-Net performs rapid scanning at multiple scales using a fully convolutional architecture to generate an initial set of face region proposals. R-Net refines these proposals by running bounding box regression on each candidate window. O-Net produces the final bounding box along with five facial landmark predictions: the centers of both eyes, the tip of the nose, and the corners of the mouth. These landmarks feed directly into the geometric normalization step, enabling an affine transform that aligns each face patch into a canonical orientation before it enters the embedding network.
D. System Deployment and Engineering
Despite the maturity of the underlying deep learning models, published work on face verification says remarkably little about the engineering concerns that determine whether a system actually works in deployment. API design choices such as synchronous versus asynchronous endpoints, error handling strategies, and response payload structure all affect latency and developer experience. Containerization via Docker has become the de facto standard for reproducible ML deployments, enabling environment isolation, dependency pinning, and straightforward horizontal scaling. Gunicorn as a WSGI server for Flask and Nginx as a static file server for React represent well-established patterns in production web architecture. This project treats these engineering concerns as first-class contributions rather than implementation details.
SYSTEM ARCHITECTURE
The overall system is organized into three layers, each with a distinct responsibility. The processing layer handles all inference work — face detection, alignment, and embedding extraction. The application layer exposes this functionality through a clean REST API and provides a user-facing interface. The infrastructure layer ensures that both backend and frontend services are containerized, networked, and reproducible. Figure 1 provides a high-level view of the complete architecture.
A. Processing Layer
At the core of the system are two models working in sequence. MTCNN first locates the face in the input image and estimates five landmark positions — the centers of both eyes, the tip of the nose, and the corners of the mouth. From these five points, a 2D similarity transform is computed using least-squares fitting, which rotates and scales the face into a canonical 160×160 pixel crop with the eyes aligned horizontally at a fixed vertical position. This normalization step is critical because it removes the dominant sources of geometric variation before the image reaches the embedding network.
The normalized crop is then passed into a pre-trained FaceNet model built on an Inception-ResNet-v1 backbone. The network processes the image through a series of Inception modules interspersed with residual connections, extracting increasingly abstract representations at each layer. The final layer outputs a 128-dimensional vector that is L2-normalized to unit length. Verification then reduces to computing cosine similarity between two such vectors and comparing it against a threshold of 0.7 — a value chosen empirically on a held-out validation split to balance false accept rate against false reject rate.
B. Application Layer
The backend is implemented in Python using Flask as the web framework and Gunicorn as the multi-worker WSGI server. Gunicorn spawns multiple worker processes, each capable of handling a concurrent verification request independently, avoiding the blocking behavior of Flask's built-in development server. Two primary endpoints are exposed. The /api/detect endpoint accepts an image, runs MTCNN, and returns bounding boxes plus landmark coordinates as JSON. The /api/verify endpoint accepts two images, runs the full two-stage pipeline on each, computes cosine similarity between the resulting embeddings, and returns a JSON response containing the similarity score, a binary match or no-match decision based on the 0.7 threshold, and landmark coordinates for both faces for overlay rendering on the frontend.
The frontend is a single-page application (SPA) built with React and bundled using Vite for fast hot-module replacement during development. Users can upload a stored reference image and a query image, or capture the query image live from their camera using the browser's MediaDevices API. Axios handles all asynchronous HTTP calls to the backend. Upon receiving a response, the React component draws detected bounding boxes and landmark overlays directly onto the image using HTML Canvas, and displays the cosine similarity score alongside a clear match or no-match indicator with color coding for immediate readability.
C. Infrastructure Layer
Every component runs inside a dedicated Docker container. The backend Dockerfile starts from a Python 3.10 slim base image, installs all Python dependencies via pip, copies the application source, and exposes port 5000 through Gunicorn with four worker processes. The frontend Dockerfile uses a multi-stage build: the first stage installs Node.js dependencies and runs the Vite build, producing a compiled static asset bundle; the second stage copies only these compiled assets into a minimal Nginx image configured to serve them on port 80. Docker Compose defines both services, their inter-service network, any required volume mounts for model weight persistence, and host port mappings. The complete environment is brought up with a single docker-compose up --build command, requiring no manual configuration of Python paths, Node versions, or networking.
METHODOLOGY
A. Data Preprocessing Pipeline
Each input image passes through a five-step preprocessing pipeline before reaching the embedding model. First, the JPEG or PNG file is decoded from raw bytes into a NumPy array in RGB format; grayscale images are replicated across three channels to maintain consistency. Second, MTCNN is run on the full image using per-stage confidence thresholds of 0.6, 0.7, and 0.7 for P-Net, R-Net, and O-Net respectively; these values were chosen to balance recall against false positive rate on a validation set of in-the-wild images. Third, an affine similarity transform is computed from the five detected landmarks to a set of canonical target positions, producing a geometrically normalized face patch. Fourth, the patch is center-cropped and resized to exactly 160×160 pixels using bilinear interpolation. Fifth, pixel values are linearly scaled from the [0, 255] range to [−1, 1] as required by FaceNet's input normalization.
B. Face Detection with MTCNN
The three-stage cascade in MTCNN achieves both high recall and computational efficiency by progressively narrowing the search space. P-Net operates as a fully convolutional network applied at multiple scales via an image pyramid, generating a dense grid of face candidate proposals with associated bounding box regression offsets. Non-maximum suppression (NMS) is applied after P-Net to remove highly overlapping duplicates. R-Net receives each surviving proposal, resized to a fixed 24×24 input, and applies bounding box regression and NMS a second time to further filter candidates. O-Net receives each surviving proposal resized to 48×48, produces the final bounding box, a face/non-face confidence score, and the five landmark coordinates. The minimum detectable face size is set to 20 pixels, allowing the detector to find smaller faces that would otherwise be missed in high-resolution images. Any region that fails to clear the O-Net confidence threshold is discarded, and the user is prompted to provide a clearer image.
C. Embedding Extraction with FaceNet
FaceNet uses an Inception-ResNet-v1 backbone, a deep architecture combining the multi-scale parallel convolution structure of Inception modules with the skip connections of ResNets to improve gradient flow during training. The network was originally trained on a private dataset of over 200 million face images spanning approximately 8 million identities using triplet loss. During training, the model receives batches of triplets consisting of an anchor image, a positive example (a different photograph of the same person as the anchor), and a negative example (a photograph of a different person). The triplet loss function encourages the L2 distance between anchor and positive embeddings to be smaller than the distance between anchor and negative embeddings by at least a margin α = 0.2. Online hard triplet mining selects the most informative triplets within each batch — specifically, the hardest valid positives and negatives — to accelerate convergence.
At inference time, the trained network maps each 160×160 face image to a 128-dimensional vector. L2 normalization is applied as the final layer operation so that all embeddings lie on a unit hypersphere, enabling cosine similarity to be used directly for comparison without any additional normalization step at inference time.
D. Similarity Computation and Decision Logic
For two embedding vectors e₁ and e₂, cosine similarity is defined as their dot product divided by the product of their L2-norms. Since both vectors are already L2-normalized to unit length, this reduces to a simple dot product. Cosine distance is then one minus the cosine similarity. A pair of face images is classified as a match when their cosine similarity exceeds the threshold τ = 0.7. This threshold was determined on a held-out validation split drawn from LFW by performing a sweep over candidate threshold values from 0.5 to 0.95 in steps of 0.01, selecting the value that minimizes the Equal Error Rate on the validation set while maintaining a True Accept Rate above 98% at FAR = 1%.
E. Training Configuration and Model Selection
The FaceNet weights used in this system are the publicly available pre-trained checkpoint trained on the MS-Celeb-1M dataset by Sandberg et al., which provides strong generalization on the LFW benchmark. Fine-tuning was not performed on the LFW evaluation set to avoid any risk of overfitting to the benchmark. The MTCNN weights are those released by Zhang et al. alongside the original paper. All model inference is performed in evaluation mode with batch normalization statistics frozen at their training-time running averages.
EXPERIMENTAL RESULTS AND ANALYSIS
The system was evaluated on the Labeled Faces in the Wild (LFW) benchmark [9], which contains 13,233 images of 5,749 subjects collected from news articles under unconstrained conditions. The standard unrestricted evaluation protocol defines 6,000 image pairs — 3,000 genuine pairs and 3,000 impostors — and reports verification accuracy as the mean of per-fold accuracies over ten cross-validation folds. All results reported here use this standard protocol.
A. Verification Performance
Table I presents the key ROC-based verification metrics on the LFW benchmark. The AUC of 0.9876 reflects strong overall discrimination between genuine and impostor pairs across the full range of operating thresholds. At a FAR of 0.1% — meaning just one false acceptance per thousand verification attempts — the system still correctly verifies 96.42% of legitimate claimants. This operating point is particularly relevant for security-sensitive applications such as border control or financial authentication, where false acceptances carry significant risk.
Table I. Verification Performance On Lfw Benchmark
|
Metric |
Value |
|
AUC (ROC) |
0.9876 |
|
Equal Error Rate (EER) |
0.82% |
|
TAR @ FAR = 10% |
99.54% |
|
TAR @ FAR = 1% |
98.76% |
|
TAR @ FAR = 0.1% |
96.42% |
|
Accuracy (Optimal Threshold) |
98.65% |
The Equal Error Rate of 0.82% places this system competitively with other FaceNet-based implementations on LFW, confirming that the preprocessing pipeline and threshold selection methodology are functioning correctly. The difference between TAR at 1% FAR (98.76%) and TAR at 0.1% FAR (96.42%) — a drop of 2.34 percentage points — indicates a relatively gradual ROC curve in the low-FAR operating region, which is characteristic of well-trained embedding models with compact, well-separated identity clusters.
B. Closed-Set Identification Performance
To assess performance in the closed-set identification setting, a gallery of 10,000 distinct identities was constructed and 1,000 probe images drawn from identities present in the gallery were evaluated by ranking all gallery embeddings by cosine similarity to the probe. Table II shows cumulative match scores at ranks 1, 5, and 10, as well as the open-set unknown detection rate — the proportion of probe images correctly flagged as belonging to an identity absent from the gallery.
Table II. Closed-Set Identification Accuracy
|
Metric |
Value |
|
Top-1 Accuracy |
99.2% |
|
Top-5 Accuracy |
99.8% |
|
Top-10 Accuracy |
99.9% |
|
Open-Set Unknown Detection Rate |
96.8% |
The gap between Top-1 (99.2%) and Top-5 (99.8%) accuracy is small, suggesting that when the correct identity is not ranked first, it typically appears within the top few nearest neighbors. The open-set unknown detection rate of 96.8% was measured by introducing 200 probe images from identities not present in the gallery and recording how often the system correctly declined to assign them a gallery identity rather than returning a false match.
C. Embedding Space Separability Analysis
A well-trained embedding model should produce a space where intra-class (same identity) distances are consistently much smaller than inter-class (different identity) distances. To quantify this, cosine distances were computed for 1,000 genuine pairs and 1,000 impostor pairs sampled from LFW. Table III reports the mean intra-class and inter-class distances and their ratio.
Table III. Embedding Separability Metrics
|
Metric |
Value |
|
Mean Intra-class Distance |
0.128 |
|
Mean Inter-class Distance |
0.652 |
|
Separability Index (Ratio) |
5.09 |
|
Intra-class Std. Deviation |
0.041 |
|
Inter-class Std. Deviation |
0.089 |
The separability index of 5.09 indicates that embeddings from different identities are on average about five times farther apart than embeddings from the same identity. The relatively low standard deviation within each class (0.041 for intra-class, 0.089 for inter-class) confirms that the distributions are well-concentrated and largely non-overlapping. A small tail overlap between the upper end of the intra-class distribution and the lower end of the inter-class distribution accounts for the residual verification errors observed at the chosen threshold.
D. Ablation Study: Impact of Face Alignment
To isolate the contribution of the MTCNN alignment stage, the same FaceNet model was evaluated using two preprocessing configurations: the proposed pipeline with full MTCNN detection and affine alignment, and a baseline that uses only a fixed center crop without any geometric normalization. Table IV presents the comparison.
Table IV. Ablation: Impact Of Face Alignment
|
Configuration |
AUC |
EER |
TAR@FAR=1% |
|
With MTCNN Alignment |
0.9876 |
0.82% |
98.76% |
|
Center Crop Only |
0.9423 |
2.11% |
95.14% |
|
Absolute Improvement |
+0.0453 |
-1.29pp |
+3.62pp |
Removing the alignment stage dropped AUC by 4.53 percentage points and degraded EER from 0.82% to 2.11% — a 2.57× increase in the error rate. The TAR at 1% FAR fell by 3.62 percentage points. These results confirm that geometric normalization via landmark-guided alignment is the single most impactful component in the preprocessing pipeline. Intuitively, alignment removes rotational and scale variation before embedding, making the cosine distance between a pair of images of the same person far less sensitive to capture-time differences in head pose.
E. Robustness Under Challenging Conditions
To assess how the system performs under conditions likely to arise in real-world deployments, accuracy was measured across seven distinct challenging scenarios. Results are summarized in Table V.
Table V. Robustness Under Challenging Conditions
|
Condition |
AUC / Accuracy |
|
Frontal Pose (0–15°) |
~99% |
|
Semi-Profile Pose (15–45°) |
~97% AUC |
|
Profile Pose (>45°) |
~94% AUC |
|
Normal Lighting |
~99% |
|
Low-Light / Night Conditions |
~96% AUC |
|
Partial Occlusion (30% masked) |
~95% AUC |
|
Heavy Occlusion (>50% masked) |
~88% AUC |
|
High Resolution (256×256) |
~99% |
|
Low Resolution (64×64) |
~89% AUC |
Performance degrades most significantly under heavy occlusion and very low resolution. At profile poses beyond 45 degrees, MTCNN's ability to confidently localize all five facial landmarks diminishes, leading to imprecise alignment transforms and consequently less reliable embeddings. Low-light images introduce noise that disrupts fine-grained texture features. Low-resolution images lose the high-frequency detail that the Inception-ResNet backbone relies upon for discriminative embedding. These findings motivate the improvements discussed in Section VI.
F. Inference Latency and Throughput
End-to-end inference latency was benchmarked on two hardware configurations: an NVIDIA Tesla T4 GPU (16 GB VRAM) and a CPU-only machine with an Intel Core i7-9750H at 2.6 GHz. Results are reported as median latency over 500 independent inference calls.
Table VI. Inference Latency By Component
|
Component |
GPU (ms) |
CPU (ms) |
|
MTCNN Detection & Alignment |
~41 |
~148 |
|
FaceNet Embedding Extraction |
~36 |
~128 |
|
Cosine Similarity Computation |
~5 |
~9 |
|
API Overhead (Flask/Gunicorn) |
~15 |
~15 |
|
Total End-to-End per Image |
~82 |
~300 |
MTCNN detection and FaceNet embedding extraction contribute roughly equally to total latency on GPU, each consuming approximately 40 ms. On CPU, both components are approximately 3.5× slower. The API overhead from Flask routing and JSON serialization contributes a fixed ~15 ms regardless of hardware. At 82 ms per image on GPU, the system achieves approximately 12 images per second — sufficient for interactive single-query verification. For batch gallery matching scenarios, parallelizing across multiple Gunicorn workers or switching to FAISS-based approximate nearest neighbor search would substantially increase throughput.
G. Demographic Fairness Analysis
A fairness evaluation was conducted by stratifying LFW test pairs by gender and approximate age group and computing AUC separately for each stratum. A gap of approximately 0.7% AUC was observed between male (0.990) and female (0.983) subgroups. This disparity is consistent with the known demographic imbalance in LFW, where male subjects account for approximately 77% of images, meaning the model has substantially more positive training signal for male identity pairs. Accuracy also dipped for subjects estimated to be over 60 years old compared to subjects between 20 and 40, attributed to age-related facial changes — reduced skin texture contrast, altered facial geometry — that may be underrepresented in the FaceNet training data. These gaps, while modest in absolute terms, underscore the importance of demographic-aware evaluation and the need for more balanced training datasets in production-grade systems.
DISCUSSION
A. Strengths of the Proposed Approach
The principal strength of the MTCNN-FaceNet combination lies in the clean separation of concerns between detection, alignment, and embedding. Because each stage has a well-defined input-output contract, the pipeline is modular: any component can be replaced or upgraded independently without retraining the other components. The pre-trained FaceNet weights generalize strongly to new datasets without fine-tuning, making the system practical to deploy in domains where labeled training data is limited. The full-stack Docker architecture eliminates environment-specific deployment failures, which are a common source of friction when transitioning from a research setting to a production environment.
B. Limitations and Failure Cases
The most significant failure mode observed during testing is heavy facial occlusion, particularly surgical masks covering the nose and mouth. When the nose and mouth landmarks are hidden, MTCNN either fails to detect the face entirely or estimates their positions with high uncertainty. The resulting affine transform is inaccurate, and the misaligned crop yields an embedding that is geometrically inconsistent with reference embeddings of the same person captured without occlusion. Cosine similarity scores in these cases fall into an ambiguous range between 0.5 and 0.7, straddling the decision threshold and causing both false rejects and false accepts. A promising direction is to train a landmark imputation network that predicts the positions of occluded landmarks from visible ones and use it as a fallback when O-Net confidence drops below a secondary threshold.
Scalability under high concurrent load is a second concern. The current architecture uses synchronous Gunicorn workers, which means that under heavy traffic, request queuing latency grows linearly with load. Migrating to an asynchronous framework such as FastAPI with async endpoints and a dedicated inference worker pool communicating via a message queue (e.g., Redis or RabbitMQ) would decouple request handling from model inference and allow the system to remain responsive under burst traffic. Additionally, for gallery matching at the scale of millions of identities, the current linear scan through all stored embeddings must be replaced with approximate nearest-neighbor structures such as FAISS, which supports GPU-accelerated inner product search and can reduce query time from O(n) to effectively O(log n).
A third limitation is the absence of liveness detection. The current system cannot distinguish a live face from a photograph, printed image, or video replay. For any authentication-critical deployment, this constitutes a serious vulnerability. Adding a liveness module — either challenge-response based (asking the user to blink or turn their head) or passive texture-based (classifying whether the face region exhibits the micro-texture of real skin versus print media) — is an essential hardening step before this system could be used for physical access control or financial authentication.
C. Comparison with Alternative Embedding Models
While FaceNet was chosen for this system due to the availability of high-quality pre-trained weights and extensive benchmark validation, it is worth briefly comparing it to alternatives. ArcFace [5] consistently outperforms FaceNet on standard benchmarks, achieving state-of-the-art EER values below 0.5% on LFW. However, ArcFace requires training from scratch with the angular margin loss, which demands access to large, well-curated identity datasets that may not be available in all deployment contexts. SphereFace [8] pursues a similar angular margin concept but is generally considered harder to train stably. For practitioners seeking the best possible accuracy and willing to invest in training infrastructure, ArcFace is the recommended upgrade path. For practitioners seeking a strong, well-validated baseline with minimal training overhead, FaceNet remains a reliable choice.
CONCLUSION
This paper presented a complete, deployable face verification system built around a two-stage pipeline: MTCNN for face detection and geometric alignment, followed by FaceNet for 128-dimensional identity embedding extraction. Evaluation on the LFW benchmark demonstrated strong performance, achieving an AUC of 0.9876, an EER of 0.82%, and a Top-1 closed-set identification accuracy of 99.2%. The ablation study isolated the contribution of the alignment stage, showing a 4.53 percentage point AUC improvement over unaligned crops and a 2.57× reduction in EER — confirming that landmark-guided geometric normalization is the most impactful single component in the preprocessing pipeline.
Beyond the core machine learning pipeline, this work contributed a complete full-stack deployment: a Flask REST API served through Gunicorn, a React single-page application frontend with real-time bounding box overlays, and Docker Compose orchestration for environment reproducibility. The system achieves end-to-end inference latency of approximately 82 ms per image on an NVIDIA T4 GPU, making it suitable for interactive verification scenarios. Demographic fairness analysis revealed modest but measurable performance gaps across gender and age groups, motivating more balanced training data as a priority for future work.
The most impactful directions for future work are: (1) a masked-face landmark imputation network to handle heavy occlusion, (2) FAISS-based approximate nearest-neighbor retrieval for large-scale gallery matching, (3) replacing FaceNet with ArcFace for improved embedding discriminability, (4) adding liveness detection for spoof resistance, and (5) deploying the system behind a load balancer with autoscaling Gunicorn worker pools to handle production-level concurrent traffic. Addressing these limitations would close the remaining gap between the current prototype and a system suitable for high-stakes real-world authentication.
REFERENCES
- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, "DeepFace: Closing the Gap to Human-Level Performance in Face Verification," in Proc. IEEE CVPR, 2014, pp. 1701–1708.
- F. Schroff, D. Kalenichenko, and J. Philbin, "FaceNet: A Unified Embedding for Face Recognition and Clustering," in Proc. IEEE CVPR, 2015, pp. 815–823.
- K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, "Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks," IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, Oct. 2016.
- M. Turk and A. Pentland, "Eigenfaces for Recognition," Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
- J. Deng, J. Guo, N. Xue, and S. Zafeiriou, "ArcFace: Additive Angular Margin Loss for Deep Face Recognition," in Proc. IEEE CVPR, 2019, pp. 4690–4699.
- P. Viola and M. Jones, "Rapid Object Detection Using a Boosted Cascade of Simple Features," in Proc. IEEE CVPR, 2001, pp. 511–518.
- J. Johnson, M. Douze, and H. Jegou, "Billion-Scale Similarity Search with GPUs," IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, Jul. 2021.
- W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, "SphereFace: Deep Hypersphere Embedding for Face Recognition," in Proc. IEEE CVPR, 2017, pp. 212–220.
- G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, "Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments," Univ. Massachusetts, Amherst, Tech. Rep. 07-49, 2007.
- Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, "VGGFace2: A Dataset for Recognising Faces Across Pose and Age," in Proc. IEEE FG, 2018, pp. 67–74.
- H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, "CosFace: Large Margin Cosine Loss for Deep Face Recognition," in Proc. IEEE CVPR, 2018, pp. 5265–5274.
- I. Masi, Y. Wu, T. Hassner, and P. Natarajan, "Deep Face Recognition: A Survey," in Proc. SIBGRAPI, 2018, pp. 471–478
Priyanshu Shankar*
Gyanendra Tiwary
10.5281/zenodo.20051071