Dr. Khatri Mahavidyalaya, Chandrapur, Maharashtra, India 442401
The rapid decentralization of computational intelligence from cloud environments to edge devices has become a cornerstone of modern residential security. However, existing surveillance paradigms struggle with high false-positive rates due to environmental noise and a lack of semantic understanding regarding object persistence and human intent. This paper presents a novel, real-time edge-based framework that integrates multi-stage deep learning modules to provide comprehensive home security. Our approach utilizes a lightweight YOLO-based person detection backbone, an embedding-based face authentication module, and a spatial-temporal behavioral anomaly scoring mechanism. To address asset protection, we introduce an object-state comparison module that monitors high-value items using instance segmentation and persistence modeling. To satisfy the stringent resource constraints of edge hardware, we employ structured pruning and INT8 quantization, achieving significant throughput gains on the NVIDIA Jetson platform. Furthermore, we integrate eXplainable AI (XAI) techniques, specifically Grad-CAM and SHAP, to enhance the interpretability of automated security decisions. Evaluations on a custom residential dataset demonstrate that the proposed framework achieves a 96.5% intrusion detection accuracy and significantly lower latency compared to cloud-centric models, while effectively maintaining privacy through local data processing.
1. Problem Statement
The transition toward intelligent home monitoring is hindered by the limitations of traditional motion-triggered systems, which lack the discriminative capacity to distinguish between benign movement (e.g., pets, swaying curtains) and malicious activities [1]. While cloud-based deep learning solutions offer high accuracy, they introduce unacceptable latency—often exceeding the critical window for intrusion response—and raise substantial privacy concerns regarding the transmission of private household video streams to third-party servers [2], [8]. Moreover, a significant gap exists in the detection of "Object Removal." Standard intrusion systems focus on boundary crossing but fail to detect when a specific high-value asset is displaced or stolen by an individual who may have initially bypassed detection. There is a lack of unified frameworks that can simultaneously verify identity, track object persistence, and analyze behavioral patterns locally on resource-constrained edge devices [14]. This research addresses these challenges by proposing a resource-efficient, multi-tasking architecture capable of performing real-time semantic analysis at the network edge.
2. Literature Review and Research Gaps
The field of intelligent surveillance has evolved through three distinct phases: traditional background modeling, cloud-based deep learning, and current edge-centric architectures.
2.1. Motion Detection and Object Tracking
Early systems relied on Gaussian Mixture Models (GMM) and frame differencing. While computationally inexpensive, these methods are highly sensitive to illumination fluctuations and lack semantic depth [9]. Subsequent advancements in Convolutional Neural Networks (CNNs) significantly improved detection accuracy but were initially too heavy for real-time deployment on embedded systems [11].
2.2. Edge Intelligence and Model Compression
To enable edge deployment, research has shifted toward model compression. Techniques such as weight pruning and quantization-aware training have been shown to reduce the footprint of models like YOLO and Mobile Net without drastic accuracy degradation [4]. However, most studies focus on generic object detection rather than the specific temporal logic required for intrusion and theft detection [18].
2.3. Action Recognition and Anomaly Detection
Action recognition using CNN-LSTM architectures has shown promise in identifying suspicious behavior [13]. However, these models often suffer from high computational latency, making them impractical for low-power edge gateways. Furthermore, the "black-box" nature of these models poses a challenge for residential users who require justification for triggered alarms [5].
2.4. Identified Research Gaps
3. Proposed Methodology
The proposed framework follows a hierarchical execution strategy designed to minimize power consumption by activating high-complexity modules only when necessary.
3.1. Edge Processing Pipeline
The system operates via a continuous "Sensing and Trigger" loop:
3.2. Model Optimization for Edge Devices
We implement a three-tier optimization strategy:
4. System Architecture
4.1. Person Detection Module (PDM)
The PDM is the "always-on" component. It utilizes a modified YOLOv8-Nano architecture optimized for indoor aspect ratios. It identifies the bounding box Bp
4.2. Face Authentication Module (FAM)
Once Bp
is established, the FAM extracts the facial region. It uses a lightweight Siamese network to generate a 128-dimensional embedding vectorThis vector is compared against a local database of authorized embeddings Vauth
4.3. Object-State Comparison Module (OSCM)
The OSCM focuses on "Regions of Interest" (RoI) where high-value objects (e.g., safes, laptops) are located. It employs a persistence counter for each object Oj.
4.4. Behavior-Aware Anomaly Detection Module (BAAD)
The BAAD module analyzes the sequence of centroids C={c1,c2,...,ct}
over a temporal window W. It uses a lightweight GRU to predict the next likely position; significant deviations from "Routine Pathing" increase the anomaly score [17].
4.5. Alert and Logging Subsystem
To ensure privacy, the subsystem converts detections into encrypted JSON metadata. Raw video is stored in a rolling 24-hour buffer locally, with only XAI-stamped frames uploaded to the user upon a confirmed alert.
5. Mathematical Modeling
5.1. Intrusion Detection Logic
Let Z
be the spatial domain of a restricted zone. The intrusion probability Pintat time t is modeled as:
where 1
5.2. Object Persistence Modeling
The state of a monitored object Oj
is defined by its existence probability Ej .We model the "Removal" event through a temporal decay function:
where Mref
5.3. Behavioral Anomaly Scoring
The behavioral score Ψ is calculated based on the entropy of the path H(C) and the velocity V of the subject:
where
5.4. Performance Evaluation Metrics
We evaluate the framework using the F1-score and the Latency-Accuracy Trade-off (LAT):
LAT=Precision⋅RecallInference Latency (ms)
This metric emphasizes the necessity for both high detection rates and rapid edge execution [18].
6. Experimental Setup
6.1. Edge Configuration
All experiments were conducted on an NVIDIA Jetson Orin Nano (8GB RAM) utilizing Tensor RT 8.6 and CUDA 11.4. The camera input was a 1080p stream at 30 FPS.
6.2. Custom Dataset Assumptions
We utilized a custom residential dataset consisting of 12,000 frames. The dataset incorporates:
7. Comparative Analysis
The framework was benchmarked against four baseline architectures:
|
Architecture |
Inference Latency |
Intrusion Accuracy |
False Alarm Rate (FAR) |
|
Cloud-based CNN [2] |
1450 ms |
94.2% |
8.5% |
|
YOLOv8-Nano (Vanilla) [4] |
22 ms |
87.1% |
12.4% |
|
GMM Motion Detection [9] |
8 ms |
41.5% |
35.0% |
|
CNN-LSTM Action Recog. [13] |
185 ms |
90.8% |
7.2% |
|
Proposed Framework |
38 ms |
Gajanan Pimpalkar*, Nikhil Singh, A Real-Time Edge-Based Deep Learning Framework for Intelligent Home Intrusion and Object Removal Detection, Int. J. Sci. R. Tech., 2026, 3 (3), 256-260. https://doi.org/10.5281/zenodo.19015502 |