← Back to Projects

Edge AI · Thermal-Aware Dynamic Scaling · Senior Thesis Project

Adaptive Edge AI Controller

A hardware-aware control layer for NVIDIA Jetson Orin NX that uses real-time thermal telemetry, FOPDT prediction, and fuzzy logic to dynamically scale YOLOv8 inference parameters, preventing thermal shutdowns.

NVIDIA Jetson Orin NX YOLOv8 Fuzzy Logic FOPDT Control TensorRT CUDA Telemetry
Adaptive Edge AI Controller system interface visual

Running YOLO and real-time vision workloads on Jetson-class edge devices is not only a peak-performance problem; it is a long-duration reliability problem. In security cameras, robotic platforms, and unattended field systems expected to operate continuously, sustained inference load gradually accumulates heat, risking thermal throttling, performance degradation, and sudden device crashes.

Adaptive Edge AI Controller introduces a software control layer that sits above the inference loop. By querying hardware temperature sensors in real time and predicting near-future thermal pressure via a First-Order Plus Dead Time (FOPDT) model, it dynamically manages inference resolution (imgsz) and frame-processing ratios (percentage) using a fuzzy-logic controller. The result is a self-regulating system that maintains stable operation without triggering emergency shutdowns.

Demo Duration 130 Mins

Continuous closed-loop run on Jetson hardware

Telemetry Samples 3,900

Synchronized CSV logs of temperatures, load, and FPS

GPU Temp Range 52.5°C - 81.8°C

Maintained safely below the 85°C emergency threshold

Average GPU Temp 74.17°C

Stabilized within warning and critical operating bands

Role Edge AI System Design, Control Modeling, Telemetry Logging, Performance Optimization
Challenge Aroid thermal throttling and system crashes during continuous real-time object detection
Core Tech Python, YOLOv8, scikit-fuzzy, FOPDT predictor, Jetson sysfs API, OpenCV GStreamer

Closed-loop thermal adaptation in action.

Watch the controller manage a real-time YOLOv8 human detection pipeline on Jetson hardware for over 2 hours, dynamically switching between operating modes to maintain thermal equilibrium.

Results Demo

Real-time YOLOv8 pipeline demonstrating live parameter scaling based on device telemetry.

Heads-Up Display overlay rendering on-screen telemetry
Heads-Up Display HUD

On-screen overlay showing measured vs. predicted temperature curves, current mode (Safe, Warning, Critical), active frame ratios, and YOLO input size during active inference.

Thermal instability in unattended Edge AI nodes.

Real-time object detectors continuously stress GPU, CPU, memory, and hardware decoders. On compact edge systems, this sustained load generates significant heat. If unchecked, the device enters thermal throttling, creating erratic latency spikes and frame drops. In severe cases, the operating system locks up entirely or performs a hard shutdown, requiring manual physical intervention on site.

Developer Forum Reports & Field Evidence

Source Reported Problem Implication
NVIDIA Developer Forums GPU temperature climbs to 70 °C within 10 minutes of running YOLOv8 on Jetson Nano, causing stability concerns for outdoor deployments. Long-duration field operations require temperature-trend-aware scaling rather than static configurations.
NVIDIA Jetson Forum Embedded system shuts down completely after 15–20 minutes of real-time object detection from RTSP stream due to heatsink overheating. Model optimization is insufficient on its own; application-level thermal safety must be built into the runtime.
Ultralytics Community Pipeline freezes completely within 1–2 hours under multi-camera YOLO inference, necessitating physical power cycling. Unattended remote nodes (e.g., security cameras, outdoor sensors) require proactive self-healing mechanisms.
NVIDIA Developer Forums DeepStream pipeline experiences frame delays and thermal throttling at 68–70 °C on newer Jetson Orin Nano Super hardware. Sustained inference stresses the thermal envelope of even latest-generation compact edge AI hardware.

Self-regulation vs. hardware-level failure.

Without the controller, the system passively waits for operating-system throttling or system lockup. With the controller, the application proactively adapts its compute needs to maintain stability.

Thermal Behavior

Proactive vs. Passive
Without Controller

GPU temperature rises unchecked, leading to severe thermal stress and eventual hardware safety shutoffs.

With Controller

Workload is actively scaled back as temperatures cross thresholds, keeping the system below the 85°C emergency line.

Throughput (FPS)

Stable vs. Erratic
Without Controller

Starts high, then suffers erratic drops and high latency spikes as OS clock-throttling (DVFS) degrades clocks.

With Controller

Controlled trade-off. Workload levels are adjusted smoothly, maintaining predictable and stable pipeline frame rates.

System Uptime

Continuous vs. Bounded
Without Controller

High risk of system lockups, kernel freezes, or sudden shutdowns after 15 to 60 minutes of heavy YOLO execution.

With Controller

Continuous 130+ minute operation verified under high ambient load without a single thermal freeze or crash event.

Performance Loss

Deterministic vs. Random
Without Controller

Unpredictable and arbitrary. The operating system decides which system components to throttle and when.

With Controller

Managed degradation. The application controls which dimensions (resolution vs. frame count) are sacrificed to keep running.

Operator Needs

Autonomous vs. Manual
Without Controller

High. Requires manual intervention to reset the frozen field devices or cool them down physically.

With Controller

Zero. The system self-regulates in real time, automatically restoring full parameters when the device cools down.

Multi-stage thermal-region traversal.

Safe Region

GPU Temp < 70 °C

Full-performance mode. YOLO inference runs at maximum quality (imgsz=640) and processing ratio (percentage=1.0) to prioritize detection accuracy.

Warning Region

GPU Temp 70 °C – 80 °C

Gradual scaling mode. The fuzzy-logic controller dynamically scales down input resolution (imgsz) and limits processed frames to prevent heat buildup.

Critical Region

GPU Temp 80 °C – 85 °C

Aggressive scaling mode. The controller enforces strict workload reduction (processed frame ratio drops sharply) to arrest the upward temperature trend.

Emergency Region

GPU Temp >= 85 °C

Safety override mode. Bypasses the fuzzy loop to force immediate, hard-coded limits (imgsz=320, percentage=0.25) to protect the device from thermal damage.

A closed-loop software control cycle.

The inference loop queries the latest control decisions before each frame. The controller runs as a parallel thread, updating metrics and resolving fuzzy parameters periodically.

01
Sysfs Telemetry

Directly queries NVIDIA Tegra sysfs nodes to extract GPU temperature, CPU utilization, and GPU load at sub-second intervals.

02
Telemetry Ring Buffer

Stores recent samples to smooth sensor noise and compute short-horizon temperature derivatives (dT/dt) for trend analysis.

03
FOPDT Predictor

Estimates near-future thermal pressure by modeling the Jetson device as a First-Order Plus Dead Time dynamic process.

04
Fuzzy Logic Engine

Translates thermal error and rate of change into continuous workload adjustment factors using Mamdani-style inference rules.

05
Safety Guard

Monitors the hardware thresholds and enforces hard override values in case of critical temperature spikes (>= 85 °C).

06
Workload Actuation

Dynamically updates the YOLOv8 pipeline, adjusting input image size (imgsz) and processed frame ratio on the fly.

System Architecture Diagram
System Architecture Map
Closed-Loop Control Mechanism Diagram
Closed-Loop Control Loop
Technologies and Layers Diagram
Technologies & Control Layers

Empirical validation of control effectiveness.

Telemetry recorded during a 130-minute test run demonstrates the controller successfully managing temperatures, keeping the system inside a stable thermal envelope.

130-minute closed-loop GPU temperature response curve
Sustained GPU temperature profile over 130 minutes (3,900 samples). The controller prevents crossing the emergency 85°C threshold, forcing temperature drops when entering warning and critical bands.
Operating mode distribution chart
Operating mode distribution: Safe (18.6%), Warning (71.4%), and Critical (10.0%) mode shares during the 130-minute experiment.

Experiment A: Processed Frame Ratio Throttling (Percentage Lever)

This experiment isolates the processed-frame ratio lever, scaling it from 1.0 (all frames inferred) to 0.25 (1 in 4 frames inferred) at a fixed resolution of 640px.

Control State Average GPU Temp Average FPS Average GPU Load Average CPU Load
percentage=1.0 (baseline) 56.47 °C 24.06 72.88% 19.10%
percentage=0.25 (scaled) 54.63 °C 8.34 26.39% 10.70%
Thermal response curve for FPS experiment
Thermal relief effect after percentage reduction
GPU load response for FPS experiment
GPU utilization drop (from ~90% down to ~40-70%)
CPU load response for FPS experiment
CPU utilization drop under frame gating

Experiment B: Input Resolution Scaling (Image-Size Lever)

This experiment isolates the resolution control lever, scaling input width between 640px and 320px while keeping the frame-processing ratio fixed at 1.0.

Control State Average GPU Temp Average FPS Average GPU Load Average CPU Load
resolution=640 (baseline) 57.15 °C 23.75 71.03% 18.98%
resolution=320 (scaled) 55.99 °C 25.87 67.04% 19.11%
Thermal response curve for Resolution scaling
Milder temperature decline under resolution scaling
GPU load response for Resolution scaling
Modest decrease in peak GPU utilization spikes
CPU load response for Resolution scaling
Consistent CPU load across resolution settings

Offline First-Order Plus Dead Time Model Fits

The thermal response parameters were identified offline by applying step changes to the actuators, mapping the system transfer function (Gain K, Time Constant Tau, and Dead Time Theta).

Step Experiment Process Gain (K) Time Constant (Tau) Dead Time (Theta) RMSE R² Accuracy
FPS percentage step 3.96 °C/u 119.1 s 0.0 s 0.154 °C 0.936
Image-size step 2.50 °C/u 82.3 s 5.2 s 0.076 °C 0.963

Practical deployment scenarios.

Smart Surveillance

Security cameras expected to run continuous target identification in outdoor enclosures under direct sunlight.

Robotics & UAVs

Compact mobile systems with strict battery limits, small physical footprints, and limited airflow cooling.

Traffic Analytics

Street-level camera nodes tracking vehicle and pedestrian volumes continuously without physical access.

Remote Edge Sensors

Unattended environmental and industrial monitoring stations where physical maintenance is costly or impossible.

I designed the closed-loop thermal experiments and implemented the real-time YOLOv8 GStreamer camera capture pipeline. I built the telemetry logger to sample Jetson Orin NX sysfs nodes, coded the FOPDT thermal predictor, and developed the Mamdani fuzzy-logic rules. I validated the control levers empirically, proving that application-level self-regulation preserves system uptime.

Python Ultralytics YOLOv8 PyTorch TensorRT scikit-fuzzy NVIDIA Jetson Orin NX sysfs Telemetry OpenCV GStreamer Fuzzy Logic FOPDT Modeling Ring Buffering CSV Logging psutil