Edge AI · Thermal-Aware Dynamic Scaling · Senior Thesis Project

Adaptive Edge AI Controller

A hardware-aware control layer for NVIDIA Jetson Orin NX that uses real-time thermal telemetry, FOPDT prediction, and fuzzy logic to dynamically scale YOLOv8 inference parameters, preventing thermal shutdowns.

NVIDIA Jetson Orin NX YOLOv8 Fuzzy Logic FOPDT Control TensorRT CUDA Telemetry

View Repository Watch Demo Video Read Medium Article

Adaptive Edge AI Controller system interface visual

Overview

Running YOLO and real-time vision workloads on Jetson-class edge devices is not only a peak-performance problem; it is a long-duration reliability problem. In security cameras, robotic platforms, and unattended field systems expected to operate continuously, sustained inference load gradually accumulates heat, risking thermal throttling, performance degradation, and sudden device crashes.

Adaptive Edge AI Controller introduces a software control layer that sits above the inference loop. By querying hardware temperature sensors in real time and predicting near-future thermal pressure via a First-Order Plus Dead Time (FOPDT) model, it dynamically manages inference resolution (imgsz) and frame-processing ratios (percentage) using a fuzzy-logic controller. The result is a self-regulating system that maintains stable operation without triggering emergency shutdowns.

Demo Duration 130 Mins

Continuous closed-loop run on Jetson hardware

Telemetry Samples 3,900

Synchronized CSV logs of temperatures, load, and FPS

GPU Temp Range 52.5°C - 81.8°C

Maintained safely below the 85°C emergency threshold

Average GPU Temp 74.17°C

Stabilized within warning and critical operating bands

Role Edge AI System Design, Control Modeling, Telemetry Logging, Performance Optimization

Challenge Aroid thermal throttling and system crashes during continuous real-time object detection

Core Tech Python, YOLOv8, scikit-fuzzy, FOPDT predictor, Jetson sysfs API, OpenCV GStreamer

Demonstration Video

Closed-loop thermal adaptation in action.

Watch the controller manage a real-time YOLOv8 human detection pipeline on Jetson hardware for over 2 hours, dynamically switching between operating modes to maintain thermal equilibrium.

Results Demo

Real-time YOLOv8 pipeline demonstrating live parameter scaling based on device telemetry.

Heads-Up Display overlay rendering on-screen telemetry

Heads-Up Display HUD

On-screen overlay showing measured vs. predicted temperature curves, current mode (Safe, Warning, Critical), active frame ratios, and YOLO input size during active inference.

The Problem

Thermal instability in unattended Edge AI nodes.

Real-time object detectors continuously stress GPU, CPU, memory, and hardware decoders. On compact edge systems, this sustained load generates significant heat. If unchecked, the device enters thermal throttling, creating erratic latency spikes and frame drops. In severe cases, the operating system locks up entirely or performs a hard shutdown, requiring manual physical intervention on site.

Developer Forum Reports & Field Evidence

Source	Reported Problem	Implication
NVIDIA Developer Forums	GPU temperature climbs to 70 °C within 10 minutes of running YOLOv8 on Jetson Nano, causing stability concerns for outdoor deployments.	Long-duration field operations require temperature-trend-aware scaling rather than static configurations.
NVIDIA Jetson Forum	Embedded system shuts down completely after 15–20 minutes of real-time object detection from RTSP stream due to heatsink overheating.	Model optimization is insufficient on its own; application-level thermal safety must be built into the runtime.
Ultralytics Community	Pipeline freezes completely within 1–2 hours under multi-camera YOLO inference, necessitating physical power cycling.	Unattended remote nodes (e.g., security cameras, outdoor sensors) require proactive self-healing mechanisms.
NVIDIA Developer Forums	DeepStream pipeline experiences frame delays and thermal throttling at 68–70 °C on newer Jetson Orin Nano Super hardware.	Sustained inference stresses the thermal envelope of even latest-generation compact edge AI hardware.

A/B Evaluation

Self-regulation vs. hardware-level failure.

Without the controller, the system passively waits for operating-system throttling or system lockup. With the controller, the application proactively adapts its compute needs to maintain stability.

Without Controller

GPU temperature rises unchecked, leading to severe thermal stress and eventual hardware safety shutoffs.

With Controller

Workload is actively scaled back as temperatures cross thresholds, keeping the system below the 85°C emergency line.

Without Controller

Starts high, then suffers erratic drops and high latency spikes as OS clock-throttling (DVFS) degrades clocks.

With Controller

Controlled trade-off. Workload levels are adjusted smoothly, maintaining predictable and stable pipeline frame rates.

Without Controller

High risk of system lockups, kernel freezes, or sudden shutdowns after 15 to 60 minutes of heavy YOLO execution.

With Controller

Continuous 130+ minute operation verified under high ambient load without a single thermal freeze or crash event.

Without Controller

Unpredictable and arbitrary. The operating system decides which system components to throttle and when.

With Controller

Managed degradation. The application controls which dimensions (resolution vs. frame count) are sacrificed to keep running.

Without Controller

High. Requires manual intervention to reset the frozen field devices or cool them down physically.

With Controller

Zero. The system self-regulates in real time, automatically restoring full parameters when the device cools down.

Operating Modes

Multi-stage thermal-region traversal.

Full-performance mode. YOLO inference runs at maximum quality (imgsz=640) and processing ratio (percentage=1.0) to prioritize detection accuracy.

Gradual scaling mode. The fuzzy-logic controller dynamically scales down input resolution (imgsz) and limits processed frames to prevent heat buildup.

Aggressive scaling mode. The controller enforces strict workload reduction (processed frame ratio drops sharply) to arrest the upward temperature trend.

Safety override mode. Bypasses the fuzzy loop to force immediate, hard-coded limits (imgsz=320, percentage=0.25) to protect the device from thermal damage.

Control Architecture

A closed-loop software control cycle.

The inference loop queries the latest control decisions before each frame. The controller runs as a parallel thread, updating metrics and resolving fuzzy parameters periodically.

Sysfs Telemetry

Directly queries NVIDIA Tegra sysfs nodes to extract GPU temperature, CPU utilization, and GPU load at sub-second intervals.

Telemetry Ring Buffer

Stores recent samples to smooth sensor noise and compute short-horizon temperature derivatives (dT/dt) for trend analysis.

FOPDT Predictor

Estimates near-future thermal pressure by modeling the Jetson device as a First-Order Plus Dead Time dynamic process.

Fuzzy Logic Engine

Translates thermal error and rate of change into continuous workload adjustment factors using Mamdani-style inference rules.

Safety Guard

Monitors the hardware thresholds and enforces hard override values in case of critical temperature spikes (>= 85 °C).

Workload Actuation

Dynamically updates the YOLOv8 pipeline, adjusting input image size (imgsz) and processed frame ratio on the fly.

System Architecture Diagram — System Architecture Map

Closed-Loop Control Mechanism Diagram — Closed-Loop Control Loop

Technologies and Layers Diagram — Technologies & Control Layers

Telemetry Analysis

Empirical validation of control effectiveness.

Telemetry recorded during a 130-minute test run demonstrates the controller successfully managing temperatures, keeping the system inside a stable thermal envelope.

130-minute closed-loop GPU temperature response curve — Sustained GPU temperature profile over 130 minutes (3,900 samples). The controller prevents crossing the emergency 85°C threshold, forcing temperature drops when entering warning and critical bands.

Operating mode distribution chart — Operating mode distribution: Safe (18.6%), Warning (71.4%), and Critical (10.0%) mode shares during the 130-minute experiment.

Experiment A: Processed Frame Ratio Throttling (Percentage Lever)

This experiment isolates the processed-frame ratio lever, scaling it from 1.0 (all frames inferred) to 0.25 (1 in 4 frames inferred) at a fixed resolution of 640px.

Control State	Average GPU Temp	Average FPS	Average GPU Load	Average CPU Load
percentage=1.0 (baseline)	56.47 °C	24.06	72.88%	19.10%
percentage=0.25 (scaled)	54.63 °C	8.34	26.39%	10.70%

Thermal response curve for FPS experiment — Thermal relief effect after percentage reduction

GPU load response for FPS experiment — GPU utilization drop (from ~90% down to ~40-70%)

CPU load response for FPS experiment — CPU utilization drop under frame gating

Experiment B: Input Resolution Scaling (Image-Size Lever)

This experiment isolates the resolution control lever, scaling input width between 640px and 320px while keeping the frame-processing ratio fixed at 1.0.

Control State	Average GPU Temp	Average FPS	Average GPU Load	Average CPU Load
resolution=640 (baseline)	57.15 °C	23.75	71.03%	18.98%
resolution=320 (scaled)	55.99 °C	25.87	67.04%	19.11%

Thermal response curve for Resolution scaling — Milder temperature decline under resolution scaling

GPU load response for Resolution scaling — Modest decrease in peak GPU utilization spikes

CPU load response for Resolution scaling — Consistent CPU load across resolution settings

Offline First-Order Plus Dead Time Model Fits

The thermal response parameters were identified offline by applying step changes to the actuators, mapping the system transfer function (Gain K, Time Constant Tau, and Dead Time Theta).

Step Experiment	Process Gain (K)	Time Constant (Tau)	Dead Time (Theta)	RMSE	R² Accuracy
FPS percentage step	3.96 °C/u	119.1 s	0.0 s	0.154 °C	0.936
Image-size step	2.50 °C/u	82.3 s	5.2 s	0.076 °C	0.963

Target Systems

Practical deployment scenarios.

Smart Surveillance

Security cameras expected to run continuous target identification in outdoor enclosures under direct sunlight.

Robotics & UAVs

Compact mobile systems with strict battery limits, small physical footprints, and limited airflow cooling.

Traffic Analytics

Street-level camera nodes tracking vehicle and pedestrian volumes continuously without physical access.

Remote Edge Sensors

Unattended environmental and industrial monitoring stations where physical maintenance is costly or impossible.

My Contribution

I designed the closed-loop thermal experiments and implemented the real-time YOLOv8 GStreamer camera capture pipeline. I built the telemetry logger to sample Jetson Orin NX sysfs nodes, coded the FOPDT thermal predictor, and developed the Mamdani fuzzy-logic rules. I validated the control levers empirically, proving that application-level self-regulation preserves system uptime.

Technology Stack

Python Ultralytics YOLOv8 PyTorch TensorRT scikit-fuzzy NVIDIA Jetson Orin NX sysfs Telemetry OpenCV GStreamer Fuzzy Logic FOPDT Modeling Ring Buffering CSV Logging psutil

Adaptive Edge AI Controller

Closed-loop thermal adaptation in action.

Thermal instability in unattended Edge AI nodes.

Developer Forum Reports & Field Evidence

Self-regulation vs. hardware-level failure.

Thermal Behavior

Throughput (FPS)

System Uptime

Performance Loss

Operator Needs

Multi-stage thermal-region traversal.

Safe Region

Warning Region

Critical Region

Emergency Region

A closed-loop software control cycle.

Empirical validation of control effectiveness.

Experiment A: Processed Frame Ratio Throttling (Percentage Lever)

Experiment B: Input Resolution Scaling (Image-Size Lever)

Offline First-Order Plus Dead Time Model Fits

Practical deployment scenarios.

Smart Surveillance

Robotics & UAVs

Traffic Analytics

Remote Edge Sensors