New Method Boosts AI Vision on Low-Power Devices
Researchers have developed a novel technique, SharpZO, to significantly improve the performance of vision-language models (VLMs) on devices with limited memory. This breakthrough addresses a key challenge in deploying advanced AI capabilities on edge devices by enabling efficient fine-tuning without the need for computationally intensive backpropagation.
Key Takeaways
- SharpZO enables efficient fine-tuning of vision-language models on memory-constrained devices.
- It utilizes a two-stage optimization process combining evolutionary strategies and zeroth-order estimation.
- The method achieves significant accuracy improvements and faster convergence compared to existing forward-only techniques.
- SharpZO demonstrates robustness to distribution shifts and is suitable for parameter-efficient fine-tuning.
The Challenge of Fine-Tuning on Edge Devices
Vision-language models (VLMs) excel at various computer vision tasks. However, the standard method for updating these models, backpropagation, is too computationally demanding for resource-limited edge devices. This has led to the exploration of alternative fine-tuning strategies that rely solely on forward passes, drastically reducing computational needs. Zeroth-order (ZO) estimation is one such approach, but current ZO-based methods lag behind backpropagation in accuracy and convergence speed.
A primary hurdle for ZO methods is their high variance, leading to inconsistent and noisy gradient estimates. This instability can prevent models from converging to optimal solutions and may cause them to get stuck in suboptimal regions of the loss landscape. The loss landscape can be visualized as a high-dimensional space where parameter settings are mapped against the model's error. Backpropagation calculates the gradient (slope) to guide parameters downhill, while ZO estimates this gradient by sampling nearby points.
Introducing SharpZO: A Hybrid Approach
Presented at NeurIPS 2025, SharpZO is a hybrid sharpness-aware zeroth-order optimization approach designed for fine-tuning VLMs using only forward passes. It employs a two-stage optimization process:
- Global Exploration Stage: This stage uses evolutionary strategies, specifically a sharpness-aware covariance-matrix adaptation evolution strategy (CMA-ES), to smooth the loss landscape and construct a robust initialization. CMA-ES estimates the distribution of loss across parameter values and their correlations, updating the distribution's mean and covariance matrix iteratively. SharpZO modifies CMA-ES by adding a term to the loss function that accounts for the worst possible loss, further smoothing the landscape.
- Local Search Stage: Following the global exploration, a modified sparse ZO algorithm performs refined local searches. Traditional sparse ZO reduces gradient estimate dimensionality by discarding low-magnitude terms. SharpZO enhances this by normalizing the gradient vector based on its mean and standard deviation, which also contributes to landscape smoothing.
Performance and Implications
In experiments across 11 diverse downstream tasks using CLIP models, SharpZO demonstrated significant improvements. It boosted accuracy by an average of up to 7% over forward-only methods like ZIP and BlackVIP. On several tasks, its performance neared that of CoOP, a first-order method that requires backpropagation. Furthermore, SharpZO achieved substantially faster convergence, reaching target accuracy on ImageNet in 15.3 minutes, compared to 19 minutes for ZIP and 170 minutes for BlackVIP.
Beyond accuracy and speed, SharpZO reduces memory footprint by eliminating the need to store gradients. The method also proved robust to distribution shifts, outperforming baselines on out-of-distribution tasks such as sketch recognition and adversarial image examples. While currently optimized for prompt tuning (parameter-efficient fine-tuning), scaling to full-model fine-tuning and addressing potential computational costs of the CMA-ES warmup stage in high-dimensional settings remain areas for future research. This work is a collaboration between Amazon and UCSB.