To enable effective exploation, undirected stochasity (e.g., Gaussian) are often injected into policy parameterization where the effectiveness collapses as system dimensionality grows.
Existing online RL methods for high-dimensional control often use dimension reduction to mitigate this issue, which sacrifice the flexibility and robustness of complex systems.
We propose Q-guided Flow Exploration (Qflex), which
• samples from a probability flow guided by the state-action value function Q
• achieves directed exploration with policy improvement validity
• preserve full system flexibility by exploration over native high-dimesional action space
We provide an actor-critic implementation of Qflex, where we
• maintain a learnable source distribution to facilitate informative initialization
• construct the Q-guided flow from finite-step gradient ascent of learned Q-network
• learn the policy in simulation-free manner via flow matching
Qflex outperforms representative online RL baselines over a wide range of benchmarks for high-dimensional continuous control.
Qflex preserves the system flexibility to perform running and ballet dancing on a 700-muscle full-body musculoskeletal system.
Qflex consistently generate better samples comared to Gaussian-based sampling, achieving better policy improvement.
We propose a scalable reinforcement learning method, Qflex, which is capable of
• exploring original high-dimensional action space with policy improvement guarantee
• achieving efficient learning in continuous control over various high-dimensional dynamical systems
• preseving the system flexibility to facilitate agile and complex full-body movment control
@inproceedings{
wei2026scalable,
title={Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow},
author={Wei, Yunyue and Zuo, Chenhui and Sui, Yanan},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={http://arxiv.org/abs/2601.19707}
}