Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow

Yunyue Wei,   Chenhui Zuo,   Yanan Sui
Tsinghua University     
International Conference on Learning Representations (ICLR), 2026
Qflex

Figure 1. Exloration behavior across increasing dimensionality.

Gaussian-based exploration: undirected exploration where exploratory collapses as system dymensionality increases.
Q-guided flow exploration (ours): directed exploration towards high-value modes in high dimensions by following value-guided flow

In this paper, we aim to develop a scalable and efficient online reinforcement learning (RL) method for continuous control of high-dimensional dynamical systems. Such systems often present challenges that significantly hinder efficient learning:
We expect the control method to have the following properties:
High-dimensionality: The size of the state-action space grows rapidly with dimension, leading to pronounced “curse-of-dimensionality” effects
Over-actuation: multiple action sequences can yield indistinguishable kinematics but different internal forces and costs given much more actuators than degrees-of-freedom

These challenges make effective exploration crucial for both learning efficiency and control performance.

Vanishing Effectiveness of Undirected Exploration

gaussian

Figure 2. Gaussian-based exploration vanishes along increasing action dimensionality.

The gray polyline depicts a planar kinematic chain with |A| degrees of freedom. The orange background (darker is higher) visualizes the state–action value Q. Green contours show the end-effector distribution induced by an undirected Gaussian proposal over joint angles, whose exploratory reach collapses as |A| increases.

To enable effective exploation, undirected stochasity (e.g., Gaussian) are often injected into policy parameterization where the effectiveness collapses as system dimensionality grows.

Existing online RL methods for high-dimensional control often use dimension reduction to mitigate this issue, which sacrifice the flexibility and robustness of complex systems.

Scalable Exploration via Value-Guided Flow

gaussian

Figure 3. Scalable exploration achieved by Q-guided flow.

Red streamlines/contours depict Q-guided probability flows that transport probability mass from the Gaussian proposal toward high-value modes, sustaining directed exploration in high dimensions.

We propose Q-guided Flow Exploration (Qflex), which
• samples from a probability flow guided by the state-action value function Q
• achieves directed exploration with policy improvement validity
• preserve full system flexibility by exploration over native high-dimesional action space

We provide an actor-critic implementation of Qflex, where we
• maintain a learnable source distribution to facilitate informative initialization
• construct the Q-guided flow from finite-step gradient ascent of learned Q-network
• learn the policy in simulation-free manner via flow matching

Experimental Results

Control over high-dimensional systems

Qflex outperforms representative online RL baselines over a wide range of benchmarks for high-dimensional continuous control.

syn

Figure 4. Control over high-dimensional control benchmarks.

(a) Morphologies and state-action dimensions of evaluated benchmarks. (b) Learning curve of algorithms. Results show mean performances with one standard deviation of 5 independent runs. Baselines in the second row are run only on musculoskeletal benchmarks.

Full-body human musculoskeletal motion control

Qflex preserves the system flexibility to perform running and ballet dancing on a 700-muscle full-body musculoskeletal system.

run

Figure 5. Steady whole body running gait learned by Qflex.

Grid plot on the right indicates muscle activation during movement.

dance

Figure 6. Ballet routine with single leg spins and balance learned by Qflex.

Grid plot on the right indicates muscle activation during movement.

Algorithm analysis

Qflex consistently generate better samples comared to Gaussian-based sampling, achieving better policy improvement.

ab_behavior

Figure 7. Sample quality between Qflex and source Gaussian policy during training

Over-actuated musculoskeletal control tasks are denoted as dash-dotted lines.

Conclusion

We propose a scalable reinforcement learning method, Qflex, which is capable of
• exploring original high-dimensional action space with policy improvement guarantee
• achieving efficient learning in continuous control over various high-dimensional dynamical systems
• preseving the system flexibility to facilitate agile and complex full-body movment control

BibTeX

@inproceedings{
wei2026scalable,
title={Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow},
author={Wei, Yunyue and Zuo, Chenhui and Sui, Yanan},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={http://arxiv.org/abs/2601.19707}
}