Multi-task learning is a very challenging problem in reinforcement learning.While training multiple tasks jointly allows the policies to share parameters across different tasks, the optimization problem becomes non-trivial: It is unclear what parameters in the network should be reused across tasks and the gradients from different tasks may interfere with each other. Thus, instead of naively sharing parameters across tasks, we introduce an explicit modularization technique on policy representation to alleviate this optimization issue. Given a base policy network, we design a routing network which estimates different routing strategy to reconfigure the base network for each task. Moreover, instead of creating a concrete route for each task, our task-specific policy is represented by a soft combination of all possible routes. We name this approach soft modularization. We conduct experiments on multiple robotics manipulation tasks in simulation and show our method improves sample efficiency by a large margin and still achieve performance on par with individual policy trained for each task.
Our framework contains a base policy network with multiple modules (left in left figure) and a routing network (right in left figure). Our base policy network has L layers of modules, and each layer contains n modules. The routing network predicts L-1 layers of probabilities to weight the connections between different modules in the base policy network. The soft combinations of different modules are used to predict the action
Sampled observation and corresponding routing visualization.
For each column, we visualize the routing networks for two different tasks sharing similar routing. We can see that even the tasks are different, they can still share similar module connections. This shows our soft modularization method allows the reuse of skills across different manipulation tasks.
We highlight the shared part with blue boxes. The pair of tasks include: (a) Close Drawer and Insert Peg; (b) Push and Close Window; (c) Reach and Pick Place. (d) Open Door and Open Drawer.
We extract the probabilities predicted from the routing network for different tasks and visualize with t-NSE. We can see that the routing probabilities from different tasks are grouped in different clusters.
Besides, we notice that those tasks sharing similar task structures (e.g., drawer-open-v1 and drawer-close-v1, window-open-v1 and window-close-v1) are close in the t-SNE plot.
Ruihan Yang, Huazhe Xu, Yi Wu, Xiaolong Wang Multi-Task Reinforcement Learning with Soft Modularization
(hosted on arXiv)