Abstract

Recent studies have indicated that visual distractions can impair the performance of RL agents when observations in the evaluation environment differ significantly from those in the training environment. This issue is even more crucial in the visual offline RL paradigm, where the collected datasets can differ drastically from the testing environment. In this work, we investigated an adversarial-based algorithm to address the problem of visual distraction in offline RL settings. Our adversarial approach involves training agents to learn features that are more robust against visual distractions. Furthermore, we proposed a complementary dataset to add to the V-D4RL[1] distraction dataset by extending it to more locomotion tasks. We empirically demonstrate that our method surpasses state-of-the-art baselines in tasks on both the V-D4RL and proposed dataset when evaluated on random visual distractions.

Problem

scheme

We hypothesise that the subpar performance of the current state-of-the-art baselines can be attributed to visual distractions present during the evaluation but absent from the offline RL training dataset. These unseen visual distractions confuse the encoder, making the encoder estimate less robust latent features z. Consequently, the actor-critic backbone, which relies on the latent features z, cannot learn robust policy and state-action estimations. Baseline agents can achieve expert-level performance in an environment where the same visual distractions exist in the training dataset. However, when we evaluated the same agent in an environment with random visual distractions, the agent’s performance diminished significantly.

Proposed Method

scheme

Our proposed method is based on DrQv2+BC[1,2] and incorporates two components. We adopt (i) a domain discriminator that trains adversarially against the encoder, and (ii) DropBlock[3] layers added to the encoder to achieve robustness against unseen visual distractions. An overview of this architecture is presented in the figure above. Our agent is trained using datasets from two domains: a normal observation domain and a visually distracted domain. Intuitively, the presence of domain-specific visual distractions may allow the discriminator to accurately classify latent features belonging to a particular class. However, we aim to induce the opposite and train the visual encoder such that the estimated latent features are indistinguishable from the domain discriminator. We can achieve so by using a gradient reversal layer[4] that inverts the gradients from the discriminator when updating the encoder.

Proposed Dataset

V-D4RL[1] provides a starting benchmark for visual distractions in offline RL settings. However, specifically for distraction observations, they only collected on one tasks. It is suitable to extend V-D4RL to provide a more comprehensive benchmark for offline continuous control with visual distraction. We collected distracting dataset for four additional tasks with three difficulties each.

dataset


Experiment results

Overall, the experimental results demonstrate that the proposed method consistently outperforms other model-free offline RL baselines in most tasks. DrQv2+BC and IQL[5] were the second-best algorithms overall. Interestingly, training directly on more challenging datasets does not result in better performance, even on difficult tasks, compared with agents trained on simpler datasets and evaluated on more difficult tasks. We hypothesise that this is because the datasets do not contain sufficient variations in distractions for the more difficult distractions; thus, the agents have difficulty learning to deal with them, whereas training on easier tasks allows the agent to infer some domain invariance across distractions.

results1


results2


References

[1] Cong Lu et al., Challenges and opportunities in offline reinforcement learning from visual observations. In TMLR, 2023
[2] Danijar Hafner et al., Mastering atari with discrete world models. In ICLR, 2021
[3] Golnaz Ghiasi et al., DropBlock: A regularization method for convolutional networks. In NeurIPS, 2018
[4] Yaroslav Ganin and Victor Lempitsky, Unsupervised domain adaptation by backpropagation. In ICML, 2015
[5] Ilya Kostrikov et al., Offline reinforcement learning with implicit q-learning. In ICLR, 2022