About Me

I am currently an Applied Research Scientist on the Trust and Safety team at Alberta Machine Intelligence Institute (Amii). Before Amii, I completed my PhD at Toronto Metropolitan University under the supervision of Nariman Farsad and Isaac Woungang. My PhD work examined how to improve the capabilities of multi-task reinforcement learning agents in scenarios where a single policy must accomplish multiple tasks. Previously, I completed my MSc in Computer Science at Brock University under the supervision of Beatrice Ombuki-Berman, and I received my BSc. in Computer Science from Trent University.

Research

My current research focuses on the reliability, robustness, and trustworthiness of reinforcement learning. This includes both using RL to perform automated safety testing (e.g. automated red-teaming) or doing research in RL specifically looking at performance beyond episodic returns.

Past Experience

Before joining Amii, I was a part-time lecturer in the Computer Science Department at Brock University. Previously, I was an intern at Royal Bank of Canada working on supporting their technical infrastructure using AIOps methods. Upon the completion of my MSc, I was the Lead Machine Learning Developer at Castle Ridge Asset Management.

Publications

* Indicates equal contributions.
Position: AI Agent Safety is a Reinforcement Learning Problem
Reginald McLean, Tabitha Edith Lee, Montaser Mohammedalamen, Kevin Roice, Glen Berseth, Patrick M. Pilarski, Marlos C. Machado, Alyssa Lefaivre Škopac, Benjamin Rosman
Accepted at Second Workshop on Agents in the Wild: Safety, Security, and Beyond, ICML 2026.
With the rapid advancement and deployment of Agentic AI, our scientific understanding of capabilities and limitations has not kept pace, leading to cases where AI agents cause harm. We argue that many of these safety limitations are not novel problems. Instead, the safety challenges currently facing AI agents can be seen as instances of problems the reinforcement learning (RL) community has studied rigorously for decades. The core of this argument concerns the problem formulation of AI agents. AI agents are designed to solve sequential decision-making problems: problems with long-term objectives in which actions have delayed consequences. To model these types of problem, the problem is set up the problem such that the agent receives observations, feedback on its progress, and then takes actions. This is precisely the formulation of the RL problem. In this paper, we formalize the problem equivalence, which we then leverage to argue that AI Agent safety is a reinforcement learning problem: the failure modes currently observed in deployed AI agents are structural instances of problems RL has formalized for decades, and the RL safety literature provides principled tools to diagnose and address them. We conclude with a call for deliberate collaboration between the RL and AI agent research communities: AI agent researchers gain access to principled frameworks, while RL researchers gain a class of real-world problems that could expose fundamental gaps in current RL benchmarks and theory.
A Systematic Investigation of The RL-Jailbreaker in LLMs
Montaser Mohammedalamen, Kevin Roice,Reginald McLean, Alyssa Lefaivre Škopac
Accepted at Second Workshop on Agents in the Wild: Safety, Security, and Beyond, ICML 2026.
The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardening. Adversarial jailbreaking, the strategic manipulation of models to elicit harmful output, remains a primary threat to safe deployment. While Reinforcement Learning (RL) frames jailbreaking as a multi-step attack through sequential optimization, a mechanistic understanding of why the framework succeeds remains incomplete. To fill this gap, we present the first systematic decomposition of RL jailbreaking. We deconstruct the framework into problem formalization (reward function, action space, episode length), and algorithmic measures (RL algorithm, training data, reward-shaping) to identify the structural determinants of adversarial success. Our results reveal that the RL-jailbreaker successfully compromised all targeted models and safeguards. Through this first-of-its-kind analysis, we demonstrate that environment formalization, specifically dense rewards and extended episode lengths, is the primary driver of jailbreaking success. This work provides a tool for improving RL-jailbreaker efficiency and, ultimately, harden generative models resistant to RL-based attacks.
Publication diagram
Meta-World+: An Improved, Standardized, RL Benchmark
Reginald McLean*, Evangelos Chatzaroulas*, Luc McCutcheon, Frank Röder, Tianhe Yu, Zhanpeng He, K.R. Zentner, Ryan Julian, J.K. Terry, Issac Woungang, Nariman Farsad, Pablo Samuel Castro
Accepted at Conference on Neural Information Processing Systems 2025.
Accepted at International Conference on Machine Learning (ICML) 2025, Workshop on CODEML: Championing Open-source DEvelopment in Machine Learning (Spotlight, Oral).
Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release an open-source version of Meta-World that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.
Publication diagram
Multi-Task Reinforcement Learning Enables Parameter Scaling
Reginald McLean*, Evangelos Chatzaroulas*, J.K. Terry, Issac Woungang, Nariman Farsad, Pablo Samuel Castro
Accepted at Reinforcement Learning Conference, 2025.
Outstanding Paper on Scientific Understanding in Reinforcement Learning.
Multi-task reinforcement learning (MTRL) aims to endow a single agent with the ability to perform well on multiple tasks. Recent works have focused on developing novel sophisticated architectures to improve performance, often resulting in larger models; it is unclear, however, whether the performance gains are a consequence of the architecture design itself or the extra parameters. We argue that gains are mostly due to scale by demonstrating that naively scaling up a simple MTRL baseline to match parameter counts outperforms the more sophisticated architectures, and these gains benefit most from scaling the critic over the actor. Additionally, we explore the training stability advantages that come with task diversity, demonstrating that increasing the number of tasks can help mitigate plasticity loss. Our findings suggest that MTRL's simultaneous training across multiple tasks provides a natural framework for beneficial parameter scaling in reinforcement learning, challenging the need for complex architectural innovations.
Publication diagram
Overcoming State and Action Space Disparities in Multi-Domain, Multi-Task Reinforcement Learning
Reginald McLean, Kai Yuan, Issac Woungang, Nariman Farsad, Pablo Samuel Castro
Accepted at Morphology-Aware Policy and Design Learning Workshop @ CoRL 2024 (Spotlight, Oral).
Current multi-task reinforcement learning (MTRL) methods have the ability to perform a large number of tasks with a single policy. However when attempting to interact with a new domain, the MTRL agent would need to be re-trained due to differences in domain dynamics and structure. Because of these limitations, we are forced to train multiple policies even though tasks may have shared dynamics, leading to needing more samples and is thus sample inefficient. In this work, we explore the ability of MTRL agents to learn in various domains with various dynamics by simultaneously learning in multiple domains, without the need to fine-tune extra policies. In doing so we find that a MTRL agent trained in multiple domains induces an increase in sample efficiency of up to 70% while maintaining the overall success rate of the MTRL agent.
Publication diagram
Video Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
Minttu Alakuijala, Reginald McLean, Isaac Woungang, Nariman Farsad, Samuel Kaski, Pekka Marttinen, Kai Yuan
Accepted at Transactions on Machine Learning Research
Accepted at Workshop on Language and Robot Learning: Language as an Interface @ CoRL 2024
Natural language is often the easiest and most convenient modality for humans to specify tasks for robots. However, learning to ground language to behavior typically requires impractical amounts of diverse, language-annotated demonstrations collected on each target robot. In this work, we aim to separate the problem of what to accomplish from how to accomplish it, as the former can benefit from substantial amounts of external observation-only data, and only the latter depends on a specific robot embodiment. To this end, we propose Video-Language Critic, a reward model that can be trained on readily available cross-embodiment data using contrastive learning and a temporal ranking objective, and use it to score behavior traces from a separate actor. When trained on Open X-Embodiment data, our reward model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only, despite a significant domain gap. Using in-domain data but in a challenging task generalization setting on Meta-World, we further demonstrate more sample-efficient training than is possible with prior language-conditioned reward models that are either trained with binary classification, use static images, or do not leverage the temporal information present in video data.
Publication diagram
Swarm Based Algorithms for Neural Network Training
Reginald McLean, Beatrice Ombuki-Berman, Andries P. Engelbrecht
Accepted at 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
The purpose of this paper is to compare the abilities and deficiencies of various swarm based algorithms for training artificial neural networks. This paper uses seven algorithms, seven regression problems, sixteen classification problems, and four bounded activation functions to compare algorithms in regards to loss, accuracy, hidden unit saturation, and overfitting. It was found that particle swarm optimization is the top algorithm for regression problems based on loss, firefly algorithm was the top algorithm for classification problems when examining accuracy and loss. The ant colony optimization and artificial bee colony algorithms caused the least amount of hidden unit saturation, with the bacterial foraging optimization algorithm producing the least amount of overfitting.

reginald k mclean at gmail dot com

Department of Computer Science
Toronto Metropolitan University
Toronto, Ontario
Canada