Cooperative Multi-Agent Deep RL with MATD3
This project explores cooperative multi-agent reinforcement learning (MARL) using the Multi-Agent Twin Delayed Deep Deterministic Policy Gradient (MATD3) algorithm.
The goal was to compare two learning strategies:
- Independent Learners – each agent learns with only its local observations/rewards.
- Fully Observable Critic – agents share a central critic with access to the global state.
The experiments were conducted in the Simple Spread and Simple Speaker Listener environments from the PettingZoo MPE suite.
Simulation Environments
Simple Spread
- 3 agents, 3 landmarks
- Agents must spread to cover all landmarks while avoiding collisions.
- Reward = global coverage (distance to landmarks) – collision penalties.
Simple Speaker Listener
- 2 agents: one speaker, one listener
- The speaker sees the goal landmark but cannot move.
- The listener moves but does not see the goal—it relies on the speaker’s communication.
MATD3 Algorithm
MATD3 extends TD3 to multi-agent settings with:
- Actor-Critic architecture (separate networks for policy and value estimation).
- Double Critics to mitigate Q-value overestimation.
- Target Networks + Delayed Policy Updates for training stability.
- Action Noise for better exploration.
MATD3 Pseudocode
Actor-Critic Diagram
Independent Learners vs Fully Observable Critic
-
Independent Learners
Each agent has its own critic, based only on its local observations.- Pros: scalable, decentralized.
- Cons: less coordination.
-
Fully Observable Critic
A centralized critic sees the entire environment state and all agent actions.- Pros: better coordination and performance.
- Cons: slower training, heavier computation.
Experimental Setup
- Framework: AgileRL
- Training episodes: 6000
- Optimization: Evolutionary Hyperparameter Search
- Metrics: episodic reward, learning stability, task completion
Results
Simple Spread
Independent Learners – Training Curve
Independent Learners – Learned Behavior
Fully Observable Critic – Training Curve
Fully Observable Critic – Learned Behavior
Simple Speaker Listener
Independent Learners – Training Curve
Independent Learners – Learned Behavior
Fully Observable Critic – Training Curve
Fully Observable Critic – Learned Behavior
Final Rewards
| Environment | Independent Learners | Fully Observable Critic |
|---|---|---|
| Simple Spread | -48.83 | -41.05 |
| Simple Speaker Listener | -53.12 | -35.55 |
Conclusions
- The Fully Observable Critic consistently outperformed Independent Learners in both environments, but required longer training time.
- Independent Learners trained faster, but showed higher variance and weaker coordination.
This project highlights the trade-off between decentralized efficiency and centralized performance in cooperative MARL.
Repository
👉 Source code and experiments: SOAS-MADRL on GitHub
Cooperative Multi-Agent Deep RL with MATD3
https://github.com/marcmonfort/SOAS-MADRL
Research Engineer