I have been reading “A Brief History of Intelligence” by Max Bennet, which takes the reader on a fascinating journey into the history of human intelligence, and the physiology that has developed over millions of years.
One of the beautiful concepts highlighted in the book is the Temporal Difference Signal, and the story about how a collaboration between AI researchers and Neuroscientists led to a better understanding of Dopamine and also improved AI algorithms.
Temporal Difference (TD) and AI
A temporal difference (TD) signal is the difference between predicted future rewards and received rewards, used to update value estimates in reinforcement learning, enabling agents to learn from experiences incrementally and improve decision-making over time.
The concept of reinforcement learning has been around since the 1950s. However, it wasn’t until 1984 when the PhD dissertation by Richard Sutton laid one of the intellectual cornerstones for the reinforcement learning revolution.
The key challenge with reinforcement learning was the following - when training an AI to perform a task, it is not sufficient to simply reinforce recent moves, and it's virtually impossible to reinforce all possible moves (too many combinations) that can lead to the optimal outcome.
Bennet illustrates this challenge using the example of training AI to play a game of checkers, and we can extend the theory to any other multi-step game. The actual result in any multi-step game is the outcome of many individual steps, and it's incredibly difficult to figure out how good any particular move is based on the actual outcome of the game (side-note: for readers in the investment profession, note the similarity here with investment decision making).
For example, you might play a bad move toward the end of the game and still win because you had played a really good move early on. Therefore, by trying to train an AI to make the next move based on the actual outcome, you run into the issue that the AI might value a poor move at the end highly even though it was an earlier move that was responsible for the positive outcome. The alternative, which would be evaluating the outcome of all possible games, is not technically feasible. As Bennet points out in his book, there are 10^120 possible games of chess, more than the number of atoms in the universe.
Richard Sutton’s key insight was to reinforce learning based on predicted/expected rewards, and not on actual outcomes.
Instead of trying to train a system based on outcomes, Sutton proposed training a system by predicting the probability of winning the game at each step of the game. In Sutton’s model, he separates the system into the actor (who makes actual moves) and a critic (who predicts the likelihood of winning at each step). The actor gets rewarded (or penalized) not at the end of the game, but rather at each step of the game based on the updated probabilities/expectations from the critic.
Sutton discovered that the interaction between actual and predicted moves led to a re-calibration of expected outcomes, and by continuing to optimize expected outcomes, AI agents were able to effectively improve the likelihood of achieving the end result.
An engineer at IBM, Gerald Tesauro utilized Sutton’s work to create a practical implementation. He created TD-Gammon, a system that by 1994 ended up playing Backgammon as good as some of the top human players, and laid the foundation for using TD learning to train AI systems on human tasks.
However, the most fascinating result of TD research was not in the world of AI, but rather back in the physical world of the brain.
Temporal Difference (TD) and Dopamine
Sutton, a psychologist by training, was inspired by the physical world but hadn’t been able to make an explicit connection between his idea and the brain. One of his students, Peter Dayan and postdoc Read Montague at the Salk Institute in San Diego were convinced that brains implemented some form of TD learning, and began searching for evidence in neuroscience data.
In neuroscience, it was established that the neurotransmitter Dopamine was related to reinforcement and reward. However, research had also shown that Dopamine was less about pleasure/liking and more about wanting.
In the 1990s, Dayan and Montague came across research on the relationship between dopamine and reinforcement, and data generated by a German neuroscientist named Wolfram Schultz from the 1980s. Schultz used a set of simple reward/response experiments with monkeys to measure the activity of individual dopamine neurons under various scenarios.
Schultz's work demonstrated that dopamine was not a signal for reward or pleasure based on outcomes, but rather dopamine was triggered by the anticipation of outcomes. However, Schultz and others in neuroscience were not sure how to interpret the data.
When Dayan and Montague looked at Schultz’s data from a TD learning lens they realized that dopamine responses aligned exactly with a temporal difference signal. Collaboration between AI scientists and neuroscientists led to a monumental insight, namely that Dopamine is not a signal for reward but for reinforcement.
Brains reinforce behaviors based on changes in predicted future rewards, not actual rewards.
In 1997 Wolfram Schultz, Peter Dayan, and P. Read Montague published the groundbreaking paper "A Neural Substrate of Prediction and Reward," with their findings.
The core concept presented in the paper is the idea of "prediction error" in the brain’s reward system. When an outcome is better than expected, dopaminergic neurons increase their firing rate, signaling a positive prediction error. Conversely, if the outcome is worse than expected, these neurons decrease their activity, indicating a negative prediction error. This discrepancy between expected and actual outcomes drives learning and adaptation, allowing organisms to better predict and respond to future events.
A World of (Temporal) Difference - Biology, AI, Investing, and Philosophy
The work of Dayan, Montague, Schultz and many others have profound implications for understanding not only basic brain functions but also complex behaviors and disorders related to the reward system, such as addiction and depression. Such an understanding may perhaps not have occurred without the collaboration across AI, psychology and neuroscience.
TD learning, and its focus on expectations vs actual outcomes created a great leap forward in the AI world, improving reinforcement learning. Over time, by updating policy based on TD errors, AI learns the optimal path. This process mirrors the brain's use of dopaminergic prediction errors to refine expectations and behaviors. Variations of this concept power most modern AI approaches from self-driving cars to better video generation and various complex optimization problems.
TD also provides a valuable lesson for investment systems. Given highly uncertain environments, the objective of a system is to maximize its ability to perform across multiple scenarios. In order to do that, natural systems evolved in a manner that separates predicted vs actual, and assigns credit or makes predictions based on expectations, not outcomes. Investors may be better served by taking a similar approach - Instead of being fixated on the outcome, and seeking immediate results, approaches that focus on maximizing expectations may end up increasing probability of success.
Conceptually, TD also appears to bear similarity with philosophy. The ancient text Bhagavad Gita, for example, expresses the view that individuals should be unattached to outcomes, and should instead focus on the process and the goal. Perhaps that’s the ultimate expression of Temporal Difference -stay focused on the process, update the process based on expected probabilities, and be unattached to the outcome.
Further Reading:
A Brief History of Intelligence by Max Bennett
Wolfram, Schultz, et al. A Neural Substrate of Prediction and Reward. Mar. 1997
Note: ChatGPT helped with editing and research