RL and The Problem of Intelligence

 

Artificial Intelligence (AI) aims to reproduce, in software, the salient features of human intelligence. The idea of building an artificial intelligence is old. The first computers were less than a decade old, when discussions first started on the possibility of implementing intelligence in a computer program. The task may still seem monumental. Even if you are willing to accept that the principles underlying human intelligence 

"can be so precisely described,
that a machine can be made to simulate it" (John McCarthy), 

our understanding of the human mind is still so limited, that it may appear a fool's errand to hope to reproduce it at this stage. Yet, fully understanding and reproducing the messiness of human minds (and brains) might not necessarily be a requirement for building intelligence. 

Planes were not built by copying the way birds fly: they don't need artificial feathers, nor they flap their wings to propel themselves. Instead, by understanding the physical principles of fluid dynamics, engineers were capable of designing a completely different system that, nonetheless, achieves the same objective. Pulling off the same feat in the case of intelligence may, however, turn out to be significantly harder. 

Compared to other projects, building an artificial intelligence has unique challenges. When designing the first planes, there were clear, and broadly agreed metrics used to evaluate progress: from the distance covered in a flight, to the speed of the plane, or its fuel consumption. Researchers in AI, instead, don't just disagree on how to build intelligence, they disagree on what intelligence is in the first place. 

Disagreement on the nature of the very objective of a venture wouldn't normally provide much reassurance on its likelihood of success, but we might want to be a bit more lenient, and see it not be a sign of incompetence, but as a sign of the breathtaking ambition of the project: after all, AI aims not at simply unlocking a single narrow technology (e.g. flight), but instead aims at constructing systems capable of solving as many and varied problems as humans do. But even if we do cut AI researchers some slack, we must recognise this as a fundamental issue. Indeed, a clear framework for defining the problem of intelligence, and a metric for measuring progress in building intelligence, is thus what we seek in this post.

That there may be any controversy on the nature of intelligence may appear surprising at first; after all, we all have an intuitive understanding of what is referred to as intelligence. For instance, you probably think you can recognise intelligent behaviour (or lack thereof) when you see it. Yet, turning this intuitive understanding into something measurable and quantitative is trickier than you may expect.

Part of the issue is that, in its informal usage, intelligence is an umbrella term, under which we fold a variety of distinct desirable skills and capabilities: learning, planning, logic, probabilistic reasoning, language or social prowess. Any one of the skills above has been hailed, at some point or another, as the true essence of intelligence; but this confuses specific solutions to intelligence, with the problem of intelligence: any or all of the above may be required for intelligent behaviour, depending on the setting.

In this post, we will discuss reinforcement learning as a leading way of formalising the problem of intelligence, in a truly general form, such that any of the classic problems of AI (learning, planning, etc…) may be understood in a single unified context. 

Reinforcement Learning

In reinforcement learning (RL), an adaptive system, referred to as the agent, must learn to act in the universe it is embedded in, referred to as the environment, in order to achieve some goal. RL puts little constraints on the specific form of both agent and environment.

An agent may be, for instance, a chess playing program, which interacts with its environment by observing the pieces on a digital representation of the board and choosing how to move those pieces around; in this case the environment will include, at least, the board and the chess program's opponent (whether this is a human player or another program). Alternatively, an agent may be a physically embodied robot, that must navigate its surroundings and use its actuators to perform some task in the real world (such as vacuuming a house); you may then think of the entire universe, or whatever portion of it that the agent has access to, as it's environment.

Intelligence is, in this framework, the capability for the agent to achieve complex goals in diverse and potentially very complex environments. A partial ordering is implicitly induced by this definition, in that we may consider an agent to be more intelligent if it can learn to achieve a broader set of goals in a broader set of environments. For instance, according to this definition, DeepBlue (the first artificial agent to beat a world champion at chess) was more intelligent, in its limited domain, than any of its predecessors, as it could win against a broader set of chess opponents (including Gary Kasparov!). 

Importantly, even a partial ordering provides a path towards measuring progress, and one that is agnostic to the specific solution method implemented by any one agent. It's worth noting however, that not all types of intelligence will be directly comparable: an agent may perform better in one class of environments, and a different agent may instead be best in some other class of environments. For instance, DeepBlue will be neither more nor less intelligent than TD-gammon (the first artificial agent to beat professional backgammon human players), as each was effective in a different domain, and exclusively in such domain. The recent AlphaZero agent could instead be considered more intelligent than DeepBlue as it could master not only chess, but also other board games such as Shogi and Go.

Reinforcement learning also puts little constraints on the type of solutions that may be considered for the problem of intelligence. Learning, logic, planning, probabilistic reasoning, language, social skills and other classic topics in AI research, are all understood in this framework as specific enablers of goal-oriented behaviour; each of them may or may not be required by an agent, depending on the type of environment and goal.

An Extraordinary Hypothesis

Reinforcement learning is, by construction, a fairly flexible framework, but where RL does however make stronger assumptions is in the notion of goal. More precisely, RL assumes that goals can be expressed in a particular form, and, specifically,

that all of what we mean by goals and purposes can be well thought of as maximization of
the expected value of the cumulative sum of a received scalar signal (Rich Sutton).

This has been referred to in the literature as the Reward Hypothesis or the Reinforcement Learning Hypothesis and is, of course, an extraordinary claim. The hypothesis can be backed up formally in the context of Markov Decision Processes and other simplified settings. Under suitable assumptions, for instance, it can be shown that, for any behaviour, there is a reward that is maximised by that behaviour, and that therefore it is indeed possible to specify any goal oriented behaviour through a single scalar reward. Yet, whether this is the most appropriate or convenient way of representing goals in order to build an Artificial Intelligence is a separate question, and one that only time and effort can answer. 

Many find problematic the identification of intelligence with reward maximisation. Surely human intelligence demonstrates collaborative and altruistic behaviours that go beyond individual reward optimisation. A minimal rebuttal is that it can be shown how cooperation can also be induced by the maximisation of a narrow egoistic reward signal. This can be demonstrated theoretically through game theory and, empirically, through careful experiments on optimisation. In some sense, this is also shown by evolution: a relentless and single minded process of natural selection has, nonetheless, resulted in human beings, with empathy and a capability for altruistic behaviour. 

Yet, such an answer often does not feel fully satisfying. Most of us consider themselves motivated by more than "reward" alone. We behave well towards others not just because it is convenient. Perhaps, part of the reluctance to consider reward maximization as a paradigm for intelligence, is also due to the choice of word itself: the name reward gives it a materialistic and selfish feel. This, however, is not necessarily the case.

The notion of reward in RL is actually fairly flexible. The reward hypothesis only posits that any intelligent behaviour can be specified and learned in terms of maximisation of a scalar signal, but this signal may encode any altruistic feature we feel motivates our (or more generally human) behaviour. Maximising your own happiness is a possible reward, but so is maximising the total happiness of all humankind. As they are different rewards, they imply different kinds of goal oriented behaviours, but both are fully compatible with the Reinforcement Learning problem.

References

We are heavily in debt, for the ideas and concepts expressed in this essay, towards the books, blog posts, and papers listed below:

Reinforcement Learning: An Introduction, Rich Sutton and Andrew Barto, 2018

The Reward Hypothesis, Rich Sutton

Reward is enough, David Silver et al, 2021