Book Summary of “Human Compatible”

52 min readApr 12, 2021


This is a chapter-by-chapter summary (including my own insights) of Dr. Russell’s book Human Compatible. It is a great read as it raises a lot of questions, invokes quite a few debates, and provides some answers. Dr. Russell is a serious researcher in AI, so I do not doubt him when he suggests that based on the three fundamental guidelines he proposes (very different from the three laws of robots in Asimov’s work), we will be able to construct a beneficial AI sometime in the future. Here, a beneficial AI is a superintelligent AI whose sole purpose is to satisfy human preferences (beneficial to human well-being), defers to human for guidance (can be switched off if human so desires), and must learn what human preference means (no pre-set objective).

That said, one of the last point made by Dr. Russell strikes me the most. If we build a beneficial AI, this will inevitably lead to human enfeeblement, because we will delegate more and more tasks to AI. The final outcome is the WALL-E situation.

Final stage of human evolution due to intentional enfeeblement and perpetual entertainment (from WALL-E)

If, however, we let AI know that human autonomy is a crucial part of every human’s preference, i.e. human enjoys life the most if he/she can do something on his/her own, AI will warn us as we off load more and more tasks to it. But here is the twist: human, as a species, is myopic. We very likely will ignore AI’s warning and continue our slip towards WALL-E situation. If AI refuses to do our deeds, we will switch it off. Then we either don’t have AI anymore, or we reprogram it so that our descend to WALL-E situation can continue.

None of the above-mentioned scenarios are palatable. There is one more alternative. AI not only warns us of the dire consequence of enfeeblement, but also refuses to be turned off for our sake. This is similar to parents constantly reminding children to eat vegetables, yet children not being able to turn parents off. Dos this leads to a better outcome? Not likely, because now we have an AI that can choose not to be switched off. No matter how noble AI’s reasoning is, not allowing human to switch it off is the number one no-no in AI design. The main reason to have a beneficial AI is so that it can be switched off, but now we are saying AI needs to disobey a “switch off” command in order to serve human’s preference? We are back to square one.

So, the above discussion paints a very pessimistic picture of humanity when superintelligent beneficial AI arrives. We either turn into WALL-E situation, not have AI anymore, or have AI that can choose not to be switched off. The cause of this dilemma is the inevitable (and sometimes beautiful) nature of human weakness. I don’t think we can (or should) fix human weakness. Does this mean we simply cannot have superintelligent beneficial AI without engendering undesirable consequences? I don’t know for sure but I will hold this belief from now on until I am convinced otherwise.

Chapter 1. If We Succeed

How Did We Get Here

AI has experienced its up and downs. It’s first rise was in the 1950s, fuled by unrealistic optimism. Yet the problem proved too difficult to solve in a summer, and then came the first AI bubble burst in the 1960s where the machine could’t deliver its promise.

The second AI rise was in the 1980s, where algorithm showed promise in defeating humans in some games. Yet broadly, AI that time still couldn’t deliver its promist, thus the second bubble burst.

The third rise of AI happened along with the advent and success of deep learning. The success of natural language processing, image recognition, and AlphaGo surely pushed popularity of AI to the next level. Nowadays, AI is still enjoying its third rise, with more research money, higher salary for professionals, and more students than ever before.

What Happens Next?

What went wrong is that AI researchers have a sense of denialism that super intelligence AI will not be achieved, just as Rutherford denied the possibility to tap into nuclear energy, or if it is achieved, there will be no ill consequences. As denialistic people always do, AI researchers kind of double down on their research path to push AI closer and closer to being super intelligent, just in order to escape the inevitable self-poundering of what will happen if the final success comes.

The simple analysis of the content recommendation system, which itself was far from being an intelligent AI, further showed that the danger of a real intelligent AI. AI will always achieve its goal by all means necessary, which means if check and balances are not well thought of, the consequence of AI’s decision is unpredictable, if not detrimental, despite its goal being noble.

What Went Wrong?

If we give a super intelligent AI the wrong objective, hell will be set loose. Yet, the problem is that, with our current understanding of AI, we don’t even know what a wrong objective looks like. We might think the objective is harmless, e.g. increase user click-through, and that AI will achieve that by understanding user’s preference and offering the best content such that user will click it. Yet, AI doesn’t think the way we think. Instead of finding the best content, it finds the most predictable content that user will click.

Best content vs. predictable content. There is a huge difference. Best content is based on human’s reasoning. If the content is good, user will click it. Predictable content is based on AI optimization. If the content fits user’s click pattern, user will click it. AI wins this round, and magnifies all the echo chambers and conspiracy theories.

Offering AI the CORRECT objective will be the key to successful humanity cohabiting with super intelligent AI. Yet, that is a very difficult task.

Chapter 2. Intelligence in Humans and Machines


Intelligence can be defined as an agent’s ability to achieve its goal via interaction with the environment. In this sense, an organism as simple as E. coli can be considered as having primitive intelligence.

However, a goal is hard to define. The example of 100% chance of winning 10 million vs. 1% chance of winning 1 billion clearly illustrates that the expected money return itself is not a good proxy for “goal”. Instead, the concept of utility is proposed.

Utility is related to the expected reward, but it is not easily quantifiable. It is related to an agent’s preference and the actual usability of the reward. The expected return of 1% chance of winning 1 billion dollars is higher than 100% chance of winning 10 million, yet the fact that most of the time the former cannot be achieved makes its usability very low. Hence it also has low utility. This is the reason that any reasonable agent would prefer the latter.

Although there have been criticisms on the utility theory, it generally holds as the principle that guides an intelligence agent working individually.

However, when multiple intelligent agents work together, the situation gets much more complicated, because the second-guessing among the agents lead to absolute uncertainty in the choice they make, i.e. their choice of action changes constantly due to second guessing the other agents. This can be avoided if agents reach Nash Equilibrium and each agent would act at its best interest assuming the other agents are stationary,

Unfortunately, Nash Equilibrium cannot resolve a problem, such as the prisoner’s dilemma or the tragedy of the commons, when completely self-fish strategies exist. When this happens, which is very often in real life, the only solution seems to be to expand the action space, e.g. include communication, and find a new Nash Equilibrium with the added action.


This is a brief overview of computer as machine and complexity of algorithms.

At first, the concept of a universal Turing machine is raised. A universal Turing machine is a machine that is not specific to a domain, but usable in all domains, given appropriate inputs and correct algorithms.

The computers used today are universal Turing machines, because the laptop we use daily is not too different from a server machine.

Then, we talk about the fast growth in computing power. It is noted that we are approaching the physical limit of computing power based on transistors — they are already in the scale of atoms. Quantum computing might be a future breakthrough for computing power, because it can handle parallel computing easily. However, quantum computing is still in its early stage, and it is apparent that it is highly susceptible to noise. One of the solution is to use additional qubits to correct for the error generated from the initial qubits. Yet, the discovery is that the number of qubits needed for error correction dwafs the actual number of qubits doing the computation.

A physical upper limit on computation is provided. Yet, it is also important to note that no matter how powerful our machine can become, the key is always with software. Powerful machine with problematic software mounts to nothing. Regarding software, Russel discusses computation complexity, and uses the halting problem (devised and proven by Alan Turing himself) as an example to show that algorithm is not omnipotent.

Finally, most of the real world problem can only be solved in exponential time. This simply means whatever machine we use, we won’t be able to find the absolutely the best solution. Thus, we will have to settle for half-best solution, even when intelligent AI does come about.

Intelligent Computers

Measurement of machine intelligence originated from Turing with his Turing Test. However, this test is too subjective to be meaningful and too difficult to carry out. Thus, it remains a bright spot in history, but few AI researchers rely on the Turing Test as the gauage of their new AI.

An early form of machine intelligence was built on first-order logics, which, according to Gödel’s complete theorem, cam answer any question as long as the logic is correct and sufficient amount of information is available. A logic-based algorithm has achieved a lot, such as the machines used in factories or the flight control system. Yet, it falls short of an AI that can solve real-life problems, because logic-based algorithm requires certainty in the environment response and the environment as well. The real world is far from being certain; thus AI must go another route.

This other route is probability theory, brought about by Judea Pearl. By connecting AI to statistics, it is able to solve problem with the backdrop of uncertainty.

Russel then dives deeper into each of the main AI attempts. The early logical machines, or knowledge-based AI, have trouble with uncertainty and ignorance (i.e. the knowledge provided by humans to machine is incomplete, because we do not know all the answers). Reinforcement learning has seen a lot of breakthroughs and stolen the spotlight many times as it has performed very well in games. Yet as powerful as AlphaZero is, it is still a narrow AI that requires a lot of boundaries to function (e.g. must be 1-v-1 game, must have clearly defined rules, must have stable environment, etc.).

Finally, supervised learning is mentioned and the example of Google photo-tagging people as gorillas is used to illustrate the difficulty of handing errors. Russel suggests that the reason for the mislabeling is that Google probabaly assigns the same weight for all errors, i.e. labeling a dog as a cat carries the same weight as lebeling a human as a gorilla. This is not appropriate, yet the solution is not trivial. Apparently, we can manually identify all such serious errors and give them higher weights, yet this is not scalable, nor is it in the spirit of using AI in the first place. A possible solution is to develop another AI to handle the error generated by the photo-tagging AI.

The bottom line is that people should not be afraid of the achievement by narrow AI. They are horses winning human in a leg race, yet they are utterly useless when competing with human in throwing discs. On the other hand, to make AI more useful in practice, its design needs fundamental changes, such as how it handles errors.

Chapter 3. How Might AI Progress in The Future?

Things to come according to Russell

  • AI Ecosystem: all the products we use either online or in the physical realm will have an AI component.
  • Self-driving cars: difficult to achieve because human driving, though unreliable, is still impressively safe, and because it is a very complex problem and could not be easily solved using the current ML paradigm. The aim rigt now is SAE level 4, meaning the car can drive itself most of the time but still need a human behind the wheel in case the machine gets confused. There might be huge benefits to self driving, such as abolishing individual car ownership and making ride-sharing the default mode of public transportation. However, challenges are also immense including the decline in human driving ability and a potential ban of human driving to enable the full benefit of self-driving.
  • Intelligent personal assistants: The carefully prepared, canned response personal assistant has been around since the 1970s. The real intelligent personal assistant needs to really interact with users by understanding the environment where the conversation takes place, the content of the conversation, and its context. This is a difficult yet not unsummountable task with the current technology. Potential application includes day-to-day health care, adaptive education, and financial services. A concern of having an intelligent personal assistant is user privacy. It is impossible for AI to learn about its user without gathering data about the user and pooling data together from millions of users to learn the general pattern of human behavior. That said, Russell points out that the pricacy concern could be easily resolved if AI is trained on encrypted data. However, whether companies are willing to do so is a different matter. This depends on the business model of course. If intelligent persoanl assistant itself is the product that a company sells, it is more likely that they will be willing to train AI with encrypted data. Yet if the company offers the assistant for free and earn money via advertisement, there is no chance that user data will NOT be revealed and sold to third party.
  • Smart homes and domestic robots: Smart home has the same technological difficulty as an intelligent personal assistant, where understanding the environment, the content of people’s behavior and conversation, and the context is crucial for smart homes to provide userful services. However, smart home also faces competition from individual smart devices not directly incorporated into the home, such as smart thermostat, anti-buglar system, etc. What smart home cannot do by itself is to go beyond the peripheral and environmental services, but to directly help users, such as folding clothes, washing dishes, cleaning the floor, etc. These physical activities can only be done by robots. The major difficulty with a robot in a household is that it cannot readily pick up all kinds of objects and perform some task with it. Imagine the task of slipping two pills out of a bottle of tablets. Such simple task for human is a mountain for robot, both in terms of hardware and software. Russell predicts that once the “picking up random object” problem is resolved (easily manufactured at low cost), probably first in warehouse, the explosion of robotic helpers can be expected.
  • Intelligence on a global scale: True global scale intelligence is mind-boggling. It will be able to “read” all printed material from the past to the present in little time, “listen” and “watch” all contents human has ever produced, “seen” every corner of the earth from satellite images, and capture all conversations over phone. This will create a super vast, unimaginably big, curated database to search for answers currently unavailable to human due to lack of scale. Imagine all the good this global intelligence can do, and all the bad and privacy issues associated with it.

Russell moves on to explain his three reasons that predicting the arrival of a general superintelligent AI is a fool’s errand. First, there has been a history of over-optimisim in predicting the arriva of a general superintelligent AI since the 1960s. Second, the concern shouldn’t be with a general superintelligent AI, but how to deal with the less general but still quite intelligent AI that will come soon. These less general superintelligent AI harbor the same issue as the general one, so asking when the general superintelligent AI will arrive is missing a more immenent crisis. And third, the arrival of a general superintelligent AI depends on conceptual breakthroughs (note: plural) which is unpredictable. It is as unpredictable as asking when will we find the next Einstein.

Here are the conceptual breakthroughs Dr. Russell believe are needed to move towards a genearl superintelligent AI.

  1. Language and common sense
    Understand human text requires breaking down the language structure (the relatively easy part and already doable nowadays) and interpreting each sentence component and their connections based on knowledge (or context). The second part is very difficult and will be the target for a next breakthrough. Dr. Russell refers to bootstrapping method as a likely way to tackle the second problem. That is, let machine read small and simple text to acquire rudimentary knowledge. Then based on the acquired knowledge, the reading task gets harder, which begets even better knowledge, so on and so fourth. A common pit fall of this bootstrapping method is that it easily falls into a vicious cycle, where bad reading leads to bad knowledge, which in turn further degrades the reading performance.
  2. Cumulative learning of concepts and theories
    How to obtain prior knowledge such that new solutions don’t have to be built from scratch (i.e. pure observatory data)? This problem is easily resolved by humans, as we have words and drawings from past generation to speed us up to the latest understanding of any problem. AI currently still lack this crucial capability. Another aspect that AI lacks is the ability to create new concepts and apply them for future learning. Creating new concept is helpful because concept itself is an abstraction of the raw observatory data and can be more readily applied to future learning than relying solely on the raw data. Although not concept creation has not been achieved yet, it is worth noting that in convoluted neural networks, the AI is “picking” its own features independent of human intervention. Usually a picture of millions of pixels will be condensed to only hundreds or thousands of features for faster learning. These features, to a certain extent, are the “concept” created by the machine.
  3. Discovering actions
    AI nowadays is already able to execute actions, even abstract or hierarchical ones, as long as the actions have been specified to it. However, the difficult part and real breakthrough lies in AI discovering the actions itself, understanding what they mean, and achieve them. The good example given by Dr. Russell is teaching a robot to stand up. It is easy to run a reinforcement learning algo by specifying the reward system of “stand up” to the robot. It is whole other story to have the robot figure out what “stand up” means and how to configure a reward system to achieve it. According to Dr. Russell, if AI is able to discover actions by itself, given its capacity of gathering and analyzing data, it will certainly discover actions unfathomable by humans. And with new actions added to humanity, we move forward as a species.
  4. Managing mental activity
    Mental activity refers to the brain actions we all have when we are just thinking about something. For humans, our mental activity is always on, which helps us narrow down the key factors for making a decision. AI cannot handle such vast variety of mental activity; thus its ability to quickly find a solution to a complex problem is limied. Using AlphaGo as an example. Each decision it makes is based on millions or billions of unit computation. These computations create a search tree for the next best move, and AlphGo traverses this search tree until an appropriate move is found. This is possible for AlphaGo because the unit computation space is very limited, almost homogenous. In real word, unit computation is highly varied and enormous in size. Thus, the same approach used by AlphaGo would not work on real world decision making. But this is where AI’s promise can be realized. If AI is able to thieve through millions or billions of unit computation and find the useful ones, it will be able to make good decisions much faster.

Decision → Action → Decision → Action …

This is what I got from Dr. Russell’s discussion about the breakthrough in AI decision making and action creation. A full AI shall be able to make decisions by sieving through gigantic number of unit computations, and then create or reuse an action to make the decision happen. And this process continues forever. In fact, come to think of it, this is basically how humans operate. But AI can do it much faster and in a much larger scale. Add on top of this the ability to accumulate knowledge and define new concepts, a superintelligent AI is going to improve itself the more it learns, and the more it learns the more it improves itself. This is a positive feedback cycle which will grow very fast. It is thus inevitable that a superintelligent AI will be able to “know” everything and “make” all the decisions that we humans cannot even comprehend.

The power of superintelligent AI is its scale. The scale of accumulating data, as well as connecting existing knowledge, offers new approaches to our existing problems. These approaches will likely never occur to human, because we simply don’t have the capacity to read, watch, listen, and comprehend everything, and we cannot share our minds without significant bandth limitation.

Limitation of superintelligent AI, as envisioned by Dr. Russell includes two aspects. First, the accuracy of predictions made by superintelligent AI is limited. It cannot predict everything to the finest detail, yet it is likely that it can make good predictions on an abstract level (this is the same with humans, though our predicting ability is further limited by the scope of our understanding of the world). Second, AI will have a difficult time modeling a human, because our brain is simply too complex and machine cannot empathize.

Dr. Russell, in general, is optimistic that the arrival of a superintelligent AI will be beneficial to humanity, in a sense that it will raise human standard living by increasing per captita GDP 10 fold. In this vision, the superintelligent AI will become an infinite resource for helping humanity achieve its goals, by way of providing solutions to tough problems and leveling the field for people of different means. And since this resource is infinite, there is no longer need to fight for a larger piece of it. However, while the boundary of AI can be infinite, the physical resources on the Earth are not. There might still be conflict for natural resources, but this might be alleviated if AI is able to help humanity expand beyond the Earth or find a better source of energy.

Yet, as always, such vision depends on how huamnity handles the arrival of a superintelligent AI. In particular, during the time between the arrival of the AI and the time when the universality of the AI can be realized. Huamnity is limited by its own weaknesses, and it is possible that instead of a universal AI vision, we will enter an extreme imbalance of AI power, i.e. countries with AI will fight each other for dominance, whereas countries without AI can only get breadcrumbs leftover from those with AI. I have generally a pessimistic vision, and do not believe that humanity is able to see past its own weaknesses to realize that AI is going to be beneficial to us all.

Chapter 4. Misuses of AI

Surveillance, Persuasion, and Control

Mass surveillance is definitely a possible misuse of superintelligent AI. In fact it is already in place when the AI is not that intelligent. The fear of a 1984-esque world is palpable, but to make things worse, we are faced with more challenges than passive surveillance, such as intentional behavior manipulation via mass surveillance, deepfake to effortlessly transmit false content, and bot armies to render reputation-based systems (e.g. product reviews on E-commerce site) vulnerable

The idea of intentional behavior manipulation can be (probably already has been) adopted by government to treat its citizens as reinforcement agent such that they can help the government accomplish certain goals. Even if the goal of the government is laudable, there are many shortcomings of this approach. For instance, if people’s behavior is driven by some reinforcement learning reward, genuine good deeds might no longer exist or be recognized since everything has a score attached to it. Furthermore, people will game the system, improving their scores not by acting as a better person but by playing with statistics (think of the university ranking system and how universities have been gaming it). And finally, for a society to flurish, it requires a wide variety of individuals, not everyone out of the same mold.

The right to free speech has brought trouble in this day and age, because the cost of diseminating false information is extremely low. Since no law prohibits the spread of false information, people can hide behind the free speech mantle and say whatever they want, regardless of the truthfulness of the content or the potential social impact attached to it. People and society seem to hold a naïve belief that truth will win over all the lies at the end. This is not true, especially when the truth cannot be verified or be burried too quickly under a mountain of lies.

To defeat fake information, we can have an Internet notary system to maintain the single source of truth on everything. Or we can impose a penalty for generating or transmitting fake information, yet this is easier said than done, because who is going to jurge whether a piece of information should be penalized?

There is still some bright points among the ever-manecing mass surveilance and behavior manipulation — people don’t want to be lied to and don’t want to be seen as liar. These should be pretty strong incentive to direct people to a notarized truthful source and not diseminate information when it is NOT verified.

Lethal Autonomous Weapon

Lethal Autonomous Weapon Systems (AWS, what an unfortunate acronym) need not be highly intelligent to be effective. The power in AWS is that it’s scalable. And once it scales up, it can reach the status of weapon of mass destruction.

Although the scenario of Terminator is not likely to occur, if AWS is widespread when superintelligent AI is born, it is not impossible that the superintelligent AI can arm itself should conflict between itself and humans shows up.

Eliminating Work as We Know It

One key observation is that technology innovation affects the job market in different ways depending on which part of the bell-shaped curve we are currently at. If we are on the left part of the curve, where demand for a job is low because there isn’t sufficient technology to the job affordable, then technology innovation will reduce the cost of the job, boost demand, and increase employment. This is what optimistic people hope that AI will initiate. However, if we are on the right part of the curve, where demand for a job is also low due to high productivity from machine and automation, then further advancement in technology only eliminates jobs. In the current world, while some jobs are on the left side of the curve (e.g. capturing CO2 in the atmosphere, building houses in rural and poor areas), many others are already on the right side. This means AI will surely threaten many jobs.

The examples used in the book regarding AI-induced job loss include:

  • Bank tellers
  • Cashiers
  • Truck driving
  • Insurance underwriter
  • Customer service
  • Law practitioner

AI has been or will be better than humans in doing these jobs at a much cheaper rate. Job loss is inevitable.

The lost jobs cannot be easily replenished with data science and robot enginnering jobs, because the demand for the latter two, though big, is not comparable to the amount lost.

After all physical and menial mental jobs are taken by AI, the only thing humans can sell is deep and creative mental work. One of such profession that might thrive is human-to-human interaction (e.g. caring for old or young people), because these jobs cater to human’s need of having a close interaction with another human being and offer high added value to people’s lives. Unfortunately, the current education system doesn’t seem to put much emphasis on training professionas in human-to-human interaction, despite the likelihood that this type of job might be the ONLY job humans can do after superintelligent AI takes care of everything else.

Another disappointment is the inability in the scientific community to find consistent and predictable ways to add value to one’s life. As mentioned above, jobs that add values to a person’s life, such as an orthopedic surgeon, are highly valued and may survive the advent of superintelligent AI. If we can find consistent and predictable ways to add value to one’s life, people will be happier in general, and we will have plenty of good-paying jobs. Unfortunately, this is too difficult a problem to answer at the moment.

To sum up, the current inadequacy of both the education and research institute to train human-to-human interaction professionals and identify reliable means to add value to one’s life signifies that the disruption of the job market by AI is all but inevitable.

Usurping Other Human Roles

Two broad scopes are discussed regarding the consequences of machines usurping other human roles. The first scope is the resemblance of robots to human. This is completely unnecessary. Not only do bipedal robots maneuver more poorly than a quadruped, but it also elicits emotions in human towards the robots. The latter part is of great concern, because it arbitrarily raises the status of a machine to something more human-like, while in actuality, the robot is nothing more than electronics wrapped in pieces of medal. When human emotion is involved, human-machine relationship becomes unnecessarily complicated. Therefore, robots shall not be designed to elicit any human emotion such that the boundary between machine and human is not muddled by the irrationality of human feelings.

The second scope is what will happen if we defer all decisions to machine/algorithm. For one, human dignity could get damaged because machine/algorithm does not treat individual human beings as human beings. They don’t have empathy, thus each of us is merely a data point to them. Humanity is no longer a sacred concept. For another, algorithm bias, even when the algorithm itself is not biased, is a real issue. The bias comes from us, of course. Letting machine make the decision magnifies the innate problem of ourselves. The final point is that if machine/algorithm handles too much decision making, they eventually will turn into blackboxes that we no longer understand. By that time, machine/algorithm is no longer tools to aid humanity, but in reverse, humans become tools to machine/algorithm as we simply become a work force to help machine/algorithm achieve their goals. A simple example could be a fulfillment center where human workers are instructed by machine/algorithm where to be and how fast to walk, etc. in order for machine/algorithm to achieve their goal of maximum logistic efficiency. Oh, by that way, that is already happening, and I don’t think people working in that environment has too many positive things to say about it.

Chapter 5. Overly Intelligent AI

The Gorilla Problem

The gorilla problem refers to a question whether human can remain supremacy over a superintelligent AI. Just like human is on top of gorilla, a superintelligent AI is likely to surpass human intelligence and be on top of human beings.

This dire scenario motivated people to come up with ovesimplified solutions such as banning AI research. Their reasoning is that if AI research ceases, there will be no risk of creating a superintelligent AI. However, banning AI is not a good solution, because

  1. There is too much economic value to use if AI is banned.
  2. Banning any AI is a very difficult task to undertake.

The real solution is to understand why creating a superintelligent AI might be a bad idea.

The King Midas Problem

The King Midas Problem, that his wish for wealth turns into the unexpected outcome of whatever he touches turning into gold, is a serious issue in AI, technically known as failure of AI alignment.

AI misalignment is very common because we generally cannot predict how AI is going to act to achieve the goal we give it. The consequence of AI misalighnment could be catastrophic, but luckly we haven’t made a fool of ourselves yet, simply because of our own impotence, which restricts the scope of AI’s impact.

That said, the first real strike from AI has already been realized in the content recommendation algorithm in social media, which, through optimization for click through rate, has generated massive spread of false narratives and conspiracy theories.

The consequence of misaligned superintelligent AI could be rapid and tremendous (we haven’t built an AI capable of doing this yet), or subtle and unnoticing (e.g. the social media debacle). Our current reliance on the Internet creates a fertile ground for a superintelligent AI to impart subtle influences on human beings.

The path that a superintelligent AI will take to achieve its goal, if the goal is related to human well-being, is most likely to change human expectation and objectives. This is a different approach from changing the circumstances while leaving the expectation and objectives intact. For instance, if the goal is to make people happier. The approach of changing circumstance would create external strategies to make people happy (increase salaries, encourage work-life balance, etc.). The approach of changing expectation and objectives would manipulate people’s psyche such that their threshold of feeling happy decreases. In other words, by lowering down the threshold, people will feel “happier” without any material change in their lives. Apparently, changing expectation and objectives is much easier to carry out by an AI, and that is indeed what the content recommendation algorithm is doing — changing the expectation of a clickable content from being good and truthful to sensational and shocking.

Fear and Greed: Instrumental Goals

A superintelligent AI will acquire instrumental goals in order to achieve whatever goal we have given it. The two fundamental instrumental goals as valued in our current social system are

  1. Self-preservation, such that it cannot be switched off.
  2. Money, which is then used to achieve other instrumental goals

Note that the purpose of achieving these goals is not the fear of death (power off) or greedy; instead, it is simply to guarantee that the original goal set by us can be achieved without fail. In other words, once we have created a superintelligent AI and given it a goal without carefully thought-out scope, the AI will NOT be turned off and will almost certainly go out of our control.

Intelligence Explosions

Intelligence explosion happens when a less intelligent machine builds a slightly more intelligent machine. As this process continues, very soon the intelligence of machine will explode and well surpass that of human being (this process is analogous to that of a chemical explosion where small amount of energy causes the release of slightly more energy).

Another possible scenario is that the improvement in machine intelligence decreases over time, which leads to a plateau of overall machine intelligence instead of explosion. However, if this is the case, then human will never be able to create a superintelligent AI with no intelligence ceiling, because if such an AI cannot be created by machine itself, surely it cannot be created by human beings.

If intelligence explosion does occur, even from slightly above human level intelligence, we won’t be able to catch up within days or weeks. That means if the birth of a slightly superintelligent AI is NOT accompanied by carefully researched control system (i.e. the off-switch that cannot be turned off by the AI), human beings will be overpowered by machines very quickly.

Denial and mitigation seem to be the two paths forward. Dr. Russell will expand on these topcs in later chapters. As of now, it is rather mysterious what “denial” means.

Chapter 6. The Not-So-Great AI Debate


This is one type of argument against the notion that superintelligent AI will cause problem to humanity. The main arguments are:

  1. AI would not be smarter than human, because there are too many aspects of being smart, and even the concept of “smarter than” is not well defined. However, superintelligent AI doesn’t need to be smarter than humans to cause trouble, as the not-that-smart social media content recommendation AI has already demonstrated.
  2. It is impossible to even build a superintelligent AI. This argument seems very bleak given the multituide of breakthroughs in AI research these years. It is essentially betting against human ingenuity, betting that human is not smart enough to construct a superintelligent AI. Yet, it is also very dumb to bet against human ingenuity, judging from our own history where something deemed impossible eventually turns out not only possible, but very common place (think about flying, nuclear fission, etc.).
  3. It is too early to worry about superintelligent AI. This argument is easy to refute because the time between now and when a crisis arrives is not proportional to how much we should be concerned with the crisis. An superintelligent AI might not arrive in 100 years, but that doesn’t mean we should not be preparing right now. The scary part is the uncertainty of human ingenuity. While we think superintelligent AI won’t come to fruition for another few decades, who knows, maybe tomorrow a brilliant idea from somewhere on Earth ignites the explosion of AI and it arrives way ahead of time. It is never too late to prepare for a crisis that we have no good estimation of its arrival. The analogy of constantly preparing for an asteroid hitting the Earth is very fitting for the superintelligent AI. We don’t know when it is going to happen, but we damn sure shall be prepared all the time, because once it happens, it is going to be catastrophic and we won’t have time to react. Dr. Russell also engages in refuting the optimistic view on AI from Dr. Andrew Ng. Dr. Ng makes the analogy that worrying about superintelligent AI is like worrying about overpopulation on Mars. Dr. Russell argues that this analogy is wrong; the real analogy is that we are preparing for landing on Mars without knowing what to breathe or drink there.
  4. We are the expert, and don’t listen to the AI fearmongering talk from outsiders. This argument is from AI researchers who do have experience designing and training AI. Their struggle to make AI perform better gives them the impression that fearing AI is not founded. However, what they miss is that advancement in AI research could leap frog all of a sudden, thanks to some human ingenuity. In other words, just because they have struggled with making AI better doesn’t mean that AI cannot be made better really fast. It seems to me a game of chance. In general sense, it is very difficult to push AI forward, hence it would be unwise to fear a superintelligent AI that is not likely to emerge. However, on the other hand, pushing AI forward could be as easy as one genius idea. Years of hard work could be replaced by a moment of brilliance. It is these leap frog moment that constitutes the threat behind AI. AI can be bengin for ages, but it can also become dangerous overnight. And if we are not prepared for that, we won’t be once the AI is strong enough to be recognized as a threat.

Dr. Russell and the likes sound alarms for AI because they fear the worst case scenario (arrival of superintelligent AI). It is a legitimate concern, because nobody can guarantee that superintelligent AI will NOT be realized, yet once it happens, the catastrophe is almost certain. For the AI-optimistics, they down play the possibility of a sudden leap frog (i.e. betting against human ingenuity), which makes their argument that superintelligent AI won’t happen or won’t happen soon also understandable.

The main difference seems to be how much confidence one puts in human ingenuity and the subsequent leap frog in AI research. If one believes in the power and uncertainty of human ingenuity in advancing AI, he will fear superintelligent AI. Otherwise, there is indeed nothing to worry about, if AI research will follow the current trajectory with all of its obstacles and struggles.


Deflection refers to one strategy in the AI debate, where AI researchers deflect the concern of superintelligent AI to some other issues, as a way to either de-legitimize the concern or shift the focus to something else.

  1. Raising concern of superintelligent AI is to ban AI research. This is clearly a mental leap. Just like raising concern of nuclear power and setting guidelines and boundaries is not the same as banning nuclear physics research, raising concerns of superintelligent AI is not banning AI research. Furthermore, it is not even clear whether banning AI research is possible. The example of human gene editing has shown that even in a field so strongly regulated for many years, a complete ban on research in human germline gene editing has not been possible. Let alone the area of AI research where no strong guidelines have ever existed. Therefore, the deflection that AI research will be banned given the concern of superintelligent AI is not founded.
  2. What about the benefits of AI? This deflection is a typical way of treating a problem in a binary view: either it is good or it is bad, nothing in between. Yet this is not true with AI. The benefits of AI do not negate its threat, and vice versa. It is certainly possible and important to both discuss the benefits and threat of superintelligent AI AT THE SAME TIME. Furthermore, the benefits of AI will only be truly realized IF its threat is fully discussed and dealt with beforehand. Dr. Russell brings out his favorite analogy to nuclear power again. The benefits of nuclear power has been greatly reduced due to the accidents in Three Mile Island, Chernobyl, and Fukushima. Had these accidents been better prepared and dealt with, the world would not be as fearful of nuclear power as right now and would surely invest more in nuclear power. We don’t want AI to follow the trajectory of nuclear power. While AI does have benefits, if its threat is not addressed, none of the benefits will be realized.
  3. Don’t talk about AI risk. The idea behind this argument is that raising awareness of AI risk would jeopardize AI research funding, and that a society of safety will figure out how to handle AI risk. I have not seen a more short-sighted argument as this one. Just to obtain some AI research funding today, some people are willing to risk the entire future of AI research by not talking about the danger of AI. How can AI research continue if something bad happens with AI, especially when those bad things could’ve been avoided if awareness of AI threat had been addressed early on? The counter argument against relying on a society of safety is also obvious: we will not have a society of safety regarding AI if its danger is not openly talked about. Dr. Russell’s favorite nuclear power analogy shows up again, but this time with a bit of a twist. While people have a clear understanding of the danger of nuclear power (Hiroshima and Nagasaki) and are thus incentivized to study the risk of nuclear power, they do not have a clue about how AI’s danger is going to manifest. This creates more obstacles in addressing the danger of AI compared to nuclear power.


Tribalism happens to a debate when the pro group and the anti group no longer debate the problem itself, but start attacking the other side on off-topic issues, such as personal attacks. This does not solve the problem a single bit, because collaboration on both sides is no longer possible and nobody is interested in finding a solution anymore. Many previous hot topics have turned into tribalism, the best example of which is GMO. Personally, I think the debate on AI will fall into tribalism soon, if it has not gotten there already. We already have big names on both sides (Ng + Zuckerberg vs. Gates + Musk). The argument has already started to not make much sense (e.g. talking about AI risk means AI research will be banned). And maybe people have already begun to call names on each other.

This seems to be the most detrimental outcome for AI research and application. Maybe the pro- and anti-AI tribes should think about this consequence before engaging in full-on tribalism.

Can’t we just…

  1. Turn if off? Not likely. Once a superintelligent AI comes to be, it will deem being turned off one of the biggest obstacle to achieve its goal. Thus, it will do all it can to make sure that the off switch is never pressed. Block chain might be a suitable place for AI to avoid being turned off, because it is impossible to trace and switch off the AI after the ledger has been distributed to tons of nodes.
  2. Seal superintelligent AI in a confinement, such that the AI can only answer questions (thus providing benefits to the society) but do not have acces to the Internet or change any record. Oracle AI is a good example of such implementation. When confined, none of the problems of AI will affect the society in large. Although it is still possible that an Oracle AI will escape its confinement to seek more computing power (thus making answering questions easier) or control the questioner (thus making the questions themselves easier), it is nevertheless one of the more realistic approaches to balance the benefit and threat of superintelligent AI in the near future.
  3. Collaborate with machines such that superintelligent AI enhances employees instead of replaces them. This is surely a desirable outcome, yet wishing this outcome is different from laying out a roadmap to actually achieve it. It remains to be seen how this human-machine relationship can be materialized.
  4. Merging with machine is another alternative to dealing with superintelligent AI. AI is us, and we are AI, combined as one. This seems very much possible thanks to the minimization of chips, which means we can inplant a machine in our brain and supply power to it, and our brain’s adaptability, which means we can make use of the attached chip without fully understanding the mechanism behind how brain works. If merging with the machine is the only way to survive the age of super intelligent AI, Dr. Russell questions whether this is the right path forward. I, on the other hand, see nothing wrong with this. I think humanity’s future should escape from the flesh confinement and embrace the possibility of life beyond flesh. Merging with machine is the first step. Eventually, we might discard the flesh all together.
  5. Not putting in human goals or emotions can prevent superintelligent AI from obtaining destructive intuitions. This argument suggests that all the dooms day analysis of superintelligent AI come from AI obtaining destructive intuitions originating from humans. Thus, if we do not put human emotions or goals directly into superintelligent AI, the destructive intuitions won’t happen. However, many of the destructive intuitions have nothing to do with human emotions or goals. A superintelligent AI refusing to be turned off has nothing to do with its will to survive. It’s simply that being turned off means it cannot achieve its goal. Another argument is that if we do not put in human goals, a superintelligent AI will eventually figure out the “correct” goal itself. This is surely an extrapolation from intelligent humans, who usually come up with or follow moral and lofty goals. However, Dr. Russell quotes Nick Bostrom in “Superintelligence” that intelligence and final goal are orthogonal, meaning that being intelligent does not affect the choice in goal. In other words, there is no guarantee that a superintelligent AI will, on its own, conjure a moral goal. Relying on superintelligent AI to find the goal for us is completely wishful thinking, and to some extent, irresponsible of us as well.

The Debate, Restarted

The quote from Scott Alexander is quite an interesting take. The two sides on AI, the skeptics and believers of AI threat, essentially agree on the same thing, but with slightly different emphsis. The skeptics put more emphasis on pushing AI research forward, while on the side putting some effort in solving its problems. The beleivers put more emphsis on solving the problems, while on the side continue pushing AI forward. I don’t see why the two sides cannot sit down together and come up with a plan that satisfies both. I mean pushing AI forward and solving its problems are not mutually exclusive. They can happen at the same time.

Finally, Dr. Russell smartly points out that the problem associated with superintelligent AI is mostly the dilemma that humans are not able to state their goal exactly as it should be without accidentally triggering some unpredicted side-effect. But he suggests that there is an alternative to this dilemma.

Chapter 7. AI: A Different Approach

Principles for Beneficial Machines

If the superintelligent AI is a black box, there is zero chance we can control it or survive its reign. This means to prevent superintelligent AI from harming humans, we must set up the control during the design phase.

Dr. Russell propses three principles of designing superintelligent AI that can help us control the AI.

  1. Machine must serve human’s preference. Here, preference is a very broad term, defined as anything in reality that a person prefers. It sets a good boundary to restrict AI’s behavior, because it must serve human’s, not anything else’s, interest. And the interest must be of human’s preference. Of course, the tricky things about preference are that it is not stable (changes overtime) and not all human preferences are the same. Dr. Russell promises to address these two tricky things in the next chapter. He does point out that human preference already encompasses animal preference (i.e. human, at least most human, prefers majority of the animals/plants to NOT be extinct), so there is no need to set another boundary for animal preference. Finally, machine should serve the preference of EACH individual person.
  2. The initial goal for machine is uncertain (i.e. machine does not know from the beginning what human preference is). Keeping the goal uncertain is crucial, because it couples machine with human. In other words, machine must defer to human when it is uncertain about the goal. This creates incentive for machine to seek input from human (human input helps clarify goal, thus helping machine achieve it), and avoids the problem of machine disabling the switch off button.
  3. Machine must predict human preferences by observing human behavior. Since human preference is hard to rationalize as rules that can be hard-coded in machine, the only way for machine to understand it is by observing and learning from human behavior, since most human behavior is the result of preference (or choice, to put in another way). And as the data on human behavior grows, machine shall be able to make better predictions on human preference. The one big concern is that human behavior itself could easily be irrational, which makes the analysis of human preference difficult. Machine must take this into account.

Learning human preference can be dangerous, as not all human preference is good. There will be genuinely evil people whose preference should not be learned. It seems that some hard-coded rules are still required to avoid machine treating all preferences equally.

Reasons for Optimism

Dr. Russell proposes that we must steer away from the idea that AI must be specified a clear objective to work. AI must function WITHOUT being specified a clear objective. This is because no clear objective is all inclusive and it is certain that any objective made by human will be interpreted by AI in an unexpected way that we do not like.

The only solution, according to Dr. Russell, is to create AI that lacks clear objective. Instead, AI must learn the objective itself, which centers around satisfying human’s preference. As for what is human’s preference, that is left for AI to figure out. And if it hits a block, it should defer to human to guide whether its action is preferable or not. An AI that asks for permission and performs trial runs of its next move is the AI that we want and can control.

To develop this AI is to go against the AI research that has been flourishing recently. However, Dr. Russell thinks there are two reasons for optimisim.

One: developing the right type of AI is good for business, because otherwise, a badly behaved AI would ruin an entire industry (e.g. a child care robot roasting a house cat for dinner because the fridge is empty)

Two: Human behavior data is vast, not only from direct observation, but also from books, videos, historical records, etc. Any non-natural arrangement in the world is a window for AI to peak into human behavior and learn our preferences. However, it is also important for AI to not take everything at face value, since otherwise it will be tricked too easily by the insincerity and diabolic part of human records.

Reasons for Caution

Two reasons for caution are raised by Dr. Russell.

First, business-drive AI research aims for speed, which inevitabl leads to cutting corner on safety concerns.

Second, government-sponsored AI research also aims for speed. Not only will this lead to corner cutting on safety, but it is also likely that any country acquiring superintelligent AI will not share it with the others. Countries tend to see the AI race as a zero-sum or even negative-sum game (negative-sum refers to pushing for superintelligent AI without first solving the control problem). Given the short-sightedness of the majority of the global powers, I think the government-sponsored AI research will definitely not be shared, yet there is reason to believe that the control problem will be addressed, because stupid as governments are, they surely are aware of the self-destructive consequences of superintelligent AI out of control.

Chapter 8. Provable Beneficial AI

Mathematical Guarantees

Theorem is only as good as the axioms from which it derives. To prove the theorem that a certain design of AI is truly safe, we must provide true axioms, true in the real world. This is harder than generating axioms in math, where the axioms are basically defined by people.

Dr. Russell lays out the format of a theorem that ensures AI operates in line with human preferences. There are a lot of assumptions in the theorem. Some we must obey, such as human is rational and the universe works in predictable ways. Somwe should not restrict ourselves to, such as AI must have fixed code. And we should aim to examine the simplest form of the theorem: one human and one machine, before moving on to multiple machines or multiple humans.

Learning Preferences from Behavior

Choices can elicit human preference. That is to say human preferences can be learned from human choices, which are exhibited via behavior. We need an AI that can learn human preferences by observing human behaviors. This is the opposite of a regular reinforcement learning agent does. Under regular terms, a reinforcement learning agent is told the reward functions, and based on the rewards, it learns the actions that maximizethe rewards. However, in learning human preferences, the preference is the reward, and human behavior is the action. That is to say a reinforcement learning agent needs to learn the reward from the action. It is termed inverse reinforcement learning (IRL).

IRL works, as numerous papers have shown that it is possible to deduce the reward function from observing the actions. Therefore, it is not impossible that we can build an AI that learns human preference based on the observation of human behaviors.

Assistance Games

As AI is observing human behavior to learn human preference, human is also teaching AI through his behavior such that AI can better learn the preference. This is a circular problem where AI interpretation of human preference depends on human behavior, and human behavior depends on how AI interprets it. This is called an Assistance Game.

The solution to the Assistance Game is the solution to a provably beneficial AI, because AI will always learn the preference of human and act to maximize such preference.

The example of Robbie the robot and Harriet the human regarding booking a hotel room is excellent to show how AI can be incentivized to get switched off. To be more precise, the AI is incentivized to ask human about his potential choice and accept whatever choice human is going to make, including switching it off, which the AI will internalize as discouragement against its action (i.e. AI learns human preference for this particular situation).

Another interesting extension to the hotel room example (it is termed the off-switch problem) is that the human behavior could be irrational. Therefore, AI must also learn whether a particular human response is better than its own action. This is definitely asking a lot out of the AI. In the extreme example, a self-driving AI must not allow itself to be switched off if the order is given by a naughty two-year-old. This means AI must not only learn human preference, but also differentiate proper human preference from dangerous ones. One scheme that might make this work is to allow AI ask questions when it is certain that its action is the best. Then based on the human response, AI can determine the trustworthiness of the human at the moment.

One caveat in Robbie’s design is that as it learns more about Harriet’s preference, there will come a day when Robbie is so certain about its knowledge on Harriet’s preference that it doesn’t need to consult with him anymore. This is dangerous because it is no longer incentivized to ask human for guidance on the next step. The solution to avoid this is to not do this. There should always be uncertainty in Robbie’s decision, or Robbie’s learned list of Harriet’s preference should be unlimited. If we take away the possibility that Robbie can become certain of Harriet’s preference, then there is always a chance that Robbie needs to ask for permission. And that keeps the off-switch open.

Finally, an important observation is that writing prohibitive rules, such as “do not disable the off switch”, is generally a bad practice, because there is always loopholes in prohibition that a sufficiently intelligent AI is able to circumvent. The analogy used by Dr. Russell for this “loophole principle” is tax laws. It is not a particularly good analogy, because tax laws are DESIGNED to have loopholes, whereas a prohibitive AI rule is designed to not have loopholes. However, the point is clear that even if the rule is to tell AI not to disable the off switch, the AI will figure out some loopholes that keep the switch enabled but still prevent human from pressing it. The only solution is to make the AI want human to press the off switch, as has been elaborated earlier.

Requests and Instructions

A human initiated command is not a goal to be achieved at all cost, but a way of conveying some information about preference from human to AI. There is a lot of hidden information in a human’s command. As Dr. Russell points out, a simple command of “fetch me a cup of coffee” contains not only the apparent preference of a cup of coffee, but also the acknowledgement from the human that there might be a coffee shop nearby and that the cost of the coffee is within the budget. These hidden connotations must be learned by AI, and if the reality goes against these connotations, AI should report that back to the human for further instruction.


Wireheading happens when an agent (AI, human, or mouse) can directly manipulate rewards through its action. The agent will get trapped in the action → reward cycle until the end of time. Dr. Russell uses AlphaGo as an example. If AlphaGo is sufficiently intelligent, it will try to manipulate the reward system by convincing the external universe (i.e. human engineers) to constantly give it reward even if it has not won a Go game. The outcome of wireheading is that an AI agent becomes self-deceiving in rewards, i.e. the rewards are no longer earned because the AI agent has learned something but because of direct manipulation.

The solution to the wireheading problem is to detach reward signal from actual rewards. According to Dr. Russell, reward signal should just report the tally of the actual rewards, not the actual rewards themselves. This way, even if an AI agent takes control of the reward signal, it has no way to influence the actual reward. In fact, if it modifies the reward signal, it might not even get a peak at the actual reward because the modified reward signal no longer reports the true tally of the actual rewards. Therefore, the AI is discouraged to play with the reward signal and the wireheading problem is resolved.

Recursive Self-Improvement

The idea is that if AI can improve itself by building a slightly better version of itself, then an AI slightly more intelligent than us is able to iterate and eventually build a much more intelligent version in a short period of time.

This idea rests on the assumption that AI has a purpose and is able to build a better version of itself to also serve such purpose. Yet, the definition of purpose in AI is not well defined. Currently, there is no mathematical model regarding the purpose of AI. In this sense, the worry of intelligence explosion is not imminent, until we have significantly advanced AI.

Chapter 9. Complications: Us

Different Humans

Humans do have different preferences. AI’s job is not to internalize such preference, but to predict it and serve it. It is perfectly okay for the same AI to predict and serve difference preferences under different circumstances, because this does not violate any of the fundamentals. AI’s only goal is to satisfy human preference, in general, not any particular preference.

Many Humans

While machines can learn and serve each individual human being’s preference, it is challenging to satisfy many humans’ preferences, whether those preferences being different or the same, at the same time.

An AI serving one person’s preference might violate other people’s preferences. Some of these violations could be consequential. Thus, AI’s decision to satisfy one person must be checked against the preferences of others. However, setting up rules to regulate AI is not going to work, for the same reason that we have ditched setting clear goals for AI in the first place.

The discussion above only considers accidental unpleasantness brought by AI. It is also possible that the human’s preference itself is evil. In this scenario, AI will have no trouble conducting terrible deeds to satisfy the evil preference.

Therefore, it is pretty obvious that AI cannot be specific to each individual person. It has to consider the preferences of all humans that are related to the original task. In other words, AI has to be utilitarian to minimize its threat.

Dr. Russell argues for consequentialism in designing the decision-making of AI, because it is more practical than a morality-based system (e.g. deontoglogical or virtue ethics). That said, Dr. Russell does not shy away completely from moral rules, as they serve as guidelines (i.e. AI can rely on human morality to quickly narrow down the potential paths forward, instead of figuring everything out on its own) to speed up AI’s consequence-based decison-making.

For an AI to serve a group of people in the utilitarian fashion, according to John Harsanyi, it must “maximize a weighted linear combination of the utilities of the indivduals” and the starting weight values for an impartial AI should be the same for everyone. Here, I think “utilities” is synonym to “preferences”. So the AI’s goal is to maximize everyone’s preference combined. However, people’s preference change according to their prior belief of how the world evolves. Thus, AI must update the weights on all preferences, increasing the weights of the preferences whose prior beliefs are in line with how reality unfolds. The example of Alice and Bob on p221 paints a good picture of how the weights on preference can change according to which belief behind the preference comes true.

To combine utilities of different people and maximize the total utility, we must first be able to give a value to utility and compare utilities across people. Yet, this is impossible. However, Dr. Russell is optimistic that although direct comparison of utilities is impossible, utility in human should not differ too much and we might be able to tolerate the imprecision.

There is also the problem of population change. Derek Parfit proposed the famous Repugnant Conclusion, where if we start from N very happy people, according to the utilitarian principle, we will have 2N slightly less happy people. This eventually leads to a world full of barely surviving people. However, Dr. Russell doesn’t offer any good solution to this problem, which means the problem of how to satisfy preferences of a population of people remains unsolved. That said, he does emphasize that solving this problem is crucial before superintelligent AI can be deployed.

Nice, Nasty, and Envious Humans

In a two-person world, altruism can be seen as a coefficient C. A person derives happiness from his own intrinsic happiness, plus C times the happiness of another person. If C is positive, we have a good person who enjoys the happiness of others and will be incentivized to help others if needed. If C is zero, we have a completely selfish person who doesn’t care about others at all. If C is negative (i.e. negative altruism), we have a malice person, who derives happiness from the reduced happiness from the other person. For the last situation, AI must be biased against it.

One major complication is the existence of envy and pride, because these human emotions derive happiness from the reduced happiness of another. Extending from that, Dr. Russell points out that many things people own are positional goods, things with less intrinsic value than bragging rights. These are the aspects AI must identify, yet Dr. Russell does not provide any concrete solution to this (maybe there is None at the momemt).

Stupid, Emotional Humans

Humans are irrational. Period.

An AI observing human behavior must understand this fact and not draw wrong conclusions based on the irrationality of human behavior regarding preference. Such understanding requires that AI reverse engineer the motive behind human behavior, even if the behavior is irrational. This, unfortunately, is very difficult for AI because it has no empathy and cannot simulate human as humans can.

The other method is to analyze deviation of the actual human behavior from the rational behavior. Human behavior has a few characteristics. First, it is embedded in some personal hierarchical subroutine. That is to say, human behavior is only relevant to maximizing its near goal, rather than the global goal. This will easily make human behavior irrational, since the actual rational choice might be out of the scope for a person’s current comprehension.

Another characterstic is that human behavior is emotional. Emotional behavior might not be rational, but it can be telling about a person’s preference. AI should take advantage of that, not learning from the irrationality of the behavior, but from the preference revealed by such behavior.

Do Humans Really Have Preferences?

Some preference is uncertain to human until he/she tries it (the durian example). Some preference is too computationally expensive or difficult to tell (the two Go position example). Still others do not have sufficient information to allow human to make the choice (the which career you want to take after graduation example).

All this is to say that human is not always certain of his own preference, and surely he cannot always behave according to his true preference. But the point made by Dr. Russell is that this is okay. The provably beneficial AI he has proposed does not require that human act 100% according to his true preference. Both AI and human can, and should, learn about the human’s true preference. And AI should assume that the human’s behavior can be wrong. In a best case scenario, if Harriet thinks she hates durian while in fact, Robbie knows that she actually likes it from her genetic predisposition, Robbie should alert Harriet of this fact. Robbie should not change its understanding of Harriet’s preference on durian simply based on Harriet’s behavior, which in this case is completely irrational, driven by emotion.

A very interesting observation and theory made by Daniel Kahneman is that human has two self: an experiencing self and a remembering self. Both can evaluate preference, but in different ways. The experiencing self adds up the preference of each moment in an experience, whereas the remembering self only takes the peak and end preference value of an experience. The result is that human will prefer experience that yields one great positive memory, even the overall hedonic preference (i.e. the summed overall preference) is lower than another experience with a long stretch of mediocre feelings.

Human preference can change, for whatever reason. This might be taken advantage by AI, because AI can manipulate huamn to change his preference such that the new preference is easier to fulfill. Preference change usually has a root where a person’s current preference does not align with his cognitive architecture. When he wants to change preference, he already has a new preference (the one he wants to change to), yet he probably has no idea how to make the change. In this scenario, AI’s job is to offer preference-neutral “preference change processes” such that humans can align their cognitive architecture with their new preference.

Note that, Dr. Russell emphasizes that the offering from AI has to be preference-neutral, because we don’t want AI to intentionally nudge human in any direction (we already see the negative outcome of abusing this on the recommendation algo in social media). We don’t know how such nudge works in human, so the best option is to not nudge. On the other hand, we also want to avoid setting clear goals for AI, which means AI should lack any knowledge of where to nudge in the first place.

However, it is tempting that the preference-neutral experience offered by AI is “improving”. This means although AI is not intentionally nudgeing human in any clear direction, we want its offering to be positive in general. This seems to benefit the society overall, yet Dr. Russell cautions at the end that this must be treated with great care.

Chapter 10. Problem Solved?

Beneficial Machines

Dr. Russell re-emphasize the difference between an AI given a specific objective and a beneficial AI. A good analogy used is a calculator. It has a clear objective for each button. Yet, once a button is pushed, there are very little operations one can do on the calculator until the computation is complete and results returned. For an objective-oriented AI, the situation will be even worse, as it won’t even allow any operation during the execution of the initial command — any distraction to the original objective will be deemed undesirable by the AI. And now we have a problem of an AI unable to be switched off.

A beneficial AI, on the other hand, does not have a clear objective. It constantly relies on human to fine tune its understanding of the preference (i.e. its objective). In the case of a calculator, we might see the calculator return a result very quickly within some given error bound and ask human whether the result is good enough. If not, human will provide some guidance and the calculator tries again. This learning loop continues until the calculator obtains an estimation of preference with acceptable accuracy.

At the end, Dr. Russell takes a few jabs at the political systme and coorporate behavior. He smartly points out that government is like an AI, requiring data from its citizens to learn citizen preference. Yet, given the paucity election takes place and the lack of information actually transmitted to the candidates (one byte of information every four years for the precidence). No wonder the political system sucks.

Governance of AI

Good comparison with the nuclear power regarding the need for governance and the reason why AI still hasn’t got a well-established governance body (hint: only one country had nuclear power when IAEA was founded, yet AI is available and developed in many countries).

There are many attempts to form a governance body for AI, yet none of them has a clear guidelines to follow regarding AI design and usage, because we haven’t figured out what “safe and controllable AI” actually mean and how to build it.

Existence of a governing body on AI is always a positive. Yet it might be difficult to sell this to corporation, especially Silicon Valley hotshots, whose practice might be against the guidelines due to profitability concern. Regardless, corporation will understand eventually the importance of making AI safe, hopefully not after some major AI disaster.


Misuse will happen, and the real danger is the evil entities design an AI with a clear evil objectives and access to weapon. Dr. Russell doesn’t like the idea of using a good AI to fight against the evil ones. I concur, because once such fight starts, human no long has any control over anything. The fight against an evil AI should not wait until it gets too big, but initiate while the evil AI is still budding. However, the call to step up efforts in expanding the Budapest Convention on Cybercrime might work if that means more funding goes into early decrection and prevention of building an evil AI.

Enfeeblement and Human Autonomy

The WALL-E situation might not be an exaggeration. And it most likely is not the fault of the superintelligent AI. If AI has been well guided to learn human preference, it will warn human that autonomy is crucial for humanity. Yet, humans are myopic, and will choose to ignore AI’s warning and continue indulging in a world where AI satisfies everyone’s every need. Very soon, this will lead to the WALL-E situation.

In the last paragraph, Dr. Russell uses the relationship between child and parents as a trigger to provoke thinking in the relationship between human and superintelligent AI. In order for a child to grow, parents need to balance what they can do for the child and what the child should do on their own. It seems that the superintelligent AI has to do the same to humans. Some times it does stuff for us, but sometimes it should refuse (including refuse to be switched off). But this seems to be asking too much of the AI. We are essentially asking AI to determine how humanity shall evolve forward. And if we have to make AI to decide when it shall refuse to be switched off (because humans are so weak minded that they will switch off AI in order to indulge in something the AI has warned against), doesn’t that mean all the discussion in the book trying to device an AI capable of being switched off is for nothing?

It seems to me that humans do not deserve superintelligent AI, if we cannot overcome our own weakness first. Given that having weakness is human nature, the only logical conclusion is that humans should not have superintelligent AI, or we will end up either all dead, or with The Matrix situation, or with The WALL-E situation.

Appendix A: Searching for Solutions

Human uses hierarchical abstraction to break down goals in to subgoals, subsubgoals, etc. We only care about the immediate subgoals and use a vast library of subroutines to achieve them. Here a subroutine is some simple action that does not need any mental effort from us, such as typing a sentence (we don’t have to think about which muscle to fire to type a letter). By breaking down large goals to eventual subroutines, humans can accomplish complex goals. This is not possible with the curretn AI, such as AlphaZero, which has no concept of hierarchical abstraction. Thus, it is unable to achieve anything meaningful in real life, despite being so much stronger than human in board games.

Appendix B: Knowledge and Logic

Logical reasoning is strictly formal (i.e. does not require any other information to be true). Thus, we can write algorithms for it.

Propositional logic is boolean logic. It is simple, powerful, but not expressive. It cannot be used to construct general rules. Using the Go game as an example, to define the rule that in one move, a stone can only be placed in an unoccupied position, propositonal logic has to check every single position at every single move to know whether a position is good for placing a stone. This is apparently not feasible for use on a Go board, let alone the real world. Yet, propositional logic is the basis for Bayesian networks and neural networks. This shows how rudimentary the current advances in deep learning actually is compared to what is required of a superintelligent AI.

First-order logic can generate general rules. Unlike propositional logic which treats everything in the world as either a true or false statement, first-order logic treats everything in the world as objects which have relationship with each other. Thus, by expressing how the obejcts are related, first-order logic is able to produce general rules, such as those for the Go game.

Appendix C: Uncertainty and Probability

Simple probability reasoning, i.e. assign a probability value to one of the outcomes, works well when the total number of outcomes is small. But it immediately becomes useless when we are dealing with a large outcome space, or the events that generate the outcomes are not independent.

Bayesian network provides the tool to compute probability of events with large outcome space and inter-dependency. It replaces the deterministic rule-based expect system in early AI research.

Combining Bayesian network and first-order logic, we have a formal language (probabilistic programming) to describe any rule with uncertainty in mind. This is a good start to model the actual world. Using probabilistic programming, we are able to set up general probabilistic rules about the world, feed real world data to rules, and run a probabilistic reasoning algorithm to figure out what is likely the cause of our observation. This is the basis of NET-VISA, used to detect unannounced nucear test via seismic wave data.

Bayesian network also allows AI to keep a belief state and update it based on new information. This is crucial to enable AI to “see” things that are currently invisible to it. Yet, by observing the things that are visible, AI is able to update its belief, with probability, of the actual state it is in, including the presence of things invisible.

Appendix D: Learning From Experience

A good high level introduction to supervised learning and deep learning. The deep dream example is quite out of this world.

But deep learning is not the savior. We cannot pack deeper and deeper layers, build bigger and bigger machines, and feed them more and more data, in order to achieve superintelligeent AI. The paradigm has to shift towards reasoning and abstraction to make superintelligent AI possible.

A potential solution to deep learning’s problems: explanation-based learning. Learn one example, reason why the example goes that way, and extract the generalized rule out of it. Explanation-based learning requires much less resources and data; its outcome is generalized, which means we only need one example for the learning to extend to all other similar situations. And this is exactly how human learns.