Ratings16
Average rating4.3
Finalist for the Los Angeles Times Book Prize A jaw-dropping exploration of everything that goes wrong when we build AI systems and the movement to fix them. Today’s “machine-learning” systems, trained by data, are so effective that we’ve invited them to see and hear for us—and to make decisions on our behalf. But alarm bells are ringing. Recent years have seen an eruption of concern as the field of machine learning advances. When the systems we attempt to teach will not, in the end, do what we want or what we expect, ethical and potentially existential risks emerge. Researchers call this the alignment problem. Systems cull résumés until, years later, we discover that they have inherent gender biases. Algorithms decide bail and parole—and appear to assess Black and White defendants differently. We can no longer assume that our mortgage application, or even our medical tests, will be seen by human eyes. And as autonomous vehicles share our streets, we are increasingly putting our lives in their hands. The mathematical and computational models driving these changes range in complexity from something that can fit on a spreadsheet to a complex system that might credibly be called “artificial intelligence.” They are steadily replacing both human judgment and explicitly programmed software. In best-selling author Brian Christian’s riveting account, we meet the alignment problem’s “first-responders,” and learn their ambitious plan to solve it before our hands are completely off the wheel. In a masterful blend of history and on-the ground reporting, Christian traces the explosive growth in the field of machine learning and surveys its current, sprawling frontier. Readers encounter a discipline finding its legs amid exhilarating and sometimes terrifying progress. Whether they—and we—succeed or fail in solving the alignment problem will be a defining human story. The Alignment Problem offers an unflinching reckoning with humanity’s biases and blind spots, our own unstated assumptions and often contradictory goals. A dazzlingly interdisciplinary work, it takes a hard look not only at our technology but at our culture—and finds a story by turns harrowing and hopeful.
Reviews with the most likes.
I am thoroughly impressed with Christian's documentation of AI's development and emergence from nascent geekery to world-altering capital-T Thing. This book released in 2020, and a mere 3.5 years later basically every tech product you're likely to see has had “AI” thrown at the front or back of its name. There is so much fear, uncertainty, and doubt around this technology that half of the conversations I'm in that involve it seem to want to resolve into people fleeing for the woods.
Christian does a good job of documenting the historical, psychological, ethical, and epistemological origins of AI. I was particularly drawn to the psychological analogies, many of which surprised me. I rented this from the library in physical form and so to save my notes for future reference had to painstakingly write page numbers on index cards and go back to scan/dictate the text to my Notes app, but I'm posting those here for my convenience.
—
Notes:
The Alignment Problem
P30 - In one of the first articles explicitly addressing the notion of bias in computing systems, the University of Washington's Batya Friedman and Cornell's Helen Nissenbaum had warned that “computer systems, for instance, are comparatively inexpensive to disseminate, and thus, once developed, a biased system has the potential for widespread impact. If the system becomes a standard in the field, the bias becomes pervasive.”, ^40 (Representation)
P49 - As Princeton's Arvind Narayanan puts it: “Contrary to the ‘tech moves too fast for society to keep up' cliché, commercial deployments of tech often move glacially-just look at the banking and airline mainframes still running. ML [machine-learning] models being trained today might still be in production in 50 years, and that's terrifying.” ^93 (Representation)
Feedback loops
“Machine learning is not, by default, fair or just in any meaningful way.” - Moritz Hardt (^3, Fairness)
“No machinery is more efficient than the human element that operates it.” (??)
“One of the most important things in any prediction is to make sure that you're actually predicting what you think you're predicting. This is harder than it sounds.”
P123 - Thorndike sees here the makings of a bigger, more general law of nature. As he puts it, the results of our actions are either “satisfying” or “annoying.” When the result of an action is “satisfying,” we tend to do it more. When on the other hand the outcome is “annoying,” we'll do it less. The more clear the connection between action and outcome, the stronger the resulting change. Thorndike calls this idea, perhaps the most famous and durable of his career, “the law of effect.”
As he puts it:
The Law of Effect is: When a modifiable connection between a situation and a response is made and is accompanied or followed by a satisfying state of affairs, that connection's strength is increased: When made and accompanied or followed by an annoying state of affairs its strength is decreased. The strengthening effect of satisfyingness (or the weakening effect of annoy-ingness upon a bond varies with the closeness of the connection between it and the bond. ^7 (Reinforcement)
P127 - Continuing to develop machines that could learn, in other words—by human instruction or their own experience-would alleviate the need for programming. Moreover it would enable computers to do things we didn't know how to program them to do.
P141 - “this is apparently the first application of this algorithm to a complex non-trivial task,” TESAURO wrote. Re: use of algorithms to play go, I got you got, learning from guesses steadily coming to learn one adventure position look like. ... “it is spelling it, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong, intermediate level of performance, which is clearly better than conventional commercial programs, and which in fact, surpasses comparable networks trained on a massive human expert data set. This indicates that TD learning may work better in practice than one would expect based on current theory.“
P151 - Meanwhile, we take up another question. Reinforcement learning in its classical form takes for granted the structure of the rewards in the world and asks the question of how to arrive at the behavior-the “policy” —that maximally reaps them. But in many ways this obscures the more interesting—and more dire—matter that faces us at the brink of Al. We find ourselves rather more interested in the exact opposite of this question: Given the behavior we want from our machines, how do we structure the environment's rewards to bring that behavior about?
How do we get what we want when it is we who sit in the back of the audience, in the critic's chair—we who administer the food pellets, or their digital equivalent?
This is the alignment problem, in the context of a reinforcement learner. Though the question has taken on a new urgency in the last five to ten years, as we shall see it is every bit as deeply rooted in the past as reinforcement learning itself.
P160 - But Miyamoto had a problem. There are also good mushrooms, which you have to learn, not to dodge, but to seek. “This gave us a real head-ache,” he explains. “We needed somehow to make sure the player understood that this was something really good.” So now what? The good mushroom approaches you in an area where you have too little headroom to easily jump over it-you brace for impact, but instead of killing you, it makes you double in size. The mechanics of the game have been established, and now you are let loose. You think you are simply playing.
But you are carefully, precisely, inconspicuously being trained. You learn the rule, then you learn the exception. You learn the basic mechanics, then you are given free rein.
P161 - in both cases, the use of a curriculum – an easier version of the problem, followed by a harder version – succeeding in cases we're trying to learn the more difficult problem by itself could not.
P169 - “As a general rule,” says Russell, “it is better to design performance measures according to what one actually wants in the environment, rather than according to how one thinks the agent should behave.”^50 Put differently, the key insight is that we should strive to reward states of the world, not actions of our agent. These states typically represent “progress” toward the ultimate goal, whether that progress is represented in physical distance or in something more conceptual like completed subgoals (chapters of a book, say, or portions of a mechanical assembly). (^50 Shaping).
P185 - Learned helplessness; “As the celebrated aphorist Ashleigh Brilliant put it, “If you're careful enough, nothing bad or good will ever happen to you.” ^11 (Curiosity)
P202 - All rewards are internal. ^61 (Curiosity).
P222 - Conway lloyd Morgan - “Five minutes' demonstration is worth more than five hours' talking where the object is to impart skill. It is of comparatively little use to describe or explain how a skilled feat is to be accomplished; it is far more helpful to show how it is done.” ^32 (Imitation)
P228 - At its root, the problem stems from the fact that the learner sees an expert execution of the problem, and an expert almost never gets into trouble. No matter how good the learner is, though, they will make mistakes – whether blatant or supple. But because the learner never saw the expert get into trouble, they have also never seen the expert get out. In fact, when the beginner makes beginner mistakes, they may end up in a situation that is completely different from anything they saw during their observation of the expert. “That means,“ says Sergey Levine, “that, you know, all bets are off.” (Cascading errors).
P247 - Eliezer Yudkowsky, cofounder of the Machine Intelligence Research Institute, wrote an influential 2004 manuscript in which he argues for imbuing machines, not simply to imitate and hold or norms as we imperfectly embody them, but rather, we should instill in machines what he calls our “coherent extrapolated volition.“ “In poetic terms, “he writes, “our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wish we were.“
P251 - Warneken, along with his collaborator Michael Tomasello of Duke, was the first to systematically show, in 2006, that human infants as young as eighteen months old will reliably identify a fellow human facing a problem, will identify the human's goal and the obstacle in the way, and will spontaneously help if they can-even if their help is not requested, even if the adult doesn't so much as make eye contact with them, and even when they expect (and receive) no reward for doing so.^2 (Inference)
P261 - We are now, it is fair to say, well beyond the point where our machines can do only that which we can program into them in the explicit language of math and code.
P268 - Russell dubbed this new framework cooperative inverse reinforcement learning (“CIRL,” for short).^40 In the CIRL formulation, the human and the computer work together to jointly maximize a single reward function - and initially only the human knows what it is.
“We we're trying to think, what's the simplest change we can make to the current math and the current theoretical systems that fixes the theory that leads to these sort of existential-risk problems?“ says Hadfield-Menell. “What is a math problem where the optimal thing is what we actually want?“^41 (Inference)
P282 - He has the students play games where they must decide which side of various bets to take, figuring out how to turn their beliefs and hunches into probabilities, and deriving the laws of probability theory from scratch. They are games of epistemology: What do you know? And what do you believe? And how confident are you, exactly? “That gives you a very good tool for machine learning,” says Gal, “to build algorithms—to build computational tools —that can basically use these sorts of principles of rationality to talk about uncertainty.” (...) Gal: “I wouldn't rely on a model that couldn't tell me whether it's actually certain about its predictions.” (re: uncertainty in models and models communicating uncertainty; ensembling; dropouts...).^14
There's a certain irony here, in that deep learning–despite being deeply rooted in statistics—has, as a rule, not made uncertainty a first-class citizen.
Note from TB: thinking about uncertainty in prioritization. Weighing measures in a prioritization algorithm.
P292 - Another researcher who has been focused on these problems in recent years is DeepMind's Victoria Krakovna. Krakovna notes that one of the big problems with penalties for impact is that in some cases, achieving a specific goal necessarily requires high-impact actions, but this could lead to what's called “offsetting”: taking further high-impact actions to counterbalance the earlier ones. This isn't always bad: if the system makes a mess of some kind, we probably want it to clean up after itself. But sometimes these “offsetting” actions are problematic. We don't want a system that cures someone's fatal illness but then-to nullify the high impact of the cure-kills them. ^43 (Uncertainty)
Note from TB: thinking about uncertainty in prioritization again, and how to measure / quantify “impact on PEH,” in algorithm. What is the impact of each stage of the prioritization process, from inflow to referral, etc.
P294 - Turner's idea is that the reason we care about the Shanghai Stock Exchange, or the integrity of our cherished vase, or, for that matter, the ability to move boxes around the virtual warehouse, is it those things for whatever reason matter to us, and they matter to us because they are ultimately in some way or other tied to our goals. We want to save for retirement, put flowers in the vase, complete the sokoban level. What if we model this idea of goals explicitly? His proposal goes by the name “attainable utility preservation“: giving the system a set of auxiliary goals in the game environment, and making sure that it can still effectively pursue these auxiliary goals after it's done whatever points-scoring actions the game incentivizes. Fascinatingly, the mandate to preserve a tangible utility seems to foster good behavior in the AI safety gridworlds even when the auxiliary goals are generated at random. ^49 (Uncertainty)
P295 - One of the most chilling and prescient quotations in the field of AI safety comes in a famous 1960 article on the “Moral and Technical Consequences of Automation” by MIT's Norbert Wiener: “If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it... then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it.”^51 It is the first succinct expression of the alignment problem.
No less crucial, however, is this statement's flip side: If we were not sure that the objectives and constraints we gave the machine entirely and perfectly specified what we did and didn't want the machine to do, then we had better be sure we can intervene. In the Al safety literature, this concept goes by the name of “corrigibility,” and—soberingly—it's a whole lot more complicated than it seems.^52 (Uncertainty)
P299 - But, they found, there's a major catch. If the system's model of what you care about is fundamentally “misspecified”-there are things you care about of which it's not even aware and that don't even enter into the system's model of your rewards-then it's going to be confused about your motivation. For instance, if the system doesn't understand the subtleties of human appetite, it may not understand why you requested a steak dinner at six o'clock but then declined the opportunity to have a second steak dinner at seven o'clock. If locked into an oversimplified or misspecified model where steak (in this case) must be entirely good or entirely bad, then one of these two choices, it concludes, must have been a mistake on your part. It will interpret your behavior as “irrational,” and that, as we've seen, is the road to incorrigibility, to disobedience.”^63 (Uncertainty)
——
Notes
Representation
* 40 - Friedman and Nissenbaum, “Bias in Computer Systems.”
* 93 - Narayanan on Twitter: https://twitter.com/random_walker/status/993866661852864512
Fairness
* 3 - Hardt, “How Big Data Is Unfair.”
Reinforcement
* 7 - Thorndike, The Psychology of Learning.
Shaping
* 50 - Russell and Norvid, Artificial Intelligence.
Curiosity
* 11 - See Henry Alford, “The Wisdom of Ashleigh Brilliant,” http://www.ashleighbrilliant.com/BrilliantWisdom.html, excerpted from Alford, How to Live (New York: Twelve, 2009).
* 61 - Singh, Lewis, and Barto. For more discussion, see Oudeyer and Kaplan, “What Is Intrinsic Motivation?”
* Sing, Lewis, and Barto — “Where Do Rewards Come From?” In “Proceedings of the Annual Conference of the Cognitive Science Society,” 2601-06. 2009.
Imitation
* 32 - Morgan, “An Introduction to Comparative Psychology.”
Inference
* 2 - See also Meltzoff, “Understanding the intentions of Others” which showed that eighteen-month olds can successfully imitate the intended acts that adults tried and failed to do, indicating that they ‘situate people within a psychological framework that differentiates between the surface behavior of people and a deeper level involving goals and intentions.'
* The citation for the Warneken paper: Warneken, Felix, and Michael Tomasello. “Altruistic Helping in Human Infants and Young Chimpanzees.” Science 311, no. 5765 (2006): 1301-03.
* 40 - Hadfield-Menell et al., “Cooperative Inverse Reinforcement Learning.” (“CIRL” is pronounced with a soft c, homophonous with the last name of strong AI skeptic John Searle (no relation). I have agitated within the community that a hard c “curl” pronunciation makes more sense, given that “cooperative” uses a hard c, but it appears the die is cast.).
* Note from TB: I agree w/ the hard c note.
* 41 - Dylan Hadfield-Menell, personal interview, March 15, 2018.
Uncertainty
* 14 - Yarin Gal, “Modern Deep Learning Through Bayesian Eyes” (lecture), Microsoft Research, December 11, 2015, https://www.microsoft.com/en-us/research/video/modern-deep-learning-through-bayesian-eyes/.
* 43 - As Eliezer Yudkowsky put it, “If you're going to cure cancer, make sure the patient still dies!” See https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/. See also Armstrong and Levinstein, “Low Impact Artificial Intelligence,” which uses the example of an asteroid headed for earth. A system constrained to only take “low-impact” actions might fail to divert it—or, perhaps even worse, a system capable of offsetting might divert the asteroid, saving the planet, and then blow the planet up anyway.
* 49 - Mind Safety Kesearer o
* designing-agent-incentives-to-avoid-side-effects-elac80ea6107.
* 49. Turner, Hadfield-Menell, and Tadepalli, “Conservative Agency via Attainable Utility Preservation.” See also Turner's “Reframing Impact” sequence at http://www.alignmentforum.org/s/7CdoznhJaLEKHwvJW and additional discussion in his “Towards a New Impact Measure,” https://www.alignmentforum.org/ posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure; he writes, “I have a theory that AUP seemingly works for advanced agents not because the content of the attainable set's utilities actually matters, but rather because there exists a common utility achievement currency of power.” See Turner, “Optimal Farsighted Agents Tend to Seek Power.” For more on the notion of power in an Al safety context, including an information-theoretic account of “empowerment,” see Amodei et al., “Concrete Problems in Al Safety,” which, in turn, references Salge, Glackin, and Polani, “Empowerment: An Introduction,” and Mohamed and Rezende, “Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning.”
* 51 - Wiener, “Some Moral and Technical Consequences of Automation.”
* 52 - According to Paul Christiano, “corrigibility” as a tenet of AI safety began with Machine intelligence Research Institute's Eliezer Yudkowsky, and the name itself came from Robert Miles. See Christiano's “Corrigibility,” https://ai-alignment.com/corrigibility-3039e668638.
* 63 - For more on corrigibility and model misspecification using this paradigm, see also, e.g., Carey, “Incorrigibility in the CIRL Framework.”
Books
9 booksIf you enjoyed this book, then our algorithm says you may also enjoy these.