Let’s Discuss: Situational Awareness

Jun 18

Situational Awareness is a series of essays written by Leopold Aschenbrenner regarding the future of AI within the next decade. Leopold presents a strikingly terrifying future, and gives his input on what we should do to prepare. Behind all of this, however, is a surprising number of descriptive statements, which makes this an especially tough piece. After all, if Leopold's predictions are correct (or close to it), then there are some tough truths to accept.

I'll start by giving my best shot at a summary of each section. I'll likely get stuff wrong, but hopefully I'll get a few things right too. I'll also try to make it clear what I disagree with. Let’s get started.

Part I. Counting the OOMs

(An OOM is an order of magnitude. Increase by one OOM means to multiply by 10x.)

The main idea behind this part is quite simple: by measuring the speed of progress in recent years, we can extrapolate to predict how far we are from AGI. In fact, we can do a bit better: we can also look at reasons to believe that we will continue (or not continue) to follow these trends.

AGI in this section is defined similarly to how I did so last week: an AGI can do ML research. Once this threshold is passed, AI research itself can be automated with millions of AI copies that never sleep. This thus starts a feedback loop that ends with something akin to superintelligence.

Leopold first notes that, within the 4 years from GPT-2 to GPT-4, our models went from "about as smart as a preschooler" to "about as smart as a very bright highschooler". This progress was then broken down into three different categories as follows:

1. Compute. We're using bigger computers to train our models.

Leopold points out that, between GPT-2 and GPT-4, we had a 3.5-4.0 OOM increase in compute, and we have no reason to believe that this will slow down anytime soon. One large limiting factor here is going to be raw energy consumption, but companies are already planning massive clusters. In fact, a later section is about just this.

2. Algorithmic efficiencies. We're writing better algorithms that require less compute for equal capabilities.

Although this is a bit less intuitive to quantify, one can make very good estimates here based performance on benchmarks (tests for the AI), as well as API costs. If we do X amount better on some benchmark with the same amount of FLOPs, we can say we've made algorithmic progress. We can also track API costs; as our algorithms get more efficient, AIs get cheaper to run.

Leopold counts the numbers here, and estimates that we made approximately 2 OOMs of progress from algorithmic gains between GPT-2 to GPT-4. Again, he argues that we have no reason to think this will slow down very much.

3. "Unhobbling gains". Giving the AI better access to tools, whether it be directly within the model itself, or external tools such as a code compiler.

To me, this was the most interesting of the three, as I hadn't thought much about this before (I certainly hadn't sectioned it off in my head as it's own discrete thing). Leopold points out a huge list of unhobbling gains between GPT-2 and GPT-4. Some of the most notable are reinforcement learning from human feedback (having a human press thumbs up or thumbs down on the model output), chain of thought reasoning (forcing the model to "think out loud"), and allowing the model to use tools such as calculators and web browsers. From these, Leopold estimates that, between GPT-2 and GPT-4 (again, calculated from benchmarks), we had something like 2 OOMs of improvement from this, although it's somewhat harder to quantify than the others.

It also seems clear that we're nowhere near the limit on unhobbling gains. Some that we should expect to see in the future include: full access to a computer, larger context length (input length) allowing for fast onboarding to current projects, and spending longer time "thinking" about hard problems vs easy problems. Potential gains from this in the future could be absolutely massive.

Ultimately, Leopold estimates that we are on track to maintain the same speed of progress as between GPT-2 and GPT-4. If he's right, this means that by 2027 we'll have another jump analogous to "preschooler to smart highschooler", but this time it'll be "smart highschooler to ???". Leopold then argues that this resulting AI will likely be able to do ML research, especially since it’ll have read every AI/ML paper that’s ever been written.

I agree with the calculations done in this section, and I wouldn’t be surprised if we end up with AGI before 2030. However, there is one relevant possibility that Leopold doesn't discuss in this section that I think has a meaningful probability of occuring:

As we continue to increase compute, discover algorithmic efficiencies and collect unhobbling gains, what if capability gains receive diminishing returns? At the surface, it may seem like we have no reason to believe this. Our models have been doing better and better on benchmarks with no signs of slowing down. But I posit that there is one key piece of information in this regard: what is the AI actually learning? Let's consider two simplistic options:

1. LLMs just do statistics over lots of training data. They can have no internal model of the world, including representations of complex objects or ideas.

2. LLMs have an internal world model, including representations of complex objects and ideas. They are able to manipulate these representations, have "ideas", and operate within the world model.

Of course, these are very simplified views; likely there is some mixture going on. But again, it will be useful to consider these two limiting cases to examine why me might see diminishing returns of capabilities.

If option 1 is correct, we have no reason to believe that the AI will be able to come up with any fundamentally new ideas. The absolute best that we can expect is the synthesization of current ideas; that is, to combine ideas in ways that have not been done before, often creating something more than the sum of the parts. If this is indeed true, and if furthermore this is a general characteristic of LLMs, then we would start to see diminishing returns of capabilities in the coming years. For eventually we will exhaust all of the ideas (and combination of ideas) that have been proposed by humans regarding ML research. And once of these combinations have been exhausted, the AIs will no longer get better at ML research, and we thus might never start the recursive improvement loop.

If option 2 is correct, we will likely not see diminishing returns. As we increase compute and disover algorithmic efficiencies, the AI's world model will get better, and eventually surpass our own. It will therefore eventually be able to do ML research as good or better than humans.

If, as it is indeed likely, LLMs operate within some combination of these options, then we might start to see diminishing returns. For the increase in capabilities can be broken down into two parts: that attributed to a better world model, and that attributed to better statistics resulting in combinations of existing ideas. As we increase compute and discover algorithmic efficiencies, while the AIs world model will improve, we will also run out of existing ideas or combinations thereof (regarding ML research). Hence, we would see diminishing improvements to AI capabilities in the field of ML research with increasing compute and algorithmic efficiencies.

An important question, then, is if these diminishing returns will start before or after our LLMs can do successful ML research. And if they indeed start before this critical moment, how drastic will the effect be? If I had to at all guess, I would say the following:

It seems to me that a good way to create coherent sentences is to have internal representations of less complex objects and ideas, but not to have a fully complete world model. The AI can thus rely on statistics to operate within the space of more complex ideas, and can interface with its internal representations in order to output coherent responses. If this is correct, then we may run out of combinations of ideas (regarding ML research) before reaching recursive self improvement, and also before the AI's world model is complex enough to have good new ideas. This would then result in a very drastic slowdown in capability gains in the future, pushing back the recursive ML research loop—and thus AGI—many years into the future.

However, it is also possible that new combinations of old ideas will improve the world model of the subsequent AI to a degree such that it will be able to come up with new ideas.

It is worth noting that my intuitions regarding the inner workings of LLMs is much worse than any researcher actually working on the LLM frontier, and thus I give a very large variance to these ideas. Accordingly, I would not be surprised to see either of the two outcomes I suggest (or for the truth to unfold not within simplistic model at all).

Part II. The Intelligence Explosion

The central claim in this piece is something I've discussed previously: Once we get automated ML research, it's not long before we get superintelligence through a simple feedback loop of AI progress.

In short, Leopold does a few sanity checks here to make sure we're not missing anything. For example, a) how many copies of this AI researcher will we be able to run with our limited compute, and b) how hard will it be to get to this level of AI in the first place?

To the a), the answer is a lot: likely in the many millions. To b), you can feel free to read the part I.

Part III a. Racing to the Trillion-Dollar Cluster

The main argument of this piece is the feasibility of the trillion dollar cluster, which would (based on previous estimates) be enough to reach AGI. (In fact, we may not even need a cluster this big depending on algorithmic and unhobbling gains.) The argument is broken down into smaller chunks and taken one at a time. Firstly, Leopold argues that the revenue of major AI companies (and associated industries) would be enough to finance such a cluster. He also gives a number of historical examples of projects undertaken at a similar price (thus providing additional evidence for the feasibility).

Next, raw energy consumption (power) is discussed, which is likely the largest bottleneck to the trillion dollar cluster. Leopold seems to think that once people "wake up to AGI" (i.e. realize that it could happen soon and what it implies), that we'll end up easing up on regulations and will therefore be able to set up enough sources of power. (It is also noted that, if we don't ease up on these regulations, we'll probably end up putting clusters in other countries such as the middle east, which is a horrible idea on the basis of national security).

Finally, chips themselves are discussed. Like all of the other constraints, producing enough chips seems totally feasible as long as companies decide to do so. Personally, it seems reasonable that chip manufacturing companies will make it happen, simply because they're in the perfect position to see the AGI boom coming.

My further thoughts on this section are two-fold. In some sense, as we discussed regarding chips, the feasibility of the cluster is only half of the conversation. An equally important question is whether or not we’ll (the united states) actually try. In order to produce enough power for the cluster, for example, existing regulations will likely need to change. Ultimately, this latter question can be reduced to “when/will the US government and chip companies become AGI-pilled?” Of the US government, Leopold discusses this later.

Part III b. Security for AGI

This piece argues that, given the power of AGI/ASI (all starting from the first automated ML research), it is of the utmost importance that our (the United States) AGI be protected as a national security measure. Such power would allow for insane distopian futures if put into the wrong hands including authoritarian states including the CCP. (For those with less knowledge in this regard, Leopold has a short list of atrocities undertaken by the CCP in section IIId of situational awareness.)

Leopold argues that there are two main threats that we need to protect against: stealing of algorithmic secrets, and direct stealing of model weights. This second one is a much more visceral threat, and is in some sense clearer to protect against (not easier, but more clear how this should be done). The first threat, stealing of algorithmic ideas, is much less clear how to prevent, given that one person defecting (being bribed) or being kidnapped could mean the end of such secrets (and thus the end of the AI lead carried by the united states).

This piece then claims that the security needed to prevent both (or either) of these threats would only be possible with the help of the government. Leopold lists a few of the kinds of security measures we would need, which include airgapped datacenters with physical security equal to that of a military base, better encryption algorithms, all researchers working in SCIFs (see this visualization), and extreme personal security clearances for researchers.

Finally, Leopold argues that we are not on track to make this happen; he seems to predict that sometime in the next few years, China will steal some big algorithmic secret, thus causing a reform in our security measures. He hopes this does not happen too late.

Part III c. Superalignment

I have a lot of critiques of the arguments presented in this section. But like always, I'll try to steel-man his position during my summary.

Superalignment is same old alignment problem, but specifically for a superintelligent AI. If you're unfamiliar with the alignment problem, aligning an AI means to align the AI's goals with human goals such that it doesn't kill us (in other words, it's a control problem). To explain why the superalignment problem is so hard, Leopold juxtaposes it against our current (sufficient but not perfect) alignment techniques. Currently, the main technique we use to align our language models is RLHF (reinforcement learning from human feedback), where we click on the thumbs up or thumbs down button after reading the output of the model. The model is then told to try to make people press thumbs up instead of thumbs down. This however, fails when the model gets sufficiently smart. How, for example, do you press thumbs up or down on a million lines of code? How do you press thumbs up or down when it's using complex thought processes that you cannot understand.

It's hard to align a superintelligent agent because we can no longer evaluate whether the AI is doing good or bad things. (Not until something really bad or really good happens, then we'll know).

Leopold, however, is not a doomer, and he thinks that we'll "muddle through" on this one. His prediction for how things will go seems to be roughly as follows:

1. We'll start by aligning the somewhat-superhuman models (the same models that will start the recursive ML research)

2. We'll use these models to align the smarter ones.

In order for this kind of scheme to work, Leopold present a few areas of research that are particularly important.

a. Scalable oversight. In other words, can you have a less intelligent verifier to evaluate the more intelligent models? It is well known that verification is much easier than generation. Accordingly, how much less intelligent do you have to be to still act as a good verifier?

b. Generalization. By studying how current models generalize as they get smarter, maybe we learn about a superintelligent agent getting a bit smarter. Accordingly, as we improve our models by the next OOM, we can try to be sure that it won't generalize in unwanted ways.

c. Top-down interpretability. For example, can we create some kind of lie detector for an AI? (This is contrasted against “bottom-up interpretability”, which is an attempt to understand all of the thought processes and model weights that create any given output”

d. Adversarial testing / measurements. This one is heavily related to all the others, but the main idea here is to try to encounter every possible failure mode in the lab before we publicly release the model.

If we understand these things (and a few others) sufficiently well, Leopold thinks that we'll muddle through, and that our automated AI researchers will be sufficiently aligned in order to align the next, smarter model.

As mentioned previously, I have a few critiques here. My main criticism is that Leopold masks the difficulty of the alignment problem by present an even harder problem (superalignment). Leopold’s predictions are predicated on the fact that we can successfully align a regular-intelligent model. While I agree that this is in theory a solvable problem (one much easier than superaligment), we’re not on track to solve it in time. Leopold mentioned that there’s a lot of low hanging fruit in alignment research, which I fully agree. But he also agrees that the amount of people actually doing the work is terrifyingly small. So unless something drastic changes in the next few years, (especially with Leopold’s short AGI timelines), it seems unlikely to me that we’ll actually solve it. Ultimately, to me this section reads like Leopold says “we’ll muddle through”, and tries to make it convincing by creating a list of important research areas. Then he goes one by one saying “we’ll muddle through” to each individually. That’s not an argument, even if it sounds like one.

It’s possible that Leopold’s model is something like “once America wakes up to AI and begins The Project (part IV), we’ll put a sufficient amount of funding and compute towards solving the alignment problem.” This is certainly possible, but our biggest companies are not doing so currently; I hope this changes drastically once we’re closer to true AGI.

Part III d. The Free World Must Prevail

If you're on board with everything that's been argued up until now, this one kinda comes for free. The main arguments here are as follows:

1. Whoever first creates AGI will end up with a massive military advantage.

Again, once you get an intelligence explosion, scientific progress would skyrocket on extremely short timelines, thus giving rise to technologies we can only imagine. This also probably includes more powerful weapons of mass destruction, and also defense against current nukes (whether it be disarmament, or protection/absorption).

2. America should lead in this, certainly over other options such as the CCP, North Korea, Russia, or some terrorist group from the middle east.

There's not much else to say about this one.

Part IV. The Project

At the heart of this piece is one central descriptive claim: Once America becomes AGI-pilled, we'll start a government project of similar scale to the making of the atomic bomb. If this indeed happens, Leopold hopes that those in charge do a good job.

It’s important to note that the predictions of this section are predicated on his short AGI timelines. If creating AGI took 70 years instead of 7, the global AI landscape would look much different, including security measures.

Conclusions

In situational awareness, Leopold presents a surprising bleak prediction about the future. His main line of prediction seems to be something like "sometime around 2026, China will steal some important algorithmic secret from the US, and we'll start to take things more seriously. We'll ramp up security, and hopefully the government will get involved. (If not, companies will likely be locked in a tight race, thus causing less time to be spent on alignment, making things enormously more dangerous. ALSO, if the government doesn't get involved, our security measures will not be enough to prevent espionage from the CCP). Hopefully, we'll act fast enough such that China doesn't steel our secrets. (Again, if we're locked in a race with China, we'll spend less time on alignment...). Ultimately, we'll muddle our way through the alignment problem, be the first country to create superintelligence (thanks to The Project), and then we'll all put our hand in our pockets and walk away into the sunset".

The two places I disagree most heavily with Leopold are a) the probability/timeline with which we reach AGI, and b) the probability with which we solve alignment. For both of these disagreements, the root seems to lie in the larger uncertainties I give to potential futures. For example, while I accept the possibility of AGI within the next decade, I also think it is possible that we’ll see a reduction in capability gains with upcoming compute/algorithmic advancements; if this happens rapidly enough, we may not reach AGI anytime soon. Moreover, I seem to think that the probability we solve the alignment problem is smaller than what Leopold thinks. Again, it’s totally possible that we’ll figure it out before the first automated ML research, but I think we’re not quite on track to do so; I would not be surprised by either outcome.

With all of this said, situational awareness is an incredibly important piece. Leopold’s discussion of national security and The Project are especially so, given the small amount of existing dialogue. In this regard, I mostly agree with the arguments presented; it is of the utmost importance that AGI does not fall into the wrong hands.

For next week, we’ll be reading Economics in One Lesson by Henry Hazlitt.

Max Cohen