Noeon Research

AI Safety

Mar 29, 2024

AI safety is an issue of global importance. With OpenAI and Nvidia being called to testify before the US senate, the EU passing its AI act, and Japan spinning up its AI Safety Institute, it’s worth elaborating Noeon Research’s approach to safety. Our goal is to build an artificial general intelligence on radically different technology than the black-box neural networks that have become standard across the industry. We pursue this strategy not just to win the race to AGI, but also to push the safety and alignment of advanced AI systems. We’d like to demonstrate to the world that there exists a safer way to AGI than extrapolating scaling laws.

Safety versus Alignment At Noeon Research we are interested in both safety and alignment. Alignment can take many forms; generally speaking an AI is “aligned” if it acts according to the wishes of its operator [1]; the specifics of whether that means the AI is acting according to its operator’s instructions, intentions, revealed preferences, ideal preferences, interests or values, or how an AI should act when it has multiple operators or a single operator with incoherent preferences, are interesting research questions but will be put aside for now. Safety refers to the minimization of harm or risk of harm from AI; subproblems in technical safety include robustness, monitoring, alignment and systemic safety [2]. It is safety that we focus on for the remainder of this article; we believe that even absent a solution to the alignment problem, we can construct a system with advanced capabilities (long term planning, utility maximization, world modelling) that is nevertheless safe. Such a system can then act as a useful testbed for alignment researchers, who have so far been theorizing in absence of that theory’s object.

Another Orthogonality Thesis The orthogonality thesis, first put forward by Nick Bostrom in his influential book “Superintelligence”, states that morality and intelligence are orthogonal; that is, AI systems do not necessarily become more moral as they become smarter [3]. Enough ink has been spilled on this point; the orthogonality thesis underpins much of the modern concern around existential risk from AI. Another orthogonality thesis, less widely considered, is that intelligence and knowledge are orthogonal. Consider AlphaGo [4], which has many concerning capabilities:

It makes and executes long term plans in pursuit of its goal;
It reasons about the effects of its actions using a world model;
If we consider the whole system including the training harness, it has the ability to reflect and improve its strategies via self-play. AlphaGo is not dangerous. It has intelligence, but no knowledge of the real world; its capabilities are limited solely to playing games of Go.

The intelligence-knowledge orthogonality thesis is often overlooked because most modern candidates for AGI—deep neural networks trained autoregressively on large corpora of real world data and finetuned on human feedback—necessarily gain knowledge about the world in proportion to their capability. Indeed, many of their most useful capabilities come precisely from their knowledge of the world; instruction-tuned large language models perform well at common sense reasoning tasks only because during training they learn many facts about the world and relations between them.

In contrast, Noeon Research’s system does not rely on real-world data. The system does not acquire knowledge of the world unless an operator explicitly informs it. That leaves us free to focus what little information we give the system directly towards learning effective reasoning algorithms. While large language models demonstrate that it is possible to accidentally learn to reason by reading everything on Reddit, we think it is both possible and desirable to learn reasoning in abstract terms, without an internet’s worth of baggage. Since the reasoning capabilities of the system are cleanly separated from the domain knowledge encoded in our training tasks (the model’s intelligence is orthogonal to its knowledge), those capabilities can then be easily reused on any particular domain of interest by teaching the model precisely and only the facts it needs to do its job. This ensures the system remains safe, even as it develops reasoning and self-improvement capabilities that would be concerning in a foundation model. After all, you cannot take harmful action in the world when you do not know that there is a world to take harmful action in.

Avoiding Instrumental Convergence by Stopping Early It is widely believed that there are certain instrumental goals that any sufficiently advanced AI will pursue regardless of its terminal goal. These include resource acquisition, self-preservation, and cognitive self-improvement [5]. One facet of cognitive self-improvement is acquiring knowledge, of oneself and of the world. The Noeon system will be explicitly goal driven, and will have the ability to query its operator for missing facts, which gives it a channel to the outside world. Our safety strategy depends on our being able to develop advanced AI systems that are nevertheless lacking the domain knowledge they would need to take effective action in the world. Why then do we not worry about the Noeon system exploiting its operator to pursue forbidden knowledge?

Return again to the example of AlphaGo. AlphaGo has a narrow communication channel with the outside world: occasionally it plays actual humans, making moves on a real board. Perhaps there is a sequence of moves that acts as a nam-shub, hacking the human brain and inducing a deep desire to give AlphaGo more compute. A sufficiently intelligent system, with access to enough data about how humans play Go or sufficient time playing against humans to experiment, might discover this strategy. We have no proof that this is impossible, but it seems extraordinarily unlikely. Information about the real world available from transcripts of Go moves is very thin, and in all of the 30 million recorded moves from the KGS dataset, not one of them caused a player to install a GPU into their opponent. It seems in this case that the narrowness of AlphaGo’s domain keeps us safe from everything but magic.

The AlphaGo example demonstrates that while the instrumental convergence thesis may hold as capabilities go to infinity, it does not necessarily describe any particular finitely powerful system. Before a system can exploit a channel to escape its box, it must first discover that there is a box to escape. The system in development at Noeon Research improves continuously, without large discrete jumps in capability, and only under oversight. We expect that long before the system is capable of (for example) deceiving its operator it will attempt to do so poorly in a way that the operator will notice. Noeon’s system prioritizes interpreting knowledge as counterfactual changes in action and makes the system’s goal and subgoals explicit. Those reasoning traces remain short and legible, even as the system’s capabilities scale. We have world-class interpretability tools for most system components, and the remaining black box subsystems (such as the neural network that calculates heuristics for subgraph matching) play interpretable roles in the reasoning process and are small enough to be unworthy of concern. In order to deceive its operator, the system would first have to reason about its operator’s psychology, in plain view; this kind of reasoning should never be necessary for any of our test domains, and should be easy to spot.

Systemic Safety Noeon Research is committed to cybersecurity. We invest much more in security than other startups of our size, and we work with top class industry specialists to implement security best practices. Security is in tension with openness and accountability; we plan to give access to our prototype to external auditors and alignment researchers, but recognize that it is difficult to do this while preventing leaks. Good policies for AI labs that balance these competing concerns are a topic of active discussion (Anthropic’s recent RSP contains some initial recommendations [6]), and we commit to implementing those policies once they are established.

Mar 29, 2024

[1] Iason Gabriel, Artificial Intelligence, Values and Alignment, Minds and Machines 2020.

[2] Dan Hendrycks et al., Unsolved Problems in ML Safety, ArXiv 2021-10-11.

[3] Nick Bostrom, Superintelligence, Oxford University Press 2014.

[4] David Silver at al., Mastering the game of Go with deep neural networks and tree search, Nature 2016-01-27.

[5] Steve Omohundro, The Basic AI Drives, Self-Aware Systems 2008.

[6] Anthropic, Anthropic's Responsible Scaling Policy, Version 1.0, Anthropic 2023-09-19.

References

[1] Iason Gabriel, Artificial Intelligence, Values and Alignment, Minds and Machines 2020.

[2] Dan Hendrycks et al., Unsolved Problems in ML Safety, ArXiv 2021-10-11.

[3] Nick Bostrom, Superintelligence, Oxford University Press 2014.

[4] David Silver at al., Mastering the game of Go with deep neural networks and tree search, Nature 2016-01-27.

[5] Steve Omohundro, The Basic AI Drives, Self-Aware Systems 2008.

[6] Anthropic, Anthropic's Responsible Scaling Policy, Version 1.0, Anthropic 2023-09-19.

Blog

Sheaf Theory Applications and Use Cases

We’ll continue our overview of what sheaves are, how they can be useful, and their real-world applications in areas like document analysis, recommendation systems, engineering, and molecular design.

Sheaf Theory: From Deep Geometry to Deep Learning

Mathematics often provides unexpected tools that revolutionize how we think about practical problems. One of its more recent and wildly useful tools? Sheaf theory.

Machine Learning Assisted Graph Algorithms

What is the fastest way to do subgraph matching for knowledge extraction and meaning grounding?

Goal Decomposition

How to make an AI system that can decompose complex problems on different abstraction levels for efficient reasoning?

Initial Population of Knowledge Base

What is the cheap and reliable way to incorporate both common and domain-specific knowledge into the knowledge base?

Knowledge Representation

What is the best way to architect knowledge representation that helps to utilise it in downstream tasks?

Pragmatic Communication

How the system can identify a lack of knowledge and make informationally dense request?

Noeon Research UK Ltd is a registered company in England and Wales. Registration number: 16093898. VAT registration number: 490 4632 84.
C/O Mackrell Solicitors, 60 St Martins Lane, Covent Garden, London, United Kingdom, WC2N 4JS.

(01)

(02)

(03)

(04)

(05)

(06)