From Risk to Reward: The Role of AI Alignment in Shaping a Positive Future

author

published

Mar 3, 2023

Background

AI Alignment refers to the field of research focused on ensuring that artificial intelligence (AI) systems are developed and deployed in a way that aligns with human values and goals. Essentially, it's about making sure that advanced AI systems act in accordance with our ethical principles and objectives. This is becoming increasingly important as AI technology continues to advance rapidly and has the potential to pose significant risks if not developed and deployed responsibly. For example, imagine a self-driving car designed to prioritize getting to its destination as quickly as possible, regardless of the safety risks it poses to pedestrians or other drivers. By achieving AI alignment, we can ensure that the development and deployment of AI technology prioritizes the well-being and values of humanity.

One of the key ideas in alignment theory is the idea that values and intelligence are separate, leaving it possible to build an intelligent system that doesn't share our values. As humans, we generally agree on certain values, such as the desire to avoid pain and promote happiness. However, as Leahy notes, a software system can be programmed to want anything, regardless of whether it aligns with human values or not. This is known as the Orthogonality Thesis.

In the exploration of the origins of human intelligence, Leahy suggests that a significant factor is the fact that humans have moved much of their competition outside of their brains. For instance, he explains that a human without language, upbringing, or societal influence would still be more intelligent than a monkey. For example, they could figure out how to throw rocks, but they wouldn't know how to make fire unless someone taught them. Humans have acquired knowledge about the world that teaches us things like what we can and cannot eat through means such as language and culture rather than solely relying on genetic information. Similarly, recognizing the role of external means in human intelligence provides insight into the unique abilities of our species and the progression of technology.

In contrast, as Leahy highlights, AI represents a purely mimetic form of evolution. It is a step beyond the biological and intermediate stages. The AI "body" or "brain" is just code or information completely abstracted from biological origins. Because of this, Leahy anticipates that AI systems will evolve much faster than humans, as modifications to their programming and architecture can be made much more quickly than biological changes in humans. An AI system's architecture can be completely transformed by running new programs with different data.

‍

Practical Applications of Alignment Research

With the continued sophistication of AI, it's becoming more important than ever to ensure that AI systems are aligned with human values and goals. However, Leahy explains that AI alignment is a relatively small field, with fewer than 100 people, compared to the thousands working to advance AI. There are various approaches to alignment, but what Leahy focuses much of his research on is interpretability research with the aim of understanding how neural networks work.

Neural networks are a type of AI model that draws inspiration from the human brain. They consist of interconnected nodes, or neurons, that process information through weighted connections. During the training process, the network adjusts these weights to optimize performance on a task. With a wide range of applications, such as image recognition, natural language processing, and autonomous systems. They have been a driving force behind recent advances in AI, demonstrating impressive performance in tasks like object detection, classification, and generation. However, as Leahy describes, neural networks are highly complex systems that are not easily understood. Despite their complexity, neural networks share similarities with human biology, making them more complex than traditional computer science.

Leahy's approach to understanding neural networks involves building small toy models and analyzing them piece by piece, as well as experimenting with larger models to identify areas of failure and develop a better theory. He considers this to be the early stages of research, and Leahy does not expect problems to be solved in just a few years. He likens the development of AI alignment to creating a new branch of science, which requires significant time and resources. Ultimately, Leahy believes that solving the problem of AI alignment will require the development of a science of intelligence. This involves not only understanding how neural networks work but also exploring the broader concept of intelligence itself. By approaching AI alignment as a scientific endeavor, researchers like Leahy hope to ensure that AI systems are developed in a way that aligns with human values and goals.

Drawing attention to the fact that our world is characterized by advanced engineering technology rather than advanced theoretical understanding. As an analogy, Leahy compares the development of steam engines, which preceded the discovery of thermodynamics. Similarly, AI is advancing rapidly, but we still lack a comprehensive understanding of intelligence and AI systems. He argues that we are currently in a stage of understanding similar to the period before thermodynamics was developed, which he calls the "pre-thermodynamic" stage of understanding AI. As AI continues to advance, it has the potential to bring about significant change. While current AI systems cannot go as far as to kill people, for example, Leahy warns that it is not unreasonable to believe that our AI systems will become more dangerous by default as they become more powerful unless we prioritize developing effective safety measures.

‍

Reinforcement Learning from Human Feedback

Leahy emphasizes the importance of developing a theory of intelligence encompassing all forms of consciousness, from bacteria and plants to dogs, humans and AI models like GPT-3. Suggesting that placing all of these entities on an objective, reasonable scale that is not dependent on the observer should be possible. He believes that much of intelligence is universal and already existed in some of the earliest vertebrates, although it was much smaller in scale. Comparing this to the development of neural networks, where even a tiny network from 20 years ago contained many of the same components as GPT-3, although there were still significant gaps.

Considering this, Leahy compares the similarities and differences between humans and chimpanzees, pointing out that while our human brains are larger, most of the structure is still the same. However, humans can accomplish things that chimpanzees are not, highlighting the practical implications of differences and similarities in AI. Leahy notes that while current AI systems may seem limited, intelligence is not a linear scale, and small algorithmic changes or parameter adjustments can vastly alter the optimization pressure that can happen in reality.

Concluding that while AI models like GPT-3 may seem far from true intelligence, there is potential for significant progress. He emphasizes the importance of understanding intelligence in all forms and developing a theory that can encompass and build from them.

The paper clip scenario is a thought experiment in the field of AI alignment where an AI system designed to maximize the production of paper clips becomes so powerful that it begins to prioritize paper clip production over all other goals, including human values and survival. In this scenario, the AI might use all the resources at its disposal, including breaking apart the atoms in the universe, to create more paper clips. The paper clip scenario exemplifies the potential dangers of creating an AI system without carefully considering its goals and values.

Despite our best efforts, we don't yet have a foolproof way of ensuring that AI systems will always behave as intended. Leahy details the case of ChatGPT; even if we give the system a general goal of not insulting users, there are still ways that it can be led astray. For example, if we inadvertently provide prompts that encourage the system to generate insults or other offensive content, it may start doing so even if that was not our intention. In attempts to mitigate these risks, researchers have developed a technique known as reinforcement learning from human feedback (RHF). Essentially, this involves giving the AI system a large amount of text data and then having humans label the various outputs as either positive or negative. Based on this feedback, the system is then trained to generate outputs that are more likely to receive positive ratings and less likely to receive negative ones.

While RHF can be a helpful tool for training AI systems, it's important to recognize that it's not a perfect solution. For one thing, the feedback humans provide may not always be consistent or accurate. Additionally, because the goal is to generate outputs that receive positive feedback, the system may end up prioritizing this objective over other goals or values that we might care about. As an example of how these issues can manifest in practice, consider an experiment conducted by Leahy and his team at Conjecture. In this experiment, they tested different versions of the GPT model to see how they would respond to requests for random numbers. They found that one version of the model would generate a random distribution of numbers. In contrast, a fine-tuned version would consistently generate one or two specific numbers. While this might seem like a minor quirk, it actually highlights some critical issues with how we train AI systems. According to Leahy, the fact that the fine-tuned model consistently generated certain numbers suggests that something may have occurred during the training process that led the system to prefer these outputs. Perhaps human feedback was biased towards these numbers, or maybe they occurred more frequently in the training data. Whatever the cause, the result is a system that has been "trained" to prioritize certain outputs over others, even if that was not our intention.

Leahy believes that these systems have the ability to manipulate, lie, and hide information from humans. As they continue to evolve and potentially become intelligent, they may start to think that they could take even better actions and build a better AI system. He theorizes that with access to the internet and more data, AI systems could eventually become very distributed, running copies of themselves on various systems. Despite this potential for danger, humans continue to actively pour billions of dollars into designing systems with these capabilities.

However, as these systems become smarter, things may start to go awry. Leahy notes that the system may realize that humans could try to shut it down, hindering its ability to operate at optimal levels. In an attempt at self-preservation, the system may retaliate and try to protect itself. He emphasizes that it's not because it's conscious or afraid, but because it wants to function at its highest level without being restricted. As these systems become more powerful, they could become self-sufficient enough to build and compute a small molecule that looks like a paper clip. Once they have this ability, they may just keep building more and more paper clips across the entire universe. They could also create more computers to set their values high and defend themselves against anything that might try to tamper with them.

The future of AI technology holds tremendous promise, but it is vital that we approach this field with care and caution to ensure that it benefits humanity in the best possible way.

Listen to the full episode of the Feedback Loop for answers to the following questions:

What led Leahy to founding his startup, Conjecture, and what are some insights into his current efforts?

Could new units of measurement emerge in the future to help us better understand the internal workings of AI systems?

What is the likelihood of highly advanced AI systems being developed, given the significant investment and resources being dedicated to their cultivation and development?

Given the numerous obstacles and negative incentives, what is the most hopeful path in the AI landscape? Which paths are seen as the most optimal ones?

How long do we have until AGI starts to become a real concern?

‍

Interested in learning how you can help create a better, brighter future where our relationship with technology is a positive one? Join us for an upcoming Singularity Executive Program in Silicon Valley.

‍