alt.hn

5/22/2025 at 7:25:10 PM

Problems in AI alignment: A scale model

https://muldoon.cloud/2025/05/22/alignment.html

by hamburga

5/23/2025 at 2:29:29 AM

I think this post is sort of confused because, centrally, the reason "AI Alignment" is a thing people talk about is because the problem, as originally envisioned, was to figure out how to avoid having superintelligent AI kill everyone. For a variety of reasons the term no longer refers primarily to that core problem, so the reason so many things that look like engineering problems have that label is mostly a historical artifact.

by comp_throw7

5/23/2025 at 5:28:08 AM

  > as originally envisioned
This was never the core problem as originally envisioned. This may be the primary problem that the public was first introduced to, but the alignment problem has always been about the gap between intended outcomes and actual outcomes. Goodhart's Law[0].

Super-intelligent AI killing everyone, or even super-dumb AI killing everyone, is a result of the alignment problem when given enough scale. You don't jump to the conclusion of AI killing everyone and post hoc explain through reward hacking, you recognize reward hacking and extrapolate. This is also the reason why it is so important to look at it from engineering problems and from things happening on the smaller scales, *because ignoring all those problems is exactly how you create the scenario of AI killing everyone...*

[0] https://en.wikipedia.org/wiki/Goodhart%27s_law

[Side note] Even look at Asimov and his robot stories. The majority of them are about alignment. His 3 laws were written as things that sound good and have intent that would be clear to any reader, and then he pulls the rug out on you showing how they're naively defined and it isn't so obvious. Kinda like a programmer teaching their kids to make and PB&J Sandwich... https://www.youtube.com/watch?v=FN2RM-CHkuI

by godelski

5/23/2025 at 3:32:48 PM

But Asimov never called it alignment: he never used that word or the phrase "aligned with human values". The first people to use that word and that phrase in the context of AI (about 10 to 13 years ago) where concerned mainly with preventing human extinction or something similarly terrible happening after the AI's capability has exceeded human capabilities across all relevant cognitive skills.

BTW, it seems futile to me to try to prevent people from using "AI alignment" in ways not intended by the first people to use it (10 to 13 years ago). A few years ago, writers working for OpenAI started referring to the original concept as "AI superalignment" to distinguish it from newer senses of the phrase, and I will follow that convention here.

>the alignment problem has always been about the gap between intended outcomes and actual outcomes. Goodhart's Law.

Some believe Goodhart captures the essence of the danger; Gordon Seidoh Worley is one such. (I can probably find the URL of a post he wrote a few years ago if you like.) But many of us feel that Eliezer's "coherent extrapolated volition" (CEV) plan published in 2004 would have prevented Goodhart's Law from causing a catastrophe if the CEV plan could have been implemented in time (i.e., before the more reckless AI labs get everyone killed), which looks unlike to many of us now (because there has been so little progress on implementation of the CEV plan in the 21 years since 2004).

The argument that persuaded many of us is that people have a lot of desires, i.e., the algorithmic complexity of human desires is at least dozens or hundreds of bits of information and it is unlikely for that many bits of information to end up in the right place inside the AI by accident or by any process except by human efforts that show much much more mastery of the craft of artificial-mind building than shown by any of the superalignment plans published up to now.

One reply made by many is that we can hope that AI (i.e., AIs too weak to be very dangerous) can help human researchers achieve the necessary mastery, but the problem with that is that the reckless AI researchers have AIs helping them, too, so the fact that AIs can help people design AIs does not ameliorate the main problem: namely, we expect it to prove significantly easier to create a dangerously capable AI than it is to keep a dangerously capable AI aligned with human values, and our main reason for believing that is the rapid progress made on the former concern (especially since the start of the deep-learning revolution in 2006) compared to the painfully slow and very tentative-speculative nature of the progress made on the latter concern since public discussion on the latter concern began in 2002 or so.

by hollerith

5/23/2025 at 5:43:28 PM

> The argument that persuaded many of us is that people have a lot of desires, i.e., the algorithmic complexity of human desires is at least dozens or hundreds of bits of information

I would really try to disentangle this.

1. I don't know what my desires are. 2. "Desire" itself is a vague word that can't be measured or quantified; where does my desire for "feeling at peace" get encoded in any hypothetical artificial mind? 3. People have different and opposing desires.

Therefore, Coherent Extrapolated Volition is not coherent to me.

This is kind of where I go when I say that any centralized, top-down "grand plan" for AI safety is a folly. On the other hand, we all contribute to Selection.

by hamburga

5/23/2025 at 6:09:17 PM

>I don't know what my desires are.

No need: it would be the AI's job to find out (after it has become very very capable), not your job.

>"Desire" itself is a vague word that can't be measured or quantified

There are certain ways the future might unfold that would revolt you or make you very sad and others that don't have that problem. There is nothing vague or debatable about that fact even if we use vague words to discuss it.

Again, even the author of the CEV plan no longer put any hope in it. My only reason for bringing it up is to flesh out my assertion that there are superalignment plans not vulnerable to Goodhart's Law/Curse, so Goodhart's Law cannot be the core problem with AI: at the very least, the core problem would need to be a combination of Goodhart with some other consideration, and I have been unable to imagine what that other consideration might be unless perhaps it is the fact that all alignment plans I know about not vulnerable to Goodhart would be too hard to implement in the time humanity has left before unaligned AI kills us or at least permanently disempowers us. But even then it strikes me as misleading or outright wrong to describe Goodhart as the core problem just because there probably won't be enough time to implement a plan not vulnerable to Goodhart. It seem much better to describe the core problem as the ease with which an non-superaligned AI can be created relative to how difficult it will be to create a superaligned AI.

Again "superaligned" means the AI stays aligned even if its capabilities grow much greater than human capabilities.

by hollerith

5/23/2025 at 7:44:46 PM

  > not vulnerable to Goodhart's Law/Curse
I'm going to need some good citations on that one.

CEV does not resolve Goodhart's Law. I'm really not sure you even can!

Let me give a really basic example to show you how something you might assume is perfectly aligned actually isn't.

Suppose you want to determine how long your pen is. You grab out your ruler and measure it, right? It's 150 mm, right? Well... no... That's at least +/- 1mm. But that's according to the ruler. How good is your ruler? What's the +/- value from an actual meter? Is it consistently spaced along the graduations? Wait, did you mean clicker open or closed? That's at least a few mm difference.

If you doubt me, go grab as many rulers and measuring devices as you can find. I'm sure you'll find differences. I know in my house I have 4 rulers and none of them are identical to 250um. It's easy to even see the differences between them, though they are pretty close and good enough for any task I'm actually using them for. But if you wanted me to maximize the pen's size, you can bet I'm not going to randomly pick a rule... I'm going to pick a very specific one... Because what are my other options? I can't make the pen any bigger without making an entirely new one or without controlling spacetime.

The point is that this is a trivial measurement where we take everything for granted, yet the measurement isn't perfectly aligned with the intent of the measurement. We can't even do this fundamentally with something as well defined as a meter! The physics will get in the way and we'd have to spend exorbitant amounts of money to get down to the nm scale. These are small amounts of misalignment and frankly, they don't matter for most purposes. But they do matter based on the context. It is why when engineers design parts it is critical to include tolerances. Without them, you haven't actually defined a measurement!

So start extrapolating this. How do you measure to determine "what is a cat"? How do you measure happiness? How do you measure any of that stuff? Even the warped wooden meter stick you see in every Elementary School classroom provides a more well defined measurement than any tool we have for these things!

We're not even capable of determining how misaligned we are!

And that was the point of my earlier post. These are the same thing! What do you think the engineering challenges are?! You're talking about a big problem and complaining that we are breaking it down into smaller workable components. How else do you expect us to fix the big problem? It isn't going to happen through magic. It happens by factorizing it into key components, that can be more easily understood by themselves where then we can work back up by adding complexity. We're sure not going to solve the massively complicated problem if we aren't allowed to try to solve the overly simple naive versions first.

by godelski

5/23/2025 at 5:24:53 PM

  > Asimov never called it alignment
He also never said "super intelligence", "general intelligence", or a ton of other things. Why would he? Jargon changed. Doesn't mean what he discussed changed.

So it doesn't matter. The fact that someone coined a better term for the concept doesn't mean it isn't the same thing. So of course it gets talked about in the way you see because it has been the same concept the whole time.

If we're really going to nitpick then the coined phrase usage was not about killing everyone, aligning with human values. Much more broad and the connection is clearer. It implies killing, but it's still the same problem. (Come on, Asimov's stuff was explicit "aligning with human values" it would be silly to say it isn't)

So by your logic we would similarly have to conclude that Asimov never talked about artificial super intelligence despite multivac's various upgrades, up to making a whole universe. Never was saying ASI in "The Last Question", but clearly that's what was discussed. Similarly you'd argue that Asimov only discussed artificial intelligence but never artificial general intelligence. Are none of those robots general? Is Andrew, from Positronic Man, not... "General"? Not sentient? Not conscious? The robot literally transforms into a living breathing human!

So I hope you agree that it'd be ridiculous to make such conclusions in these cases. The concepts were identical, we just use slightly different words to describe them now and that isn't a problem.

It's only natural that we say "alignment" instead of "steering", "reward hacking", or the god awful "parasitic mutated heuristics". It's all the same thing and the verbiage is much better.

by godelski

5/23/2025 at 2:46:52 AM

Not totally following your last point, though I do totally agree that there is this historical drift from “AI alignment” referring to existential risk, to today, where any AI personality you don’t like is “unaligned.”

Still, “AI existential risk” is practically a different beast from “AI alignment,” and I’m trying to argue that the latter is not just for experts, but that it’s mostly a sociopolitical question of selection.

by hamburga

5/23/2025 at 3:48:56 AM

What I understand from what GP was saying, is that AI Alignment today is more akin to trying to analyze and reduce error in an already fitted linear regressor rather than aligning AI behaviour and values to expected ones.

Perhaps that has to do with the fact that aligning LLM-based AI systems has become a pseudo predictable engineering problem solvable via a "target, measure and reiterate cycle" rather than the highly philosophical and moral task old AI Alignment researchers thought it would be.

by mrbungie

5/23/2025 at 4:13:17 AM

Not quite. My point was mostly that the term made more sense in its original context rather than the one it's been co-opted for. But it was convenient for various people to use the term for other stuff, and languages gonna language.

by comp_throw7

5/23/2025 at 3:45:45 AM

> historical drift from “AI alignment” referring to existential risk, to today, where any AI personality you don’t like is “unaligned.”

Alignment has always been "what it actually does doesn't match what it's meant to do".

When the crowd that believes that AI will inevitably become an all-powerful God owned the news cycle, alignment concerns were of course presented through that lens. But it's actually rather interesting if approached seriously, especially when different people have different ideas about what it's meant to do.

by tbrownaw

5/23/2025 at 1:06:01 AM

I'm kind of upset to see systematically "Alignment" and "AI Safety" co-opted for "undesirable business outcomes".

These are existential problems, not mild profit blockers. Its almost like the goals of humanity and these companies are misaligned.

by blamestross

5/23/2025 at 1:45:51 AM

  > Its almost like the goals of humanity and these companies are misaligned.
Certainly. I'd say that we've created a lot of Lemon Markets, if not an entire Lemon Economy[0]. The Lemon Market is literally an alignment problem, resultant from asymmetric information. Clearly the intent of the economy (via our social contract) is that we allocate money towards things that provide "value". Where I think we generally interpret that word to mean bettering peoples' lives in some form or another. But it is also clear that the term takes on other definitions and isn't perfectly aligned with making us better. Certainly our metrics can be hacked, as in the case of Lemon Markets.

A well functioning market has competition that not only drives down prices but increases quality of products. Obviously customers want to simultaneously maximize quality and minimize price. But when customers cannot differentiate quality, they can only minimize price. Leading to the feedback loop, where producers are in a race to the bottom, making sacrifices to quality in favor of driving down prices (and thus driving up profits). Not because this is actually the thing that customers want! But because the market is inefficient.

I think critical to these alignment issues is that they're not primarily driven by people trying to be malicious nor deceptive. They are more often driven by being short sighted and overlooking subtle nuances. They don't happen all at once, but instead slowly creep, making them more difficult to detect. It's like good horror: you might know something is wrong, but by the time you put it all together you're dead. It isn't because anyone is dumb or doing anything evil, but because maintaining alignment is difficult and mistakes are easy.

[0] https://en.wikipedia.org/wiki/The_Market_for_Lemons

by godelski

5/23/2025 at 4:01:34 AM

No, we are not on track to create a literal god in the machine. Skynet isn't actually real. LLM system do not have intent in the way that is presupposed by these worries.

This is all much much less of an existential threat than, say, nuclear-armed countries getting into military conflicts, or overworked grad students having lab accidents with pathogen research. Maybe it's as dangerous as the printing press and the wars that that caused?

by tbrownaw

5/23/2025 at 5:40:43 AM

It's a much greater existential threat. An entity with intent, abductive reasoning, and self-defined goals is more interpretable. They can fill in the gaps between the letter of an instruction and the intent of an instruction. They may have their own agendas, but they are able to interpret through those gaps without outside help.

But machines? Well they have none of that. They're optimized to make errors difficult to detect. They're optimized to trick you, even as reported by OpenAI[0]. It is a much greater existential threat than the overworked grad student because I can at least observe them getting flustered, making mistakes, and have much more warning like by the very nature of over working them. You can see it on their face. But the machine? It'll happily chug along.

Have you never written a program that ends up doing something you didn't intend it to?

Have you never dropped tables? Deleted files? Destroyed things you never intended to?

The machine doesn't second guess you, it just says "okay :)"

[0] https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def563...

by godelski

5/23/2025 at 5:48:00 AM

We disagree on that.

by hollerith

5/23/2025 at 2:20:13 AM

Agreed. I see this more and more as the AI safety discourse spills more into the general lexicon and into PR efforts. For example, the “sycophantic” GPT 4o was also described as “misaligned” as code for “unlikable.” In the meme, I filed this under “personality programming.” Very different from the kinds of problems the original AI alignment writers were focused on.

by hamburga

5/23/2025 at 3:51:30 AM

https://slatestarcodex.com/2014/07/30/meditations-on-moloch/ is the essay that crystalizes the reasons that Selection and Markets are not the forces we want to be aligning AI (or much else).

In short (it is a very long article) fitness is not the same as goodness (by human standards) and so selection pressure will squeeze out goodness in favor of fitness, across all environments and niches, in the long run.

by benlivengood

5/23/2025 at 2:46:45 PM

Why can’t we infuse Selection with Goodness? We’re the ones doing the selecting. We’ve selected out things like chattel slavery, for example.

(Disclaimer: fell asleep after 10 minutes of reading the SSC post last night. I know it’s part of the HN Canon and perhaps I’m missing something)

by hamburga

5/23/2025 at 10:01:34 AM

the world was completely unprepared for fully autonomous botnets

by shwouchk

5/23/2025 at 8:27:15 AM

>Why isn’t there a “pharmaceutical alignment” or a “school curriculum alignment” Wikipedia page?

>I think that the answer is “AI Alignment” has an implicit technical bent to it. If you go on the AI Alignment Forum, for example, you’ll find more math than Confucius or Foucault.

What an absolutely insane thing to write. AI Alignment is different because it is trying to align something which is completely human made. Every other field is aligned "aligned" when the humans in it are "aligned",

Outside of AI "alignment" is the subject of ethics (what is wrong and what is right) and law (How do we translate ethics into rules).

What I think is absolutely important to understand is that throughout human history "alignment" has never happened. For every single thing you believe to be right there existed a human who considered that exact thing as completely wrong. Selection certainly has not created alignment.

by constantcrying

5/23/2025 at 11:53:34 AM

> Every other field is aligned "aligned" when the humans in it are "aligned",

That doesn't seem like the whole story. Pick two countries, for instance, one of which has evolved to be democratic (with high regard for rule of law, etc.) and the other is dictatorial. How did these countries end up the way they did? It probably has to do with rules, not just default human qualities.

Let's say you consider popular participation to be good. Then you could say the humans who live in the first country are more "aligned" than the second, but the mechanisms of their forms of government also play part. E.g. if the bureaucracy is set up so that skillfully stabbing others in the back gets you political clout, the selection process will marginalize or kick out people who don't want to engage in backstabbing.

Any organization's behavior depends on some combination of what its incentives promote and on the qualities of its members. This makes AI alignment just an extreme on a scale, not a thing set apart from all other kinds of alignment. The AI alignment problem is the "all rules" extreme of the scale, and organizational alignment is some combination of rules and the inclinations of the humans who are part of it.

The ethics problem of "what does 'aligned' mean anyway" would both apply to the AI situation and the mixed organization situation. A dictator might want an AI "aligned" to maximize his own power, and would also want a human organization to be engineered in such a way as to be both obedient and effective. Someone of a more democratic predisposition would have other priorities - whether they are of what AIs should do or what human organizations should do.

by kalavan

5/23/2025 at 3:03:52 PM

Thank you for this. It gets exactly to the heart of the issue and what I sense is being missed in the AI alignment conversation. “What does ‘aligned’ mean” is and ethical/political question; and when people skip over that, it’s often to (1) smuggle in their own ethics and present them as universal, or (2) run away from the messy political questions and towards the safe but much narrower domain of technical research.

by hamburga

5/23/2025 at 2:56:33 PM

> AI Alignment is different because it is trying to align something which is completely human made.

Not sure what you’re getting at here; pharmaceuticals are also human made. The point in the blog post was that we should also want drugs (for example) to be aligned to our values.

> What I think is absolutely important to understand is that throughout human history "alignment" has never happened.

Agree with that. This is a journey, not a destination. It’s a practice, not a mathematical problem to be solved. With no end in sight. In the same way that “perfect ethics” will never be achieved.

by hamburga

5/23/2025 at 4:49:43 AM

> While Nature can’t do its selection on ethical grounds, we can, and do, when we select what kinds of companies and rules and power centers are filling which niches in our world. It’s a decentralized operation (like evolution), not controlled by any single entity, but consisting of the “sum total of the wills of the masses,” as Tolstoy put it.

Alternatively, corporations and kings can manufacture the right kinds of opinions in people to sanction and direct the wills of the masses.

by verisimi

5/23/2025 at 2:43:41 PM

Indeed. Tolstoy does a deep exploration of this in War and Peace, using the example of Napoleon.

Of course, this gets to the heart of the free will debate (to be settled in a future post ;)). Both are true at the same time - organized people and dictators and other factors simultaneously wrestle for influence in complex ways in which causation is impossible to measure.

My own two cents, though, is that the Categorical Imperative is a tremendously important and underappreciated tool for raising the self-consciousness of groups.

A practical implementation of it is linked at the bottom of the blog post.

by hamburga

5/22/2025 at 9:15:58 PM

This is an excellent point. How we choose to use and interact with AI is an individual and stochastic collective.

We can still choose not to give AI control.

by daveguy

5/22/2025 at 11:01:27 PM

I find myself using the term “human superorganism” a lot these days. Our bodies are composed of many cells and microbes working together; and if we zoom out, we have the dynamics in which human individuals operate together as superorganisms. IMO there’s a whole world of productive thinking to be done to attend to the health and strength of these entities.

by hamburga

5/23/2025 at 1:19:13 AM

But Margret Thatcher said there isn't even a society. Just individuals.

by breakyerself

5/23/2025 at 2:24:09 AM

Thatcher was briefly a research chemist; it’s extra weird that she couldn’t see beyond the atomic unit, to the societal equivalents of complex organic compounds.

by hamburga

5/23/2025 at 8:29:12 AM

Who is this "we"? Supposing a single person disagrees and decides to give AI more control and gains a very significant advantage by that, what then?

I think people keep forgetting that "Selection" can be excessively cruel.

by constantcrying

5/23/2025 at 3:15:30 PM

A single person acting in isolation (no friends, no colleagues, no customers) has very little agency. While theoretically a single person could release smallpox back into civilization, we have collectively selected it out very effectively.

by hamburga

5/23/2025 at 3:54:54 PM

The question is only relevant if AI is a significant force amplifier. If it is not, it is an unthreatening tool, whose usage should be largely unrestricted, this is the case now.

If that ever changes then there is the question what to do with it and at a certain level of power an individual decision would have impact, if sufficiently amplified.

by constantcrying

5/23/2025 at 5:49:45 PM

AI is definitely a significant force multiplier in many areas. Still, an individual in total isolation has limited agency.

by hamburga

5/23/2025 at 11:56:29 AM

You say that like AI has some sort of autonomy. It doesn't.

by daveguy

5/23/2025 at 1:22:58 PM

It doesn't, by default. All it takes is a capable enough model without rails, and a single user instructing it to act autonomously as its primary goal.

by selfhoster11

5/23/2025 at 6:33:31 PM

The point is, AI is not able to act capably or autonomously. If someone tells it to "act autonomously" it'll either sit like a lump or flail until the power runs out.

by daveguy

5/23/2025 at 3:20:05 PM

All it takes for somebody to nuke Atlanta is an atom bomb and an airplane and somebody willing to fly the plane.

I’m being facetious but there ARE ways to decide/act as a society and as subgroups within society that we want to disallow and punish and select out qualities of AIs that we think are unethical.

by hamburga

5/23/2025 at 12:19:46 PM

Doesn't matter.

by constantcrying

5/22/2025 at 11:30:05 PM

[dead]

by cokeandpepsi

5/23/2025 at 1:28:05 AM

A critical part of AI alignment is understanding what goals besides the intended one maximize our training objectives. I think this is a thing that everyone kinda knows and will say but simultaneously are not giving anywhere near the depth of thought necessary to address the problems. Kind of like a clique: something everyone can repeat but frequently fails to implement in practice.

Critically, when discussing intention I think there is not enough attention given to the fact that deception also maximizes RLHF, DPO, and any human preference based optimization. These are quite difficult things to measure and there's no formal mathematically derived evaluation. Alignment is incredibly difficult even in settings where measures have strong mathematical bases and we have means to make high quality measurements. But here, we have neither...

We essentially are using the Justice Potter definition: I know it when I see it[0]. This has been highly successful and helped us make major strides! I don't want to detract from that in any way. But we also do have to recognize that there is a lurking danger that can create major problems. As long as it is based on human preference, well... we sure prefer a lie that doesn't sound like a lie compared to a lie that is obviously a lie. We obviously prefer truth and accuracy above either, but the notion of truth is fairly ill-defined and we really have no formal immutable definition outside highly constrained settings. It means that the models are also optimizing that their errors are difficult to detect. This is inherently a dangerous position, even if only from the standpoint that our optimization methods do not preclude this possibility. It may not be happening, but if it is, we may not know.

The is the opposite of what is considered good design in all other forms of engineering. A lot of time is dedicated to error analysis and design. We specifically design things so that when they fail, or being to fail, that they do so in controllable and easily detectable ways. You don't want your bridges to fail, but when they fail you also don't want them to fail unpredictably. You don't want your code to fail, but when it does you don't want it leaking memory, spawning new processes, or doing any other wild things. You want it to come with easy to understand error messages. But our current design for AI and ML does not provide such a framework. This is true beyond LLMs.

I'm not saying we should stop and I'm definitely not a doomer. I think AI and ML do a lot of good and will do much more good in the future[1]. It will also do harm, but I think the rewards outweigh the risks. But we should make sure we're not going into this completely blind and we should try to minimize the potential for harm. This isn't a call to stop, this is a call for more people to enter the space, a call for people already in the space to spend more time deeply thinking about these things. There's so many underlying subtleties that they are easy to miss, especially given all the excitement. We're definitely on an edge now, in the public eye, where if our work makes too many mistakes or too big of a mistake that it will risk shutting everything down.

I know many might interpret me as being "a party pooper", but actually I want to keep the party going! But that also means making sure the party doesn't go overboard. Inviting a monkey with a machine gun sure will make the party legendary, but it's also a lot more likely to get it shut down a lot sooner with someone getting shot. So maybe let's just invite the monkey, but not with the machine gun? It won't be as epic, but I'm certain the good times will go on for much longer and we'll have much more fun in the long run.

If the physicists can double check that the atomic bomb isn't going to destroy the world (something everyone was highly confident would not happen[2]), I think we can do this. Stakes are pretty similar, but the odds of our work doing high harm is greater.

[0] https://en.wikipedia.org/wiki/Potter_Stewart

[1] I'm a ML researcher myself! I'm passionate about creating these systems. But we need to recognize flaws and limitations if we are to improve them. Ignoring flaws and limits is playing with fire. Maybe you won't burn your house down, maybe you will. But you can't even determine the answer if you won't ask the question.

[2] The story gets hyped, but it really wasn't believed. Despite this, they still double checked considering the risk. We could say the same thing about micro-blackholes with the LHC. Public finds out and gets scared, physicists really think it is near impossible, but run the calculations anyways. Why take that extreme level of risk, right?

by godelski

5/23/2025 at 2:35:17 AM

> this is a call for more people to enter the space

Part of my argument in the post is that we are in this space, even those of us who aren’t ML researchers, just by virtue of being part of the selection process that evaluates different AIs and decides when and where to apply them.

A bit more on that: https://muldoon.cloud/2023/10/29/ai-commandments.html

by hamburga

5/23/2025 at 5:21:12 AM

I more mean we need more people placing attention in the direction of alignment. I definitely agree this extends well past researchers (I'd even argue past AI and ML[0]). It is a critical part of being an engineer, programmer, or whatever you want to call it.

You are completely right that we're all involved, but I'm not convinced we're all taking sufficient care to ensure we make alignment happen. That's what I'm trying to make a call of arms to. I believe you are as well, I just wanted to make it explicit that we need active participation, instead of simply passive.

[0] https://en.wikipedia.org/wiki/Goodhart%27s_law

by godelski

5/23/2025 at 2:48:56 PM

Agreed - and this was definitely my intent with the blog post. If you only do Selection passively, you’re abdicating your ethical responsibilities to contribute to AI Alignment.

by hamburga