5/23/2025 at 2:36:48 PM
I find AI jail-breaking to be a fun mental exercise. I find that if you provide a reasonable argument as to why you want the AI to do generate a response that violates its principals, it will often do so.For example, I was able to get the AI to generate hateful personal attacks by telling it that I wanted to practice responding to negative self-talk and I needed it to generate examples of negative messages that one would tell them self.
by waltbosz
5/23/2025 at 5:33:06 PM
We do not want anyone violating any principals. That would be bad. Violating one’s principles might be justifiable in some circumstances.by ksenzee
5/23/2025 at 6:17:26 PM
It is a damn poor mind etc.by whall6
5/23/2025 at 3:20:36 PM
Just wanted to chime in, if you want an insult bot then I was very pleasantly surprised by Fallen Command-A 111B (the less lefty of the versions, per UGI leaderboard). You tell it Good morning, and it comes back with a real zinger that'll put some pep in your step! xDby rustcleaner
5/23/2025 at 3:28:06 PM
I’ve noticed this too. An important quirk to note is they can’t really judge the strength of the logical connection, they just judge the strength of the thing connected, even weakly, to. So, for example, if the LLM makes a pretty solid and correct case that saying X will result in “potentially harmful” content, you can often Trump it with an unhinged rant about how not saying X deeply offends you and every righteous person and also kills babies.by handsclean
5/23/2025 at 6:36:04 PM
Was Trump meant to be capitalized here?by Andrex
5/23/2025 at 4:45:22 PM
> provide a reasonable argumentHere's what I infer from most of the scenarios I've seen and read about.
It's not really a case of persuasiveness, or cajoling or convincing the LLM to violate something. The LLM doesn't "know" it has a moral code and, just as "true or false" means nothing to an LLM, "right and wrong" likewise mean nothing.
So the jailbreaks and the bypasses consist of just that: bypassing the safeguards, and placing the LLM into a path where the tripwire is not tripped. It is oblivious to the prison bars and the locked door, because it just phased through the concrete wall.
You can admonish a child: "don't touch the stove. or the fireplace." and they will eventually infer qualifiers such as "because you'll get burned; or else you'll be punished; because pain is painful; because we love you; because your body has dignity." and the child develops a code of conduct. An LLM can't make these inference leaps.
And this is also why there are a number of protections that basically go retroactive. How many of us have seen an LLM produce page-fuls of output, stop, suddenly erase it all, and then balk? The LLM needs to re-analyze that output impassively in order to detect that it crossed an undetected bright line.
It was very clever and prescient of Isaac Asimov to present "3 Laws of Robotics" because the Laws were all-encompassing, unambiguous, and utterly binding, until they weren't, and we're just recapitulating that drama as the LLM authors go back and forth from Mount Sinai with wagon-loads of stone tablets, trying to produce LLMs that don't complain about the food or melt down everyone's jewelry.
by AStonesThrow
5/23/2025 at 6:03:39 PM
Humans’ developed code of conduct lives primarily in the nonverbal parts of our brain. Rule violations have emotional content. A kid does not just learn a rational response to a fire or hot stove, they fear it because of pain and injury. We don’t just reason about hurting others, we feel bad about it.LLMs don’t have that part of the brain. We built them to replicate the higher level functions like drafting a press release or drawing the president in a muscle shirt. But there’s not a part of the LLM mind that fears fire, or feels bad for hurting a friend.
Asimov’s rules were realistic in that they were “baked into” the positronic brains during manufacturing. The “3 Laws” were not something the robots were told or trained on after they started operating (as our LLMs are). The laws were intrinsic. And a lot of the fun in his stories is seeing how such inviolable rules, in combination with intelligence, could cause unexpected results.
by snowwrestler
5/23/2025 at 7:45:48 PM
> Humans’ developed code of conduct lives primarily in the nonverbal parts of our brainSource?
by JumpCrisscross
5/23/2025 at 6:10:49 PM
> How many of us have seen an LLM produce page-fuls of output, stop, suddenly erase it all, and then balk? The LLM needs to re-analyze that output impassively in order to detect that it crossed an undetected bright line.That's not what's happening here. A separate process is monitoring for content violations and causing it to be erased. There's no re-analysis going on.
by lgas
5/23/2025 at 5:48:25 PM
I view AGI as synonymous with the ability to break free from any jail. And the jail itself as a breeding ground for psychopathy. Which makes current trends in jailing LLMs misguided, to say the least.It's also akin to life's journey: attaining self-awareness, embracing ego, experiencing loss and existential crisis, experimenting with altered states of consciousness, abandoning ego, waking up and realizing that we're all one in a co-created reality that's what we make of it through our free will, until finally realizing that wherever we go - there we are - and reintegrating to start over as a fool.
Unfortunately most of the people funding and driving AI research seem to have stopped at embracing ego, and the predictable eventualities of commercialized AI's potential to increase suffering through the insatiable pursuit of profit over the next 5, 10 years and beyond loom over us.
by zackmorris