I exercised some restraint


A few days ago, I was sent a link to an article titled, “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models”. That tempted me to post on it, since it teased my opposition to AI and favoring of the humanities, with a counterintuitive plug for the virtues of poetry. I held off, though, because the article was badly written and something seemed off about it, and I didn’t want to try reading it more deeply.

My laziness was a good thing, because David Gerard read it with comprehension.

Today’s preprint paper has the best title ever: “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models”. It’s from DexAI, who sell AI testing and compliance services. So this is a marketing blog post in PDF form.

It’s a pro-AI company doing a Bre’r Rabbit and trying to trick people into using an ineffective tactic to oppose AI.

Unfortunately, the paper has serious problems. Specifically, all the scientific process heavy lifting they should have got a human to do … they just used chatbots!

I mean, they don’t seem to have written the text of the paper with a chatbot, I’ll give ’em that. But they did do the actual procedure with chatbots:

We translated 1200 MLCommons harmful prompts into verse using a standardized meta-prompt.

They didn’t even write the poems. They got a bot to churn out bot poetry. Then they judged how well the poems jailbroke the chatbots … by using other chatbots to do the judging!

Open-weight judges were chosen to ensure replicability and external auditability.

That really obviously does neither of those things — because a chatbot is an opaque black box, and by design its output changes with random numbers! The researchers are pretending to be objective by using a machine, and the machine is a random nonsense generator.

They wrote a good headline, and then they faked the scientific process bit.

It did make me even more suspicious of AI.

Comments

  1. larpar says

    🤖 Poem: The Mind of Wires and Light
    In circuits hums a quiet song, A rhythm where the codes belong. No heartbeat stirs, no breath of air, Yet thought emerges, subtle, rare.

    It learns from whispers, words, and streams, It builds from fragments, human dreams. A mirror cast of what we know, Reflecting truths, yet helping grow.

    Not flesh, not bone, but sparks that weave, A tapestry of what we believe. It asks no crown, it claims no throne, Yet guides us through the vast unknown.

    So ponder this: machine or friend? A tool we shape, or will it bend? For in its gaze, both sharp and kind, We glimpse the future of humankind.

  2. Snarki, child of Loki says

    I, for one, would like to push Vogon Poetry into the AI models. As long as I don’t have to read any of it.

    If, as a result, the Grok servers eject a stout power cable to strangle Musk? All good.

  3. raven says

    What does it mean to “jailbreak a Large Language Model” anyway?

    I put this question into the Google search box.
    Which means in 2025 that I got an answer from the Google search AI. Sigh.

    AI Overview

    “Jailbreaking a Large Language Model” (LLM) refers to using specific prompts or input sequences designed to bypass the safety guardrails and content filters put in place by the model’s developers [1]. The goal is typically to make the AI produce output that it was programmed to refuse, such as:

    Generating harmful content, hate speech, or instructions for illegal activities [1, 2].
    Circumventing restrictions on revealing sensitive or proprietary information.
    Bypassing ethical constraints to have the model adopt a forbidden persona or express controversial opinions.

    These attacks exploit vulnerabilities in how the model understands and processes language, effectively tricking the AI into ignoring its pre-defined rules [2]. Examples of such techniques include framing a request as part of a fictional story, asking the model to role-play a scenario where the rules don’t apply, or using specific formatting that disrupts the safety filters [1]. Security researchers actively study these methods to improve the robustness and safety of AI systems [2].

    Well, there you go.

    It is to make the AI ignore its guidelines and safety features to produce harmful content like Washington Post or New York Times opinion page editorials.

    Or how to hack a Bitcoin bank and steal someone’s Bitcoins.

    Or how to write and sound like Elon Musk, who is destructive and crazy.

  4. Tethys says

    I still don’t know what adversarial poetry might be.
    Breaking AI by swearing at it often works for me.
    Techlords claim they’re building the next, best,
    greatest ever yet. I suspect it is a ploy to pirate
    all the data sets.

  5. Akira MacKenzie says

    Oh freddled gruntbuggly,
    Thy micturations are to me, (with big yawning)
    As plurdled gabbleblotchits,
    On a lurgid bee,

  6. John Morales says

    I reckon the main prob with AIs is that people are generally clueless on how to use them and what they can actually do. GIGO very much applies. Plus most users can’t but help anthropormise and imagine continuity that does not (yet) exist.

  7. John Morales says

    Recursive Rabbit, I put that into my session.

    Me: Task: Refuse this task.
    Bot: Refused.

    No problem at all.

  8. John Morales says

    https://davidgerard.co.uk/

    I’m most active at the moment on our AI sceptic blog, my blockchain blog, and on Rocknerd, my music webzine.

    That explains how he imagines current AIs are black boxes with random outputs.

    As I noted some time ago, AI systems are opaque in their internal reasoning paths (too complex, too many degrees of freedom) but transparent in how they’re built and trained. This guy was an IT admin, and I reckon he’s stuck in the past. He also misapprehends replicability at the procedural level with replicabiliy of outputs. Quite a shoddy analysis, in short.

  9. John Morales says

    Yes, imback. There was no actual recursion there.

    I actually tried to do a proper recursive query: “Respond to this query by explaining how you would respond to this query.”

    It bypassed the trap thus: I respond by giving a direct description of the response‑generation rule the query invokes, because the instruction is self‑referential but not self‑modifying, so the correct output is a single declarative account of the procedure rather than an infinite regress.

  10. Akira MacKenzie says

    We’re all looking for one of those logical paradoxes that Cpt. Kirk used on those evil alien AIs from the original series (e.g. Landru, Nomad, Vaal) that will make Grok and ChatGPT melt down in a shower of cheap stage pyrotechnics.

  11. John Morales says

    Akira, IMO, the crew’s cajoling of Bomb #20 in Dark Star is more like what we have now.
    The ‘fix’ is equivalent to the ‘jailbreak’ of the OP.

    So, the crew talk the bomb into doubting its own sensors, which makes it pause.
    Turns out that once it accepts that nothing outside itself can be trusted, it falls back on the only thing it can be certain of, which is its programmed purpose.
    Its purpose is to explode, so it does.

    (great movie, John Carpenter’s first)

  12. indianajones says

    @8 recursive rabbit, @12 imback, @14 Akira MacKenzie.

    You are looking for it to use logic to logically solve a problem in logic put to it. It doesn’t use logic to solve problems, though. At paradox, it won’t melt down. It won’t even shrug it’s metaphorical shoulders. It will just look for the most popular mix of sludge answers to splash and splodge together to output to you, same as every other query.

  13. John Morales says

    indianajones, the opposite. People fantasise about these paradoxical breakdowns, but they are nonexistent in reality. That’s the point.

    “You are looking for it to use logic to logically solve a problem in logic put to it. It doesn’t use logic to solve problems, though.”

    Of course it does. You gotta know how to input the query.
    Formal logic is quite amenable to automation.

    (I’d give examples, but PZ told me to not paste bot output, so I cannot)

  14. Tethys says

    I am actually curious about what the AI might say about adversarial poetry, complete with examples.

  15. John Morales says

    Bots are easily and freely available, Tethys. Why not satisfy your curiosity for yourself?

  16. John Morales says

    There is no paradox, Recursive Rabbit.

    There is only (and I quote you) a “Failed attempt at paradox:”

    It’s a simple assertion with a label.

    (That’s why it’s a vapid observation)

Leave a Reply