I love to show that kind of shit to AI boosters. (In case you’re wondering, the numbers were chosen randomly and the answer is incorrect).

They go waaa waaa its not a calculator, and then I can point out that it got the leading 6 digits and the last digit correct, which is a lot better than it did on the “softer” parts of the test.

  • Kazumara@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    3 months ago

    I was hoping, until seeing this post, that the reasoning text was actually related to how the answer is generated. Especially regarding features such as using external tools, generating and executing code and so on.

    I get how LLMs work (roughly, didn’t take too many courses in ML at Uni, and GANs were still all the rage then), that’s why I specifically didn’t call it lies. But the part I’m always unsure about is how much external structure is imposed on the LLM-based chat bots through traditional programming filling the gaps between rounds of token generation.

    Apparently I was too optimistic :-)

    • rook@awful.systems
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      1
      ·
      3 months ago

      It is related, inasmuch as it’s all generated from the same prompt and the “answer” will be statistically likely to follow from the “reasoning” text. But it is only likely to follow, which is why you can sometimes see a lot of unrelated or incorrect guff in “reasoning” steps that’s misinterpreted as deliberate lying by ai doomers.

      I will confess that I don’t know what shapes the multiple “let me just check” or correction steps you sometimes see. It might just be a response stream that is shaped like self-checking. It is also possible that the response stream is fed through a separate llm session when then pushes its own responses into the context window before the response is finished and sent back to the questioner, but that would boil down to “neural networks pattern matching on each other’s outputs and generating plausible response token streams” rather than any sort of meaningful introspection.

      I would expect the actual systems used by the likes of openai to be far more full of hacks and bodges and work-arounds and let’s-pretend prompts that either you or I could imagine.