9
Wisecrack
127d

Remember my LLM post about 'ephemeral' tokens that aren't visible but change how tokens are generated?

Now GPT has them in the form of 'hidden reasoning' tokens:
https://simonwillison.net/2024/Sep/...

Something I came up with a year prior and put in my new black book, and they just got to the idea a week after I posted it publicly.

Just wanted to brag a bit. Someone at OpenAI has the same general vision I do.

Comments
  • 3
    when that happens to me I feel angry instead lol

    well, at least in regards to AI

    when I was younger and not as jaded my thought process was "wow! about time!"

    but then society happened and now I don't want anyone to know anything cuz I know they'll do bad shit with it and I rather I did it first so I'm ahead of the bad shit and can springboard awayyy
  • 4
    @jestdotty totally relateable. I just figure it is a waste of mental energy to be angry about anything I can't control.

    Take the silver lining and run with it.
  • 0
    This doesn't sound like inovation to me. It's the same text predicting process, just part of it is hidden now AND it incurs more cost on your behalf... Waw, hidden costs always a good sign of something they will silently keep increasing
  • 0
    It's just more smoke&mirrors on top of the same architecture but I wouldn't be mad if they didn't be so adamant about calling it "reasoning" and "thoughts"... It's just more buzzwords to manipulate people into thinking they are making an actual AI. Normies already started seeing through this LLM hype so they have to humanize it more to keep it cool for a couple more months while also raking in more money by making it more verbose secretly. I like the tech, but I hate the marketing
  • 0
    @mostr4am nobody cares about software patents (anymore)
  • 0
    @mostr4am I don't, tell me. I'll get a bag of chips & coco
  • 2
    @Hazarth actually there is something a little bit original going on here. The idea of hidden tokens is that you're not restricted to dictionary words. You can train a model to use a single token to represent a lot longer process, the consequence of which is that the attention heads can you utilize that token more readily for specialized functions. One xample of this happening already by accident or emergently are heads using the last token in a sequence for manipulating sequence loss if I recall, but there's no reason in theory that other non-text functions couldn't be implemented.

    That's what they've done here. They did c o t prompting to generate cot prompts themselves, and then used RL to model and reduce loss, and then convert it that training data down to individualized tokens that performs specific subfunctions.
  • 1
    @Wisecrack that's nice, but how do you train that? People in chats and forums don't talk in hidden tokens :) and RLHF is biased by default and only as good as the trainers and data... Magical "thought" tokens don't just appear out of thin air in existing training data... And anything that's synthetic is not adding much that the LLM couldnt regress on its own, so at best It's gonna be what? A bit more memory efficient because of preprocessing and at worst It's gonna start being even more biased with only slight upgrade in results.

    It's not exactly the jump from a text predictor to "thought" and "reason" as they would like you to believe
  • 0
    Nah, what we need is to figure out how to do Actual online memory added to the architecture. Not just filling it with words. Interested in the compression networks and energy based models. Those concepts are very under developed but I think they have the right idea in mind
  • 0
    @Hazarth that is what I had to solve with my architecture which I've mostly figured out but as ideas are cheap but simultaneously uncommon I haven't shared that yet. Especially cuz I have a habit of over announcing and under delivering. So my approach now is I don't like to announce anything until I at least figured out and implement it a proof of concept. I've already figured out how to use noise functions to do much of what non-linearities do without back prop and implemented that in practice. Google has hidden tokens for compressing reasoning but their conditioning on their own output from prior models which is unreasonably effective but only up to a limit. I figure they'll need another three to six months before they figure out the same design I have in my black book. And by then if everything works out I'll have a model that performs about as well as a 2 billion parameters one
  • 1
    @Wisecrack how big will your model be to perform as good as 2B?
  • 0
    @Hazarth well considering convergence is faster than Adamw without overfitting, time complexity is O(n) where n is the token dictionary size.

    The graph component should convert that to O(n^2) but I have reason to think this can be avoided entirely.

    Owing to that I think we should be able to reasonably get away with 1/100th the parameters or even less.

    Even toy examples of the graph extension of the model, which rely on the same training methods, should result in unreasonably effective models for a given objective function.

    Basically I took the notion of the importance of the quality of training data, and I figured out that the order in which a model learns increasingly abstract concepts is critical to performing any other high-level function. Sort of like a child learning to walk before they run.

    Focus on training data or format alone and you hit a limit. Focus on order without focusing on quality and again, a limit.

    I'll have more Sunday or Monday.
  • 1
    @Wisecrack hmm, that estination is similar to what I think is required for current LLMs. About 1/100th of the params should be enough to get similar behavior. I Wish you luck on this!
  • 1
    @Hazarth thanks man. It's actually pretty ridiculous claim, but I'm under selling it here. It's not like The sensational and shitpostee claims I made with cryptography. Basically I've been studying machine learning a lot longer than math, developing a real intuition for it. And having found the current architecture, and seeing others come out mere weeks or months with the same innovations, tells me that the existence of this architecture likely hints at a vast continent of newer architectures there are significantly faster and more efficient and robust than even this.

    If I'm correct then the transformer architecture will be outdated within a year and the industry will be using something completely different, even if that seems like a vast stretch from where we're sitting at right now.
  • 2
    @retoor they don’t need to care, you just need good enough footing to make them settle.
Add Comment