6
Wisecrack
24h

Adaptive Latent Hypersurfaces

The idea is rather than adjusting embedding latents, we learn a model that takes
the context tokens as input, and generates an efficient adapter or transform of the latents,
so when the latents are grabbed for that same input, they produce outputs with much lower perplexity and loss.

This can be trained autoregressively.
This is similar in some respects to hypernetworks, but applied to embeddings.

The thinking is we shouldn't change latents directly, because any given vector will general be orthogonal to any other, and changing the latents introduces variance for some subset of other inputs over some distribution that is partially or fully out-of-distribution to the current training and verification data sets, thus ultimately leading to a plateau in loss-drop.

Therefore, by autoregressively taking an input, and learning a model that produces a transform on the latents of a token dictionary, we can avoid this ossification of global minima, by finding hypersurfaces that adapt the embeddings, rather than changing them directly.

The result is a network that essentially acts a a compressor of all relevant use cases, without leading to overfitting on in-distribution data and underfitting on out-of-distribution data.

Comments
  • 4
    Now i'm hungry.
  • 3
    see! tokens should just be regular WinZip compression

    it's awfully opinionated to assume you should dictate the way the AI interprets its information. there would be far more efficient ways if they were made on the fly
  • 3
    If anyone doesn't know what the fuck I'm talking about or is confused on anything I'm happy to answer any questions.

    Basically the reason it is important to preserve latents and not change them directly is to preserve orthogonality (right angles) between vectors. On average any random vector you pick is likely to be orthogonal to most other vectors for a high enough dimensional embedding.

    Why is this important? Because the entire mechanism assumes statistical independence between vectors as a starting condition, a property of jointly gaussian variables, otherwise it causes models to do weird things and have edge-case behaviors and more failure modes.

    Changing the latents directly during training will therefore cause other vectors on other training examples to have higher loss (poorer performance) than they otherwise wood.

    You can think of a hypersurface as a system projective hypothesis about a subset of vector embeddings for a particular prompt or training example.
  • 3
    @jestdotty "WinZip compression"

    Unironically a few people tried something similar with zip, and for very early llm's when they were still early and producing shortstories intermixed with gibberish.

    It was still mostly gibberish, but it was semi-coherent gibberish from what I heard, which for the time was impressive for a language model built on zip!
  • 2
    @Wisecrack "otherwise wood" lol.

    Damn, don't rant on DR when tired.
  • 2
    I can understand this, but I couldn't explain it to someone without taking a good chunk of time processing it.

    It does sound reasonable, I assume it would also improve cohesion over previous conversation. Allowing it to rationalise the file it was asked to read to be the same if it were to read it again, but this time analysing it for typos instead of bugs for example? I've had luck giving the AI a persona to do that otherwise but this sounds like it could help there as well
  • 3
    @BordedDev "It does sound reasonable, I assume it would also improve cohesion over previous conversation"

    Actually I didn't consider that but you're right. Damn good idea!

    Whats more, now that you mention it, a smaller test would be to add this to just lookups on the context window and see if it lowers perplexity.

    Don't know if you could get away with it without training though.

    Effectively we could view it as a normalization layer for the error signal and noise introduced at each step in the process of going from token->embedding->logit to output token.

    So now thats its posted to the internet, we should see it appearing in big models in the next 30 days our so, ha.
  • 2
    @Wisecrack Hahaha definitely, and then we'll hear about the new breakthrough AI 10 days after that
  • 2
    @BordedDev I think we're gonna see some pretty big developments in 2025.

    Theres too many people now working at the coalface just to fail to climb the difficulty curve of diminishing returns on existing architectures.

    We're at the stage and momentum now that year of research yields more than one year of progress. Of course thats a subjective statement open to interpretation, but I don't think too many experts would disagree with that assessment.
  • 1
    @Wisecrack fuck yeah, new science shit, LOVE IT! 🤟

    For anyone who struggles to understand our crazy mathematician’s theory, here is a short explanation:

    Imagine you have a box of different-shaped puzzle pieces. Each piece represents some information, like a word in a sentence.

    Now, instead of changing the shape of the puzzle pieces to make them fit together, you use a magic board that slightly changes the way the pieces connect when you put them on it. This way, the pieces stay the same, but they fit better depending on what’s around them.

    Some people think we should just squish and stretch the puzzle pieces to make them fit, but that can make a big mess and break the puzzle. Instead, it’s better to have the magic board do the adjusting so everything still works smoothly.

    This idea helps computers understand words better and make fewer mistakes when figuring out what people mean!
Add Comment