1

What are the key differences between a large language model and traditional machine learning models in terms of architecture and application?

Follow-up: How do these differences impact the model's ability to understand and generate human-like text?

Comments
  • 4
    "understand" - LLMs don't understand anything.

    they just generate text tokens that might fit the pattern of a possibly relevant answer.

    also: try asking a LLM, it might give you a response more quickly than a platform that's dedicated to _ranting_ - not _explaining_ ;)
  • 1
    @tosensei that's all humans do too. Decades of environment training of what mouth shapes to make after others make certain mouth shapes.

    OP: I'm not an AI engineer, but I work with them and I have a passing interest in the tech.

    I think the big difference in LLMs is the ability to map training data into tokens in N dimensional space in such a way that more likely connected tokens are spacialy closer to each other.

    The details are fuzzy for me.
  • 1
    there are no real differences here. There's not really anything that we could call "traditional machine learning" There's a couple of approaches, Neural Networks being the most famous and relevant one right now, and LLMs are built with Neural Networks. The architecture is a "General Purpose Transformer" where the point is transforming one piece of data (LLMs: large set of tokens) into another piece of data (LLMs: large set of tokens again)

    Initially Transformers were made to do machine translation, doing sentence -> sentence mapping is what they were made to do well thanks to attention to the input context and already generated context. However it was quickly discovered that you can do sentence -> sentence mapping to finish already started sentences. Thus we have LLMs now. Machine text generation is not at all different from machine text translation for a transformer model. There are a few implementation differences between different LLMs, but at it's core it's transformers
  • 1
    For your Follow Up question. You have to understand that NNs are a large black-box. We don't really know what exact parameters are used to map the prompt space to the prediction space and why they are chosen. In fact you can train the same model 1000 different times and the parameters will be largely different even though the result will be extremely similar. You don't ever really know if you trained a local or a global maximum, but that's less and less important with more and more data.

    So comparing across different transformers is in terms of "understanding" is currently impractical, however we have some hints based on how transformers function in general. For one, we know that masked attention is an important part of the architecture for language prediction. This is quite intuitive, you need to know which words and sentences came before the one you're trying to predict. different parts of the context, sentences and words will be important to different parts of the generation
  • 0
    The second part that we discovered is that the Decoder/Encoder model of the original translation GPT architecture isn't strictly necessary and doesn't really provide much boost to the model.

    In simple term, most LLMs right now are Decoder only models. The Encoder part, while useful for translation tasks, proves to be less important for text generation. In other words, the Neural Net doesn't seem to have a need to compress the context to extract features, it performs perfectly fine on embeddings. This is interesting because it allows you to simplify the transformer and thus save computation and space both during training and inference.

    I'm not sure what the results are for D/E LLMs but I suppose they are not significantly better, or else that would be the status quo already. Translators however seem to benefit from it, perhaps because the order of the tokens is less important in a translation than the actual semantic meaning of the input? Again, blackbox, we can just guess
  • 0
    Finally, to touch on the topics of "understanding" and "human-like"

    As tosensei already mentioned, with LLMs you can't exactly talk about "understanding". The LLM model is strictly speaking a space mapping function in it's truest sense. The same input always results in the same output. There's no real cognitive process taking place here, it's just a sequence of space transformations that transform a whole lot of word-part (token) embeddings to a vector of probabilities for the next token. In practice sampling as applied at the end to provide more interesting and less rigid generation. But if you for example set the topk parameter to 1 you'll see the true highest probability generation without sampling. in other words, you'd remove what we perceive as "creative". And this answers the second topic.. the "human-like" part is not at all part of the LLM architecture itself. We achieve more interesting and "human like" generation by using sampling strategies at the output of the model
  • 0
    That's not to say that the sampling itself isn't an interesting topic. There's a couple interesting approaches that are commonly used with LLMs:

    Greedy, Top K, Beam Search, Nucleus Sampling.

    This is a non-exhaustive list, but these are the ones I have hands-on experience with except for Nucleus.

    Greedy is your naive TopK = 1 approach I described before. You just take the token with highest probability and you're set... sub-par results and by definition repetitive.

    TopK is a commonly used approach where instead of sampling over the entire result space (which could end up picking a very unprobable token) you take the K most probable tokens, and only sample on them.. this is essentially a simple sampling guardrail and works fine while also being very fast.

    Beam Search is a more advanced technique, where instead of picking the token immediately, you sample several tokens over several generations and pick the most likely *sequence* of tokens.

    Nucleus I'm not really sure.
  • 0
    I could talk about Neural Networks all day probably. I've been messing with them since early 2000, as well as with Genetic algorithms and SVMs. I work in the ML field right now, though I don't work as a data scientist, rather as an operations engineer. Making sure models are deployed where they belong, data gets to and from and that they all run. I've been messing with all sorts of Neural Networks mostly as a hobby for a long time and reading papers when something catches my eye. So my knowledge and interest is relatively up-to-date
  • 0
    Though do take what I said with a grain of healthy skepticism and do your own research into the topics if you really want to learn it properly. I'm fairly certain my understanding on this topic is pretty solid, but hey, I'm just one guy on the internet
  • 2
    @Hazarth thanks for doing my homework, dad
  • 0
    @electrineer No problem, son!
  • 0
    @electrineer yeah, this sounded like a shitty test question.
Add Comment