
From my big black book of ML and AI, something I've kept since I've 16, and has been a continual source of prescient predictions in the machine learning industry:

"Polynomial regression will one day be found to be equivalent to solving for self-attention."

Why run matrix multiplications when you can use the kernal trick and inner products?

Fight me.

    Could you explain this a little more? I don’t understand very complex math, but I do understand matrix multiplication (and its role in ML) as well as kernels (in the context of computer vision)
    @DeepHotel look at variations on the perceptron models and deep networks. They can be represented as arbitrary polynomials.

    Now look at classification, for example bayes.

    Predicting a class is equivalent to generating a token if we treat the input+context as a feature vector, and the token dictionary as the class labels.

    But because things like bayes assume independence between vectors, what we'd be training is for dependence. We're looking for the *least* likely class, rather than the most likely class because of that independence. Thing of it as inverted naive bayes.

    Polynomial regression would let us run a layer over this for self-attention, modifying some sort of weighted norm layer to change the classifier's bias, on the basis that polynomials can aproximate any function.

    The affect is to teach attention what tokens to attend to, rather than teach attention what tokens are correct for a given input, and then let bayes do the heavy lifting augmented with this.
    I also think ngrams and montecarlo tree-search are a bit of a baby-out-of-the-bathwater scenario.

    I could envision using mm1 queues for sequence modelling of attention, and

    'forgetting' as well as likelihood for recall.

    monte carlo tree search with UCB1 so we can use 'k' as

    a tuning constant to control 'exploration vs exploitation'

    on a per-ngram basis.

    interest search (for faster minmax) takes 'temperature' and 'top k' and puts it on its head. The network uses interest ratings both during training and inference, to rate branching

    possible outcomes (because every token probability is a branch),

    implementing look ahead for those tokens

    And then just as next-token ngram inference has interest matrices generated at run time, each prior output, gets re-rated on interest, based on its UCB1 and mm1 attention.
    Yes, I don't see what the big revelation is. NNs are literally just a big addition and multiplication of features and constants

    If it werent for the activation function that introduces more interesting non-linearities it would literally just be a huge polynomial. Pretty much any book on statistical analysis will mention neural networks and vector support machines as curve fitting models (once again, just like higher order polynomials with tunable parameters).

    It's nothing revolutionary. The only revolution is happening in computing power and storage. Most Interesting neural models simply werent possible before due to hardware limitations. The first really cool aplication that stuck with me was AlphaFold from 2018.
    @Hazarth you're demonstrating you're familiarity with the subject here for sure. Most familiar with it have wrote the same thing, the more data and processing power generally the better the performance, with negligible gains from architecture.

    Well, not entirely accurate. Architectural *variation* has negligible impact..that is intra-algorithm.

    Inter-algorithm is another matter.

    Right now we can't control exploration vs exploitation. Temperature is probably the closest thing we have, and top-k isn't all that great.

    The funny thing is, by representing a sequence as an ngram selected from a probability tree, we can selectively control exploration vs exploitation, or any other property.

    At the same time interest algorithms like UCB1 allow us to move away from minmax and local optimal inherent in selective search.

    From there plausibility ordering with huffman encoding can be used as the basis for training embeds in this case, as an alternative to matmuls.
