How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion | by Lili Jiang | Jun, 2023

Tags: potion alchemist acircsarah lies

June 18th 2023

Sed ut perspiciatis unde. The backbone of ChatGPT is the GPT model, which is built using the Transformer architecture. The backbone of Transformer is the Attention mechanism. The hardest concept to grok in Attention for many is Key, Value, and Query. In this post, I will use an analogy of Potion to internalize these concepts. Even if you already understand the maths of transformer mechanically, I hope by the end of this post, you can develop a more intuitive understanding of the inner workings of GPT from end to end.This explanation requires no maths background. For the technically inclined, I add more technical explanations in […]. You can also safely skip notes in [brackets] and side notes in quote blocks like this one. Throughout my writing, I make up some human-readable interpretation of the intermediary states of the transformer model to aid the explanation, but GPT doesn’t think exactly like that.[When I talk about “attention”, I exclusively mean “self-attention”, as that is what’s behind GPT. But the same analogy explains the general concept of “attention” just as well.]GPT can spew out paragraphs of coherent content, because it does one task superbly well: “Given a text, what word comes next?” Let’s role-play GPT: “Sarah lies still on the bed, feeling ____”. Can you fill in the blank?One reasonable answer, among many, is “tired”. In the rest of the post, I will unpack how GPT arrives at this answer. (For fun, I put this prompt in ChatGPT and it wrote a short story out of it.)You feed the above prompt to GPT. In GPT, each word is equipped with three things: Key, Value, Query, whose values are learned from devouring the entire internet of texts during the training of the GPT model. It’s the interaction among these three ingredients that allows GPT to make sense of a word in the context of a text. So what do they do, really?Let’s set up our analogy of alchemy. For each word, we have:In the first step (attention), the alchemists of all words each go out on their own quests to fill their flasks with potions from relevant words.Let’s take the alchemist of the word “lies” for example. He knows from previous experience — after being pre-trained on the entire internet of texts — that words that help interpret “lies” in a sentence are usually of the form: “some flat surfaces, words related to dishonesty, words related to resting”. He writes down these criteria in his recipe (query) and looks for tags (key) on the potions of other words. If the tag is very similar to the criteria, he will pour a lot of that potion into his flask; if the tag is not similar, he will pour little or none of that potion.So he finds the tag for “bed” says “a flat piece of furniture”. That’s similar to “some flat surfaces” in his recipe! He pours the potion for “bed” in his flask. The potion (value) for “bed” contains information like “tired, restful, sleepy, sick”.The alchemist for the word “lies” continues the search. He finds the tag for the word “still” says “related to resting” (among other connotations of the word “still”). That’s related to his criteria “restful”, so he pours in part of the potion from “still”, which contains information like “restful, silent, stationary”.He looks at the tag of “on”, “Sarah”, “the”, “feeling” and doesn’t find them relevant. So he doesn’t pour any of their potions into his flask.Remember, he needs to check his own potion too. The tag of his own potion “lies” says “a verb related to resting”, which matches his recipe. So he pours some of his own potion into the flask as well, which contains information like “tired; dishonest; can have a positive connotation if it’s a white lie; …”.By the end of his quest to check words in the text, his flask is full.Unlike the original potion for “lies”, this mixed potion now takes into account the context of this very specific sentence. Namely, it has a lot of elements of “tired, exhausted” and only a pinch of “dishonest”.In this quest, the alchemist knows to pay attention to the right words, and combines the value of those relevant words. This is a metaphoric step for “attention”. We’ve just explained the most important equation for Transformer, the underlying architecture of GPT:Advanced notes:1. Each alchemist looks at every bottle, including their own [Q·K.transpose()].2. The alchemist can match his recipe (query) with the tag (key) quickly and make a fast decision. [The similarity between query and key is determined by a dot product, which is a fast operation.] Additionally, all alchemists do their quests in parallel, which also helps speed things up. [Q·K.transpose() is a matrix multiplication, which is parallelizable. Speed is a winning feature of Transformer, compared to its predecessor Recurrent Neural Network that computes sequentially.]3. The alchemist is picky. He only selects the top few potions, instead of mixing in a bit of everything. [We use softmax to collapse Q·K.transpose(). Softmax will pull the inputs into more extreme values, and collapse many inputs to near-zero.]4. At this stage, the alchemist does not take into account the ordering of words. Whether it’s “Sarah lies still on the bed, feeling” or “still bed the Sarah feeling on lies”, the filled flask (output of attention) will be the same. [In the absence of “positional encoding”, Attention(Q, K, V) is independent of word positions.]5. The flask always returns 100% filled, no more, no less. [The softmax is normalized to 1.]6. The alchemist’s recipe and the potions’ tags must speak the same language. [The Query and Key must be of the same dimension to be able to dot product together to communicate. The Value can take on a different dimension if you wish.]7. The technically astute readers may point out we didn’t do masking. I don’t want to clutter the analogy with too many details but I will explain it here. In self-attention, each word can only see the previous words. So in the sentence “Sarah lies still on the bed, feeling”, “lies” only sees “Sarah”; “still” only sees “Sarah”, “lies”. The alchemist of “still” can’t reach into the potions of “on”, “the”, “bed” and “feeling”.Up till this point, the alchemist simply pours the potion from other bottles. In other words, he pours the potion of “lies” — “tired; dishonest;…” — as a uniform mixture into the flask; he can’t distill out the “tired” part and discard the “dishonest” part just yet. [Attention is simply summing the different V’s together, weighted by the softmax.]Now comes the real chemistry (feed forward). The alchemist mixes everything together and does some synthesis. He notices interactions between words like “sleepy” and“restful”, etc. He also notices that “dishonesty” is only mentioned in one potion. He knows from past experiences how to make some ingredients interact with each other and how discard the one-off ones. [The feed forward layer is a linear (and then non-linear) transformation of the Value. Feed forward layer is the building block of neural networks. You can think of it as the “thinking” step in Transformer, while the earlier mixology step is simply “collecting”.]The resulting potion after his processing becomes much more useful for the task of predicting the next word. Intuitively, it represents some richer properties about this word in the context of its sentence, in contrast with the starting potion (value) that is out of context.How do we get from here to the final output, which is to predict that the next word after “Sarah lies still on the bed, feeling ___” is “tired”?So far, each alchemist has been working independently, only tending to his own word. Now all the alchemists of different words assemble and stack their flasks in the original word order and present them to the final linear and softmax layer of the Transformer. What do I mean by this? Here, we must depart from the metaphor.This final linear layer synthesizes information across different words. Based on pre-trained data, one plausible learning is that the immediate previous word is important to predict the next word. As an example, the linear layer might heavily focus on the last flask (“feeling”’s flask).Then combined with the softmax layer, this step assigns every single word in our vocabulary a probability for how likely this is the next word after “Sarah lies on the bed, feeling…”. For example, non-English words will receive probabilities close to 0. Words like “tired”, “sleepy”, “exhausted” will receive high probabilities. We then pick the top winner as the final answer.Now you’ve built a minimalist GPT!To recap, for each word in the attention step, you determine which words (including self) each word should pay attention to, based on how well that word’s query (recipe) matches the other word’s key (tag). You mix together those words’ values (potions) proportional to the attention that word pays to them. You process this mixture to do some “thinking” (feed forward). Once each word is processed, you then combine the mixtures from all the other words to do more “thinking” (linear layer) and make the final prediction of what the next word should be.Side note: the language “decoder” is a vestige from the original paper, as Transformer was first used for machine translation tasks. You “encode” the source language into embeddings, and “decode” from the embeddings to the target language.Source link Save my name, email, and website in this browser for the next time I comment.By using this form you agree with the storage and handling of your data. * Δdocument.getElementById( "ak_js_1" ).setAttribute( "value", ( new Date() ).getTime() );Tech dedicated news site to equip you with all tech related stuff.I agree that my submitted data is being collected and stored.✉️ Send us an emailTechToday © 2023. All Rights Reserved.TechToday.co is a technology blog and review site specializing in providing in-depth insights into the latest news and trends in the technology sector.TechToday © 2023. All Rights Reserved.Be the first to know the latest updatesI agree that my submitted data is being collected and stored.

Unveiling the Secrets of Seamless Clo…
Ø§Ø³ØªÙƒØ´Ù Ø£ÙØ¶Ù„ ÙƒØªØ¨ Ø§Ù†Ø¬Ù„…

This post first appeared on VedVyas Articles, please read the originial post: here

People also like

Unveiling the Secrets of Seamless Cloud Gaming on Steam

How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion | by Lili Jiang | Jun, 2023

Related Articles

Share the post

Subscribe to Vedvyas Articles

Thank you for your subscription