<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The AiEdge Newsletter]]></title><description><![CDATA[A newsletter for continuous learning about Machine Learning applications, Machine Learning System Design, MLOps, the latest techniques and news. 
Subscribe and receive a free Machine Learning book PDF!]]></description><link>https://newsletter.theaiedge.io</link><image><url>https://substackcdn.com/image/fetch/$s_!kRD-!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6e9c582-b22b-45c5-a64e-a9105824fb01_1067x1067.png</url><title>The AiEdge Newsletter</title><link>https://newsletter.theaiedge.io</link></image><generator>Substack</generator><lastBuildDate>Sun, 19 Apr 2026 01:45:16 GMT</lastBuildDate><atom:link href="https://newsletter.theaiedge.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[AiEdge]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[damienb@theaiedge.io]]></webMaster><itunes:owner><itunes:email><![CDATA[damienb@theaiedge.io]]></itunes:email><itunes:name><![CDATA[Damien Benveniste]]></itunes:name></itunes:owner><itunes:author><![CDATA[Damien Benveniste]]></itunes:author><googleplay:owner><![CDATA[damienb@theaiedge.io]]></googleplay:owner><googleplay:email><![CDATA[damienb@theaiedge.io]]></googleplay:email><googleplay:author><![CDATA[Damien Benveniste]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[AdalFlow: A PyTorch-Like Framework to Auto-Optimizing Prompt for your LLM agent]]></title><description><![CDATA[AI Agent frameworks are becoming just as important as model training itself!]]></description><link>https://newsletter.theaiedge.io/p/adalflow-a-pytorch-like-framework</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/adalflow-a-pytorch-like-framework</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Mon, 29 Sep 2025 15:01:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!o1BQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>AI Agent frameworks are becoming just as important as model training itself! I am excited to introduce you to <a href="https://www.linkedin.com/in/li-yin-ai/">Li Yin</a>. She is the CEO of <a href="https://github.com/SylphAI-Inc">SylphAI</a> and the founder of <a href="https://github.com/SylphAI-Inc/AdalFlow">AdalFlow</a>, a PyTorch-like open-source library on GitHub that enables developers to build and auto-optimize any Language Model (LM) workflows.</strong></p><p><strong>In this guest post, <a href="https://www.linkedin.com/in/aria-ailearning/">Aria Shi</a>, the Developer Relations lead at SylphAI, walks you through how AdalFlow empowers AI Agent development, highlighting a hands-on example with a <a href="https://github.com/SylphAI-Inc/AdalFlow">LinkedIn Reachout Agent</a>.</strong></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o1BQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o1BQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!o1BQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!o1BQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!o1BQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o1BQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png" width="554" height="554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:554,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!o1BQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!o1BQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!o1BQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!o1BQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2431b183-b2fc-4935-87f1-689b6846781a_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>Say goodbye to manual prompt engineering. <strong><a href="https://github.com/SylphAI-Inc/AdalFlow">AdalFlow</a></strong> is the all-in-one, auto-differentiative solution for optimizing prompts, whether you&#8217;re using zero-shot or few-shot learning. Backed by our state-of-the-art research (LLM-AutoDiff and Learn-to-Reason), our framework achieves the highest accuracy among all automatic prompt optimization libraries.</em></p></blockquote><p>The rise of large language models has completely changed the way we build applications&#8212;whether it&#8217;s chatbots, RAG systems, or fully autonomous agents. But as an AI engineer, trying to bring these models into production often feels like stitching together a bunch of experiments, rather than building a stable and reliable system.</p><p>We introduce AdalFlow: a PyTorch-like library designed to bring structure, clarity, and optimization to the world of LLM application development. Built as a community-driven project, AdalFlow is uniting AI research and production engineering into a single ecosystem.</p><div class="pullquote"><p><strong><a href="https://github.com/SylphAI-Inc/AdalFlow">AdalFlow GitHub Repository</a></strong></p></div><h2>Why We Built AdalFlow</h2><p>Modern AI development faces a paradox. On one hand, researchers push the boundaries of model capabilities with new techniques in prompting, evaluation, and optimization. On the other hand, production teams need reproducibility, scalability, and a way to iterate safely on real-world data.</p><p>Most libraries excel at one side of the equation but leave the other underserved. AdalFlow was born to bridge this gap. With 100% control and clarity of source code, it empowers researchers to experiment freely while giving product engineers the tools to build and ship with confidence.</p><h3>Why AdalFlow Matters</h3><p>By treating prompts as first-class citizens and introducing LLM-AutoDiff, AdalFlow provides what&#8217;s been missing in the LLM ecosystem:</p><ul><li><p>For researchers: A familiar PyTorch-like environment to prototype new prompting and training methods.</p></li><li><p>For engineers: Production-ready workflows that are debuggable, reproducible, and optimizable.</p></li><li><p>For teams: A shared framework that unites research and production into one healthy ecosystem.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HKp4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HKp4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png 424w, https://substackcdn.com/image/fetch/$s_!HKp4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png 848w, https://substackcdn.com/image/fetch/$s_!HKp4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png 1272w, https://substackcdn.com/image/fetch/$s_!HKp4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HKp4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png" width="724" height="264.0412087912088" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HKp4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png 424w, https://substackcdn.com/image/fetch/$s_!HKp4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png 848w, https://substackcdn.com/image/fetch/$s_!HKp4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png 1272w, https://substackcdn.com/image/fetch/$s_!HKp4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aa8ef86-71f7-493b-b183-44b000d9c1b8_1600x584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The overview of AdalFlow</figcaption></figure></div><h2>Core Philosophy: Prompt Is the New Programming Language</h2><p>If PyTorch turned tensors into the lingua franca of deep learning, AdalFlow treats prompts as the new programming primitives.</p><p>Every LLM application boils down to structured prompts and their transformations. AdalFlow embraces this reality by making prompt engineering explicit and optimizable. Behind the scenes, it uses the <a href="https://jinja.palletsprojects.com/en/stable/">Jinja2</a> templating engine to let developers define composable prompt structures, ensuring that LLM apps are both modular and debuggable.</p><h3>Components: The Building Blocks of LLM Workflows</h3><p>At the heart of AdalFlow lies the Component abstraction. Just as <code>nn.Module</code> became the foundation for PyTorch models, Components unify every stage of an LLM pipeline.</p><div class="pullquote"><p><strong><a href="https://adalflow.sylph.ai/new_tutorials/core_concepts.html">AdalFlow Core Concepts</a></strong></p></div><ul><li><p>Component: The base class for all workflows. Handles both training (forward) and inference (call) modes, with bicall bridging the two.</p></li><li><p>GradComponent: Components capable of backpropagation (e.g., Generators, Retrievers).</p></li><li><p>DataComponent: Lightweight components for formatting and parsing data (e.g., DataClassParser).</p></li><li><p>LossComponent: Wraps evaluation metrics and enables gradient-like feedback for text optimization.</p></li></ul><h3>Example 1: Q&amp;A with Object Counting (Component + DataComponent)</h3><div class="pullquote"><p><strong><a href="https://adalflow.sylph.ai/use_cases/question_answering.html?source=post_page-----84f95a03f22b---------------------------------------">Question Answering - Build and Optimize LM Workflows</a></strong></p></div><pre><code>template = r<strong>"""&lt;START_OF_SYSTEM_PROMPT&gt;
{{system_prompt}}
&lt;END_OF_SYSTEM_PROMPT&gt;
&lt;START_OF_USER&gt;
{{input_str}}
&lt;END_OF_USER&gt;"""</strong>

<strong>@adal.func_to_data_component</strong>
<strong>def</strong> parse_integer_answer(answer: <strong>str</strong>):
    numbers = re.findall(r"\d+", answer)
    <strong>return</strong> <strong>int</strong>(numbers[-1])</code></pre><h4>What&#8217;s happening here?</h4><ul><li><p><code>parse_integer_answer</code> is wrapped with <code>@adal.func_to_data_component</code>.</p></li><li><p>This turns a plain Python function into a <code>DataComponent</code>, which handles structured output parsing.</p></li><li><p>In this case, it ensures the model&#8217;s answer ends with a numerical value.</p></li></ul><p>Next, we define a full pipeline:</p><pre><code><strong>class</strong> ObjectCountTaskPipeline(adal.Component):
    <strong>def</strong> __init__(
            <strong>self</strong>, model_client: adal.ModelClient, model_kwargs: Dict
        ):
        <strong>super</strong>().__init__()
        system_prompt = adal.Parameter(
            data=<strong>"You will answer a reasoning question. Think step by step. The last line should be 'Answer: $VALUE'."</strong>,
            role_desc=<strong>"Task instruction for the model"</strong>,
            requires_opt=<strong>True</strong>,
            param_type=ParameterType.PROMPT,
        )
        <strong>self</strong>.llm_counter = adal.Generator(
            model_client=model_client,
            model_kwargs=model_kwargs,
            template=template,
            prompt_kwargs={<strong>"system_prompt"</strong>: system_prompt},
            output_processors=parse_integer_answer,
        )

    <strong>def</strong> bicall(<strong>self</strong>, question: <strong>str</strong>, id: <strong>str</strong> = <strong>None</strong>):
        <strong>return</strong> self.llm_counter(
              prompt_kwargs={<strong>"input_str"</strong>: question}, id=id
        )</code></pre><p><code>ObjectCountTaskPipeline</code> subclasses Component. Inside it, we define:</p><ul><li><p>A Parameter of type <code>PROMPT</code>, which AdalFlow can later auto-optimize.</p></li><li><p>A Generator (a <code>GradComponent</code>) that executes the prompt, then passes the raw LLM output through our <code>parse_integer_answer</code> <code>DataComponent</code>.</p></li></ul><blockquote><p><em>The workflow is:<br>Prompt &#8594; LLM Generation &#8594; Structured Output Parsing &#8594; Final Numerical Answer.</em></p></blockquote><h3>Example 2: Classification with Structured Output (Component + DataClass)</h3><div class="pullquote"><p><strong><a href="https://adalflow.sylph.ai/use_cases/classification.html?source=post_page-----84f95a03f22b---------------------------------------">Classification Optimization - Build and Optimize LM Workflows</a></strong></p></div><p>Classification tasks are a perfect showcase of AdalFlow&#8217;s DataClass feature.</p><pre><code><strong>@dataclass</strong>
<strong>class</strong> TRECExtendedData(adal.DataClass):
    question: <strong>str</strong> = field(
        metadata={<strong>"desc"</strong>: <strong>"The question to be classified"</strong>}
    )
    rationale: <strong>str</strong> = field(
        metadata={<strong>"desc"</strong>: <strong>"Step-by-step reasoning"</strong>}, default=<strong>None</strong>
    )
    class_name: Literal[
        <strong>"ABBR"</strong>, <strong>"ENTY"</strong>, <strong>"DESC"</strong>, <strong>"HUM"</strong>, <strong>"LOC"</strong>, <strong>"NUM"
    </strong>] = field(
        metadata={<strong>"desc"</strong>: <strong>"The class name"</strong>}, default=<strong>None</strong>
    )

    __input_fields__ = [<strong>"question"</strong>]
    __output_fields__ = [<strong>"rationale"</strong>, <strong>"class_name"</strong>]</code></pre><ul><li><p><code>TRECExtendedData</code> extends <code>DataClass</code>, which (like Pydantic) gives us schema enforcement.</p></li><li><p>Input: a question.</p></li><li><p>Output: a rationale (reasoning trace) and a <code>class_name</code> (final label).</p></li></ul><p>Now let&#8217;s plug it into a pipeline:</p><pre><code><strong>class</strong> TRECClassifierStructuredOutput(adal.Component):
    <strong>def</strong> __init__(
        <strong>self</strong>, model_client: adal.ModelClient, model_kwargs: Dict
    ):
        <strong>super</strong>().__init__()
        <em><strong># Task description prompt</strong></em>
        task_desc_str = adal.Prompt(
            template=task_desc_template,
            prompt_kwargs={
                <strong>"classes"</strong>: [
                    {<strong>"label"</strong>: l, <strong>"desc"</strong>: d} 
                    <strong>for</strong> l, d 
                    <strong>in</strong> <strong>zip</strong>(_COARSE_LABELS, _COARSE_LABELS_DESC)
                ]
            }
        )()

        parser = adal.DataClassParser(
            data_class=TRECExtendedData,
            return_data_class=<strong>True</strong>,
            format_type=<strong>"yaml"</strong>
        )

        prompt_kwargs = {
            <strong>"system_prompt"</strong>: adal.Parameter(
                data=task_desc_str,
                role_desc=<strong>"Task description"</strong>,
                requires_opt=<strong>True</strong>,
                param_type=adal.ParameterType.PROMPT,
            ),
            <strong>"output_format_str"</strong>: parser.get_output_format_str(),
        }

        <strong>self</strong>.llm = adal.Generator(
            model_client=model_client,
            model_kwargs=model_kwargs,
            prompt_kwargs=prompt_kwargs,
            template=template,
            output_processors=parser,
        )

    <strong>def</strong> bicall(<strong>self</strong>, question: <strong>str</strong>, id: Optional[<strong>str</strong>] = <strong>None</strong>):
        <strong>return</strong> <strong>self</strong>.llm(prompt_kwargs={"input_str": question}, id=id)</code></pre><ul><li><p>The Prompt defines the system instruction with class definitions.</p></li><li><p><code>DataClassParser</code> enforces structured YAML output that matches <code>TRECExtendedData</code>.</p></li><li><p>Generator (a GradComponent) runs the LLM with prompt + parser.</p></li><li><p>Output is guaranteed to follow the schema: rationale + class name.</p></li></ul><p>This ensures the model never drifts into free-form answers&#8212;it always returns structured classification results.</p><h3>Example 3: Training With LossComponent</h3><p>Finally, how do we train or optimize these components? That&#8217;s where <code>LossComponent</code> comes in:</p><pre><code>eval_fn = AnswerMatchAcc(type=<strong>"exact_match"</strong>).compute_single_item
loss_fn = adal.EvalFnToTextLoss(
    eval_fn=eval_fn,
    eval_fn_desc=<strong>"exact_match: 1 if str(y) == str(y_gt) else 0"</strong>
)</code></pre><ul><li><p><code>AnswerMatchAcc</code> is the evaluation metric.</p></li><li><p><code>EvalFnToTextLoss</code> wraps it as a <code>LossComponent</code>, enabling LLM-AutoDiff to optimize prompts automatically during training.</p></li></ul><blockquote><p><em>By attaching this to your pipeline, you get a full training loop:<br>Forward pass &#8594; Eval metric &#8594; Backward engine &#8594; Prompt optimization.</em></p></blockquote><h2>Agents: Reasoning Meets Action</h2><p>AdalFlow embraces the ReAct paradigm&#8212;combining reasoning (plan) with acting (tool use)&#8212;to build autonomous, auditable AI systems. An agent reasons about the task, selects tools, executes them, observes results, and iterates until it can deliver a final answer.</p><ul><li><p><a href="https://adalflow.sylph.ai/new_tutorials/agents_runner.html">https://adalflow.sylph.ai/new_tutorials/agents_runner.html</a></p></li><li><p><a href="https://colab.research.google.com/github/SylphAI-Inc/AdalFlow/blob/main/notebooks/agents/agent_tutorial.ipynb">https://colab.research.google.com/github/SylphAI-Inc/AdalFlow/blob/main/notebooks/agents/agent_tutorial.ipynb</a></p></li></ul><h3>Architecture at a Glance</h3><ul><li><p>Agent (planner + tool manager)<br>Handles <em>planning and decision-making</em> via a Generator-based planner, and knows what tools are available and how to call them.</p></li><li><p>Runner (executor + conversation loop)<br>Orchestrates <em>multi-step execution</em>, tool calling, observation handling, timeouts, and final answer synthesis.</p></li></ul><p>This separation lets you swap or customize planning vs. execution independently.</p><h3>Execution Flow (ReAct Loop Recap)</h3><ol><li><p>Planning &#8211; The Agent (Generator planner) analyzes input and proposes the next action.</p></li><li><p>Tool Selection &#8211; Chooses a tool from the registered set.</p></li><li><p>Tool Execution &#8211; The Runner invokes the tool with arguments.</p></li><li><p>Observation &#8211; The result is fed back to the planner.</p></li><li><p>Iteration &#8211; Repeat 1&#8211;4 up to max_steps or until confident.</p></li><li><p>Final Answer &#8211; The planner synthesizes the answer (optionally into a structured type).</p></li></ol><h3>Minimal, End-to-End Example</h3><blockquote><p><em>1) Define a Tool (callable or FunctionTool)</em></p></blockquote><pre><code><em><strong># Tool: a plain Python callable works, or wrap with FunctionTool for extras.</strong></em>
<strong>def</strong> calculator(expression: str) -&gt; <strong>str</strong>:
    <em><strong>"""Evaluate a mathematical expression."""</strong></em>
    <strong>try</strong>:
        result = eval(expression)
        return f<strong>"Result: {result}"</strong>
    <strong>except</strong> Exception <strong>as</strong> e:
        <strong>return</strong> f<strong>"Error: {e}"</strong></code></pre><blockquote><p><em>2) Build the Agent (Planner + Tools)</em></p></blockquote><pre><code><strong>from</strong> adalflow <strong>import</strong> Agent, Runner
<strong>from</strong> adalflow.components.model_client.openai_client <strong>import</strong> OpenAIClient

agent = Agent(
    name=<strong>"CalculatorAgent"</strong>,  <em><strong># Agent identifier</strong></em>
    tools=[calculator],  <em><strong># List of tools (callables or FunctionTool)</strong></em>
    <em><strong># LLM client used by the planner (Generator-based)</strong></em>
    model_client=OpenAIClient(),
    model_kwargs={<strong>"model"</strong>: <strong>"gpt-4o"</strong>, <strong>"temperature"</strong>: 0.3},
    max_steps=6,  <em><strong># Upper bound for ReAct loops</strong></em>
)</code></pre><p>What this maps to:</p><ul><li><p>Planner: An internal Generator that decides the next step (think: &#8220;reasoning trace&#8221;).</p></li><li><p><code>ToolManager</code>: The agent&#8217;s registry of permitted tools.</p></li><li><p>max_steps: Safety rail to prevent runaway loops.</p></li></ul><h4>Model Configuration (Swap Backends Easily)</h4><pre><code><em><strong># OpenAI</strong></em>
<strong>from</strong> adalflow.components.model_client.openai_client <strong>import</strong> OpenAIClient
agent = Agent(
    model_client=OpenAIClient(), 
    model_kwargs={<strong>"model"</strong>: <strong>"gpt-4o"</strong>}
)

<strong># Anthropic</strong>
<strong>from</strong> adalflow.components.model_client.anthropic_client <strong>import</strong> (
    AnthropicAPIClient
)
agent = Agent(
    model_client=AnthropicAPIClient(), 
    model_kwargs={<strong>"model"</strong>: <strong>"claude-3-sonnet-20240229"</strong>}
)</code></pre><blockquote><p><em>3) Execute with the Runner (Multi-step Orchestration)</em></p></blockquote><pre><code><em><strong># Manages turns, tool calls, observations, and finalization</strong></em>
runner = Runner(agent=agent) 

result = runner.call(
    prompt_kwargs={<strong>"input_str"</strong>: <strong>"Invoke the calculator tool and calculate 15 * 7 + 23"</strong>}
)

<strong>print</strong>(result.answer)
<em><strong># -&gt; "The result of 15 * 7 + 23 is 128."</strong></em></code></pre><h4>RunnerResult schema (returned by Runner.call)</h4><pre><code># result has:
# - result.step_history: [StepOutput(...)]  # Each step&#8217;s action + observation
# - result.answer: str | structured type     # Final synthesized answer
# - result.error: None | Exception info      # Error if something failed
# - result.ctx: dict | None                  # Optional execution metadata</code></pre><blockquote><p><em>This is the full ReAct loop in action:</em></p><p><em>Plan &#8594; Select Tool &#8594; Execute &#8594; Observe &#8594; Iterate &#8594; Answer.</em></p></blockquote><div><hr></div><h3>Advanced Features (Production-Ready)</h3><blockquote><p><em>1) Streaming Execution (Real-Time Updates)</em></p></blockquote><pre><code><em><strong># Pseudocode: actual API may differ slightly in your version.</strong></em>
stream = runner.stream(
    prompt_kwargs={<strong>"input_str"</strong>: <strong>"Compute 42 * 73 and explain."</strong>}
)
<strong>for</strong> update <strong>in</strong> stream:
    <em><strong># update contains partial thoughts, tool calls, observations, etc.</strong></em>
    <strong>print</strong>(update)</code></pre><p>Use streaming to surface <em>live</em> reasoning/tool progress in UIs.</p><blockquote><p><em>2) Human-in-the-Loop (Permission Management)</em></p></blockquote><pre><code><strong>from</strong> adalflow.permissions <strong>import</strong> PermissionManager

<strong>class</strong> MyPerms(PermissionManager):
    <strong>def</strong> approve(<strong>self</strong>, tool_name: <strong>str</strong>, args: <strong>dict</strong>) -&gt; <strong>bool</strong>:
        <em><strong># Example policy: only allow calculator; prompt user otherwise</strong></em>
        <strong>return</strong> tool_name == <strong>"calculator"</strong>

agent = Agent(
    name=<strong>"GuardedAgent"</strong>,
    tools=[calculator, search_tool],
    model_client=OpenAIClient(),
    model_kwargs={<strong>"model"</strong>: <strong>"gpt-4o"</strong>},
    permission_manager=MyPerms(),  <em><strong># &lt;- Every tool call can be inspected/approved)</strong></em></code></pre><p>Great for tools that hit external systems (files, emails, APIs).</p><blockquote><p><em>3) Custom System Templates (Planner Behavior)</em></p></blockquote><pre><code>custom_role_desc = <em><strong>"""
You are a careful, step-by-step data analyst.
When you use a tool, explain why and what you expect to get.
"""</strong></em>
agent = Agent(
    name="DataAnalyst",
    <em><strong># Custom planner persona and guardrails</strong></em>
    role_desc=custom_role_desc,        
    model_client=OpenAIClient(),
    model_kwargs={<strong>"model"</strong>: <strong>"gpt-4o"</strong>, <strong>"temperature"</strong>: 0.2},
)</code></pre><blockquote><p><em>4) Tracing (Observability)</em></p></blockquote><pre><code><em><strong># Configure tracing once (destination: console, file, or tracing backend)</strong></em>
<strong>from</strong> adalflow.tracing <strong>import</strong> enable_tracing
enable_tracing(project=<strong>"adalflow-agents-demo"</strong>)

result = runner.call(prompt_kwargs={
    <strong>"input_str"</strong>: <strong>"Use the calculator for 88*19."</strong>
})
<em><strong># Inspect step_history, tool IO, latency, errors, etc.</strong></em></code></pre><p>Agent Summary:</p><ul><li><p>Agent = Reasoning + Tool selection (Generator-based planner + ToolManager)</p></li><li><p>Runner = Controlled execution loop (steps, tools, observations, final answer)</p></li><li><p>Tools = Safe, permissioned extensions to the agent&#8217;s capabilities</p></li><li><p>Production = Streaming, human approvals, tracing, structured outputs</p></li></ul><h2>Real-World Use Case: LinkedIn Recruitment Agent with AdalFlow</h2><p>Hiring top talent is one of the most resource-intensive parts of building a company. Recruiters spend hours scrolling LinkedIn, opening profiles, copying notes, and crafting outreach messages.</p><p>What if we could automate that entire workflow&#8212;turning hours of manual searching into minutes of AI-assisted sourcing?</p><p>That&#8217;s exactly what we built using AdalFlow&#8217;s Agent + Runner architecture combined with browser automation via Chrome DevTools Protocol (CDP).</p><h3>&#10024; Before vs. After</h3><p>Traditional Recruiting Workflow (Manual)</p><blockquote><p><em>&#10060; BEFORE: 2&#8211;3 hours per role</em></p><p><em>1. Navigate to LinkedIn people search</em></p><p><em>2. Type in &#8220;Product Manager, San Francisco&#8221;</em></p><p><em>3. Scroll endlessly, click into profiles</em></p><p><em>4. Skim experience, education, skills</em></p><p><em>5. Take notes in spreadsheets</em></p><p><em>6. Write &amp; send DMs manually</em></p><p><em>Automated Workflow with AdalFlow (Agentic)</em></p><p><em>&#9989; AFTER: 10 minutes per role</em></p><p><em>1. Run: linkedin-agent --query &#8220;Product Manager&#8221; --limit 10</em></p><p><em>2. Agent plans and executes:</em></p><p><em>- Smart search strategy</em></p><p><em>- Extract profiles via browser automation</em></p><p><em>- Evaluate candidates with scoring models</em></p><p><em>- Draft personalized outreach messages</em></p><p><em>3. Get structured output: JSON/CSV with names, titles, LinkedIn URLs, evaluation scores, outreach drafts. Recruiters get to focus on talking to people, not copy-pasting data.</em></p></blockquote><h3>How It Works &#8212; Global State Architecture</h3><p>We structured the solution around a global state shared between tools. Each tool contributes partial data (search results, profiles, evaluations, outreach drafts), which the Agent combines into a full pipeline. Agent combines into a full pipeline.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OCV8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OCV8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png 424w, https://substackcdn.com/image/fetch/$s_!OCV8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png 848w, https://substackcdn.com/image/fetch/$s_!OCV8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png 1272w, https://substackcdn.com/image/fetch/$s_!OCV8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OCV8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png" width="1022" height="731" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:731,&quot;width&quot;:1022,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:995168,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OCV8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png 424w, https://substackcdn.com/image/fetch/$s_!OCV8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png 848w, https://substackcdn.com/image/fetch/$s_!OCV8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png 1272w, https://substackcdn.com/image/fetch/$s_!OCV8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F664cc551-6805-4f73-96fe-9e4e1b668a68_1022x731.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Full Pipeline of LinkedInAgent</figcaption></figure></div><h3>Implementation with AdalFlow</h3><p>We implemented the LinkedInAgent by encapsulating:</p><blockquote><p><em>Agent &#8594; Planner + Tools (search, extract, evaluate, outreach)</em></p><p><em>Runner &#8594; Execution loop with error handling and logging</em></p></blockquote><pre><code><strong>class</strong> LinkedInAgent:
<em><strong>    """
    LinkedIn recruitment agent powered by AdalFlow.
    - Encapsulates Agent + Runner
    - Provides default recruitment tools
    - Supports both sync call() and async acall()
    """</strong></em>

    <strong>def</strong> __init__(
        <strong>self</strong>,
        model_client: Optional[OpenAIClient] = <strong>None</strong>,
        model_kwargs: Optional[Dict[<strong>str</strong>, <strong>Any</strong>]] = <strong>None</strong>,
        max_steps: Optional[<strong>int</strong>] = <strong>None</strong>,
        role_desc: Optional[<strong>str</strong>] = <strong>None</strong>,
        **kwargs,
    ):
        <em><strong># Defaults</strong></em>
        model_client = model_client <strong>or</strong> OpenAIClient()
        model_kwargs = model_kwargs <strong>or</strong> {
           <strong>"model"</strong>: <strong>"gpt-4o"</strong>, &#8220;temperature&#8221;: 0.3
        }
        max_steps = max_steps <strong>or</strong> 6

        <em><strong># Recruitment workflow tools</strong></em>
        <strong>self</strong>.tools = [
            <em><strong># 1. Search LinkedIn via CDP</strong></em>
            SmartCandidateSearchTool,
            <em><strong># 2. Extract structured profile data</strong></em>     
            ExtractCandidateProfilesTool, 
            <em><strong># 3. Score candidates</strong></em>   
            CandidateEvaluationTool,
            <em><strong># 4. Draft personalized outreach</strong></em>         
            CandidateOutreachGenerationTool,
            <em><strong># 5. Persist results </strong></em>
            SaveOutreachResultsTool,         
        ]

        <em><strong># Agent role description (personality / instructions)</strong></em>
        role_desc = role_desc <strong>or</strong> <strong>"You are a recruitment assistant that sources and evaluates LinkedIn candidates."</strong>

        <em><strong># Initialize Agent + Runner</strong></em>
        <strong>self</strong>.agent = Agent(
            name=<strong>"LinkedInRecruiter"</strong>,
            tools=<strong>self</strong>.tools,
            model_client=model_client,
            model_kwargs=model_kwargs,
            max_steps=max_steps,
            role_desc=role_desc,
            **kwargs,
        )
        <strong>self</strong>.runner = Runner(agent=<strong>self</strong>.agent, max_steps=max_steps)

    <strong>def</strong> call(
        <strong>self</strong>, query: <strong>str</strong>, context: Optional[Dict[<strong>str</strong>, Any]] = <strong>None</strong>
    ):
        <strong>return</strong> <strong>self</strong>.runner.call(prompt_kwargs={<strong>"input_str"</strong>: query})

    <strong>async</strong> <strong>def</strong> acall(
        <strong>self</strong>, query: <strong>str</strong>, context: Optional[Dict[<strong>str</strong>, Any]] = <strong>None
    </strong>):
        <strong>return</strong> <strong>await</strong> <strong>self</strong>.runner.acall(
            prompt_kwargs={<strong>"input_str"</strong>: query}
        )</code></pre><h4>Full Workflow Execution</h4><p>Here&#8217;s how we stitch the agent into a production workflow:</p><pre><code><strong>def</strong> execute_search_workflow(
    <strong>self</strong>, progress_tracker=None) -&gt; List[Dict[str, Any]]:
    logger = get_logger()
    logger.set_workflow_context(<strong>"workflow_main"</strong>, <strong>"initialization"</strong>)

    log_phase_start(<strong>"WORKFLOW_START"</strong>, f<strong>"Target: {self.limit} candidates for {self.query} in {self.location}"</strong>)

    candidates = []
    <strong>try</strong>:
        log_info(<strong>"&#129302; Initializing LinkedIn agent..."</strong>)
        agent, user_query = <strong>self</strong>.initialize_agent()

        <strong>if</strong> progress_tracker:
            progress_tracker.start_workflow()

        <em><strong># Run full pipeline</strong></em>
        result = agent.call(query=user_query)
        <strong>self</strong>._print_agent_execution_steps(result)

        <em><strong># Collect data from global state</strong></em>
        <strong>from</strong> ..core.workflow_state <strong>import</strong> get_complete_workflow_data
        workflow_data = get_complete_workflow_data()

        candidates = self._build_complete_candidate_data(workflow_data)

        log_info(f<strong>"&#9989; Found {len(candidates)} candidates"</strong>)
        <strong>return</strong> candidates

    <strong>except</strong> Exception <strong>as</strong> e:
        log_error(f<strong>"&#10060; Workflow failed: {e}"</strong>)
        <strong>return</strong> candidates</code></pre><h2>Example Output</h2><p>After running:</p><pre><code>linkedin-agent --query &#8220;Product Manager San Francisco&#8221; --limit 10</code></pre><p>We get structured results like:</p><pre><code>[
  {
    &#8220;name&#8221;: &#8220;Alex Chen&#8221;,
    &#8220;title&#8221;: &#8220;Senior Product Manager @ Stripe&#8221;,
    &#8220;location&#8221;: &#8220;San Francisco Bay Area&#8221;,
    &#8220;profile_url&#8221;: &#8220;https://linkedin.com/in/alexchen&#8221;,
    &#8220;score&#8221;: 0.92,
    &#8220;outreach_message&#8221;: &#8220;Hi Alex, I came across your experience at Stripe...&#8221;
  },
  {
    &#8220;name&#8221;: &#8220;Maria Lopez&#8221;,
    &#8220;title&#8221;: &#8220;PM, Growth @ Airbnb&#8221;,
    &#8220;location&#8221;: &#8220;San Francisco Bay Area&#8221;,
    &#8220;profile_url&#8221;: &#8220;https://linkedin.com/in/marialopez&#8221;,
    &#8220;score&#8221;: 0.88,
    &#8220;outreach_message&#8221;: &#8220;Hi Maria, your background in growth product design really stood out...&#8221;
  }
]</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kb2z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kb2z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png 424w, https://substackcdn.com/image/fetch/$s_!kb2z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png 848w, https://substackcdn.com/image/fetch/$s_!kb2z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png 1272w, https://substackcdn.com/image/fetch/$s_!kb2z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kb2z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png" width="1456" height="806" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:806,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kb2z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png 424w, https://substackcdn.com/image/fetch/$s_!kb2z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png 848w, https://substackcdn.com/image/fetch/$s_!kb2z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png 1272w, https://substackcdn.com/image/fetch/$s_!kb2z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02aba8ac-d751-4aa9-a1f3-e35c581ac506_1600x886.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Screenshot of the LinkedInAgent</figcaption></figure></div><blockquote><p><em>============================================================</em></p><p><em>WORKFLOW COMPLETION SUMMARY</em></p><p><em>============================================================</em></p><p><em>Success: &#9989; Yes</em></p><p><em>Total Candidates: 2</em></p><p><em>Duration: 68.8 seconds</em></p><p><em>Session: 20250909_220617</em></p><p><em>Log Files:</em></p><p><em>&#8226; Main: logs/workflow_20250909_220617.log</em></p><p><em>&#8226; Debug: logs/debug_20250909_220617.log</em></p><p><em>&#8226; Agent Steps: logs/agent_steps_20250909_220617.log</em></p><p><em>&#8226; Errors: logs/errors_20250909_220617.log</em></p><p><em>============================================================</em></p><p><em>&#127937; MAIN &#9989; COMPLETED - Processed 2 candidates</em></p><p><em>[RESULTS] &#9989; Recruitment workflow completed!</em></p><p><em>[RESULTS] &#128202; Final result: Successfully processed 2 candidates</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FhEb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FhEb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin 424w, https://substackcdn.com/image/fetch/$s_!FhEb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin 848w, https://substackcdn.com/image/fetch/$s_!FhEb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin 1272w, https://substackcdn.com/image/fetch/$s_!FhEb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FhEb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin" width="496" height="333" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:333,&quot;width&quot;:496,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FhEb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin 424w, https://substackcdn.com/image/fetch/$s_!FhEb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin 848w, https://substackcdn.com/image/fetch/$s_!FhEb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin 1272w, https://substackcdn.com/image/fetch/$s_!FhEb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbdbb5d26-0c73-4e5b-99ce-c7f5920e417b_496x333.bin 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Looking Ahead</h2><p>As the LLM landscape evolves, frameworks like AdalFlow will become the backbone of application development. Just as PyTorch accelerated deep learning, AdalFlow has the potential to democratize LLM app building&#8212;from chatbots to agents to beyond.</p><p>If you&#8217;re excited about shaping the future of AI workflows, the project is open-source and community-driven. Whether you&#8217;re an AI researcher, product engineer, or just curious about building smarter applications, now&#8217;s the time to get involved.</p><p>&#128640; AdalFlow isn&#8217;t just another library. It&#8217;s a paradigm shift in how we think about programming with language models.</p>]]></content:encoded></item><item><title><![CDATA[Last Week to Register: Build Production-Ready Agentic-RAG Applications From Scratch Course!]]></title><description><![CDATA[Project-Based Course]]></description><link>https://newsletter.theaiedge.io/p/last-week-to-register-build-production</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/last-week-to-register-build-production</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Tue, 23 Sep 2025 15:02:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HM5S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This is the last week to register for the <strong><a href="https://maven.com/damien-benveniste/agentic-rag">Build Production-Ready Agentic-RAG Applications From Scratch</a></strong> course! This is a fully hands-on course where we are going to implement step-by-step from scratch a production-ready Agentic-RAG application with LangGraph, FastAPI, and React!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/agentic-rag&quot;,&quot;text&quot;:&quot;Signup!&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://maven.com/damien-benveniste/agentic-rag"><span>Signup!</span></a></p><h2>What we are going to build</h2><p>We are going to build a fun web application where we can demonstrate how to orchestrate a robust RAG application using LangGraph, FastAPI, and React. Here is what we are going to build:</p><ol><li><p>A user can pass a GitHub repository URL</p></li><li><p>The files of the related repository are scraped and indexed in a vector database</p></li><li><p>Now the code is available for the user to ask questions about.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HM5S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HM5S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 424w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 848w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 1272w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HM5S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png" width="1456" height="899" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:899,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HM5S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 424w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 848w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 1272w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On the frontend, we will need two main functionalities:</p><ul><li><p>A page where we can input the repository URL and start the crawling and indexing processes:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g2u1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g2u1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 424w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 848w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 1272w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g2u1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png" width="500" height="147.32142857142858" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:429,&quot;width&quot;:1456,&quot;resizeWidth&quot;:500,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g2u1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 424w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 848w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 1272w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><ul><li><p>And a chatbot interface to ask questions about the code in the repository:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v21m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v21m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 424w, https://substackcdn.com/image/fetch/$s_!v21m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 848w, https://substackcdn.com/image/fetch/$s_!v21m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 1272w, https://substackcdn.com/image/fetch/$s_!v21m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v21m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png" width="498" height="445.3269230769231" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1302,&quot;width&quot;:1456,&quot;resizeWidth&quot;:498,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v21m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 424w, https://substackcdn.com/image/fetch/$s_!v21m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 848w, https://substackcdn.com/image/fetch/$s_!v21m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 1272w, https://substackcdn.com/image/fetch/$s_!v21m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On the backend, we will need the related endpoints:</p><ul><li><p>The indexing endpoint will respond to the provided GitHub repository URL and the &#8220;crawl&#8220; action to start the crawling and indexing processes.</p></li><li><p>The chat endpoint that will respond to messages sent by the user from the chatbot interface.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fsuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fsuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 424w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 848w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 1272w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fsuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png" width="1456" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fsuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 424w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 848w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 1272w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We are going to use the following tools:</p><ul><li><p><a href="https://react.dev/">React</a> for the frontend</p></li><li><p><a href="https://fastapi.tiangolo.com/">FastAPI</a> for the backend</p></li><li><p><a href="https://langchain-ai.github.io/langgraph/">LangGraph</a> for the agentic orchestration</p></li><li><p><a href="https://docs.pinecone.io/guides/get-started/overview">Pinecone</a> for the vector database</p></li><li><p><a href="https://www.langchain.com/langsmith">Langsmith</a> for observability</p></li><li><p>Deploy everything on Google Cloud!</p></li></ul><h2>Project-based course</h2><p>We will focus on building the project from the ground up, as we would on the job. Here is how we are going to structure the project development:</p><ul><li><p>Introduction</p><ul><li><p>What we want to build</p></li><li><p>Setting up the environment</p></li></ul></li><li><p>The RAG Application</p><ul><li><p>The Data Parsing Pipeline</p></li><li><p>The Indexing Pipeline</p></li><li><p>The Basic RAG Pipeline</p></li><li><p>Adding Observability to the Pipeline with Langsmith</p></li><li><p>Going Agentic</p></li></ul></li><li><p>The Backend Application</p><ul><li><p>The Indexing API Endpoint</p></li><li><p>Adding Memory</p></li><li><p>Administering the Database Data</p></li></ul></li><li><p>The Frontend Application</p><ul><li><p>The Indexing Page</p></li><li><p>The Chatbot Page</p></li></ul></li><li><p>Deploying to GCP</p></li></ul><p>Each session will be a live, hands-on coding session where we are going to implement every component from scratch</p><h2>Going Agentic </h2><p>&#8220;Agentic&#8221; means that we are going to use an LLM as a decision engine to enhance the quality of our pipeline. We will focus on improving the accuracy of the pipeline at the cost of latency and cost, and discuss the opportunities to reduce those induced negative points with small language models and fine-tuning. In the RAG pipeline, we are going to build a subagent for each of the main components:</p><ul><li><p>Intent router: the entry point of the pipeline that will decide if a RAG pipeline is required. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wJDy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wJDy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png 424w, https://substackcdn.com/image/fetch/$s_!wJDy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png 848w, https://substackcdn.com/image/fetch/$s_!wJDy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png 1272w, https://substackcdn.com/image/fetch/$s_!wJDy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wJDy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png" width="1456" height="597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84120,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/174308314?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wJDy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png 424w, https://substackcdn.com/image/fetch/$s_!wJDy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png 848w, https://substackcdn.com/image/fetch/$s_!wJDy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png 1272w, https://substackcdn.com/image/fetch/$s_!wJDy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf86c946-68f6-44aa-9b75-aea9f8688ee2_1621x665.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>The retriever: The sub-agent that will extract the right data</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ayDj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ayDj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png 424w, https://substackcdn.com/image/fetch/$s_!ayDj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png 848w, https://substackcdn.com/image/fetch/$s_!ayDj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png 1272w, https://substackcdn.com/image/fetch/$s_!ayDj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ayDj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png" width="1456" height="874" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:874,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:127989,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/174308314?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ayDj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png 424w, https://substackcdn.com/image/fetch/$s_!ayDj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png 848w, https://substackcdn.com/image/fetch/$s_!ayDj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png 1272w, https://substackcdn.com/image/fetch/$s_!ayDj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb186a60e-838d-44ed-9a23-940b84a25fb5_1568x941.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>The generator: The sub-agent that will generate the response to the user</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nHed!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nHed!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png 424w, https://substackcdn.com/image/fetch/$s_!nHed!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png 848w, https://substackcdn.com/image/fetch/$s_!nHed!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png 1272w, https://substackcdn.com/image/fetch/$s_!nHed!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nHed!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png" width="1445" height="967" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:967,&quot;width&quot;:1445,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122152,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/174308314?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nHed!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png 424w, https://substackcdn.com/image/fetch/$s_!nHed!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png 848w, https://substackcdn.com/image/fetch/$s_!nHed!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png 1272w, https://substackcdn.com/image/fetch/$s_!nHed!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97a91147-6c52-460e-972d-c07d99ebbc8f_1445x967.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e7NK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e7NK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 424w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 848w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e7NK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png" width="1452" height="1482" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1482,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:165873,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/172452651?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e7NK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 424w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 848w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Scaling up</h2><p>With this course, I want to focus on what we would need to do to deploy the application to 1M users. We will make sure to design every endpoint to be asynchronous, queue the indexing requests, and deploy the application with elastic load balancing to scale the application horizontally.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NOQR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NOQR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 424w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 848w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 1272w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NOQR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png" width="1456" height="643" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:643,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120017,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/172452651?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NOQR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 424w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 848w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 1272w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is going to be a fun ride! Make sure to join us!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/agentic-rag&quot;,&quot;text&quot;:&quot;Signup!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://maven.com/damien-benveniste/agentic-rag"><span>Signup!</span></a></p><h2><strong>The Real-World AI Engineering Roadblocks You Face Today</strong></h2><p>&#128075; <strong>Prototype &#8594; Production Gap</strong> &#8212; Moving from a notebook demo to a secure, observable, multi-tenant service requires orchestration, evals, guardrails, and ops most teams lack.</p><p>&#128075; <strong>&#8220;Easy RAG&#8221; vs &#8220;Reliable RAG&#8221;</strong> &#8212; Anyone can retrieve-then-generate; making answers faithful, fresh, fast, and cost-controlled under real traffic is the hard part.</p><p>&#128075; <strong>Framework Overload</strong> &#8212; The ecosystem is noisy; you need clear criteria (maturity, extensibility, latency, cost) and reference patterns to choose confidently.</p><p>&#128075; <strong>It&#8217;s Software Engineering First</strong> &#8212; Success hinges on clean interfaces, tests, typed configs, tracing, CI/CD, and change management&#8212;not just prompts and models.</p><p>&#128075; <strong>From Laptop to 1M Users</strong> &#8212; Scaling demands streaming, batching, caching, autoscaling, and SLOs, or your p95 explodes and costs spiral.</p><h3><strong>How this course will help you</strong></h3><p>&#9989; <strong>Ship a real Agentic RAG app, not a demo </strong>&#8212; Stand up an end-to-end stack&#8212;LangGraph &#8594; FastAPI &#8594; React, that runs locally today and deploys via a clean, fork-and-ship monorepo.</p><p>&#9989; <strong>Make retrieval dependable, not lucky</strong> &#8212; Adopt schema-aware chunking, strong dense embeddings with sensible metadata filters, and context packing with citations so answers stay faithful, fresh, and concise.</p><p>&#9989; <strong>Harden agentic workflows</strong> &#8212; Design a typed LangGraph state and build nodes for rewrite &#8594; retrieve &#8594; rerank &#8594; synthesize &#8594; cite &#8594; safety-check, with retries and timeouts so plans don&#8217;t loop or stall.</p><p>&#9989; <strong>Scale the experience, not the headaches</strong> &#8212; Enable server-streaming in FastAPI, cap top-k, trim context budgets, and add early-exit rules; deploy with autoscaling so you can serve real traffic without infra fuss.</p><p>&#9989; <strong>See enough to fix things fast</strong> &#8212; Bake in structured logs (no vendor tracing), per-step timing counters, and UI breadcrumbs/citations to follow <em>query &#8594; context &#8594; answer</em> and spot common failure patterns quickly.</p><p>&#9989; <strong>Choose frameworks with confidence</strong> &#8212; Follow an opinionated reference architecture plus a simple choice rubric (maturity, extensibility, latency, cost, swap effort) so you know when to stick&#8212;and how to swap components without rewrites.</p><p>&#9989; <strong>Write maintainable RAG code</strong> &#8212; Use clean module boundaries (ingest / retrieve / rerank / synthesize), typed configs (Pydantic Settings), and sensible secrets/env management so your team can extend it safely.</p><h3><strong>You&#8217;ll walk away with</strong></h3><p>&#10024; A running <strong>Agentic RAG app</strong> (LangGraph + FastAPI + React) in a <strong>fork-and-ship monorepo</strong>.</p><p>&#10024; An <strong>ingestion/indexing</strong> pipeline with metadata, hybrid retrieval, and optional re-ranking.</p><p>&#10024; A <strong>chat UI</strong> with citations, source previews, and conversation memory that behaves.</p><p>&#10024; <strong>Deploy</strong> scripts and env templates to go live right after class.</p><p>&#10024; A <strong>framework choice memo + adapters</strong> to swap models/vector stores without starting over.</p><p><strong>Bottom line:</strong> this isn&#8217;t a vitamin, it&#8217;s a blueprint you can put in production.</p><h3><strong>What you&#8217;ll get out of this course</strong></h3><ul><li><p><strong>Orchestrate complex RAG pipelines with LangGraph and OpenAI API:</strong> Build a typed LangGraph that routes rewrite &#8594; retrieve &#8594; rerank &#8594; synthesize &#8594; cite &#8594; self-check with retries, timeouts, early-exit rules, and real tool calls, exposed as a clean HTTP API.</p></li><li><p><strong>Build scalable asynchronous applications with FastAPI:</strong> Ship async FastAPI endpoints, well-typed request/response models, input validation, and sensible timeouts, ready to run locally and deploy to production.</p></li><li><p><strong>Implement chatbot interfaces with React:</strong> Create a chat UI that shows citations and source previews, lets users scope queries, preserves safe chat history, and handles transient API errors gracefully.</p></li><li><p><strong>Mitigate hallucinations with LLM judges, structured output, and context engineering:</strong> Cut errors via schema-aware chunking, dedupe and budgeted context packing, plus lightweight LLM checks and schema-constrained outputs to verify claims and enforce citations before responding.</p></li><li><p><strong>Design effective LLM prompts for high-level control on generation output:</strong> Write prompts that steer behavior: system prompts, task decomposition, Pydantic/JSON-schema constraints, and clear rules for tone, citations, and safe refusals.</p></li><li><p><strong>Develop end-to-end RAG applications using the software engineering best practices:</strong> Produce a maintainable codebase: clean module boundaries (ingest/retrieve/rerank/synthesize), typed configs, secrets/env management, reproducible local dev, and deploy that mirrors local.</p></li></ul><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/agentic-rag?promoCode=FIRST20&quot;,&quot;text&quot;:&quot;Sign Up!&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://maven.com/damien-benveniste/agentic-rag?promoCode=FIRST20"><span>Sign Up!</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[Build Production-Ready Agentic-RAG Applications From Scratch Course: What we are going to build ]]></title><description><![CDATA[On Saturday, September 27th, I am launching a new course: Build Production-Ready Agentic-RAG Applications From Scratch! This is a fully hands-on course where we are going to deploy a production-ready Agentic-RAG application with LangGraph, FastAPI, and React! Here is what we are going to build.]]></description><link>https://newsletter.theaiedge.io/p/build-production-ready-agentic-rag</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/build-production-ready-agentic-rag</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Tue, 02 Sep 2025 15:01:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!HM5S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On Saturday, September 27th, I am launching a new course: <strong><a href="https://maven.com/damien-benveniste/agentic-rag">Build Production-Ready Agentic-RAG Applications From Scratch</a></strong>! This is a fully hands-on course where we are going to deploy a production-ready Agentic-RAG application with LangGraph, FastAPI, and React! Here is what we are going to build.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/agentic-rag&quot;,&quot;text&quot;:&quot;Signup!&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://maven.com/damien-benveniste/agentic-rag"><span>Signup!</span></a></p><h2>What we are going to build</h2><p>We are going to build a fun web application where we can demonstrate how to orchestrate a robust RAG application using LangGraph, FastAPI, and React. Here is what we are going to build:</p><ol><li><p>A user can pass a GitHub repository URL</p></li><li><p>The files of the related repository are scraped and indexed in a vector database</p></li><li><p>Now the code is available for the user to ask questions about.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HM5S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HM5S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 424w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 848w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 1272w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HM5S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png" width="1456" height="899" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:899,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HM5S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 424w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 848w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 1272w, https://substackcdn.com/image/fetch/$s_!HM5S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f30f1fd-2bcd-4fc4-8ca6-51b4f969edca_1578x974.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On the frontend, we will need two main functionalities:</p><ul><li><p>A page where we can input the repository URL and start the crawling and indexing processes:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g2u1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g2u1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 424w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 848w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 1272w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g2u1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png" width="500" height="147.32142857142858" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:429,&quot;width&quot;:1456,&quot;resizeWidth&quot;:500,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g2u1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 424w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 848w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 1272w, https://substackcdn.com/image/fetch/$s_!g2u1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc850272b-747f-42ed-b72a-b4d686a8304e_1526x450.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><ul><li><p>And a chatbot interface to ask questions about the code in the repository:</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v21m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v21m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 424w, https://substackcdn.com/image/fetch/$s_!v21m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 848w, https://substackcdn.com/image/fetch/$s_!v21m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 1272w, https://substackcdn.com/image/fetch/$s_!v21m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v21m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png" width="498" height="445.3269230769231" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1302,&quot;width&quot;:1456,&quot;resizeWidth&quot;:498,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v21m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 424w, https://substackcdn.com/image/fetch/$s_!v21m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 848w, https://substackcdn.com/image/fetch/$s_!v21m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 1272w, https://substackcdn.com/image/fetch/$s_!v21m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff869f440-1d45-4194-aba0-cad9ccfbf4b7_2028x1814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On the backend, we will need the related endpoints:</p><ul><li><p>The indexing endpoint will respond to the provided GitHub repository URL and the &#8220;crawl&#8220; action to start the crawling and indexing processes.</p></li><li><p>The chat endpoint that will respond to messages sent by the user from the chatbot interface.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fsuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fsuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 424w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 848w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 1272w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fsuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png" width="1456" height="704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:704,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fsuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 424w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 848w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 1272w, https://substackcdn.com/image/fetch/$s_!fsuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9b4957a-91e3-4656-90b1-3e02d7e65596_1622x784.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We are going to use the following tools:</p><ul><li><p><a href="https://react.dev/">React</a> for the frontend</p></li><li><p><a href="https://fastapi.tiangolo.com/">FastAPI</a> for the backend</p></li><li><p><a href="https://langchain-ai.github.io/langgraph/">LangGraph</a> for the agentic orchestration</p></li><li><p><a href="https://docs.pinecone.io/guides/get-started/overview">Pinecone</a> for the vector database</p></li><li><p><a href="https://www.langchain.com/langsmith">Langsmith</a> for observability</p></li><li><p>Deploy everything on Google Cloud!</p></li></ul><h2>Project-based course</h2><p>We will focus on building the project from the ground up, as we would on the job. Here is how we are going to structure the project development:</p><ul><li><p>Introduction</p><ul><li><p>What we want to build</p></li><li><p>Setting up the environment</p></li></ul></li><li><p>The RAG Application</p><ul><li><p>The Data Parsing Pipeline</p></li><li><p>The Indexing Pipeline</p></li><li><p>The Basic RAG Pipeline</p></li><li><p>Adding Observability to the Pipeline with Langsmith</p></li><li><p>Going Agentic</p></li></ul></li><li><p>The Backend Application</p><ul><li><p>The Indexing API Endpoint</p></li><li><p>Adding Memory</p></li><li><p>Administering the Database Data</p></li></ul></li><li><p>The Frontend Application</p><ul><li><p>The Indexing Page</p></li><li><p>The Chatbot Page</p></li></ul></li><li><p>Deploying to GCP</p></li></ul><p>Each session will be a live, hands-on coding session where we are going to implement every component from scratch</p><h2>Going Agentic </h2><p>&#8220;Agentic&#8221; means that we are going to use an LLM as a decision engine to enhance the quality of our pipeline. We will focus on improving the accuracy of the pipeline at the cost of latency and cost, and discuss the opportunities to reduce those induced negative points with small language models and fine-tuning. In the RAG pipeline, we are going to build a subagent for each of the main components:</p><ul><li><p>Intent router: the entry point of the pipeline that will decide if a RAG pipeline is required. </p></li><li><p>The retriever: The sub-agent that will extract the right data</p></li><li><p>The generator: The sub-agent that will generate the response to the user</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e7NK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e7NK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 424w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 848w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e7NK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png" width="1452" height="1482" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1482,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:165873,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/172452651?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e7NK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 424w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 848w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 1272w, https://substackcdn.com/image/fetch/$s_!e7NK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F820cfde6-8e9c-47b2-b3c2-ce10ca086930_1452x1482.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Scaling up</h2><p>With this course, I want to focus on what we would need to do to deploy the application to 1M users. We will make sure to design every endpoint to be asynchronous, queue the indexing requests, and deploy the application with elastic load balancing to scale the application horizontally.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NOQR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NOQR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 424w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 848w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 1272w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NOQR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png" width="1456" height="643" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:643,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120017,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/172452651?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NOQR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 424w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 848w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 1272w, https://substackcdn.com/image/fetch/$s_!NOQR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd8a1046-e3fa-446e-bb31-f65b7dce728b_1728x763.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is going to be a fun ride! Make sure to join us!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/agentic-rag&quot;,&quot;text&quot;:&quot;Signup!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://maven.com/damien-benveniste/agentic-rag"><span>Signup!</span></a></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[New Course: Build Production-Ready Agentic-RAG Applications From Scratch]]></title><description><![CDATA[End-to-end: orchestrate and deploy agentic Retrieval-Augmented Generation with LangGraph, FastAPI, and React frontend in 2 weeks.]]></description><link>https://newsletter.theaiedge.io/p/new-course-build-production-ready</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/new-course-build-production-ready</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Mon, 25 Aug 2025 15:01:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pWtL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On Saturday, September 27th, I am launching a new course: <strong><a href="https://maven.com/damien-benveniste/agentic-rag?promoCode=FIRST20">Build Production-Ready Agentic-RAG Applications From Scratch</a></strong>! This is a fully hands-on course where we are going to deploy a production-ready Agentic-RAG application with LangGraph, FastAPI, and React! <strong>The first 30 people to sign up will get a 20% discount by applying the promo code <a href="https://maven.com/damien-benveniste/agentic-rag?promoCode=FIRST20">FIRST20</a>!</strong> So make sure to sign up early:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/agentic-rag?promoCode=FIRST20&quot;,&quot;text&quot;:&quot;Sign Up!&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://maven.com/damien-benveniste/agentic-rag?promoCode=FIRST20"><span>Sign Up!</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pWtL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pWtL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!pWtL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!pWtL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!pWtL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pWtL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2377137,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/171854996?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pWtL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!pWtL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!pWtL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!pWtL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11f819bf-fb63-4c76-a1e6-521888220f3d_2560x1440.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>From Prototype to Production: Ship Reliable and Scalable RAG Pipelines</h2><h3>The Real-World AI Engineering Roadblocks You Face Today</h3><p>&#128075; <strong>Prototype &#8594; Production Gap</strong> &#8212; Moving from a notebook demo to a secure, observable, multi-tenant service requires orchestration, evals, guardrails, and ops most teams lack.</p><p>&#128075; <strong>&#8220;Easy RAG&#8221; vs &#8220;Reliable RAG&#8221;</strong> &#8212; Anyone can retrieve-then-generate; making answers faithful, fresh, fast, and cost-controlled under real traffic is the hard part.</p><p>&#128075; <strong>Framework Overload</strong> &#8212; The ecosystem is noisy; you need clear criteria (maturity, extensibility, latency, cost) and reference patterns to choose confidently.</p><p>&#128075; <strong>It&#8217;s Software Engineering First</strong> &#8212; Success hinges on clean interfaces, tests, typed configs, tracing, CI/CD, and change management&#8212;not just prompts and models.</p><p>&#128075; <strong>From Laptop to 1M Users</strong> &#8212; Scaling demands streaming, batching, caching, autoscaling, and SLOs, or your p95 explodes and costs spiral.</p><h3>How this course will help you</h3><p>&#9989; <strong>Ship a real Agentic RAG app, not a demo </strong>&#8212; Stand up an end-to-end stack&#8212;LangGraph &#8594; FastAPI &#8594; React, that runs locally today and deploys via a clean, fork-and-ship monorepo.</p><p>&#9989; <strong>Make retrieval dependable, not lucky</strong> &#8212; Adopt schema-aware chunking, strong dense embeddings with sensible metadata filters, and context packing with citations so answers stay faithful, fresh, and concise.</p><p>&#9989; <strong>Harden agentic workflows</strong> &#8212; Design a typed LangGraph state and build nodes for rewrite &#8594; retrieve &#8594; rerank &#8594; synthesize &#8594; cite &#8594; safety-check, with retries and timeouts so plans don&#8217;t loop or stall.</p><p>&#9989; <strong>Scale the experience, not the headaches</strong> &#8212; Enable server-streaming in FastAPI, cap top-k, trim context budgets, and add early-exit rules; deploy with autoscaling so you can serve real traffic without infra fuss.</p><p>&#9989; <strong>See enough to fix things fast</strong> &#8212; Bake in structured logs (no vendor tracing), per-step timing counters, and UI breadcrumbs/citations to follow <em>query &#8594; context &#8594; answer</em> and spot common failure patterns quickly.</p><p>&#9989; <strong>Choose frameworks with confidence</strong> &#8212; Follow an opinionated reference architecture plus a simple choice rubric (maturity, extensibility, latency, cost, swap effort) so you know when to stick&#8212;and how to swap components without rewrites.</p><p>&#9989; <strong>Write maintainable RAG code</strong> &#8212; Use clean module boundaries (ingest / retrieve / rerank / synthesize), typed configs (Pydantic Settings), and sensible secrets/env management so your team can extend it safely.</p><h3>You&#8217;ll walk away with</h3><p>&#10024; A running <strong>Agentic RAG app</strong> (LangGraph + FastAPI + React) in a <strong>fork-and-ship monorepo</strong>.</p><p>&#10024; An <strong>ingestion/indexing</strong> pipeline with metadata, hybrid retrieval, and optional re-ranking.</p><p>&#10024; A <strong>chat UI</strong> with citations, source previews, and conversation memory that behaves.</p><p>&#10024; <strong>Deploy</strong> scripts and env templates to go live right after class.</p><p>&#10024; A <strong>framework choice memo + adapters</strong> to swap models/vector stores without starting over.</p><p><strong>Bottom line:</strong> this isn&#8217;t a vitamin, it&#8217;s a blueprint you can put in production.</p><h3>What you&#8217;ll get out of this course</h3><ul><li><p><strong>Orchestrate complex RAG pipelines with LangGraph and OpenAI API:</strong> Build a typed LangGraph that routes <strong>rewrite &#8594; retrieve &#8594; rerank &#8594; synthesize &#8594; cite &#8594; self-check</strong> with <strong>retries, timeouts, early-exit rules</strong>, and real tool calls, exposed as a clean HTTP API.</p></li><li><p><strong>Build scalable asynchronous applications with FastAPI:</strong> Ship <strong>async</strong> FastAPI endpoints, well-typed request/response models, input validation, and sensible timeouts, ready to run locally and <strong>deploy to production</strong>.</p></li><li><p><strong>Implement chatbot interfaces with React:</strong> Create a<strong> chat UI</strong> that shows citations and source previews, lets users scope queries, preserves <strong>safe chat history</strong>, and handles transient API errors gracefully.</p></li><li><p><strong>Mitigate hallucinations with LLM judges, structured output, and context engineering:</strong> Cut errors via <strong>schema-aware chunking</strong>, dedupe and budgeted context packing, plus <strong>lightweight LLM checks</strong> and <strong>schema-constrained outputs</strong> to verify claims and enforce citations before responding.</p></li><li><p><strong>Design effective LLM prompts for high-level control on generation output:</strong> Write prompts that <strong>steer behavior</strong>: system prompts, task decomposition, <strong>Pydantic/JSON-schema</strong> constraints, and clear rules for tone, citations, and safe refusals.</p></li><li><p><strong>Develop end-to-end RAG applications using the software engineering best practices:</strong> Produce a maintainable codebase: <strong>clean module boundaries</strong> (ingest/retrieve/rerank/synthesize), <strong>typed configs</strong>, secrets/env management, reproducible local dev, and deploy that mirrors local.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/agentic-rag?promoCode=FIRST20&quot;,&quot;text&quot;:&quot;Sign Up!&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://maven.com/damien-benveniste/agentic-rag?promoCode=FIRST20"><span>Sign Up!</span></a></p></li></ul><p></p>]]></content:encoded></item><item><title><![CDATA[Mixture-of-Experts: Early Sparse MoE Prototypes in LLMs]]></title><description><![CDATA[Mixture-of-Experts might be one of the most important improvements in the Transformer architecture!]]></description><link>https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Fri, 22 Aug 2025 15:01:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2RgN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a6fd37-5d37-49e4-b10f-d641af576d04_1500x918.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>Mixture-of-Experts might be one of the most important improvements in the Transformer architecture! It allows for scaling the number of model parameters while keeping the latency associated with the forward and backward pass of the backpropagation algorithm almost constant. Scaling in the width direction, as opposed to the depth of the model, allows for keeping the gradient paths short, improving the stability of the training. We explore here 2 early models:</strong></em></p><ul><li><p><em><strong>The Sparsely-Gated Mixture-of-Experts Layer</strong></em></p></li><li><p><em><strong>GShard</strong></em></p></li></ul><div><hr></div><h2>The First Mixture-of-Experts</h2><p>The concept of the Mixture of Experts (MoE) architecture was introduced in 1991 by <a href="https://www.cs.toronto.edu/~fritz/absps/jjnh91.pdf">Jacobs et al</a>. The idea was to combine the learning of parallel learners using a gating mechanism. The goal was to increase the model's capacity while maintaining stable training and achieving faster convergence. Deep networks tend to suffer from vanishing or exploding gradients, and extending the capacity in the width direction allows for the learning of more complex statistical patterns while keeping short gradient paths. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4QZ-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4QZ-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png 424w, https://substackcdn.com/image/fetch/$s_!4QZ-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png 848w, https://substackcdn.com/image/fetch/$s_!4QZ-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png 1272w, https://substackcdn.com/image/fetch/$s_!4QZ-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4QZ-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png" width="430" height="391.0164835164835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1324,&quot;width&quot;:1456,&quot;resizeWidth&quot;:430,&quot;bytes&quot;:262551,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/171576302?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4QZ-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png 424w, https://substackcdn.com/image/fetch/$s_!4QZ-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png 848w, https://substackcdn.com/image/fetch/$s_!4QZ-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png 1272w, https://substackcdn.com/image/fetch/$s_!4QZ-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfda17ce-cd0f-4574-9126-cf8f3391df3a_1500x1364.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>An "expert" <em><strong>E<sub>i</sub></strong></em> can be a simple feed-forward network. For example:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E_i(\\mathbf{h})= \\text{ReLU}\\left(W_1^i\\mathbf{h} + \\mathbf{b}_1\\right)W_2^i + \\mathbf{b}_2&quot;,&quot;id&quot;:&quot;LKAUJBTGBW&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>W<sub>1</sub><sup>i</sup></strong></em> and <em><strong>W<sub>2</sub><sup>i</sup></strong></em> may have different dimensions depending on the expert. The gating mechanism generates a weight <em><strong>g<sub>i</sub>(h)</strong></em> for each expert, and the MoE output is a weighted average of the experts' outputs:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{y} = \\sum_{i=1}^n g_i(\\mathbf{h})E_i(\\mathbf{h})&quot;,&quot;id&quot;:&quot;GLFBQMREAH&quot;}" data-component-name="LatexBlockToDOM"></div><p><em><strong>g<sub>i</sub>(h)</strong></em> is typically the softmax transformation of a linear projection <em><strong>W<sub>g</sub></strong></em> of the input features:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{g}(\\mathbf{h}) = \\text{Softmax}\\left(W_g\\mathbf{h} + \\mathbf{b}_g\\right)&quot;,&quot;id&quot;:&quot;NYHNOGHFLO&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>W<sub>g</sub></strong></em> is a linear layer of dimension <em><strong>|h| &#10761; n</strong></em>. The softmax transformation yields a probability-like value that captures the proportion of contributions for each expert.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ScTf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ScTf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png 424w, https://substackcdn.com/image/fetch/$s_!ScTf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png 848w, https://substackcdn.com/image/fetch/$s_!ScTf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png 1272w, https://substackcdn.com/image/fetch/$s_!ScTf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ScTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png" width="1456" height="812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:327568,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/171576302?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ScTf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png 424w, https://substackcdn.com/image/fetch/$s_!ScTf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png 848w, https://substackcdn.com/image/fetch/$s_!ScTf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png 1272w, https://substackcdn.com/image/fetch/$s_!ScTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11c30f39-add8-4c0f-b788-a4e05033d997_1500x837.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Early Sparse MoE Prototypes in LLMs</h2><h3>The Sparsely-Gated Mixture-of-Experts Layer</h3><h4>The Sparse mechanism</h4><p>The sparse Mixture of Experts (MoE) architecture was introduced by <a href="https://arxiv.org/pdf/1701.06538">Shazeer et al</a> in January 2017 in LSTM-based language models as a way to drastically scale the model capacity while keeping the number of operations constant, independent of the number of experts. RNN models are hard to scale because the LSTM operations are intrinsically iterative, preventing the high parallelism provided by other computational units. Scaling in depth with more LSTM units induces high latency, while scaling with feed-forward networks limits the ability of the LSTM layers to capture long-range coherence of the input sequences. Scaling in width with many parallel experts permits the LSTM to learn the long-term dependencies while keeping the latency to a minimum.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7hh0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7hh0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png 424w, https://substackcdn.com/image/fetch/$s_!7hh0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png 848w, https://substackcdn.com/image/fetch/$s_!7hh0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png 1272w, https://substackcdn.com/image/fetch/$s_!7hh0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7hh0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png" width="1456" height="1185" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1185,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:549233,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/171576302?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7hh0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png 424w, https://substackcdn.com/image/fetch/$s_!7hh0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png 848w, https://substackcdn.com/image/fetch/$s_!7hh0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png 1272w, https://substackcdn.com/image/fetch/$s_!7hh0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51128362-bd33-4d24-81cc-d16fc3d4e050_1500x1221.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t2a2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t2a2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png 424w, https://substackcdn.com/image/fetch/$s_!t2a2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png 848w, https://substackcdn.com/image/fetch/$s_!t2a2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png 1272w, https://substackcdn.com/image/fetch/$s_!t2a2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t2a2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png" width="1456" height="1357" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1357,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:579829,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/171576302?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t2a2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png 424w, https://substackcdn.com/image/fetch/$s_!t2a2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png 848w, https://substackcdn.com/image/fetch/$s_!t2a2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png 1272w, https://substackcdn.com/image/fetch/$s_!t2a2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c4cb031-9c90-48de-8cc5-bd3575331727_1500x1398.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To keep the number of operations independent of the number of experts, they introduced a routing mechanism that selects the top<em>-k</em> experts for each token. They tested a total number of experts that ranged between 4 and 131,072, but only the top-4 experts were used for each token. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5GWW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5GWW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png 424w, https://substackcdn.com/image/fetch/$s_!5GWW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png 848w, https://substackcdn.com/image/fetch/$s_!5GWW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png 1272w, https://substackcdn.com/image/fetch/$s_!5GWW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5GWW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png" width="534" height="351.3543956043956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:958,&quot;width&quot;:1456,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:283242,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/171576302?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5GWW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png 424w, https://substackcdn.com/image/fetch/$s_!5GWW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png 848w, https://substackcdn.com/image/fetch/$s_!5GWW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png 1272w, https://substackcdn.com/image/fetch/$s_!5GWW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa19d20e5-172d-43d8-844c-fa114d7baae1_1500x987.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For the sparse MoE, the architecture is the same for every expert <em><strong>i</strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E_i(\\mathbf{h})= \\text{ReLU}\\left(W_1^i\\mathbf{h} + \\mathbf{b}_1\\right)W_2^i + \\mathbf{b}_2&quot;,&quot;id&quot;:&quot;ZZQVVILKDI&quot;}" data-component-name="LatexBlockToDOM"></div><p>The gating mechanism is, as before, induced by a linear layer <em><strong>W<sub>g</sub></strong></em> mapping from hidden size <em><strong>d<sub>model</sub></strong></em> to <em><strong>n</strong></em> with an added normal noise <em><strong>&#1013;</strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{l}(\\mathbf{h}) = W_g \\mathbf{h} + \\epsilon&quot;,&quot;id&quot;:&quot;TGLEQGWXHN&quot;}" data-component-name="LatexBlockToDOM"></div><p>The noise allows the model to uniformly explore the different experts in the early part of the training and prevents the collapse onto a handful of "favorite" experts. To generate the noise, the input vector is first projected with another linear layer <em><strong>W<sub>n</sub></strong></em> passed through a softplus transformation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma(\\mathbf{h}) = \\text{Softplus}(W_n \\mathbf{h}) = \\log(1 + \\exp(W_n \\mathbf{h})) &quot;,&quot;id&quot;:&quot;NWRFLCSWGK&quot;}" data-component-name="LatexBlockToDOM"></div><p>The resulting value <em><strong>&#120532;(h)</strong></em> is used as the standard deviation to generate the normal noise:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\epsilon \\sim \\mathcal{N}(0, \\sigma(\\mathbf{h}))&quot;,&quot;id&quot;:&quot;YEMXDNKTUR&quot;}" data-component-name="LatexBlockToDOM"></div><p>The standard deviation <em><strong>&#120532;(h)</strong></em> controls how much randomness is injected for each expert on that particular input token. Larger <em><strong>&#120532;(h)</strong></em> will lead to more exploration, and smaller <em><strong>&#120532;(h)</strong></em> will push the gate to behave almost deterministically. Softplus is basically a "soft" version of ReLU: for large positive values, it increases linearly and flattens near 0. It is used here to make sure the learned noise scale <em><strong>&#120532;(h)</strong></em> is positive and differentiable everywhere. Because the noise is input&#8209;adaptive, the router can start off as a near&#8209;uniform sampler (large <em><strong>&#120532;(h)</strong></em>) and gradually anneal into a confident switch (small <em><strong>&#120532;(h)</strong></em>).</p><p>Once the logits <em><strong>l(x) = {l<sub>1</sub>(h), l<sub>2</sub>(h), &#8230;, l<sub>n</sub>(h)} </strong></em>have been computed, we mask the non-top-<em><strong>k</strong></em>'s contribution with <em><strong>-&#8734;</strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\overline{l}_i(\\mathbf{h})=\n\n\\begin{cases}\n\nl_i(\\mathbf{h}), &amp; \\text{if } l_i(\\mathbf{h})\\text{ is in the top-}k,\\\\\n\n-\\infty, &amp; \\text{otherwise}.\n\n\\end{cases}&quot;,&quot;id&quot;:&quot;SXFEJQAATB&quot;}" data-component-name="LatexBlockToDOM"></div><p>And we perform a softmax transformation to obtain the contribution of each top-<em>k</em> expert:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{g}(\\mathbf{h}) = \\frac{e^{\\overline{\\mathbf{l}}(\\mathbf{h})}}{\\sum_{i=1}^n e^{\\overline{l}_i(\\mathbf{h})}}&quot;,&quot;id&quot;:&quot;FRJJRPBOTM&quot;}" data-component-name="LatexBlockToDOM"></div><p>The <em><strong>-&#8734;</strong></em> masks will lead to <em><strong>g<sub>i</sub>(h) = 0</strong></em> contribution for every non-top-<em>k</em> expert while keeping the softmax normalization as if only the top-<em>k</em> experts contributed to the sum. The resulting hidden state <em><strong>y(h)</strong></em> is the weighted average of top-<em>k</em> experts' output:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{y}(\\mathbf{h})=\\sum_{i\\in \\text{top-}k} g_i(\\mathbf{h})E_i(\\mathbf{h})&quot;,&quot;id&quot;:&quot;XZDHWESRLK&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LmkM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LmkM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png 424w, https://substackcdn.com/image/fetch/$s_!LmkM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png 848w, https://substackcdn.com/image/fetch/$s_!LmkM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png 1272w, https://substackcdn.com/image/fetch/$s_!LmkM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LmkM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png" width="1456" height="639" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:639,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:211407,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/171576302?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LmkM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png 424w, https://substackcdn.com/image/fetch/$s_!LmkM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png 848w, https://substackcdn.com/image/fetch/$s_!LmkM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png 1272w, https://substackcdn.com/image/fetch/$s_!LmkM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7429def3-c09e-4645-ad77-aa90b7a95f42_1500x658.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Hierarchical Mixture of Experts</h4><p>In the Sparsely-Gated Mixture-of-Experts, they scaled the number of experts up to 131,072! With naive MoE, the gate and noise projection layers <em><strong>W<sub>g</sub></strong></em> and <em><strong>W<sub>n</sub></strong></em> would require a dimension <em><strong>(n &#10761; d<sub>model</sub>)</strong></em> ~ 67M parameters (with <em><strong>d<sub>model </sub></strong></em>= 512), with as many operations for each token in the input sequence. At this scale, the gating mechanism becomes the bottleneck. Instead, they introduced a hierarchical gating process where the experts were grouped into <em><strong>a </strong></em>blocks of <em><strong>b</strong></em> experts. The first gate projects from <em><strong>d<sub>model</sub></strong></em> to <em><strong>a</strong></em> blocks:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{l}^{(1)}(\\mathbf{h}) = W_g^{(1)} \\mathbf{h} + \\epsilon^{(1)}&quot;,&quot;id&quot;:&quot;KBASKINGAG&quot;}" data-component-name="LatexBlockToDOM"></div><p>From this first set of logits, we can pick the top-<em><strong>k<sub>1</sub></strong></em> expert blocks:</p>
      <p>
          <a href="https://newsletter.theaiedge.io/p/mixture-of-experts-early-sparse-moe">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Last Week to Register to the Build Production-Ready LLMs From Scratch Course!]]></title><description><![CDATA[From Prototype to Production: Ship Scalable LLM Systems in 6 Weeks]]></description><link>https://newsletter.theaiedge.io/p/last-week-to-register-to-the-build-417</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/last-week-to-register-to-the-build-417</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Wed, 09 Jul 2025 15:02:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xOou!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This Saturday, we kick off the latest cohort of the <strong><a href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms">Build Production-Ready LLMs From Scratch</a></strong> course! This is the last week to register, so make sure to join us if you want to get the right skills as a Machine Learning engineer! This is a 6-week program to learn to build scalable LLMs from scratch and ship them to production. It will run between July 12th and Aug 17th, 2025. It includes 12 live sessions, 6 real-world hands-on projects, 64 recorded lectures, and more material.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms&quot;,&quot;text&quot;:&quot;Enroll&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms"><span>Enroll</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xOou!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xOou!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png 424w, https://substackcdn.com/image/fetch/$s_!xOou!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png 848w, https://substackcdn.com/image/fetch/$s_!xOou!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png 1272w, https://substackcdn.com/image/fetch/$s_!xOou!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xOou!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png" width="1352" height="1062" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1062,&quot;width&quot;:1352,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:437035,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/167788158?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xOou!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png 424w, https://substackcdn.com/image/fetch/$s_!xOou!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png 848w, https://substackcdn.com/image/fetch/$s_!xOou!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png 1272w, https://substackcdn.com/image/fetch/$s_!xOou!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92b8d44d-fe3e-4574-a114-21e1fbfa9b15_1352x1062.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The Real-World LLM Engineering Roadblocks You Face Today</strong></h3><p><strong>&#128075; Transitioning from General ML to LLM Specialization:</strong> You&#8217;ve built recommendation engines or classifier models, but moving into Transformer&#8209;centric development feels like learning a whole new discipline&#8212;no clear roadmap exists.</p><p><strong>&#128075; Lack of LLM&#8209;Specific Career Path: </strong>You see &#8220;LLM Engineer&#8221; roles popping up on LinkedIn, but your current CV only shows &#8220;Data Scientist&#8221; or &#8220;ML Engineer.&#8221; You need hands&#8209;on projects and artifacts to credibly make the jump.</p><p><strong>&#128075; Career Stalled by &#8220;Academic&#8221; Skillset:</strong> You can recite Transformer papers, but when asked, &#8220;Have you shipped an LLM feature end&#8209;to&#8209;end?&#8221; you have no answer&#8212;and no portfolio to prove it!</p><p><strong>&#128075; Prototype Meltdown Under Production Load: </strong>You&#8217;ve fine&#8209;tuned a small model locally, but when you switch from 1 to 100 concurrent requests, your GPU memory spikes and inference grinds to a halt, because you&#8217;ve never applied continuous batching, KV caching, or paged&#8209;attention in a live setting.</p><p><strong>&#128075; RAG Integration Headaches: </strong>Turning a standalone model into a live, Retriever&#8209;Augmented Generation service becomes a multi&#8209;week integration nightmare.</p><h3>How this course will help you</h3><p>Because we&#8217;ve <strong>packaged every stage</strong> of the LLM lifecycle, <strong>from career transition to production rollout</strong>, into a <strong>six&#8209;week bootcamp</strong> that:</p><p>&#9989; <strong>Guides Your Career Pivot: </strong>You&#8217;ll emerge with six polished GitHub projects, a deployment playbook, and RAG demos that transform your resume from &#8220;ML generalist&#8221; to &#8220;LLM Specialist.&#8221;</p><p>&#9989; <strong>Attacks Each Pain&#8209;Point Head&#8209;On: </strong>Attacks each pain point head&#8209;on with six job&#8209;mirroring projects (from scratch &#8594; RLHF &#8594; scaling &#8594; deployment &#8594; RAG), so you never waste time on dead&#8209;end tutorials</p><p>&#9989; <strong>Live Code&#8209;Along Workshops &amp; Office Hours: </strong>Tackle your own fine&#8209;tuning bugs, scaling hiccups, and deployment errors alongside Damien in dedicated sessions, so you get hands&#8209;on fixes for the exact issues you&#8217;ll face on the job.</p><p>&#9989; <strong>Ready&#8209;to&#8209;Use Repos &amp; Playbooks: </strong>Grab our curated starter code, development scripts, deployment templates, and debugging checklists, so you can plug them straight into your next project without reinventing the wheel.</p><p>&#9989; <strong>A Portfolio of Six Production&#8209;Grade Projects: </strong>Leave with six end&#8209;to&#8209;end deliverables, from a Transformer built from scratch to a live RAG API, ready to showcase on GitHub, in performance reviews, or to hiring managers.</p><p>No more scattered blog-hopping or generic bootcamps, this is <strong>the only</strong> cohort where you&#8217;ll <strong>master</strong> Transformer internals <em>and</em> <strong>ship</strong> production&#8209;grade LLM systems while making the career leap you&#8217;ve been aiming for.</p><h3>What You&#8217;ll Actually Build and Ship</h3><p>Across six hands&#8209;on projects, you&#8217;ll deliver deployable LLM components and applications, no fluff, just job&#8209;ready code:</p><p>&#9989; <strong>A Modern Transformer Architecture from scratch: </strong>Implement a sliding&#8209;window multihead attention to slash O(N&#178;) to O(N&#183;w), RoPE for relative positional encoding, and the Mixture-of-Expert architecture for improved performance, all in PyTorch.</p><p>&#9989; <strong>Instruction&#8209;Tuned LLM: </strong>Fine&#8209;tune a model with supervised learning, RLHF, DPO, and ORPO for instruction following on a real benchmark and compare performance gains.</p><p>&#9989; <strong>Scalable Training Pipeline: </strong>Containerize a multi&#8209;GPU job with DeepSpeed ZeRO on SageMaker to maximize throughput and minimize cost.</p><p>&#9989; <strong>Extended&#8209;Context Model: </strong>Modify RoPE scaling, apply 4/8&#8209;bit quantization, and inject LoRA adapters to double your context window.</p><p>&#9989; <strong>Multi&#8209;Mode Deployment: </strong>Stand up a Hugging Face endpoint, a vLLM streaming API, and an OpenAI&#8209;compatible server, all Dockerized and optimized for low latency.</p><p>&#9989; <strong>End&#8209;to&#8209;End RAG Chat App: </strong>Build a FastAPI backend with conversational memory and a Streamlit UI for live Retrieval&#8209;Augmented Generation.</p><p>By the end of Week 6, you won&#8217;t just know these techniques, you&#8217;ll have shipped six production&#8209;grade artifacts, each reflecting the exact pipelines, optimizations, and deployment routines you&#8217;ll use on the job.</p><h3>Live &amp; Recorded Content: Reinforce, Deepen, Accelerate</h3><p>&#10024; <strong>12 Interactive Live Workshops (3 hrs each): </strong>Each session follows the Concept &#8594; Code flow. I&#8217;ll introduce the day&#8217;s core topic (e.g. self-attention, LoRA, vLLM optimizations, ...), and we&#8217;ll implement the features step&#8209;by&#8209;step in code so you see exactly how theory maps to code. Bring your questions!</p><p>&#10024; <strong>10 + Hours of On&#8209;Demand Deep&#8209;Dive Lectures: </strong>Short videos (10&#8211;20 min) on Transformer internals, fine-tuning tricks, deployment optimizations. Watch before each project to hit the ground running. Step through every line of code at your own pace; perfect for review or catching up if you miss a live session. Downloadable slide decks, annotated notebooks, and cheat sheets you&#8217;ll reference long after graduation.</p><p><strong>Why This Matters:</strong> Live workshops turn recorded concepts into <strong>actionable skills</strong>. You&#8217;ll see how theory maps directly onto code, get instant feedback, and internalize best practices. Then, recorded lectures become your <strong>asynchronous safety net</strong>, letting you revisit tricky topics, prepare for upcoming labs, and solidify your understanding on demand.</p><p>Let me know if you have any questions. I hope to see you there!  </p>]]></content:encoded></item><item><title><![CDATA[Build Production-Ready LLMs From Scratch Starting on July 12th!]]></title><description><![CDATA[From Prototype to Production: Ship Scalable LLM Systems in 6 Weeks]]></description><link>https://newsletter.theaiedge.io/p/build-production-ready-llms-from-c43</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/build-production-ready-llms-from-c43</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Mon, 16 Jun 2025 15:02:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!J9Vr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Get ready! The latest iteration of the <strong><a href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first">Build Production-Ready LLMs From Scratch</a></strong> live course is starting on <strong>July 12th</strong>! This is a 6-week program to learn to build scalable LLMs from scratch and ship them to production. It will run between July 12th and August 17, 2025. It includes 12 live sessions, 6 real-world hands-on projects, 64 recorded lectures, and more material. <strong>The first 30 people to sign up will get a 20% discount by applying the promo code <a href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first">FIRST</a>!</strong> So make sure to sign up early:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first&quot;,&quot;text&quot;:&quot;Signup&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first"><span>Signup</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png" width="574" height="322.875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:574,&quot;bytes&quot;:3569157,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/161774781?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The Real-World LLM Engineering Roadblocks You Face Today</strong></h3><p><strong>&#128075; Transitioning from General ML to LLM Specialization:</strong> You&#8217;ve built recommendation engines or classifier models, but moving into Transformer&#8209;centric development feels like learning a whole new discipline&#8212;no clear roadmap exists.</p><p><strong>&#128075; Lack of LLM&#8209;Specific Career Path: </strong>You see &#8220;LLM Engineer&#8221; roles popping up on LinkedIn, but your current CV only shows &#8220;Data Scientist&#8221; or &#8220;ML Engineer.&#8221; You need hands&#8209;on projects and artifacts to credibly make the jump.</p><p><strong>&#128075; Career Stalled by &#8220;Academic&#8221; Skillset:</strong> You can recite Transformer papers, but when asked, &#8220;Have you shipped an LLM feature end&#8209;to&#8209;end?&#8221; you have no answer&#8212;and no portfolio to prove it!</p><p><strong>&#128075; Prototype Meltdown Under Production Load: </strong>You&#8217;ve fine&#8209;tuned a small model locally, but when you switch from 1 to 100 concurrent requests, your GPU memory spikes and inference grinds to a halt, because you&#8217;ve never applied continuous batching, KV caching, or paged&#8209;attention in a live setting.</p><p><strong>&#128075; RAG Integration Headaches: </strong>Turning a standalone model into a live, Retriever&#8209;Augmented Generation service becomes a multi&#8209;week integration nightmare.</p><h3><strong>How this course will help you</strong></h3><p>Because we&#8217;ve <strong>packaged every stage</strong> of the LLM lifecycle, <strong>from career transition to production rollout</strong>, into a <strong>six&#8209;week bootcamp</strong> that:</p><p>&#9989; <strong>Guides Your Career Pivot: </strong>You&#8217;ll emerge with six polished GitHub projects, a deployment playbook, and RAG demos that transform your resume from &#8220;ML generalist&#8221; to &#8220;LLM Specialist.&#8221;</p><p>&#9989; <strong>Attacks Each Pain&#8209;Point Head&#8209;On: </strong>Attacks each pain point head&#8209;on with six job&#8209;mirroring projects (from scratch &#8594; RLHF &#8594; scaling &#8594; deployment &#8594; RAG), so you never waste time on dead&#8209;end tutorials</p><p>&#9989; <strong>Live Code&#8209;Along Workshops &amp; Office Hours: </strong>Tackle your own fine&#8209;tuning bugs, scaling hiccups, and deployment errors alongside Damien in dedicated sessions, so you get hands&#8209;on fixes for the exact issues you&#8217;ll face on the job.</p><p>&#9989; <strong>Ready&#8209;to&#8209;Use Repos &amp; Playbooks: </strong>Grab our curated starter code, development scripts, deployment templates, and debugging checklists, so you can plug them straight into your next project without reinventing the wheel.</p><p>&#9989; <strong>A Portfolio of Six Production&#8209;Grade Projects: </strong>Leave with six end&#8209;to&#8209;end deliverables, from a Transformer built from scratch to a live RAG API, ready to showcase on GitHub, in performance reviews, or to hiring managers.</p><p>No more scattered blog-hopping or generic bootcamps, this is <strong>the only</strong> cohort where you&#8217;ll <strong>master</strong> Transformer internals <em>and</em> <strong>ship</strong> production&#8209;grade LLM systems while making the career leap you&#8217;ve been aiming for.</p><h3><strong>What You&#8217;ll Actually Build and Ship</strong></h3><p>Across six hands&#8209;on projects, you&#8217;ll deliver deployable LLM components and applications, no fluff, just job&#8209;ready code:</p><p>&#9989; <strong>A Modern Transformer Architecture from scratch: </strong>Implement a sliding&#8209;window multihead attention to slash O(N&#178;) to O(N&#183;w), RoPE for relative positional encoding, and the Mixture-of-Expert architecture for improved performance, all in PyTorch.</p><p>&#9989; <strong>Instruction&#8209;Tuned LLM: </strong>Fine&#8209;tune a model with supervised learning, RLHF, DPO, and ORPO for instruction following on a real benchmark and compare performance gains.</p><p>&#9989; <strong>Scalable Training Pipeline: </strong>Containerize a multi&#8209;GPU job with DeepSpeed ZeRO on SageMaker to maximize throughput and minimize cost.</p><p>&#9989; <strong>Extended&#8209;Context Model: </strong>Modify RoPE scaling, apply 4/8&#8209;bit quantization, and inject LoRA adapters to double your context window.</p><p>&#9989; <strong>Multi&#8209;Mode Deployment: </strong>Stand up a Hugging Face endpoint, a vLLM streaming API, and an OpenAI&#8209;compatible server, all Dockerized and optimized for low latency.</p><p>&#9989; <strong>End&#8209;to&#8209;End RAG Chat App: </strong>Build a FastAPI backend with conversational memory and a Streamlit UI for live Retrieval&#8209;Augmented Generation.</p><p>By the end of Week 6, you won&#8217;t just know these techniques, you&#8217;ll have shipped six production&#8209;grade artifacts, each reflecting the exact pipelines, optimizations, and deployment routines you&#8217;ll use on the job.</p><h3><strong>Live &amp; Recorded Content: Reinforce, Deepen, Accelerate</strong></h3><p>&#10024; <strong>12 Interactive Live Workshops (3 hrs each): </strong>Each session follows the Concept &#8594; Code flow. I&#8217;ll introduce the day&#8217;s core topic (e.g. self-attention, LoRA, vLLM optimizations, ...), and we&#8217;ll implement the features step&#8209;by&#8209;step in code so you see exactly how theory maps to code. Bring your questions!</p><p>&#10024; <strong>10 + Hours of On&#8209;Demand Deep&#8209;Dive Lectures: </strong>Short videos (10&#8211;20 min) on Transformer internals, fine-tuning tricks, deployment optimizations. Watch before each project to hit the ground running. Step through every line of code at your own pace; perfect for review or catching up if you miss a live session. Downloadable slide decks, annotated notebooks, and cheat sheets you&#8217;ll reference long after graduation.</p><p><strong>Why This Matters:</strong> Live workshops turn recorded concepts into <strong>actionable skills</strong>. You&#8217;ll see how theory maps directly onto code, get instant feedback, and internalize best practices. Then, recorded lectures become your <strong>asynchronous safety net</strong>, letting you revisit tricky topics, prepare for upcoming labs, and solidify your understanding on demand.</p><p>Let me know if you have any questions. I hope to see you there!</p>]]></content:encoded></item><item><title><![CDATA[Last Week to Register to the Build Production-Ready LLMs From Scratch Course!]]></title><description><![CDATA[From Prototype to Production: Ship Scalable LLM Systems in 6 Weeks]]></description><link>https://newsletter.theaiedge.io/p/last-week-to-register-to-the-build</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/last-week-to-register-to-the-build</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Mon, 19 May 2025 15:54:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!J9Vr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This Saturday, we kick off the <strong><a href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms">Build Production-Ready LLMs From Scratch</a></strong> course! This is the last week to register, so make sure to join us if you want to get the right skills as a Machine Learning engineer! This is a 6-week program to learn to build scalable LLMs from scratch and ship them to production. It will run between May 24th and June 29, 2025. It includes 12 live sessions, 6 real-world hands-on projects, 64 recorded lectures, and more material.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms&quot;,&quot;text&quot;:&quot;Enroll&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms"><span>Enroll</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png" width="574" height="322.875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:574,&quot;bytes&quot;:3569157,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/161774781?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The Real-World LLM Engineering Roadblocks You Face Today</strong></h3><p><strong>&#128075; Transitioning from General ML to LLM Specialization:</strong> You&#8217;ve built recommendation engines or classifier models, but moving into Transformer&#8209;centric development feels like learning a whole new discipline&#8212;no clear roadmap exists.</p><p><strong>&#128075; Lack of LLM&#8209;Specific Career Path: </strong>You see &#8220;LLM Engineer&#8221; roles popping up on LinkedIn, but your current CV only shows &#8220;Data Scientist&#8221; or &#8220;ML Engineer.&#8221; You need hands&#8209;on projects and artifacts to credibly make the jump.</p><p><strong>&#128075; Career Stalled by &#8220;Academic&#8221; Skillset:</strong> You can recite Transformer papers, but when asked, &#8220;Have you shipped an LLM feature end&#8209;to&#8209;end?&#8221; you have no answer&#8212;and no portfolio to prove it!</p><p><strong>&#128075; Prototype Meltdown Under Production Load: </strong>You&#8217;ve fine&#8209;tuned a small model locally, but when you switch from 1 to 100 concurrent requests, your GPU memory spikes and inference grinds to a halt, because you&#8217;ve never applied continuous batching, KV caching, or paged&#8209;attention in a live setting.</p><p><strong>&#128075; RAG Integration Headaches: </strong>Turning a standalone model into a live, Retriever&#8209;Augmented Generation service becomes a multi&#8209;week integration nightmare.</p><h3>How this course will help you</h3><p>Because we&#8217;ve <strong>packaged every stage</strong> of the LLM lifecycle, <strong>from career transition to production rollout</strong>, into a <strong>six&#8209;week bootcamp</strong> that:</p><p>&#9989; <strong>Guides Your Career Pivot: </strong>You&#8217;ll emerge with six polished GitHub projects, a deployment playbook, and RAG demos that transform your resume from &#8220;ML generalist&#8221; to &#8220;LLM Specialist.&#8221;</p><p>&#9989; <strong>Attacks Each Pain&#8209;Point Head&#8209;On: </strong>Attacks each pain point head&#8209;on with six job&#8209;mirroring projects (from scratch &#8594; RLHF &#8594; scaling &#8594; deployment &#8594; RAG), so you never waste time on dead&#8209;end tutorials</p><p>&#9989; <strong>Live Code&#8209;Along Workshops &amp; Office Hours: </strong>Tackle your own fine&#8209;tuning bugs, scaling hiccups, and deployment errors alongside Damien in dedicated sessions, so you get hands&#8209;on fixes for the exact issues you&#8217;ll face on the job.</p><p>&#9989; <strong>Ready&#8209;to&#8209;Use Repos &amp; Playbooks: </strong>Grab our curated starter code, development scripts, deployment templates, and debugging checklists, so you can plug them straight into your next project without reinventing the wheel.</p><p>&#9989; <strong>A Portfolio of Six Production&#8209;Grade Projects: </strong>Leave with six end&#8209;to&#8209;end deliverables, from a Transformer built from scratch to a live RAG API, ready to showcase on GitHub, in performance reviews, or to hiring managers.</p><p>No more scattered blog-hopping or generic bootcamps, this is <strong>the only</strong> cohort where you&#8217;ll <strong>master</strong> Transformer internals <em>and</em> <strong>ship</strong> production&#8209;grade LLM systems while making the career leap you&#8217;ve been aiming for.</p><h3>What You&#8217;ll Actually Build and Ship</h3><p>Across six hands&#8209;on projects, you&#8217;ll deliver deployable LLM components and applications, no fluff, just job&#8209;ready code:</p><p>&#9989; <strong>A Modern Transformer Architecture from scratch: </strong>Implement a sliding&#8209;window multihead attention to slash O(N&#178;) to O(N&#183;w), RoPE for relative positional encoding, and the Mixture-of-Expert architecture for improved performance, all in PyTorch.</p><p>&#9989; <strong>Instruction&#8209;Tuned LLM: </strong>Fine&#8209;tune a model with supervised learning, RLHF, DPO, and ORPO for instruction following on a real benchmark and compare performance gains.</p><p>&#9989; <strong>Scalable Training Pipeline: </strong>Containerize a multi&#8209;GPU job with DeepSpeed ZeRO on SageMaker to maximize throughput and minimize cost.</p><p>&#9989; <strong>Extended&#8209;Context Model: </strong>Modify RoPE scaling, apply 4/8&#8209;bit quantization, and inject LoRA adapters to double your context window.</p><p>&#9989; <strong>Multi&#8209;Mode Deployment: </strong>Stand up a Hugging Face endpoint, a vLLM streaming API, and an OpenAI&#8209;compatible server, all Dockerized and optimized for low latency.</p><p>&#9989; <strong>End&#8209;to&#8209;End RAG Chat App: </strong>Build a FastAPI backend with conversational memory and a Streamlit UI for live Retrieval&#8209;Augmented Generation.</p><p>By the end of Week 6, you won&#8217;t just know these techniques, you&#8217;ll have shipped six production&#8209;grade artifacts, each reflecting the exact pipelines, optimizations, and deployment routines you&#8217;ll use on the job.</p><h3>Live &amp; Recorded Content: Reinforce, Deepen, Accelerate</h3><p>&#10024; <strong>12 Interactive Live Workshops (3 hrs each): </strong>Each session follows the Concept &#8594; Code flow. I&#8217;ll introduce the day&#8217;s core topic (e.g. self-attention, LoRA, vLLM optimizations, ...), and we&#8217;ll implement the features step&#8209;by&#8209;step in code so you see exactly how theory maps to code. Bring your questions!</p><p>&#10024; <strong>10 + Hours of On&#8209;Demand Deep&#8209;Dive Lectures: </strong>Short videos (10&#8211;20 min) on Transformer internals, fine-tuning tricks, deployment optimizations. Watch before each project to hit the ground running. Step through every line of code at your own pace; perfect for review or catching up if you miss a live session. Downloadable slide decks, annotated notebooks, and cheat sheets you&#8217;ll reference long after graduation.</p><p><strong>Why This Matters:</strong> Live workshops turn recorded concepts into <strong>actionable skills</strong>. You&#8217;ll see how theory maps directly onto code, get instant feedback, and internalize best practices. Then, recorded lectures become your <strong>asynchronous safety net</strong>, letting you revisit tricky topics, prepare for upcoming labs, and solidify your understanding on demand.</p><p>Let me know if you have any questions. I hope to see you there!  </p>]]></content:encoded></item><item><title><![CDATA[All About The Modern Positional Encodings In LLMs]]></title><description><![CDATA[The Positional Encoding in LLMs may appear somewhat mysterious the first time we come across the concept, and for good reasons!]]></description><link>https://newsletter.theaiedge.io/p/all-about-the-modern-positional-encodings</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/all-about-the-modern-positional-encodings</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Mon, 28 Apr 2025 15:02:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72b3b068-0cc6-47c5-b160-3d3b07548912_5372x2875.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>The Positional Encoding in LLMs may appear somewhat mysterious the first time we come across the concept, and for good reasons! Over the years, researchers have found many different ad-hoc ways to encode positions and relative positions in the attention mechanism. The most recent surprise to me is the emergence of NoPE (no positional encoding) as being better for generalization on longer sequences. Today, we look at:</strong></em></p><ul><li><p><em><strong>Additive Relative Positional Embeddings</strong></em></p></li><li><p><em><strong>Multiplicative Relative Positional Embeddings</strong></em></p></li><li><p><em><strong>ALiBi: Attention With Linear Biases</strong></em></p></li><li><p><em><strong>RoPE:  Rotary Position Embedding</strong></em></p><ul><li><p><em><strong>The Complex Number Representation</strong></em></p></li><li><p><em><strong>Increasing The Context Window With RoPE</strong></em></p></li></ul></li><li><p><em><strong>No Position Encoder: NoPE</strong></em></p><ul><li><p><em><strong>How NoPE Can Learn Relation Positional Information</strong></em></p></li><li><p><em><strong>Better Generalization for Longer Distance</strong></em></p></li><li><p><em><strong>Llama 4's iRoPE</strong></em></p></li></ul></li></ul><div><hr></div><p>The original positional encoding scheme introduced in <a href="https://arxiv.org/pdf/1706.03762">"Attention is All You Need"</a> uses sinusoidal functions to create absolute position representations, but this approach revealed significant shortcomings when handling longer sequences or transferring to different sequence lengths: </p><ul><li><p><strong>Fixed context windows and poor extrapolation:</strong> When faced with sequences longer than this predefined limit, the model must either truncate the input or use positional values it never encountered during training. Perhaps the most significant limitation is that vanilla encodings represent absolute positions rather than relationships between tokens.</p></li><li><p><strong>The absolute vs. relative position problem:</strong> In language, relative positions often matter more than absolute ones. Consider the sentence: "The dog that chased the cat ran away." The relationship between "dog" and "ran" remains the same whether this is the opening sentence of a document or appears on page fifty, but absolute encodings fail to directly capture this invariance.</p></li><li><p><strong>Information dilution through layers: </strong>As signals propagate through the multiple layers of a Transformer, the influence of the original positional information tends to weaken. The model must work harder to maintain positional awareness in deeper layers, especially for distant tokens.</p></li><li><p><strong>Limited inductive bias for local relationships:</strong> Natural language exhibits a locality bias as nearby words often have stronger relationships than distant ones. The vanilla encoding doesn't naturally encode this bias, treating position 5 and position 500 as equally valid attention targets from a structural perspective, requiring the model to learn this pattern from data alone.</p></li><li><p><strong>Mathematical constraints:</strong> The sinusoidal functions used in the original formulation were chosen partly for their theoretical ability to generalize to unseen positions. However, in practice, they still struggle with significant extrapolation beyond the training range. The model learns to associate specific patterns with specific position ranges, and these associations become increasingly unreliable as we move farther from the training distribution.</p></li></ul><p>These limitations collectively motivated researchers to develop more sophisticated positional encoding schemes, from Shaw's direct relative encodings to Transformer-XL's decomposed approach, ALiBi's distance-based penalties, and RoPE's elegant rotational solution, each addressing different aspects of these fundamental challenges.</p><p>You can find the discussion about the original positional encoding here:</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5fc75b4b-f26e-41bf-85f2-040fa233fd6e&quot;,&quot;caption&quot;:&quot;&quot;,&quot;cta&quot;:null,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;md&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Attention Is All You Need: The Original Transformer Architecture&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:24785675,&quot;name&quot;:&quot;Damien Benveniste&quot;,&quot;bio&quot;:&quot;I specialize in building large scale end to end Machine Learning capabilities. After a PhD in Physics, I have been a Data Scientist, ML Engineer and Software Engineer for the past 10 years. Until recently, I was a Machine Learning Tech Lead at Meta.\n&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2157884e-8c5d-4cde-ab33-455fa623975d_2060x2061.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:100}],&quot;post_date&quot;:&quot;2025-02-12T16:02:31.464Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://newsletter.theaiedge.io/p/attention-is-all-you-need-the-original&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:156937078,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:20,&quot;comment_count&quot;:0,&quot;publication_id&quot;:null,&quot;publication_name&quot;:&quot;The AiEdge Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6e9c582-b22b-45c5-a64e-a9105824fb01_1067x1067.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h2>Additive Relative Positional Embeddings</h2><p>One of the first works addressing capturing the relative position between tokens instead of the absolute ones was done by <a href="https://arxiv.org/pdf/1803.02155">Shaw et al in 2018</a>. It addressed many of the original positional encoding's shortcomings. Instead of having an embedding matrix added to the semantic representation of the tokens, it modifies directly the computation of the attention layers with two positional embeddings. The first one <em><strong>a<sub>ij</sub><sup>K</sup></strong></em> is added to the key representation, and the second <em><strong>a<sub>ij</sub><sup>V</sup></strong></em>, to the value representation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    \\mathbf{k}_j \\rightarrow \\mathbf{k}_j + \\mathbf{a}_{ij}^K \\nonumber\\\\\n\n   \\mathbf{v}_j \\rightarrow \\mathbf{v}_j + \\mathbf{a}_{ij}^V \n\n\\end{align}&quot;,&quot;id&quot;:&quot;NBSRMDTYCR&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>a<sub>ij</sub><sup>K</sup></strong></em> and <em><strong>a<sub>ij</sub><sup>V</sup></strong></em> are learned embeddings of size <em><strong>d<sub>model</sub></strong></em> that depend only on the relative distance <em><strong>j-i</strong></em>. Those embeddings are learned in each attention layer and influence more directly the interactions between tokens. Ignoring the attention heads for simplicity, the computation of a context vector <em><strong>c<sub>i</sub></strong></em> is modified as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;  \\mathbf{c}_i = \\sum_{j=1}^N \\text{Softmax}\\left(\\frac{\\mathbf{q}_i^T\\left(\\mathbf{k}_j+\\mathbf{a}_{ij}^K\\right)}{\\sqrt{d_\\text{model}}}\\right)\\left(\\mathbf{v}_j+\\mathbf{a}_{ij}^V\\right)&quot;,&quot;id&quot;:&quot;DPPBNZTUMK&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E3zP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E3zP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png 424w, https://substackcdn.com/image/fetch/$s_!E3zP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png 848w, https://substackcdn.com/image/fetch/$s_!E3zP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png 1272w, https://substackcdn.com/image/fetch/$s_!E3zP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E3zP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png" width="1456" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:434965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162302291?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E3zP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png 424w, https://substackcdn.com/image/fetch/$s_!E3zP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png 848w, https://substackcdn.com/image/fetch/$s_!E3zP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png 1272w, https://substackcdn.com/image/fetch/$s_!E3zP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef5038e-1daa-4602-ac45-f5dee9a72b6a_1500x811.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>They clip the maximum relative distance to a constant <em><strong>L</strong></em>, learning only <em><strong>2L+1</strong></em> position embeddings for each positional encoding. If we call the matrices of learned positions <em><strong>w<sup>K</sup></strong></em> and <em><strong>w<sup>V</sup></strong></em>, the actual relative position representations that are added to keys and values are the clipped version of those matrices:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    \\mathbf{a}_{ij}^K = w^K_{\\text{clip}(j-i, L)} \\nonumber\\\\\n\n    \\mathbf{a}_{ij}^V = w^V_{\\text{clip}(j-i, L)}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;WECGVRDLAI&quot;}" data-component-name="LatexBlockToDOM"></div><p>where</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{clip}(j-i, L)=\\max\\left(-L, \\min\\left(L, j-i\\right)\\right)&quot;,&quot;id&quot;:&quot;MJPSUATWIQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>which clamps the relative distance to the range <em><strong>[&#8722;L, L]</strong></em>. This means that for any relative distance <em><strong>j-i &gt; L</strong></em>, </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{a}_{ij}^K = \\mathbf{a}_{i,i+L}^K&quot;,&quot;id&quot;:&quot;QZMCUELNLY&quot;}" data-component-name="LatexBlockToDOM"></div><p>and any relative distance<em><strong> j-i &lt; -L</strong></em>, </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{a}_{ij}^K = \\mathbf{a}_{i,i-L}^K&quot;,&quot;id&quot;:&quot;BDZUAUVDIO&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qpP0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qpP0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png 424w, https://substackcdn.com/image/fetch/$s_!qpP0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png 848w, https://substackcdn.com/image/fetch/$s_!qpP0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png 1272w, https://substackcdn.com/image/fetch/$s_!qpP0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qpP0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png" width="1456" height="617" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:617,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:410044,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162302291?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qpP0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png 424w, https://substackcdn.com/image/fetch/$s_!qpP0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png 848w, https://substackcdn.com/image/fetch/$s_!qpP0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png 1272w, https://substackcdn.com/image/fetch/$s_!qpP0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F671af1e7-6bfd-497c-8851-ad9fa43532ee_1500x636.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In practice, <em><strong>L</strong></em> can be chosen quite small as the performance remained relatively stable for <em><strong>L &gt; 2</strong></em> (testing <em><strong>L = 4</strong></em>, <em><strong>L = 16</strong></em>, <em><strong>L = 64</strong></em>, and <em><strong>L = 256</strong></em>). This suggests that distinguishing the precise relative distances of tokens becomes less important beyond a few positions away. The model primarily needs fine-grained position information for nearby contexts, while more distant relationships can be effectively captured with less positional precision, allowing the content representations themselves to drive attention patterns at longer distances. Even with a very small clipping window (<em><strong>L = 2</strong></em>), the model with relative positional encoding achieved substantial improvements over absolute positioning.</p><p>Despite the added performance gain, it is important to note that it is at the cost of added parameters and added time complexity. At the time, Shaw et al. performed their experiments with a 65M parameters transformer model with 6 layers and <em><strong>d<sub>model</sub></strong></em><strong> = 512</strong>. We need two new parameter layers of size <em><strong>(2L+1) x d<sub>model</sub></strong></em>, and with <em><strong>L = 16</strong></em>, it is ~200K additional parameters, which is negligible compared to the overall size of the model. More importantly, computing all the relative distances between the <em><strong>N</strong></em> tokens, the constant memory access, and adding the related embeddings is an <em><strong>O(N<sup>2</sup>)</strong></em> process that imposes a significant additional computational work on top of the already quadratic complexity of attention.     </p><h2>Multiplicative Relative Positional Embeddings</h2><p>In a previous <a href="https://newsletter.theaiedge.io/p/how-to-construct-self-attention-mechanisms">newsletter</a>, we introduced the attention mechanism developed with <a href="https://arxiv.org/pdf/1901.02860">Transformer-XL</a> to handle arbitrarily long sequences, but we left out the discussion about the positional encoding. Developing a new relative positional encoding was one of the critical pieces to handle longer sequences. The approach is based on the following analysis we provided in a <a href="https://newsletter.theaiedge.io/p/attention-is-all-you-need-the-original">previous newsletter</a> for the original positional encoding:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\ne_{i,i+k}\\sqrt{d_\\text{model}} &amp;= \\quad\\underbrace{\\mathbf{x}_i^\\top W^{Q\\top} W^K \\mathbf{x}_{i+k}}_{\\text{Token-Token Interaction}} \\nonumber\\\\\n\n&amp;+\\quad \\underbrace{\\mathbf{x}_i^\\top W^{Q\\top} W^K \\mathbf{PE}(i) R(k)}_{\\text{Token-Position Interaction}} \\nonumber\\\\\n\n&amp;+ \\quad\\underbrace{\\mathbf{PE}(i)^\\top W^{Q\\top} W^K \\mathbf{x}_{i+k}}_{\\text{Position-Token Interaction}}\\nonumber\\\\\n\n&amp;+\\quad \\underbrace{\\mathbf{PE}(i)^\\top W^{Q\\top} W^K \\mathbf{PE}(i) R(k)}_{{\\text{Position-Position Interaction}}}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;GQFDRZMCNU&quot;}" data-component-name="LatexBlockToDOM"></div><p>In this equation, we decomposed the contributions from the different content and positional components for the first attention layer in the model when using the vanilla absolute positional encoding. We showed how <em><strong>R(k)</strong></em> was capturing the relative positional information between tokens, and we hope for the model to learn <em><strong>W<sup>Q</sup></strong></em> and <em><strong>W<sup>K</sup></strong></em> such that it can effectively utilize that information. To help the model capture better the content and position interactions between tokens, the relative positional encoding introduced in Transformer-XL modified this original equation. The relative positional encoding will now be applied within every layer following this strategy:</p><ul><li><p><em><strong> x<sub>i</sub><sup>T</sup> W<sup>QT</sup> W<sup>K</sup> x<sub>i+k</sub></strong></em> originally captured the pure content-based interaction between tokens. <em><strong>x<sub>i</sub></strong></em>, in the original Transformer, represents the embedding vector from the token embedding, and the related hidden state <em><strong>h<sub>i</sub></strong></em> is the sum of the token embedding and the positional encoding:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{h}_{i} = \\mathbf{x}_{i} + \\mathbf{PE}(i)&quot;,&quot;id&quot;:&quot;QPQARHPJBS&quot;}" data-component-name="LatexBlockToDOM"></div><p>We modify this interaction by applying it directly to the hidden states <em><strong>h<sub>i</sub></strong></em> and <em><strong>h<sub>i+k</sub></strong></em> where<em><strong> k</strong></em> is the distance between the related tokens:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{h}_i^\\top W^{Q\\top} W^K_E \\mathbf{h}_{i+k}&quot;,&quot;id&quot;:&quot;UAFQVFLKYB&quot;}" data-component-name="LatexBlockToDOM"></div><p>We specifically distinguish the projection <em><strong>W<sub>E</sub><sup>K</sup></strong></em> to be content-specific. </p></li><li><p><em><strong>x<sub>i</sub><sup>T</sup> W<sup>QT</sup> W<sup>K</sup> PE(i) R(k)</strong></em> or equivalently <em><strong>x<sub>i</sub><sup>T</sup> W<sup>QT</sup> W<sup>K</sup> PE(i+k) R(k)</strong></em> captured a token's preference for attending to positions was tied to fixed locations. However, this is confusing for the model because <em><strong>PE(i+k)</strong></em> depends on the specific position of <em><strong>x<sub>i</sub></strong></em>. Instead, in the <em>Transformer-XL</em>, they modified this interaction only accounting for the relative distance <em><strong>k</strong></em> from the hidden state <em><strong>h<sub>i</sub></strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{h}_i^\\top W^{Q\\top} W^K_R \\mathbf{r}_k&quot;,&quot;id&quot;:&quot;NLLRNWGAPI&quot;}" data-component-name="LatexBlockToDOM"></div><p><em><strong>r<sub>k</sub></strong></em> is a vector of the positional encoding matrix <em><strong>R</strong></em>. As for the additive relative positional embeddings, <em><strong>R</strong></em> is a <em><strong>(2L+1) x d<sub>model</sub></strong></em> matrix where <em><strong>r<sub>k</sub></strong></em> only depends on the relative distance between <em><strong>h<sub>i</sub></strong></em> and <em><strong>h<sub>i+k</sub></strong></em>, and <em><strong>L</strong></em> is a constant chosen to clip the maximum distances that can be represented. However, <em><strong>R</strong></em> is not a learned parameter matrix, but a fixed encoding as in the original Transformer architecture. <em><strong>W<sub>R</sub><sup>K</sup></strong></em> (different from <em><strong>W<sub>E</sub><sup>K</sup></strong></em>) is a specialized weight matrix that further enhances this separation of positional processing. </p></li><li><p><em><strong>PE(i)<sup>T</sup> W<sup>QT</sup> W<sup>K</sup> x<sub>i+k</sub></strong></em> is a position-dependent bias for attending to content. This is a strange term because it has more or less weight depending on the position of the query. To make it position independent, we replace <em><strong>PE(i)<sup>T</sup> W<sup>QT</sup></strong></em> with a global learnable parameter <em><strong>u<sup>T</sup></strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{u}^\\top W^K_E \\mathbf{h}_{i+k}&quot;,&quot;id&quot;:&quot;QIJGSPEEIQ&quot;}" data-component-name="LatexBlockToDOM"></div><p><em><strong>u</strong></em> is a vector of size <em><strong>d<sub>model</sub></strong></em>. This reflects an insight that the attentive bias toward different content should remain consistent regardless of the query position. In other words, the importance of certain word types doesn't need to depend on position, so a single global parameter can replace position-specific queries. This simplifies the model while maintaining expressiveness.</p></li><li><p><em><strong>PE(i)<sup>T</sup> W<sup>QT</sup> W<sup>K</sup> PE(i+k)</strong></em> is a position-position iteration term that depends on the specific position of the query. To make it position independent while keeping the relative positional information, we introduce another global learnable parameter <em><strong>v<sup>T</sup></strong></em> of size <em><strong>d<sub>model</sub></strong></em> to replace <em><strong>PE(i)<sup>T</sup> W<sup>QT</sup></strong></em>, and use again the relative positional encoding <em><strong>R</strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{v}^\\top W^K_R \\mathbf{r}_k&quot;,&quot;id&quot;:&quot;TTTYOEQMAZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>As before, we use the position-specific projection <em><strong>W<sub>R</sub><sup>K</sup></strong></em>. The intuition is that certain relative distances might be generally more important than others, regardless of the absolute positions involved. For example, adjacent tokens (small <em><strong>k</strong></em>) might generally be more related than distant ones, regardless of their absolute positions in the sequence. </p></li></ul><p>This leads to redefining the alignment scores as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\ne_{i,i+k}\\sqrt{d_\\text{model}} &amp;= \\quad\\underbrace{\\mathbf{h}_i^\\top W^{Q\\top} W^K_E \\mathbf{h}_{i+k}}_{\\text{Token-Token Interaction}} \\nonumber\\\\\n\n&amp;+\\quad \\underbrace{\\mathbf{h}_i^\\top W^{Q\\top} W^K_R \\mathbf{r}_k}_{\\text{Token-Position Interaction}} \\nonumber\\\\\n\n&amp;+ \\quad\\underbrace{\\mathbf{u}^\\top W^K_E \\mathbf{h}_{i+k}}_{\\text{Position-Token Interaction}}\\nonumber\\\\\n\n&amp;+\\quad \\underbrace{\\mathbf{v}^\\top W^K_R \\mathbf{r}_k}_{\\text{Position-Position Interaction}}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;TNHBEWUFNA&quot;}" data-component-name="LatexBlockToDOM"></div><p>with <em><strong>W<sup>Q</sup>h<sub>i</sub> = q<sub>i</sub></strong></em>, the query and <em><strong>W<sub>E</sub><sup>K</sup>h<sub>i+k</sub> = k<sub>i+k</sub></strong></em>, the key, we can regroup the terms:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; e_{i,i+k}\\sqrt{d_\\text{model}} = \\left(\\mathbf{q}_i+\\mathbf{u}\\right)^\\top\\mathbf{k}_{i+k}+\\left(\\mathbf{q}_i+\\mathbf{v}\\right)^\\top W_R^K\\mathbf{r}_{k}&quot;,&quot;id&quot;:&quot;EWSUNANZAY&quot;}" data-component-name="LatexBlockToDOM"></div><p>with</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\left(\\mathbf{q}_i+\\mathbf{u}\\right)^\\top\\mathbf{k}_{i+k}&quot;,&quot;id&quot;:&quot;ZXTHXTWYNJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>being a pure content interaction term and </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\left(\\mathbf{q}_i+\\mathbf{v}\\right)^\\top W_R^K\\mathbf{r}_{k}&quot;,&quot;id&quot;:&quot;ADIUGMANNX&quot;}" data-component-name="LatexBlockToDOM"></div><p>representing a content-to-relative distance interaction term. <em><strong>W<sub>E</sub><sup>K</sup>h<sub>i+k</sub> = k<sub>i+k</sub></strong></em> can be thought of as a content contribution to the key, while <em><strong>W<sub>R</sub><sup>K</sup>r<sub>k</sub></strong></em> is the relative positional piece of the key. The global parameters <em><strong>u</strong></em> and <em><strong>v</strong></em> can be seen as providing "default query projections" that are active regardless of the specific query token. This ensures that important content and positional patterns always receive some attention.</p><p><em><strong>R</strong></em> follows the same sinusoidal encoding function as the original Transformer, just applied to relative positions instead of absolute positions:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{R}(k, j)\n\n= \n\n\\begin{cases}\n\n\\sin\\left(\\frac{k}{10000^{j/d_\\text{model}}}\\right) &amp; \\text{if $j$ is even} ,\\\\\n\n\\cos\\left(\\frac{k}{10000^{(j-1)/d_\\text{model}}}\\right) &amp; \\text{if $j$ is odd},\n\n\\end{cases}&quot;,&quot;id&quot;:&quot;WCFNZERAIT&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>k</strong></em> is the relative distance between positions, ranging from <em><strong>&#8722;L</strong></em> to <em><strong>+L,</strong></em> and <em><strong>j</strong></em> is the dimension index from 0 to <em><strong>d<sub>model</sub></strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zsqx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zsqx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png 424w, https://substackcdn.com/image/fetch/$s_!Zsqx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png 848w, https://substackcdn.com/image/fetch/$s_!Zsqx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png 1272w, https://substackcdn.com/image/fetch/$s_!Zsqx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zsqx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png" width="1456" height="985" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:985,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:389195,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162302291?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zsqx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png 424w, https://substackcdn.com/image/fetch/$s_!Zsqx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png 848w, https://substackcdn.com/image/fetch/$s_!Zsqx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png 1272w, https://substackcdn.com/image/fetch/$s_!Zsqx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9b8b5a2-25a8-4bfd-aa6d-47296feaf8a8_1500x1015.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Computing </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\left(\\mathbf{q}_i+\\mathbf{v}\\right)^\\top W_R^K\\mathbf{r}_{k}&quot;,&quot;id&quot;:&quot;TVOSSZWGAZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>requires careful thought to limit the complexity of the problem. With <em><strong>N</strong></em> tokens, we could naively extract all the <em><strong>N<sup>2</sup></strong></em> related <em><strong>r<sub>k</sub></strong></em> since there are <em><strong>N<sup>2</sup></strong></em> pairs of tokens. However, there would be many duplicate vectors since different pairs of tokens have the same distance between them. Instead, we can directly use the whole <em><strong>R</strong></em> matrix and pass it through the linear layer <em><strong>W<sub>R</sub><sup>K</sup></strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;K_R=W_R^KR&quot;,&quot;id&quot;:&quot;SLABKMYRRF&quot;}" data-component-name="LatexBlockToDOM"></div><p><em><strong>R</strong></em> is a <em><strong>(2L+1) x d<sub>model</sub></strong></em>, and <em><strong>W<sub>R</sub><sup>K</sup></strong></em> is a <em><strong>d<sub>model</sub> x d<sub>model</sub></strong></em> matrix, therefore the "positional keys" <em><strong>K<sub>R</sub></strong></em> is a <em><strong>(2L+1) x d<sub>model</sub></strong></em> matrix. As a consequence </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\left(\\mathbf{q}_i+\\mathbf{v}\\right)^\\top W_R^K\\mathbf{r}_{k}&quot;,&quot;id&quot;:&quot;MVCOXWETSQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>is a <em><strong>(2L+1)</strong></em> dimensional vector which leads to a <em><strong>(2L+1) x N</strong></em> content to position alignment matrix for <em><strong>N</strong></em> queries.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zM5u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zM5u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png 424w, https://substackcdn.com/image/fetch/$s_!zM5u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png 848w, https://substackcdn.com/image/fetch/$s_!zM5u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png 1272w, https://substackcdn.com/image/fetch/$s_!zM5u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zM5u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png" width="510" height="295.98214285714283" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:845,&quot;width&quot;:1456,&quot;resizeWidth&quot;:510,&quot;bytes&quot;:699786,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162302291?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zM5u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png 424w, https://substackcdn.com/image/fetch/$s_!zM5u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png 848w, https://substackcdn.com/image/fetch/$s_!zM5u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png 1272w, https://substackcdn.com/image/fetch/$s_!zM5u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f7cf736-1738-44e8-959c-071f27723bad_4344x2522.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qwtn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qwtn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png 424w, https://substackcdn.com/image/fetch/$s_!qwtn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png 848w, https://substackcdn.com/image/fetch/$s_!qwtn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png 1272w, https://substackcdn.com/image/fetch/$s_!qwtn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qwtn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png" width="1456" height="809" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:809,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:837244,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162302291?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qwtn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png 424w, https://substackcdn.com/image/fetch/$s_!qwtn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png 848w, https://substackcdn.com/image/fetch/$s_!qwtn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png 1272w, https://substackcdn.com/image/fetch/$s_!qwtn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fbf0b85-7d6d-4fe9-927c-bcf44342f5da_5722x3181.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let's make sure we understand the different tensors in play. We have the queries <em><strong>Q = [q<sub>1</sub>, &#8230;, q<sub>N</sub>]</strong></em>, the keys <em><strong>K = [k<sub>1</sub>, &#8230;, k<sub>N</sub>]</strong></em>, and the positional equivalent to the keys <em><strong>K<sub>R</sub> =[W<sub>R</sub><sup>K</sup>r<sub>-L</sub>, &#8230;, W<sub>R</sub><sup>K</sup>r<sub>L</sub>]</strong></em>. The relative positions in <em><strong>K<sub>R</sub></strong></em> are not aligned with the relative positions of <em><strong>K</strong></em> with respect to <em><strong>Q</strong></em>. This means we cannot simply add </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(Q + \\mathbf{u})^\\top K \\quad\\text{with}\\quad (Q + \\mathbf{v})^\\top K_R&quot;,&quot;id&quot;:&quot;XKVUWGAPOD&quot;}" data-component-name="LatexBlockToDOM"></div><p>because of the misalignment in the key ordering. We just need to change the indices of the elements in <em><strong>(Q + v)<sup>T</sup> K<sub>R</sub></strong></em>. For query position <em><strong>i</strong></em>, we need:</p><ul><li><p>For key position 0: a relative distance i&#8722;0 = i</p></li><li><p>For key position 1: a relative distance i&#8722;1</p></li><li><p>For key position 2: a relative distance i&#8722;2</p></li><li><p>And so on...</p></li></ul><p>In other words, for the <em><strong>i-</strong></em>th row of the content to position alignment matrix <em><strong>P = (Q + v)<sup>T</sup> K<sub>R</sub></strong></em>, we need to select values in a specific pattern. The solution is to reshape and shift the <em><strong>P</strong></em> matrix so that the correct relative position scores align with the positions we need in the final attention matrix. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e-Lh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e-Lh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png 424w, https://substackcdn.com/image/fetch/$s_!e-Lh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png 848w, https://substackcdn.com/image/fetch/$s_!e-Lh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png 1272w, https://substackcdn.com/image/fetch/$s_!e-Lh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e-Lh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png" width="1456" height="956" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:956,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:272820,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162302291?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e-Lh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png 424w, https://substackcdn.com/image/fetch/$s_!e-Lh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png 848w, https://substackcdn.com/image/fetch/$s_!e-Lh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png 1272w, https://substackcdn.com/image/fetch/$s_!e-Lh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F575b812f-ce39-4e8e-b048-e7bfeb498a24_1500x985.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!smTf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!smTf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png 424w, https://substackcdn.com/image/fetch/$s_!smTf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png 848w, https://substackcdn.com/image/fetch/$s_!smTf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png 1272w, https://substackcdn.com/image/fetch/$s_!smTf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!smTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png" width="1456" height="852" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc805bb8-e797-4c05-bc07-167400163b89_1500x878.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:852,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:253204,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162302291?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!smTf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png 424w, https://substackcdn.com/image/fetch/$s_!smTf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png 848w, https://substackcdn.com/image/fetch/$s_!smTf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png 1272w, https://substackcdn.com/image/fetch/$s_!smTf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc805bb8-e797-4c05-bc07-167400163b89_1500x878.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Aumn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Aumn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png 424w, https://substackcdn.com/image/fetch/$s_!Aumn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png 848w, https://substackcdn.com/image/fetch/$s_!Aumn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png 1272w, https://substackcdn.com/image/fetch/$s_!Aumn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Aumn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png" width="1456" height="883" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:883,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:418807,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162302291?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Aumn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png 424w, https://substackcdn.com/image/fetch/$s_!Aumn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png 848w, https://substackcdn.com/image/fetch/$s_!Aumn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png 1272w, https://substackcdn.com/image/fetch/$s_!Aumn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14230fa9-fcb7-4e53-ab1c-431b086f41ae_1500x910.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the case of <em>Transformer-XL</em>, remember that the sequences are processed in segments of size n &#171; N, and the attention mechanism per segment follows an <em><strong>O(n<sup>2</sup>)</strong></em> time complexity. <em><strong>W<sub>R</sub><sup>K</sup>R</strong></em> also operates on segments, and the time complexity is </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}\\left((2L+1)nd_\\text{model}\\right)=\\mathcal{O}\\left(Lnd_\\text{model}\\right)&quot;,&quot;id&quot;:&quot;ILJXPDETCM&quot;}" data-component-name="LatexBlockToDOM"></div><p>The shift operation needs at least <em><strong>O(n<sup>2</sup>)</strong></em> operations to reindex the alignment score matrix. Therefore, the overall time complexity associated with the relative positional encoding follows the asymptotic behavior <em><strong>O(n<sup>2</sup>)</strong></em>.</p><p><em>Transformer-XL</em> showed improvements in the model's ability to utilize longer contexts. When trained with segments of length 128 but evaluated with various attention lengths, the model showed continued improvements in perplexity up to 640 tokens. The paper reported improved perplexity metric when increasing evaluation context length beyond training length, something absolute encoding couldn't achieve.</p><h2>ALiBi: Attention With Linear Biases</h2><p>The relative positional encoding developed in <em>Transformer-XL</em> is designed to handle very long sequences, but it adds complexity to the attention layer, and, while it can handle longer contexts through its recurrence mechanism, it was not specifically designed for extrapolation to arbitrary lengths beyond training. <a href="https://arxiv.org/pdf/2108.12409">ALiBi</a> (<strong>A</strong>ttention With <strong>Li</strong>near <strong>Bi</strong>ases) was introduced in 2021 as a simpler approach that could easily be extrapolated to much longer sequences than the ones seen during training. It does not require any model parameters and can be expressed as a penalty on the alignment score:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;e_{ij} = \\mathbf{q}_i^\\top \\mathbf{k}_j + m_h\\left(j-i\\right)&quot;,&quot;id&quot;:&quot;DAYWUGTTGR&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>j-i</strong></em> is the distance between the query <em><strong>q<sub>i</sub></strong></em> and the key <em><strong>k<sub>j</sub></strong></em>, and <em><strong>m<sub>h</sub></strong></em> is a head-specific constant, with <em><strong>h</strong></em> being the index of the head. In the case of causal language modeling, we always have j-i &#8804; 0, leading to lower attention for far-away tokens. ALiBi tends to sacrifice true long-range modeling for extrapolation capability. <em><strong>m<sub>h</sub></strong></em> is chosen as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;m_h=2^{-\\frac{8h}{n_\\text{head}}}&quot;,&quot;id&quot;:&quot;XQTFZKJXQM&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>h &#8712; [0, &#8230;, n<sub>head</sub> -1]</strong></em>. For example, if we have 8 heads, the first head would have <em><strong>m<sub>0 </sub>= 2<sup>0 </sup>= 1</strong></em>, and the last head <em><strong>m<sub>7</sub> = 2<sup>-7</sup> = 0.0078125</strong></em>, with a spectrum of the different intermediary slopes in between. For <em><strong>j-i = 1000</strong></em>, for example, we get <em><strong>m<sub>7</sub>(j-i)~ -7.8</strong></em>, which is a substantial penalty to capture long-range interaction between tokens. This suggests that ALiBi may not be optimal for tasks requiring very long-range dependencies (like book-length coherence), but it represents a valuable engineering trade-off that improves efficiency without sacrificing performance on many practical tasks. </p><p>Despite this limitation, ALiBi works well for extrapolation for several reasons:</p><ul><li><p>Graduated attention ranges: The different head slopes create a spectrum of attention distances. While no head truly specializes in very long-range attention, the collection of heads creates a gradient of focus distances.</p></li><li><p>Local coherence dominance: Language has a hierarchical structure where local coherence (within paragraphs or nearby sentences) often matters more than very distant relationships. The bias aligns with this natural property of language.</p></li><li><p>Information propagation: Information can still flow across long distances through multiple layers of the transformer. Even if direct attention across 1000 tokens is penalized, information can propagate through intermediate positions across layers.</p></li><li><p>Relative vs. absolute positioning: Unlike sinusoidal embeddings that break down completely outside their training range, ALiBi's linear bias at least provides a consistent, predictable signal at any distance.</p></li></ul><p>The paper showed that a model trained on 512 tokens could handle sequences of 3072 tokens with better perplexity than a sinusoidal model trained on 3072 tokens. A 1.3 billion parameter model trained on 1024 tokens achieved the same perplexity as a sinusoidal model trained on 2048 tokens when evaluated on 2048-token sequences. As input length increases beyond training length, sinusoidal models' performance degrades almost immediately while ALiBi's performance continues improving up to ~2-3x training length before plateauing.</p><h2>RoPE:  Rotary Position Embedding</h2><p>The <a href="https://arxiv.org/pdf/2104.09864">Rotary Position Embedding</a> (RoPE) is now one of the most common strategies used to inject the relative positional information within the attention mechanism. The idea behind RoPE is to rotate the keys and queries based on the position of the related tokens in the input sequences. This will inject the absolute positional information directly into the queries and keys. Let's look at a toy example to understand the logic. Let's consider a 2-dimensional query <em><strong>q<sub>i</sub></strong></em> and a 2-dimensional key <em><strong>k<sub>j</sub></strong></em>. To rotate 2-dimensional vectors, we use rotation matrices:</p>
      <p>
          <a href="https://newsletter.theaiedge.io/p/all-about-the-modern-positional-encodings">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Join us for a Free LIVE Coding Event: Build The Self-Attention in PyTorch From Scratch]]></title><description><![CDATA[Next Friday, I am inviting you to join me for an exciting live coding event. It is a completely free event where I will explain the basics of the self-attention layer and implement it from scratch in PyTorch. From the vanilla self-attention to the multi-head attention layer, I will walk you through all the little details.]]></description><link>https://newsletter.theaiedge.io/p/join-us-for-a-free-live-coding-event</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/join-us-for-a-free-live-coding-event</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Fri, 25 Apr 2025 15:00:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yF_u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Next Friday, I am inviting you to join me for <strong><a href="https://maven.com/p/bdd423/build-the-self-attention-in-py-torch-from-scratch">an exciting live coding event</a></strong>. It is a completely free event where I will explain the basics of the self-attention layer and implement it from scratch in PyTorch. From the vanilla self-attention to the multi-head attention layer, I will walk you through all the little details.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://maven.com/p/bdd423/build-the-self-attention-in-py-torch-from-scratch" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yF_u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png 424w, https://substackcdn.com/image/fetch/$s_!yF_u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png 848w, https://substackcdn.com/image/fetch/$s_!yF_u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!yF_u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yF_u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png" width="459" height="258.1875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:459,&quot;bytes&quot;:1538362,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://maven.com/p/bdd423/build-the-self-attention-in-py-torch-from-scratch&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/162099225?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yF_u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png 424w, https://substackcdn.com/image/fetch/$s_!yF_u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png 848w, https://substackcdn.com/image/fetch/$s_!yF_u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png 1272w, https://substackcdn.com/image/fetch/$s_!yF_u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F328ab538-6fba-487e-bc25-01621bb57109_3200x1800.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Over the past few years, I have realized that the logic behind self-attention still eludes many people who want to dive deeper into the field. For me, implementing from scratch is the best way to learn this core element of every LLM. Once the self-attention implementation becomes more intuitive, it opens the doors to understanding all the small improvements that have led to the level of maturity we have today in the field of LLMs.   </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/p/bdd423/build-the-self-attention-in-py-torch-from-scratch&quot;,&quot;text&quot;:&quot;Signup&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://maven.com/p/bdd423/build-the-self-attention-in-py-torch-from-scratch"><span>Signup</span></a></p><p>This is an opportunity to talk face-to-face and ask questions. The event is on May 2nd at 9:30 AM PST. I hope to see you there!    </p>]]></content:encoded></item><item><title><![CDATA[Build Production-Ready LLMs From Scratch]]></title><description><![CDATA[From Prototype to Production: Ship Scalable LLM Systems in 6 Weeks]]></description><link>https://newsletter.theaiedge.io/p/build-production-ready-llms-from</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/build-production-ready-llms-from</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Mon, 21 Apr 2025 15:03:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!J9Vr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Big news! I am now partnering with <a href="https://maven.com/">Maven</a> as an instructor to teach the <strong><a href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first">Build Production-Ready LLMs From Scratch</a></strong> live course! This is a 6-week program to learn to build scalable LLMs from scratch and ship them to production. It will run between May 24th and June 29, 2025. It includes 12 live sessions, 6 real-world hands-on projects, 64 recorded lectures, and more material. <strong>The first 30 people to sign up will get a 20% discount by applying the promo code <a href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first">FIRST</a>!</strong> So make sure to sign up early:</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first&quot;,&quot;text&quot;:&quot;Signup&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first"><span>Signup</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png" width="574" height="322.875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:574,&quot;bytes&quot;:3569157,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://maven.com/damien-benveniste/train-fine-tune-and-deploy-llms?promoCode=first&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/161774781?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J9Vr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 424w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 848w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1272w, https://substackcdn.com/image/fetch/$s_!J9Vr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9dc84db0-31c2-46eb-a37e-754282b2fe22_2560x1440.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The Real-World LLM Engineering Roadblocks You Face Today</strong></h3><p><strong>&#128075; Transitioning from General ML to LLM Specialization:</strong> You&#8217;ve built recommendation engines or classifier models, but moving into Transformer&#8209;centric development feels like learning a whole new discipline&#8212;no clear roadmap exists.</p><p><strong>&#128075; Lack of LLM&#8209;Specific Career Path: </strong>You see &#8220;LLM Engineer&#8221; roles popping up on LinkedIn, but your current CV only shows &#8220;Data Scientist&#8221; or &#8220;ML Engineer.&#8221; You need hands&#8209;on projects and artifacts to credibly make the jump.</p><p><strong>&#128075; Career Stalled by &#8220;Academic&#8221; Skillset:</strong> You can recite Transformer papers, but when asked, &#8220;Have you shipped an LLM feature end&#8209;to&#8209;end?&#8221; you have no answer&#8212;and no portfolio to prove it!</p><p><strong>&#128075; Prototype Meltdown Under Production Load: </strong>You&#8217;ve fine&#8209;tuned a small model locally, but when you switch from 1 to 100 concurrent requests, your GPU memory spikes and inference grinds to a halt, because you&#8217;ve never applied continuous batching, KV caching, or paged&#8209;attention in a live setting.</p><p><strong>&#128075; RAG Integration Headaches: </strong>Turning a standalone model into a live, Retriever&#8209;Augmented Generation service becomes a multi&#8209;week integration nightmare.</p><h3>How this course will help you</h3><p>Because we&#8217;ve <strong>packaged every stage</strong> of the LLM lifecycle, <strong>from career transition to production rollout</strong>, into a <strong>six&#8209;week bootcamp</strong> that:</p><p>&#9989; <strong>Guides Your Career Pivot: </strong>You&#8217;ll emerge with six polished GitHub projects, a deployment playbook, and RAG demos that transform your resume from &#8220;ML generalist&#8221; to &#8220;LLM Specialist.&#8221;</p><p>&#9989; <strong>Attacks Each Pain&#8209;Point Head&#8209;On: </strong>Attacks each pain point head&#8209;on with six job&#8209;mirroring projects (from scratch &#8594; RLHF &#8594; scaling &#8594; deployment &#8594; RAG), so you never waste time on dead&#8209;end tutorials</p><p>&#9989; <strong>Live Code&#8209;Along Workshops &amp; Office Hours: </strong>Tackle your own fine&#8209;tuning bugs, scaling hiccups, and deployment errors alongside Damien in dedicated sessions, so you get hands&#8209;on fixes for the exact issues you&#8217;ll face on the job.</p><p>&#9989; <strong>Ready&#8209;to&#8209;Use Repos &amp; Playbooks: </strong>Grab our curated starter code, development scripts, deployment templates, and debugging checklists, so you can plug them straight into your next project without reinventing the wheel.</p><p>&#9989; <strong>A Portfolio of Six Production&#8209;Grade Projects: </strong>Leave with six end&#8209;to&#8209;end deliverables, from a Transformer built from scratch to a live RAG API, ready to showcase on GitHub, in performance reviews, or to hiring managers.</p><p>No more scattered blog-hopping or generic bootcamps, this is <strong>the only</strong> cohort where you&#8217;ll <strong>master</strong> Transformer internals <em>and</em> <strong>ship</strong> production&#8209;grade LLM systems while making the career leap you&#8217;ve been aiming for.</p><h3>What You&#8217;ll Actually Build and Ship</h3><p>Across six hands&#8209;on projects, you&#8217;ll deliver deployable LLM components and applications, no fluff, just job&#8209;ready code:</p><p>&#9989; <strong>A Modern Transformer Architecture from scratch: </strong>Implement a sliding&#8209;window multihead attention to slash O(N&#178;) to O(N&#183;w), RoPE for relative positional encoding, and the Mixture-of-Expert architecture for improved performance, all in PyTorch.</p><p>&#9989; <strong>Instruction&#8209;Tuned LLM: </strong>Fine&#8209;tune a model with supervised learning, RLHF, DPO, and ORPO for instruction following on a real benchmark and compare performance gains.</p><p>&#9989; <strong>Scalable Training Pipeline: </strong>Containerize a multi&#8209;GPU job with DeepSpeed ZeRO on SageMaker to maximize throughput and minimize cost.</p><p>&#9989; <strong>Extended&#8209;Context Model: </strong>Modify RoPE scaling, apply 4/8&#8209;bit quantization, and inject LoRA adapters to double your context window.</p><p>&#9989; <strong>Multi&#8209;Mode Deployment: </strong>Stand up a Hugging Face endpoint, a vLLM streaming API, and an OpenAI&#8209;compatible server, all Dockerized and optimized for low latency.</p><p>&#9989; <strong>End&#8209;to&#8209;End RAG Chat App: </strong>Build a FastAPI backend with conversational memory and a Streamlit UI for live Retrieval&#8209;Augmented Generation.</p><p>By the end of Week 6, you won&#8217;t just know these techniques, you&#8217;ll have shipped six production&#8209;grade artifacts, each reflecting the exact pipelines, optimizations, and deployment routines you&#8217;ll use on the job.</p><h3>Live &amp; Recorded Content: Reinforce, Deepen, Accelerate</h3><p>&#10024; <strong>12 Interactive Live Workshops (3 hrs each): </strong>Each session follows the Concept &#8594; Code flow. I&#8217;ll introduce the day&#8217;s core topic (e.g. self-attention, LoRA, vLLM optimizations, ...), and we&#8217;ll implement the features step&#8209;by&#8209;step in code so you see exactly how theory maps to code. Bring your questions!</p><p>&#10024; <strong>10 + Hours of On&#8209;Demand Deep&#8209;Dive Lectures: </strong>Short videos (10&#8211;20 min) on Transformer internals, fine-tuning tricks, deployment optimizations. Watch before each project to hit the ground running. Step through every line of code at your own pace; perfect for review or catching up if you miss a live session. Downloadable slide decks, annotated notebooks, and cheat sheets you&#8217;ll reference long after graduation.</p><p><strong>Why This Matters:</strong> Live workshops turn recorded concepts into <strong>actionable skills</strong>. You&#8217;ll see how theory maps directly onto code, get instant feedback, and internalize best practices. Then, recorded lectures become your <strong>asynchronous safety net</strong>, letting you revisit tricky topics, prepare for upcoming labs, and solidify your understanding on demand.</p><p>Let me know if you have any questions. I hope to see you there!</p><p></p>]]></content:encoded></item><item><title><![CDATA[Chapter 4 of The Big Book of Large Language Models is Here!]]></title><description><![CDATA[Chapter 4 of the Big Book of Large Language Models is finally here!]]></description><link>https://newsletter.theaiedge.io/p/chapter-4-of-the-big-book-of-large</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/chapter-4-of-the-big-book-of-large</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Mon, 31 Mar 2025 15:02:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZbCz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong><a href="https://drive.google.com/file/d/1znEc3ClvFR99GI6LMOBQGJJk0wYmk-4p/view?usp=drive_link">Chapter 4</a></strong> of the <strong><a href="https://book.theaiedge.io/">Big Book of Large Language Models</a></strong> is finally here!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://drive.google.com/file/d/1znEc3ClvFR99GI6LMOBQGJJk0wYmk-4p/view?usp=drive_link" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZbCz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!ZbCz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!ZbCz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!ZbCz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZbCz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2889823,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://drive.google.com/file/d/1znEc3ClvFR99GI6LMOBQGJJk0wYmk-4p/view?usp=drive_link&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/160238039?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZbCz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!ZbCz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!ZbCz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!ZbCz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e8c5aa5-2370-4baa-a694-1a9707efc639_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That was a difficult chapter to write! Originally, I wanted to cram in that chapter all the improvements related to the Transformer architecture, since the <a href="https://arxiv.org/pdf/1706.03762">Attention is all you need paper</a>, but I realized that it would be too long for one chapter. I ended up focusing only on improvements related to the attention layer and delaying things like relative positional encoding and Mixture of Experts to the next chapter. In this chapter, I addressed the following improvements:</p><ul><li><p><em><strong>Sparse Attention Mechanisms</strong></em></p><ul><li><p><em><strong>The First Sparse Attention: Sparse Transformers</strong></em></p></li><li><p><em><strong>Choosing Sparsity Efficiently: Reformer</strong></em></p></li><li><p><em><strong>Local vs Global Attention: Longformer and BigBird</strong></em></p></li></ul></li><li><p><em><strong>Linear Attention Mechanisms</strong></em></p><ul><li><p><em><strong>Low-Rank Projection of Attention Matrices: Linformer</strong></em></p></li><li><p><em><strong>Recurrent Attention Equivalence: The Linear Transformer</strong></em></p></li><li><p><em><strong>Kernel Approximation: Performers</strong></em></p></li></ul></li><li><p><em><strong>Memory Efficient Attention</strong></em></p><ul><li><p><em><strong>Self-attention Does Not Need O(N 2) Memory</strong></em></p></li><li><p><em><strong>The FlashAttention</strong></em></p></li></ul></li><li><p><em><strong>Faster Decoding Attention Mechanisms</strong></em></p><ul><li><p><em><strong>Multi&#8209;Query Attention</strong></em></p></li><li><p><em><strong>Grouped&#8209;Query Attention</strong></em></p></li><li><p><em><strong>Multi-Head Latent Attention</strong></em></p></li></ul></li><li><p><em><strong>Long Sequence Attentions</strong></em></p><ul><li><p><em><strong>Transformer-XL</strong></em></p></li><li><p><em><strong>Memorizing Transformers</strong></em></p></li><li><p><em><strong>Infini-Attention</strong></em></p></li></ul></li></ul><p>Obviously, I could not include everything that was ever invented in the context of the attention layer, but I believe those use cases capture well the different research routes that have been explored since then. I believe it is a very important chapter, as most materials available online tend to focus on the vanilla self-attention, which starts to be an outdated concept for today&#8217;s standards. I also found that trying to understand how to improve the self-attention is a very good way to understand what it is we are trying to improve in the first place! The self-attention may appear odd at first, but diving into the inner workings of the layer in order to improve it gives us a level of understanding that is beyond anything we can learn just by looking at the original self-attention. I hope you will enjoy it!   </p><p></p><p></p><p> </p>]]></content:encoded></item><item><title><![CDATA[Reduce AI Model Operational Costs With Quantization Techniques]]></title><description><![CDATA[A deep dive into quantization and precision levels]]></description><link>https://newsletter.theaiedge.io/p/reduce-ai-model-operational-costs</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/reduce-ai-model-operational-costs</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Wed, 26 Mar 2025 15:01:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>Model quantization is becoming a core strategy for training and deployment! I am excited to introduce you to <a href="https://www.linkedin.com/in/whats-ai/">Louis-Fran&#231;ois Bouchard</a>! He is an exceptional AI educator and entrepreneur, and in this guest post, he presents the fundamentals for model quantization and a detailed tutorial on how to quantize a Llama 3 model. </strong></em></p><div><hr></div><p><em>Louis-Fran&#231;ois Bouchard, a dedicated AI educator and entrepreneur since 2019, left his PhD studies after recognizing a disconnect between academic research and industry needs. As the founder of <strong><a href="https://academy.towardsai.net/">Towards AI</a></strong>, he is committed to making artificial intelligence accessible and bridging that gap through practical teaching on the Towards AI Academy platform. With a wealth of free resources online&#8212;videos, blogs, and <strong><a href="https://newsletter.towardsai.net/">newsletters</a></strong>&#8212;his academy empowers a diverse global community of developers and enthusiasts to innovate and thrive with new relevant AI technologies like LLMs and everything around them.</em></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oN9X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oN9X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png 424w, https://substackcdn.com/image/fetch/$s_!oN9X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png 848w, https://substackcdn.com/image/fetch/$s_!oN9X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png 1272w, https://substackcdn.com/image/fetch/$s_!oN9X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oN9X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png" width="192" height="189.69934640522877" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:907,&quot;width&quot;:918,&quot;resizeWidth&quot;:192,&quot;bytes&quot;:1349487,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/159884460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oN9X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png 424w, https://substackcdn.com/image/fetch/$s_!oN9X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png 848w, https://substackcdn.com/image/fetch/$s_!oN9X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png 1272w, https://substackcdn.com/image/fetch/$s_!oN9X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6588e37-a0c7-4d8c-855b-5096e1f1b000_918x907.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><div><hr></div><p>Large AI models are changing industries worldwide, yet their enormous size makes them challenging to deploy efficiently. With billions of parameters, they demand powerful GPUs, abundant VRAM, and extensive compute resources, leading to high memory usage and steep operational costs.</p><p>Model quantization has emerged as a powerful technique to address these issues. By reducing the precision of a model&#8217;s weights, quantization dramatically cuts memory footprints. A quantized model can often run significantly faster and use a fraction of the memory of its full-precision equivalent, lowering inference latencies and hardware costs with minimal impact on accuracy.</p><p>In this article, we&#8217;ll explore the fundamentals of model quantization, examining its underlying principles, various precision levels, and its practical implementation through a detailed code example with the Hugging Face bitsandbytes library. We will guide you step-by-step on how to load a full-precision <a href="https://ai.meta.com/blog/meta-llama-3/">Meta Llama 3</a> model, convert it into a 4-bit quantized version, and compare their memory usage, inference speed, and output quality. Additionally, we will explore the trade-offs and best practices needed to optimize model performance while achieving significant memory savings and faster inference.</p><h2>What is Model Quantization?</h2><p><a href="https://arxiv.org/pdf/2103.13630.pdf">Quantization</a> is a method for shrinking neural network models, including Transformers, by reducing the precision of their parameters (weights, biases) and activations. Lower precision reduces a model&#8217;s memory footprint and computational requirements, enabling deployment on resource-constrained devices like mobile phones, smartwatches, and embedded systems.</p><p>A model like Meta Llama3 8B, which contains 8 billion parameters, stores these parameters in model weight files loaded onto GPUs for inference. These weights are essentially matrices stored in different numerical precisions. By quantizing these weights (reducing precision), you decrease the GPU compute and memory requirements. However, overly aggressive quantization can sometimes reduce inference accuracy.</p><p>Many open-source LLMs accessed through cloud APIs or downloaded locally are already quantized. Providers typically convert models from higher precision (FP32 or FP16) to lower precision formats (INT8 or 4-bit) to optimize performance. Properly executed quantization can significantly reduce hosting and deployment costs while preserving most of the original accuracy.</p><p>Think of this: when someone asks you the time, saying &#8220;about 11 p.m.&#8221; is faster but less precise than &#8220;10:58 p.m.&#8221; This is how quantization works. It accelerates processing at the expense of slight accuracy losses. The exact trade-off depends on the numeric format chosen (FP16, BFLOAT16, INT8, etc.).</p><p>Floating-point precision determines how accurately data is stored and processed in machine learning. Higher precision (e.g., Float32) offers better accuracy but requires more memory, whereas lower precision types (Float16, BFloat16) reduce memory usage at the cost of some precision. The figure below illustrates how different floating-point formats allocate bits to sign, range, and precision:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dHE9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dHE9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png 424w, https://substackcdn.com/image/fetch/$s_!dHE9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png 848w, https://substackcdn.com/image/fetch/$s_!dHE9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png 1272w, https://substackcdn.com/image/fetch/$s_!dHE9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dHE9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png" width="1200" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dHE9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png 424w, https://substackcdn.com/image/fetch/$s_!dHE9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png 848w, https://substackcdn.com/image/fetch/$s_!dHE9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png 1272w, https://substackcdn.com/image/fetch/$s_!dHE9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a91a9c-ad5b-4d61-ad16-bc83b05a57ea_1200x750.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For instance, a <a href="https://arxiv.org/abs/2307.09288">Meta Llama2 70B</a> model using FP16 precision consumes roughly 130 GB:</p><blockquote><p>(70,000,000,000 &#215; 2 bytes) / 1024&#179; &#8776; 130.385 GB</p></blockquote><p>Further quantization to 8-bit or 4-bit reduces memory and storage even more. Different inference providers (e.g., <a href="http://Together.ai">Together.ai</a> or Groq) use varying quantization schemes, affecting performance across identical models.</p><p>At its core, quantization is simple, it&#8217;s okay if you didn&#8217;t fully grasp the equation above. You just need to remember that quantization trades a little precision for efficiency, we discuss this trade-off in detail in the next section. For LLMs running at scale, this trade-off doesn&#8217;t exist in isolation. You&#8217;ll need to weigh it alongside factors like inference speed, memory efficiency, and deployment constraints. If you&#8217;re want to learn more about how these techniques work in a practical environment, you might find our <a href="https://tinyurl.com/quantizationforLLMs">From Beginner to Advanced LLM Developer</a> course quite useful. We discuss this trade-off in detail in the next section.</p><h3><strong>Precision Levels and Memory Savings</strong></h3><p>Quantization can use various numerical precisions. Each precision level has a different trade-off in terms of memory, computational speed, and model accuracy. The common ones for AI models are FP32, FP16 (or BF16), INT8, INT4, and, more recently, a special 4-bit float (NF4). The table below summarizes these precision levels, their memory costs relative to 32-bit precision (FP32), and their key characteristics:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!253E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!253E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png 424w, https://substackcdn.com/image/fetch/$s_!253E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png 848w, https://substackcdn.com/image/fetch/$s_!253E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!253E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!253E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png" width="1396" height="1628" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4f629617-5998-418f-b375-78211469bcd6_1396x1628.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1628,&quot;width&quot;:1396,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:389598,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/159884460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!253E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png 424w, https://substackcdn.com/image/fetch/$s_!253E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png 848w, https://substackcdn.com/image/fetch/$s_!253E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png 1272w, https://substackcdn.com/image/fetch/$s_!253E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f629617-5998-418f-b375-78211469bcd6_1396x1628.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For example, FP16 cuts memory usage in half with minimal performance impact, often speeding up inference on GPUs that support half-precision. Similarly, storing parameters in 8-bit integers (INT8) rather than 32-bit floating-point (FP32) significantly decreases the model size and computational load, reducing the model size by roughly four times when calibrated to minimize accuracy loss, making it attractive for deployment.</p><p>While INT4 offers even greater compression, potentially up to eight times smaller, the effective reduction is closer to six times due to overhead and the retention of some values in higher precision to maintain accuracy. NF4 addresses this challenge by preserving more information than INT4, proving especially useful for fine-tuning LLMs using methods like QLoRA.</p><p>Using fewer bits not only reduces memory usage but also enhances computational speed by enabling faster data transfers and the use of specialized low-precision instructions.</p><h2>Types of Quantization</h2><p>Not all quantization approaches are the same; there are multiple strategies to quantize a model, each with its procedure and use case. The two broad categories are Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT).</p><p><strong>Post-Training Quantization (PTQ):</strong> PTQ is applied after a model is trained. The weights of a pre-trained model are converted to a lower precision in a single calibration step without further training. This approach is fast, does not require the full training dataset, and uses a small calibration dataset (a few hundred samples) to estimate value ranges for quantization. The process quantizes the weights (and optionally activations), dramatically reducing the model size in minutes. The main drawback is a potential small accuracy drop if the quantization error isn&#8217;t fully addressed. PTQ is an excellent choice when quick optimization is needed or retraining resources are limited. Modern PTQ methods, such as GPTQ and AWQ for LLMs, can reduce weights to 4 bits with minimal accuracy loss. Key variants include:</p><ul><li><p><strong>Weight-only Quantization:</strong></p><p>This strategy compresses the model&#8217;s weights, often the largest memory component, while keeping activations (inputs/outputs of each layer) in higher precision (FP16/FP32) to avoid additional errors. This approach reduces model memory size and can partially speed up inference by using lower precision for weight matrices. Many recent LLM quantization methods focus on weight-only quantization to preserve accuracy, offering substantial memory savings given that LLM weights can be hundreds of GB at FP32.</p></li><li><p><strong>Full Quantization (Weights + Activations):</strong></p><p>Full quantization compresses both weights and activations, often using int8 or int16 formats for additional speedups. However, quantizing activations can be challenging for LLMs due to outlier values, channels with very large magnitude activations that, when quantized, may introduce significant errors. Naively quantizing these outliers can either hurt accuracy through clipping or lead to underflow. A common workaround is mixed precision, which retains higher precision for outlier activations while quantizing the rest. For instance, LLM.int8 detects outlier features in each layer and processes them in 16-bit while handling most multiplications in 8-bit.</p></li></ul><p><strong>Quantization-Aware Training (QAT):</strong> QAT ****involves quantizing model weights during training or fine-tuning. Weights are rounded to lower precision (like 8-bit) for calculations but stored and updated in higher precision (32-bit). This process is called &#8220;fake quantization.&#8221; This process allows the model to adapt its parameters to compensate for rounding errors, resulting in higher accuracy at a given bit-width. However, QAT requires additional training time and data, making it impractical for very large models. It is often used on smaller models or when maximum accuracy is essential, while smart PTQ methods are generally preferred for large language model deployments where retraining is not feasible.</p><p>Now that we&#8217;ve covered the &#8220;what&#8221; and &#8220;why&#8221; of quantization, let&#8217;s dive into the &#8220;how.&#8221; In the next section, we look at the different techniques to perform quantization.</p><h2>Quantization Techniques with Code Examples</h2><p>There are different techniques to perform quantization, from straightforward uniform quantization of each weight (scalar quantization) to more complex methods tailored for LLMs. In this section, we&#8217;ll explore a few key methods using code:</p><h3>Scalar Quantization</h3><p>Scalar quantization treats each dataset dimension independently. First, it calculates the minimum and maximum values for each dimension and then segments the range into uniform intervals (bins). Each value is assigned to a bin, effectively quantizing the data.</p><p>For example, let&#8217;s execute scalar quantization on a dataset with 2000 vectors (each 256-dimensional) generated from a Gaussian distribution:</p><pre><code><strong>import</strong> numpy <strong>as</strong> np

dataset = np.random.normal(size=(2000, 256))

<em><strong># Calculate and store minimum and maximum across each dimension</strong></em>
ranges = np.vstack((np.min(dataset, axis=0), np.max(dataset, axis=0)))</code></pre><p>Next, we determine the start and step for each dimension. Here, we use 8-bit unsigned integers (<code>uint8</code>), which provide 256 bins:</p><pre><code>starts = ranges[0,:]
steps = (ranges[1,:] - ranges[0,:]) / 255</code></pre><p>The quantized dataset is calculated as follows:</p><pre><code>scalar_quantized_dataset = np.uint8((dataset - starts) / steps)</code></pre><p>The scalar quantization process can be encapsulated in a function as below:</p><pre><code><strong>def</strong> scalar_quantisation(dataset):
    <em><strong># Calculate and store minimum and maximum across each dimension</strong></em>
    ranges = np.vstack((
        np.min(dataset, axis=0), 
        np.max(dataset, axis=0)
    ))
    starts = ranges[0,:]
    steps = (ranges[1,:] - starts) / 255
    return np.uint8((dataset - starts) / steps)</code></pre><h3>Product Quantization</h3><p>While scalar quantization treats each dimension independently, it may not account for the data distribution, potentially causing significant information loss. Consider the following vectors:</p><pre><code>array = [
&#9;[8.2, 10.3, 290.1, 278.1, 310.3, 299.9, 308.7, 289.7, 300.1],
&#9;[0.1, 7.3, 8.9, 9.7, 6.9, 9.55, 8.1, 8.5, 8.99]
]</code></pre><p>Applying scalar quantization to convert these vectors to a 4-bit integer leads to a considerable loss of information:</p><pre><code>quantized_array = [
&#9;[0, 0, 14, 13, 15, 14, 14, 14, 14]
&#9;[0, 0, 0, 0, 0, 0, 0, 0, 0]
]</code></pre><p>Product quantization enhances this approach by splitting the original vector into sub-vectors and quantizing each of these sub-vectors separately. With product quantization, you can:</p><ol><li><p>Split each vector in the dataset into m separate sub-vectors.</p></li><li><p>Group the data in each sub-vector into k centroids, utilizing techniques such as k-means clustering.</p></li><li><p>Substitute each sub-vector with the index of the closest centroid from the relevant codebook.</p></li></ol><p>For example, with <em>m</em> = 3 sub-vectors and <em>k</em> = 2 centroids:</p><pre><code><strong>from</strong> sklearn.cluster <strong>import</strong> KMeans
<strong>import</strong> numpy <strong>as</strong> np

<em><strong># Given array</strong></em>
array = np.array([
    [8.2, 10.3, 290.1, 278.1, 310.3, 299.9, 308.7, 289.7, 300.1],
    [0.1, 7.3, 8.9, 9.7, 6.9, 9.55, 8.1, 8.5, 8.99]
])

<em><strong># Number of subvectors and centroids</strong></em>
m, k = 3, 2

<em><strong># Divide each vector into m disjoint sub-vectors</strong></em>
subvectors = array.reshape(-1, m)

<em><strong># Perform k-means on each sub-vector independently</strong></em>
kmeans = KMeans(n_clusters=k, random_state=0).fit(subvectors)

<em><strong># Replace each sub-vector with the index of the nearest centroid</strong></em>
labels = kmeans.labels_

<em><strong># Reshape labels to match the shape of the original array</strong></em>
quantized_array = labels.reshape(array.shape[0], -1)

<em><strong># Output the quantized array</strong></em>
quantized_array

<em><strong># Result
&gt; array([[0, 1, 1],
       [0, 0, 0]], dtype=int32)</strong></em></code></pre><p>By storing only the centroid indices, product quantization reduces memory usage and can speed up nearest-neighbor searches. It balances memory footprint and accuracy, depending on the number of centroids and sub-vectors used.</p><h3>LLM-Specific Quantization Methods (GPTQ, AWQ, LLM.int8)</h3><p>More advanced quantization techniques have been developed to address the challenges of maintaining accuracy in LLMs while effectively reducing their size. Let&#8217;s look at a few notable ones:</p><ol><li><p><a href="https://arxiv.org/abs/2208.07339">LLM.int8()</a>: This technique identifies that activation outliers (significantly different values) disrupt the quantization of larger models. The proposed solution is to retain these outliers in higher precision, thus ensuring the model&#8217;s performance is not adversely affected.</p></li><li><p><a href="https://arxiv.org/abs/2210.17323">GPTQ</a>: GPTQ (post-training quantization for GPT models) quantizes each layer individually, minimizing the mean squared error (MSE) between quantized and full-precision weights. It uses a mixed int4-fp16 scheme, quantizing weights as int4 while keeping activations in float16, with real-time de-quantization during inference (<a href="https://arxiv.org/abs/2210.17323">GPTQ paper</a>).</p></li><li><p><a href="https://arxiv.org/abs/2306.00978">AWQ</a>: identifies a small percentage (0.1%-1%) of critical weights based on activation magnitude and avoids quantizing them, preserving vital information in FP16 format. This technique balances efficiency with performance, though it introduces mixed-precision data types that may require additional scaling to ensure uniformity (<a href="https://arxiv.org/abs/2306.00978">AWQ paper</a>).</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6hS6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6hS6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png 424w, https://substackcdn.com/image/fetch/$s_!6hS6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png 848w, https://substackcdn.com/image/fetch/$s_!6hS6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png 1272w, https://substackcdn.com/image/fetch/$s_!6hS6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6hS6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png" width="1456" height="356" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6hS6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png 424w, https://substackcdn.com/image/fetch/$s_!6hS6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png 848w, https://substackcdn.com/image/fetch/$s_!6hS6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png 1272w, https://substackcdn.com/image/fetch/$s_!6hS6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae80b249-a6d1-4e45-b5bb-afb15a527ee1_1676x410.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Image from the &#8220;AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration&#8221; paper</figcaption></figure></div><h3>QLoRA (Quantization + Low-Rank Adaptation)</h3><p>QLoRA uses quantization for efficient fine-tuning of LLMs. By quantizing a pre-trained model to 4-bit and then training small Low-Rank Adaptation (LoRA) matrices on top, QLoRA makes fine-tuning accessible even for large models (up to 65B parameters) on a single GPU. This approach employs the 4-bit NormalFloat (NF4) data type, optimized for weights following a normal distribution. Quantile quantization ensures each bin contains an equal number of values, minimizing quantization error. With standardized weights via &#963; scaling, QLoRA matches full 16-bit fine-tuning performance on NLP tasks (<a href="https://arxiv.org/pdf/2305.14314.pdf">QLoRA paper</a>) and has been highlighted in industry blogs for its efficiency (<a href="https://huggingface.co/blog/4bit-transformers-bitsandbytes#:~:text=,new%20data%20type%20that%20is">Hugging Face blog</a>).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XNcf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XNcf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png 424w, https://substackcdn.com/image/fetch/$s_!XNcf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png 848w, https://substackcdn.com/image/fetch/$s_!XNcf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png 1272w, https://substackcdn.com/image/fetch/$s_!XNcf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XNcf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png" width="1456" height="408" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:408,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XNcf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png 424w, https://substackcdn.com/image/fetch/$s_!XNcf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png 848w, https://substackcdn.com/image/fetch/$s_!XNcf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png 1272w, https://substackcdn.com/image/fetch/$s_!XNcf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0b41a2-1861-43af-b240-b8e1bdc04ecc_2122x594.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">From &#8220;QLoRA: Efficient Fine-tuning of Quantized LLMs&#8221; paper</figcaption></figure></div><p>We&#8217;ve covered the theory behind model quantization; now, let&#8217;s apply it. Let&#8217;s look at an example of how we can directly apply these quantization techniques to optimize our model, ensuring scalability and cost-efficiency without sacrificing performance using Hugging Face&#8217;s bitsandbytes.</p><h2>Practical Implementation</h2><div class="pullquote"><p>&#128161; You can access the complete colab Notebook for this article <a href="https://colab.research.google.com/drive/1Um-4iUJo1OBBQijtIna8Eu9GGfTY4GAr?usp=sharing">here</a>.</p></div><p>In practical settings, you don&#8217;t have to quantize models from scratch. Fortunately, several libraries and tools make quantizing models much easier. In our example, we use the <a href="https://huggingface.co/docs/bitsandbytes/main/en/index">bitsandbytes</a> quantization library, which is based on LLM.int8(). This library reduces the precision of model weights by converting them from formats like FP16 or FP32 into lower-bit representations, typically 8-bit or 4-bit, thereby saving memory and speeding up computations without a substantial loss in performance. In this tutorial, we load a full-precision (FP16) Llama 3 model, convert it into a 4-bit quantized version, and compare their memory usage, generation quality, and speed.</p><p><strong>Step 1: Setting Up the Environment</strong></p><p>First, we set up the necessary imports and configure the cache directory. We import PyTorch for the deep learning framework, transformers for the model libraries, and utilities for memory management and timing. You need to also set up an access token to access private models on <a href="https://huggingface.co/models">HuggingFace</a>. You can create an access token <a href="https://huggingface.co/settings/tokens">here</a>. You also need to ensure you have access to the <a href="https://huggingface.co/meta-llama/Meta-Llama-3-8B">meta-llama/Meta-Llama-3-8B</a> model on HuggingFace since it&#8217;s a gated model.</p><pre><code><strong>import</strong> torch
<strong>from</strong> transformers <strong>import</strong> (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    BitsAndBytesConfig
)
<strong>import</strong> time
<strong>import</strong> gc
<strong>import</strong> os

<em><strong># Set up your Hugging Face access token if you're using a private model</strong></em>
os.environ[<strong>"HUGGINGFACE_HUB_TOKEN"</strong>] = <strong>"your_huggingface_token_here"</strong>

<em><strong># Configure a custom cache directory to avoid 
# re-downloading large files.</strong></em>
CACHE_DIR = <strong>"/cache_dir/path"</strong>
os.environ[<strong>"TRANSFORMERS_CACHE"</strong>] = CACHE_DIR
os.environ[<strong>"HF_HOME"</strong>] = CACHE_DIR</code></pre><p>We specify a cache directory to store downloaded models, which helps avoid repeatedly downloading large files when working with these models.</p><p><strong>Step 2: Loading the Full Precision (FP16) Model</strong></p><p>In this step, we define a function to load the full-precision model in FP16 format using the <code>AutoModelForCausalLM</code> class from the transformers library with the torch_dtype set to float16. This class automatically selects the appropriate model architecture based on the model identifier provided. We also load the corresponding tokenizer with <code>AutoTokenizer</code>.</p><pre><code><strong>def</strong> load_fp16_model():
    <strong>print</strong>(<strong>"\n=== Loading Full Precision (FP16) Model ==="</strong>)
    model = AutoModelForCausalLM.from_pretrained(
        <strong>"meta-llama/Meta-Llama-3-8B"</strong>,
        torch_dtype=torch.float16,
        device_map={<strong>""</strong>: 0}
    )

    tokenizer = AutoTokenizer.from_pretrained(
        <strong>"meta-llama/Meta-Llama-3-8B"</strong>
    )

    <em><strong># Calculate memory usage</strong></em>
    memory_gb = torch.cuda.max_memory_allocated() / (1024**3)
    <strong>print</strong>(<strong>f"Memory usage (FP16): {memory_gb:.2f} GB"</strong>)

    <strong>return</strong> model, tokenizer, memory_gb</code></pre><p>We specify that the model should be placed on the first GPU using the device_map parameter. We also load the corresponding tokenizer, which handles text-to-token conversions. We also calculate the GPU memory usage to understand how much VRAM the model consumes.</p><p><strong>Step 3: Loading the 4-bit Quantized Model</strong></p><p>Next, we define a function to load the 4-bit quantized version of the same model. Before loading, we clear the GPU memory to ensure accurate measurement. We configure quantization using <code>BitsAndBytesConfig</code>.</p><pre><code><strong>def</strong> load_4bit_model():
    <strong>print</strong>(<strong>"\n=== Loading 4-bit Quantized Model ==="</strong>)
    <em><strong># Clear memory first</strong></em>
    gc.collect()
    torch.cuda.empty_cache()

    <em><strong># Configure 4-bit quantization</strong></em>
    quantization_config = BitsAndBytesConfig(
        load_in_4bit=<strong>True</strong>,
        bnb_4bit_quant_type=<strong>"nf4"</strong>,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True
    )

    model = AutoModelForCausalLM.from_pretrained(
        <strong>"meta-llama/Meta-Llama-3-8B"</strong>,
        quantization_config=quantization_config,
        device_map={<strong>""</strong>: 0}
    )

    tokenizer = AutoTokenizer.from_pretrained(
        <strong>"meta-llama/Meta-Llama-3-8B"</strong>
    )

    <em><strong># Calculate memory usage</strong></em>
    memory_gb = torch.cuda.max_memory_allocated() / (1024**3)
    <strong>print</strong>(f<strong>"Memory usage (4-bit): {memory_gb:.2f} GB"</strong>)

    <strong>return</strong> model, tokenizer, memory_gb</code></pre><p>We configure the 4-bit quantization using the BitsAndBytesConfig class, specifying that we want to load the model in 4-bit precision with the &#8220;nf4&#8221; quantization type (normalized float 4-bit). We enable double quantization for additional memory savings. After loading, we calculate the GPU memory usage to compare with the full-precision model.</p><p><strong>Step 4: Creating a Text Generation Function</strong></p><p>We then create a function to handle text generation for both models. The function tokenizes the input prompt, transfers it to the model&#8217;s device, and measures the generation time.</p><pre><code><strong>def</strong> run_generation(model, tokenizer, prompt):
    input_ids = tokenizer(
        prompt, return_tensors=<strong>"pt"</strong>
    ).input_ids.to(model.device)

    <em><strong># Time generation</strong></em>
    start_time = time.time()
    output = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7
    )
    generation_time = time.time() - start_time

    <em><strong># Decode output tokens to text</strong></em>
    output_text = tokenizer.decode(output[0], skip_special_tokens=<strong>True</strong>)

    <strong>return</strong> output_text, generation_time</code></pre><p>This function generates text with up to 50 new tokens using temperature sampling (0.7) for a balance of creativity and coherence. After generation, we decode the output tokens back to the text and return both the generated text and the time taken.</p><p><strong>Step 5: Comparing the Models Qualitatively</strong></p><p>Here, we define a function to compare the output quality of both the FP16 and 4-bit models using example prompts.</p><pre><code><strong>def</strong> qualitative_comparison(
    model_fp16, 
    model_4bit, 
    tokenizer, 
    example_prompts
):
    <strong>print</strong>(<strong>"\n=== Qualitative Comparison: FP16 vs 4-bit ==="</strong>)
    <strong>print</strong>(<strong>"="</strong> * 50)

   <em><strong> # Create file for saving full outputs</strong></em>
    <strong>with</strong> <strong>open</strong>(<strong>"comparison_results.txt"</strong>, <strong>"w"</strong>) <strong>as</strong> f:
        f.write(<strong>"=== Qualitative Comparison: FP16 vs 4-bit ===\n"</strong>)

        <strong>for</strong> i, prompt <strong>in</strong> <strong>enumerate</strong>(example_prompts):
            <strong>print</strong>(f<strong>"\nExample {i+1}: \"{prompt}\""</strong>)
            <strong>print</strong>(<strong>"-"</strong> * 50)

            <em><strong># Generate with FP16</strong></em>
            fp16_output, fp16_time = run_generation(
                model_fp16, tokenizer, prompt
            )

            <em><strong># Generate with 4-bit</strong></em>
            q4_output, q4_time = run_generation(
                model_4bit, tokenizer, prompt
            )

            <em><strong># Print truncated results to console</strong></em>
            <strong>print</strong>(f<strong>"FP16 ({fp16_time:.2f}s): {fp16_output[:150]}..."</strong>)
            <strong>print</strong>(f<strong>"4-bit ({q4_time:.2f}s): {q4_output[:150]}..."</strong>)

            <em><strong># Write full results to file</strong></em>
            f.write(f<strong>"\nExample {i+1}: \"{prompt}\"\n"</strong>)
            f.write(<strong>"-"</strong> * 50 + <strong>"\n"</strong>)
            f.write(f<strong>"FP16 ({fp16_time:.2f}s):\n{fp16_output}\n\n"</strong>)
            f.write(f<strong>"4-bit ({q4_time:.2f}s):\n{q4_output}\n\n"</strong>)</code></pre><p>For each prompt, we generate responses using both the FP16 and 4-bit models, recording the time taken for each generation. We print truncated outputs to the console for quick review and save the full outputs to a file for more detailed analysis later. This allows us to assess the models&#8217; quality and speed differences.</p><p><strong>Step 6: Running the Complete Comparison</strong></p><p>Finally, we implement the main function that runs the entire comparison process. This includes memory measurement, model loading, and qualitative output comparisons.</p><pre><code><strong>if</strong> <strong>__name__</strong> == <strong>"__main__"</strong>:
    example_prompts = [
        <strong>"A robot discovers what it means to be human when",
        "Explain quantum computing to a 5-year old child:",
        "Write a short poem about artificial intelligence:",
        "The main difference between supervised and unsupervised learning is",
        "Summarize the plot of Romeo and Juliet in three sentences:"</strong>
    ]

    <em><strong># Completely reset before measuring each model</strong></em>
    torch.cuda.empty_cache()
    gc.collect()
    torch.cuda.reset_peak_memory_stats()

    <em><strong># Load FP16 model and measure</strong></em>
    fp16_model, tokenizer, fp16_memory = load_fp16_model()

    <em><strong># Delete FP16 model before loading 4-bit model</strong></em>
    <strong>del</strong> fp16_model
    torch.cuda.empty_cache()
    gc.collect()
    torch.cuda.reset_peak_memory_stats()

    <em><strong># Now load 4-bit model and measure</strong></em>
    q4_model, tokenizer_4bit, q4_memory = load_4bit_model()

    <em><strong># Print memory usage stats</strong></em>
    memory_reduction = (fp16_memory - q4_memory) / fp16_memory * 100
    <strong>print</strong>(f<strong>"\nMemory usage comparison:"</strong>)
    <strong>print</strong>(f<strong>"FP16: {fp16_memory:.2f} GB"</strong>)
    <strong>print</strong>(f<strong>"4-bit: {q4_memory:.2f} GB"</strong>)
    <strong>print</strong>(f<strong>"Reduction: {memory_reduction:.2f}%"</strong>)

    <em><strong># reload the FP16 model for the comparison</strong></em>
    torch.cuda.empty_cache()
    gc.collect()
    fp16_model, _, _ = load_fp16_model()

    <em><strong># Run qualitative comparison</strong></em>
    qualitative_comparison(
        fp16_model, q4_model, tokenizer, example_prompts
    )

    <em><strong># Clean up</strong></em>
    <strong>del</strong> fp16_model, q4_model
    torch.cuda.empty_cache()</code></pre><p>We define test prompts covering various types of generation tasks, from creative writing to factual knowledge. We carefully manage GPU memory between model loads to ensure accurate measurements. We load the FP16 model first, measure its memory usage, then load the 4-bit model and measure its usage. We calculate and report the memory reduction achieved through quantization. We then reload the FP16 model (since we had to free its memory earlier) and run the qualitative comparison between both models, concluding by cleaning up resources.</p><p><strong>Results</strong></p><p>After running the complete comparison, you might see outputs like:</p><pre><code>=== Loading 4-bit Quantized Model ===
Loading checkpoint shards: 100%|&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;| 4/4 [00:07&lt;00:00,  1.89s/it]
Memory usage (4-bit): 5.42 GB

Memory usage comparison:
FP16: 14.96 GB
4-bit: 5.42 GB
Reduction: 63.73%

=== Loading Full Precision (FP16) Model  ===
Loading checkpoint shards: 100%|&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;&#9608;| 4/4 [00:07&lt;00:00,  1.77s/it]
Memory usage (FP16): 20.27 GB

=== Qualitative Comparison: FP16 vs 4-bit ===
==================================================

Example 1: <strong>"A robot discovers what it means to be human when"</strong>
--------------------------------------------------
FP16 (1.84s): A robot discovers what it means to be human when it falls in love with a human woman in this futuristic science fiction tale. The year is 2036 and the...
4-bit (2.06s): A robot discovers what it means to be human when it's forced to interact with real people. When it becomes clear that a nuclear strike is imminent, th...

Example 2: <strong>"Explain quantum computing to a 5-year old child:"</strong>
--------------------------------------------------
FP16 (1.40s): Explain quantum computing to a 5-year old child: A Q&amp;A with IBM&#8217;s Dr. Talia Gershon
Dr. Talia Gershon is a quantum computing scientist at IBM. She has...
4-bit (2.03s): Explain quantum computing to a 5-year old child: a new way to think about quantum computing
Quantum computing is a very difficult concept to explain. ...

Example 3: <strong>"Write a short poem about artificial intelligence:"</strong>
--------------------------------------------------
FP16 (1.39s): Write a short poem about artificial intelligence: a poem about artificial intelligence.
A poem about artificial intelligence. A short poem about artif...
4-bit (2.02s): Write a short poem about artificial intelligence: The AI poem generator
AI is an amazing technology that can help us solve many problems. However, it&#8217;...

Example 4: <strong>"The main difference between supervised and unsupervised learning is"</strong>
--------------------------------------------------
FP16 (1.40s): The main difference between supervised and unsupervised learning is that supervised learning uses labeled data, while unsupervised learning uses unlab...
4-bit (2.03s): The main difference between supervised and unsupervised learning is that supervised learning is the learning process in which we have labeled data, wh...

Example 5: <strong>"Summarize the plot of Romeo and Juliet in three sentences:"</strong>
--------------------------------------------------
FP16 (1.39s): Summarize the plot of Romeo and Juliet in three sentences: Act 1
In Act I, Romeo and Juliet meet at a ball, fall in love, and decide to get married. T...
4-bit (2.04s): Summarize the plot of Romeo and Juliet in three sentences: What is the basic idea of Romeo and Juliet?
The basic idea of Romeo and Juliet is that two ...</code></pre><p><strong>Results Analysis</strong></p><p>Looking at the outputs of FP16 and 4-bit quantized versions of the Llama 3 model, we can see a significant trade-off in memory usage, performance speed, and output quality. Quantization to 4-bit results in substantial memory efficiency, reducing the memory footprint from 14.96 GB in the FP16 model to 5.42 GB, representing approximately a 64% decrease. This considerable reduction is particularly advantageous for deployments with constrained memory resources.</p><p>However, we can see that quantization introduces a consistent slowdown in inference speed, with the 4-bit model experiencing around 30-45% longer response times compared to the FP16 model. Specifically, the 4-bit model typically generates outputs within 2.02-2.47 seconds, whereas the FP16 model completes similar tasks within 1.40-1.84 seconds. This slowdown is partly because we ran this setup on an NVIDIA GPU, which is highly optimized for FP16 computations through Tensor Cores. In contrast, native support for 4-bit operations on these GPUs is limited, resulting in additional overhead for dequantizing and scaling values during inference.</p><p>In terms of output quality, both models produced coherent and accurate responses across a variety of prompt types. While subtle differences are seen, such as the FP16 model occasionally provided more detailed answers or specific references, whereas the 4-bit model offered more generalized explanations. There was no significant degradation in quality. Structured creative tasks, like poetry and concise summarization, were challenging for both models, indicating quantization did not disproportionately impact performance in these areas.</p><p>Overall, the quantization process achieved notable memory savings without substantial compromise in the coherence or quality of the generated outputs, making it highly suitable for memory-sensitive applications where slight increases in latency are acceptable.</p><p>With these results in mind, let&#8217;s now focus on the key performance trade-offs and challenges that arise in real-world deployments, where balancing efficiency, accuracy, and response times becomes essential.</p><h2>Performance Trade-offs and Challenges</h2><p>Quantization generally enhances speed and memory efficiency, yet it introduces several challenges that require careful consideration:</p><ul><li><p><strong>Accuracy Degradation:</strong> Lowering precision generally impacts metrics such as accuracy. For instance, FP16 precision typically results in negligible loss, while well-implemented int8 quantization often leads to less than a 1% drop. In contrast, 4-bit quantization may noticeably degrade performance unless advanced methods like GPTQ or AWQ are applied. A practical strategy is to begin with 8-bit quantization and only experiment with 4-bit if further compression is needed while monitoring accuracy. If losses are excessive, consider quantization-aware training (QAT) or alternative schemes (e.g., per-channel scaling or SmoothQuant).</p></li><li><p><strong>Selecting the Appropriate Method:</strong> Different scenarios call for different approaches. For quick CPU improvements, post-training quantization (PTQ) to int8 offers an easy 2&#8211;4&#215; speed boost. On GPUs with limited memory, using 8-bit quantization (via options like <code>load_in_8bit</code>) is a solid choice, while 4-bit methods such as GPTQ or AWQ can compress models further without extra training. If fine-tuning is possible, QAT may slightly improve accuracy over PTQ. Also, the impact of quantization can vary by task&#8212;for instance, slight decreases in perplexity might more noticeably affect generative text quality than classification accuracy.</p></li><li><p><strong>Benchmarking and Performance Gains:</strong> It&#8217;s essential to assess both accuracy and latency. Quantization yields benefits only when the hardware and runtime are optimized for lower precision. With specialized runtimes (e.g., IPEX, TensorRT), researchers have reported up to 3.5&#215; faster inference on A100 GPUs using 3-bit GPTQ models and even 4.5&#215; on older GPUs. On CPUs, int8 quantization can offer 4&#8211;8&#215; speed improvements over FP32.</p></li><li><p><strong>Integration with Other Compression Techniques:</strong> Quantization can be effectively combined with methods like pruning, distillation, and efficient architectures. For example, distilling a large model into a medium one and then applying int8 quantization produces a lightweight yet high-performing model. Integrated frameworks like the Intel Neural Compressor combine both pruning and quantization, although each added compression step requires careful evaluation to balance performance and accuracy.</p></li></ul><p>In summary, while quantization dramatically improves performance, careful calibration, minimal fine-tuning, and advanced algorithms are crucial to maintaining accuracy. Always evaluate the quantized model using key metrics to ensure it meets your requirements.</p><h2>Conclusion</h2><p>Model quantization is a transformative technique for optimizing LLMs. By reducing numerical precision&#8212;from FP32 to formats such as FP16, INT8, or even 4-bit representations&#8212;quantization substantially lowers memory usage and computational demands. This process, whether through post-training quantization or quantization-aware training, enables faster inference and cost-effective deployment on resource-constrained devices while preserving acceptable accuracy. However, striking the right balance between efficiency and performance remains essential, as aggressive quantization can introduce trade-offs such as minor accuracy losses or slower generation speeds in some cases.</p><p>If you're working with large-scale LLMs, understanding techniques like quantization is just one piece of the puzzle. Optimizing model performance, fine-tuning effectively, and managing deployment trade-offs are all critical to building efficient AI systems. If you want to develop scalable, high-performance LLM products without wasting time on trial and error, our <a href="https://tinyurl.com/quantizationforLLMs">Beginner to Advanced LLM Developer</a> course provides the in-depth guidance you need. Learn model optimization, fine-tuning strategies, and practical implementation techniques&#8212;all designed to help you build smarter, more efficient AI solutions.</p>]]></content:encoded></item><item><title><![CDATA[How To Construct Self-Attention Mechanisms For Arbitrary Long Sequences]]></title><description><![CDATA[Toward Infinite Sequence Lengths]]></description><link>https://newsletter.theaiedge.io/p/how-to-construct-self-attention-mechanisms</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/how-to-construct-self-attention-mechanisms</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Fri, 21 Mar 2025 15:01:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F547ce887-3fc7-4bf9-8af9-76bfb39308a0_1500x887.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>With Gemini models having a 2M tokens context size and Claude having a 200K tokens context size while having a time-to-first-token this fast, it is impossible not to have a modification of the attention mechanism that explicitly handles extremely long sequences. Considering the size of those models, 2M tokens would lead to Petabytes of GPU memory to generate each token! So, we need strategies to handle those sequence sizes efficiently! Let&#8217;s dive in:</strong></em></p><ul><li><p><em><strong>Transformer-XL</strong></em></p></li><li><p><em><strong>Memorizing Transformers</strong></em></p></li><li><p><em><strong>Infini-Attention</strong></em></p></li></ul><div><hr></div><p>In the previous newsletters, we examined methods to reduce attention's computational complexity. In this newsletter, we are going to focus on designing attention mechanisms specifically optimized for processing extremely long contexts. The fundamental difference lies in the objective. Low-complexity attention methods primarily aim to approximate standard attention more efficiently, whereas long-sequence attention mechanisms fundamentally rethink how information flows across distant positions. Rather than merely making attention more computationally feasible, we are going to look at strategies to make distant contextual information meaningfully accessible and useful to the model.</p><h2>Transformer-XL</h2><p><a href="https://arxiv.org/pdf/1901.02860">Transformer-XL</a> was proposed in 2019 as a way to process sequences of virtually unlimited length while maintaining coherent information flow across the entire document. The main limitations to handle sequences of any length are:</p><ul><li><p>The typical time complexity <em><strong>O(N<sup>2</sup>)</strong></em> of the attention layers. However, we have already seen strategies to reduce that complexity to <em><strong>O(N)</strong></em>. For an autoregressive process, we are bounded from below by, at least, a linear decoding process in the sequence length. Therefore, we can never do better than this theoretical constraint. </p></li><li><p>The absolute positional encoding proposed in the original "Attention is all you need" paper is the main blocker for encoding arbitrary sequence lengths. We would need a way to encode any possible positions, which is hard in practice.</p></li><li><p>Another important blocker is the memory constraint. Longer sequences take more space in memory, and we would reach an upper bound in length when the GPU memory becomes saturated. We saw when discussing the FlashAttention that a high-end NVIDIA A100-8GB could realistically handle a maximum sequence length of 5,932 for a GPT-3 model. </p></li></ul><p>With Transformer-XL, we are going to design an attention mechanism that processes sequences with linear time complexity, constant memory complexity, and a novel relative positional encoding that captures the relative distance between tokens instead of their absolute positions. We are going to delay diving into the relative positional encoding until a future newsletter and focus here on how to process arbitrary sequence lengths in linear time with bounded memory constraints.</p><p>To illustrate how it works, let's consider the following toy example of input sequence: </p><blockquote><p><strong>"Teaching computers to see the world makes every colorful dataset an adventure"</strong></p></blockquote><p>The strategy with Transformer-XL is to handle the incoming tokens by segments. We break down the incoming sequence into segments, typically of size ~128-512 tokens during training and up to 1600 tokens during evaluation. For our toy example, let's assume that our segments are four tokens long:</p><ul><li><p>Segment 1: <strong>['Teaching', 'computers', 'to', 'see']</strong></p></li><li><p>Segment 2: <strong>['the', 'world', 'makes', 'every']</strong></p></li><li><p>Segment 3: <strong>['colorful', 'dataset', 'an', 'adventure']</strong></p></li></ul><p>Formally, we divide the incoming sequence into <em><strong>T = N / n</strong></em> segments, where <em><strong>n</strong></em> is the number of tokens per segment. We are going to create a segment-level recurrence to generate the output vectors from the model:</p><ol><li><p>Generate all the hidden states <em><strong>[H<sup>1</sup><sub>1</sub>, H<sup>2</sup><sub>1</sub>, &#8230;,H<sup>L</sup><sub>1</sub>]</strong></em> related to the first segment <em><strong>&#120649; = 1</strong></em>. <em><strong>l &#8712; {1, &#8230;, L} </strong></em>is the index of the layer <em><strong>l</strong></em>, and <em><strong>L</strong></em> is the total number of layers in the model. The full attention is computed within segments, so its time and memory complexity is <em><strong>O(n<sup>2</sup>)</strong></em>. We are going to cache the intermediate representations of the tokens <em><strong>[H<sup>1</sup><sub>1</sub>, H<sup>2</sup><sub>1</sub>, &#8230;,H<sup>L</sup><sub>1</sub>]</strong></em> for the next iteration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lQhM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lQhM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png 424w, https://substackcdn.com/image/fetch/$s_!lQhM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png 848w, https://substackcdn.com/image/fetch/$s_!lQhM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png 1272w, https://substackcdn.com/image/fetch/$s_!lQhM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lQhM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:457510,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/159515179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lQhM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png 424w, https://substackcdn.com/image/fetch/$s_!lQhM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png 848w, https://substackcdn.com/image/fetch/$s_!lQhM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png 1272w, https://substackcdn.com/image/fetch/$s_!lQhM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cd684d3-0388-4a0b-bddc-3c50d7e1dd83_1500x643.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>In the next iteration, we are going to consider the second segment and its interaction with the first segment. At each layer, we retrieve the hidden states of segment <em><strong>&#120649; = 1</strong></em> and append them to the hidden state of segment 2</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{H}_2^l=\\left[H_1^l;H_2^l\\right]&quot;,&quot;id&quot;:&quot;BMGYBLIHUL&quot;}" data-component-name="LatexBlockToDOM"></div><p>and we compute the next hidden states by passing them through the layer:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H_2^{l+1}=\\text{Layer}_l\\left(\\tilde{H}_2^l\\right)&quot;,&quot;id&quot;:&quot;OYMUZGMRDD&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, the attention matrix is computed across segments 1 and 2 and is, therefore, of size <em><strong>2n x 2n</strong></em>, which still follows a quadratic complexity ~<em><strong>O(n<sup>2</sup>)</strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EzNG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EzNG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png 424w, https://substackcdn.com/image/fetch/$s_!EzNG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png 848w, https://substackcdn.com/image/fetch/$s_!EzNG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png 1272w, https://substackcdn.com/image/fetch/$s_!EzNG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EzNG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png" width="1456" height="931" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:931,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:582622,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/159515179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EzNG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png 424w, https://substackcdn.com/image/fetch/$s_!EzNG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png 848w, https://substackcdn.com/image/fetch/$s_!EzNG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png 1272w, https://substackcdn.com/image/fetch/$s_!EzNG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc965653-d8b3-46a2-bc5b-75844633b9cf_1500x959.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p>In general, for any segment &#120649;, we retrieve the computed hidden states <em><strong>[H<sup>1</sup><sub>&#120649;-1</sub>, H<sup>2</sup><sub>&#120649;-1</sub>, &#8230;,H<sup>L</sup><sub>&#120649;-1</sub>]</strong></em> for the previous segment <em><strong>&#120649; - 1</strong></em>, and append them to the hidden states of the current segment <em><strong>H<sup>l</sup><sub>&#120649;</sub></strong></em> for layer <em><strong>l</strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{H}_{\\tau}^l=\\left[H_{\\tau-1}^l;H_{\\tau}^l\\right]&quot;,&quot;id&quot;:&quot;LSQPQAUENC&quot;}" data-component-name="LatexBlockToDOM"></div><p>and compute the next hidden states:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H_{\\tau}^{l+1}=\\text{Layer}_l\\left(\\tilde{H}_{\\tau}^l\\right)&quot;,&quot;id&quot;:&quot;LLAOSQDQFP&quot;}" data-component-name="LatexBlockToDOM"></div><p>At every point during this recurring process, the time and space complexity is at most ~<em><strong>O(n<sup>2</sup>)</strong></em>, and we iterate this process until we reach the last segment in the sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rx8c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rx8c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png 424w, https://substackcdn.com/image/fetch/$s_!Rx8c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png 848w, https://substackcdn.com/image/fetch/$s_!Rx8c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png 1272w, https://substackcdn.com/image/fetch/$s_!Rx8c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rx8c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png" width="1456" height="931" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:931,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:593381,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/159515179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rx8c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png 424w, https://substackcdn.com/image/fetch/$s_!Rx8c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png 848w, https://substackcdn.com/image/fetch/$s_!Rx8c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png 1272w, https://substackcdn.com/image/fetch/$s_!Rx8c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e25d566-8f8a-46b9-979d-d2153d7281ef_1500x959.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li></ol><p>The recurrence mechanism effectively creates a form of "memory" that allows the model to maintain coherent understanding across very long texts while keeping computational requirements manageable. Transformer-XL's ability to maintain coherence across long contexts is directly related to its depth. <em><strong>H<sup>l+1</sup><sub>&#120649;</sub></strong></em> depends on <em><strong>H<sup>l</sup><sub>&#120649;-1</sub></strong></em> and <em><strong>H<sup>l</sup><sub>&#120649;</sub></strong></em>, which means that the generation of the <em><strong>n</strong></em> hidden states in <em><strong>H<sup>l+1</sup><sub>&#120649;</sub></strong></em> depends on the <em><strong>2 x n</strong></em> hidden states in <em><strong>[H<sup>l</sup><sub>&#120649;-1</sub>; H<sup>l</sup><sub>&#120649;</sub>]</strong></em>. The hidden states <em><strong>H<sup>l</sup><sub>&#120649;-1</sub></strong></em> also depend on <em><strong>H<sup>l-1</sup><sub>&#120649;-2</sub></strong></em>  and <em><strong>H<sup>l-1</sup><sub>&#120649;-1</sub></strong></em>. Therefore, <em><strong>H<sup>l+1</sup><sub>&#120649;</sub></strong></em> depends on the <em><strong>3 x n</strong></em> hidden states in <em><strong>H<sup>l</sup><sub>&#120649;</sub></strong></em>, <em><strong>H<sup>l-1</sup><sub>&#120649;-1</sub></strong></em>, <em><strong>H<sup>l-1</sup><sub>&#120649;-2</sub></strong></em>. If we go back <em><strong>k</strong></em> layers in the model, then the hidden states in <em><strong>H<sup>l+1</sup><sub>&#120649;</sub></strong></em> depend on the <em><strong>(k+1) x n</strong></em> hidden states in <em><strong>[H<sup>l-k+1</sup><sub>&#120649;</sub>, H<sup>l-k+1</sup><sub>&#120649;-1</sub>, &#8230;, H<sup>l-k+1</sup><sub>&#120649;-k</sub>]</strong></em>. By unrolling the recurrence of the hidden states' dependency, we establish the functional relationship:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; H_{\\tau}^{l+1}=f_k(H_{\\tau}^{l-k+1}, H_{\\tau-1}^{l-k+1}, \\ldots, H_{\\tau-k}^{l-k+1})&quot;,&quot;id&quot;:&quot;DJHHOHIYPR&quot;}" data-component-name="LatexBlockToDOM"></div><p>This creates a recurrence relation where information from segment <em><strong>&#120649; &#8722;k</strong></em> can reach the final layer <em><strong>L</strong></em> in segment <em><strong>&#120649;</strong></em> only if <em><strong>k+1 &#8804; L</strong></em>. If we want to cover all the <em><strong>N</strong></em> tokens in the sequence, with <em><strong>k = N/n - 1</strong></em>, this means we need:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n    &amp;\\frac{N}{n} \\leq L  \\nonumber\\\\\n    \\text{or equivalently } &amp;N  \\leq nL\n\\end{align}&quot;,&quot;id&quot;:&quot;AXJLCEVFFN&quot;}" data-component-name="LatexBlockToDOM"></div><p>The theoretical dependency length is, therefore,&nbsp;<em><strong>O(n x L)</strong></em>. If the network is not deep enough, its ability to propagate information across segments would be limited. This is because information flows through layers within a segment before being passed to the next segment. A shallow Transformer-XL would still have the unbounded context mechanism but might not effectively utilize information from distant parts of the sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dlqJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dlqJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png 424w, https://substackcdn.com/image/fetch/$s_!dlqJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png 848w, https://substackcdn.com/image/fetch/$s_!dlqJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png 1272w, https://substackcdn.com/image/fetch/$s_!dlqJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dlqJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png" width="1456" height="932" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:932,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:821219,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/159515179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dlqJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png 424w, https://substackcdn.com/image/fetch/$s_!dlqJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png 848w, https://substackcdn.com/image/fetch/$s_!dlqJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png 1272w, https://substackcdn.com/image/fetch/$s_!dlqJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5842d40b-cf33-4ca2-bb51-4bd49c3f1e4f_1500x960.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So far, we have only considered the forward pass, but it becomes messy when we consider the backward pass! When we compute the gradient of the loss function <em><strong>L</strong></em>, we need to compute its relation to the hidden states <em><strong>H<sup>l+1</sup><sub>&#120649;</sub></strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    \\frac{\\partial \\mathcal{L}}{\\partial H_{\\tau-k}^{l-k+1}}&amp;=\\sum_{i=l-k+1}^{l+1}\\sum_{j=\\tau-k}^\\tau\\frac{\\partial \\mathcal{L}}{\\partial H_{j}^{i}}\\cdot\\frac{\\partial H_{j}^{i}}{\\partial H_{\\tau-k}^{l-k+1}}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;TJYAHAQFOA&quot;}" data-component-name="LatexBlockToDOM"></div><p>The layer-wise summation </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{i=l-k+1}^{l+1}&quot;,&quot;id&quot;:&quot;ZTNGLOIUOM&quot;}" data-component-name="LatexBlockToDOM"></div><p>must consider all layers from <em><strong>l-k+1</strong></em> up to <strong>l+1</strong> and the segment-wise summation </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{j=\\tau-k}^\\tau&quot;,&quot;id&quot;:&quot;URASMDYKUX&quot;}" data-component-name="LatexBlockToDOM"></div><p>must consider all segments from <em><strong>&#120649; &#8722;k</strong></em> up to <em><strong>&#120649;</strong></em>. This creates <em><strong>(k+1)<sup>2</sup></strong></em> gradient paths for a single hidden state segment. Now, for a sequence of length <em><strong>N</strong></em> divided into <em><strong>N/n</strong></em> segments, if we extend this to consider all possible hidden states that influence the loss, we get approximately:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    \\sum_{k=1}^{N/n}(k+1)^2\\approx \\frac{\\left(N/n\\right)^3}{3}\\sim\\mathcal{O}(N^3)&quot;,&quot;id&quot;:&quot;EVIGYNQPVA&quot;}" data-component-name="LatexBlockToDOM"></div><p>gradient paths to compute. This cubic growth makes training infeasible for long sequences. We are going to modify the segment-level dependency by introducing the Stop-Gradient <em><strong>SG(.)</strong></em> operator (<a href="https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html">detach</a> in PyTorch):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    \\tilde{H}_{\\tau}^l=\\left[SG(H_{\\tau-1}^l);H_{\\tau}^l\\right]&quot;,&quot;id&quot;:&quot;SISPFRCXTE&quot;}" data-component-name="LatexBlockToDOM"></div><p>This operation prevents gradients from flowing backward through the cached states during backpropagation. This means:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    SG(H_{\\tau-1}^l) &amp;= H_{\\tau-1}^l\\quad &amp;\\text{During the forward pass}&amp; \\nonumber\\\\\n\n    \\frac{\\partial \\tilde{H}_{\\tau}^l}{\\partial SG(H_{\\tau-1}^l)}&amp;=0 \\quad &amp;\\text{During the backward pass}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;SEQJEXENMH&quot;}" data-component-name="LatexBlockToDOM"></div><p>By applying <em><strong>SG(.)</strong></em>, the previous segment hidden states <em><strong>H<sup>l</sup><sub>&#120649;-1</sub></strong></em> is treated as constant during the backward pass. When we apply <em><strong>SG(.)</strong></em> to prevent gradient flow across segment boundaries, we introduce a profound asymmetry:</p><ul><li><p><strong>Information Asymmetry:</strong> During the forward pass, the model can access and use information from previous segments, but during backpropagation, it cannot receive gradient signals from future segments.</p></li><li><p><strong>Truncated Credit Assignment:</strong> The model cannot directly attribute credit or blame to decisions made in previous segments, even though those decisions influence future outcomes.</p></li></ul><p>This creates a unique learning paradigm where the model must learn to encode useful information in its hidden states without direct optimization signals for long-range dependencies. Interestingly, despite not being directly optimized for very long dependencies, Transformer-XL still develops impressive long-range capabilities because the recurrence mechanism creates paths for information to propagate forward.</p><p>For any two layers <em><strong>a</strong></em> and <em><strong>b</strong></em>, we have:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\frac{\\partial H_{\\tau'}^{a}}{\\partial SG(H_{\\tau}^{b})}=0 \\quad\\text{ if } \\tau'\\neq \\tau&quot;,&quot;id&quot;:&quot;QQITKIRAXO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Therefore, the loss backpropagation simplifies to</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    \\frac{\\partial \\mathcal{L}}{\\partial H_{\\tau-k}^{l-k+1}}&amp;=\\sum_{i=l-k+1}^{l+1}\\frac{\\partial \\mathcal{L}}{\\partial H_{j}^{i}}\\cdot\\frac{\\partial H_{j}^{i}}{\\partial H_{\\tau-k}^{l-k+1}}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;WEOTABKUNE&quot;}" data-component-name="LatexBlockToDOM"></div><p>which reduces the computational complexity from the cubic behavior <em><strong>O((N/n)<sup>3</sup>)</strong></em> to the linear one <em><strong>O(N/n)</strong></em> .</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RsVA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RsVA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png 424w, https://substackcdn.com/image/fetch/$s_!RsVA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png 848w, https://substackcdn.com/image/fetch/$s_!RsVA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png 1272w, https://substackcdn.com/image/fetch/$s_!RsVA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RsVA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png" width="1456" height="830" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:830,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:706139,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/159515179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RsVA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png 424w, https://substackcdn.com/image/fetch/$s_!RsVA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png 848w, https://substackcdn.com/image/fetch/$s_!RsVA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png 1272w, https://substackcdn.com/image/fetch/$s_!RsVA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbfc09403-f998-46cc-820a-c2fb872523ab_1500x855.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Each segment has length <em><strong>n</strong></em>, and when processing a single segment, attention is computed over the current segment plus the cached previous segment. The time complexity of the computation per segment is, therefore,&nbsp;<em><strong>~O(4n<sup>2</sup>) = O(n<sup>2</sup>)</strong></em>. For a sequence of length <em><strong>N</strong></em>, we process approximately <em><strong>T = N/n</strong></em> segments, and each segment requires <em><strong>O(n<sup>2</sup>)</strong></em>. Therefore, the complexity of the total operations is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}\\left(\\frac{N}{n} \\times n^2\\right) = \\mathcal{O}(N n)&quot;,&quot;id&quot;:&quot;WCCPAOYPIY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Which is linear in the sequence size <em><strong>N</strong></em>. When it comes to the way the attention is computed, it is a very similar pattern to the sparse attention with a sliding window. Beyond the current and past segments, the hidden states are blind to the other tokens.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UgyD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UgyD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png 424w, https://substackcdn.com/image/fetch/$s_!UgyD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png 848w, https://substackcdn.com/image/fetch/$s_!UgyD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png 1272w, https://substackcdn.com/image/fetch/$s_!UgyD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UgyD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png" width="1456" height="921" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:921,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:337832,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/159515179?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UgyD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png 424w, https://substackcdn.com/image/fetch/$s_!UgyD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png 848w, https://substackcdn.com/image/fetch/$s_!UgyD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png 1272w, https://substackcdn.com/image/fetch/$s_!UgyD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ffae1a5-9c1b-41ed-a376-a926e8292644_1500x949.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, with sparse attention, the memory requirements grow linearly with the sequence size, but with Transformer-XL, the space complexity is bounded by <em><strong>O(n<sup>2</sup>) </strong></em>= <em><strong>O(1)</strong></em> as <em><strong>n</strong></em> is a fixed number. We still need to choose <em><strong>n</strong></em> large enough to make efficient use of the high GPU parallelism.</p><h2>Memorizing Transformers</h2><p>In Transformer-XL, the long-range coherence is captured by the successive layers in the model, but the direct interaction between tokens is lost beyond a two-segment window. Google introduced the <a href="https://arxiv.org/pdf/2203.08913">Memorizing Transformers</a> in 2022 that extended the long-range coherence by caching previous key-value pairs for selective retrieval. The attention computation is broken down into two parts:</p><ul><li><p><strong>The local attention:</strong> As before, we partition the input sequence into segments (usually 512 tokens) and compute the token-token interactions within each segment:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; C_\\text{local} = \\text{Softmax}\\left(\\frac{Q_\\text{local}K^\\top_\\text{local}}{\\sqrt{d}}\\right)V_\\text{local}&quot;,&quot;id&quot;:&quot;ZEKBDPQOHF&quot;}" data-component-name="LatexBlockToDOM"></div><p> where <em><strong>Q<sub>local</sub></strong></em>, <em><strong>K<sub>local</sub></strong></em>, and <em><strong>V<sub>local</sub></strong></em> are the local queries, keys, and values within the segment. </p></li></ul>
      <p>
          <a href="https://newsletter.theaiedge.io/p/how-to-construct-self-attention-mechanisms">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How To Improve Decoding Latency With Faster Self-Attention Mechanisms]]></title><description><![CDATA[In LLMs, handling large sequences is not enough, we need to make sure the decoding process is fast.]]></description><link>https://newsletter.theaiedge.io/p/how-to-improve-decoding-latency-with</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/how-to-improve-decoding-latency-with</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Wed, 12 Mar 2025 15:02:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0950e6c1-7b78-419c-ab24-723970786b85_1500x909.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>In LLMs, handling large sequences is not enough, we need to make sure the decoding process is fast. Here we explore 3 typical approaches used to speed up the decoding process: </strong></em></p><ul><li><p><em><strong>Multi&#8209;Query Attention</strong></em></p></li><li><p><em><strong>Grouped&#8209;Query Attention</strong></em></p></li><li><p><em><strong>DeepSeek Multi-head Latent Attention</strong></em></p></li></ul><div><hr></div><p>When generating text one token at a time, transformers face a significant performance challenge. In the original transformer architecture with Multi-Head Attention (MHA), each time we generate a new token, we need to:</p><ul><li><p>Load the entire history of key (<em><strong>K</strong></em>) and value (<em><strong>V</strong></em>) matrices for each attention head</p></li><li><p>Process them against the new query</p></li><li><p>Generate the next token</p></li><li><p>Repeat</p></li></ul><p>This creates a memory bandwidth bottleneck. For long sequences, we're constantly reloading massive tensors from memory, which becomes the limiting factor in generation speed. Specifically, with <em><strong>n<sub>head</sub></strong></em> attention heads and a sequence of length <em><strong>N</strong></em>, the original approach required loading tensors of size approximately <em><strong>n<sub>head</sub> x d<sub>model</sub> x  N</strong></em>, which could be gigabytes of data for long contexts. Since those <em><strong>K</strong></em> and <em><strong>V</strong></em> tensors are a bottleneck for the decoding process, we are going to look at strategies to minimize their memory requirements.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O68B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O68B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png 424w, https://substackcdn.com/image/fetch/$s_!O68B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png 848w, https://substackcdn.com/image/fetch/$s_!O68B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png 1272w, https://substackcdn.com/image/fetch/$s_!O68B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O68B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png" width="1456" height="696" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:696,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:416410,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158893660?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O68B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png 424w, https://substackcdn.com/image/fetch/$s_!O68B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png 848w, https://substackcdn.com/image/fetch/$s_!O68B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png 1272w, https://substackcdn.com/image/fetch/$s_!O68B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15ca7e7c-15cc-458e-b9ed-e30414d1d589_1500x717.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Strategies like multi&#8209;query attention, grouped-query attention, and multi-latent attention are highly coupled with the <a href="https://arxiv.org/pdf/2211.05102">KV-caching technique</a>. With KV-caching, instead of recomputing the same <em><strong>K</strong></em> and <em><strong>V</strong></em> matrices over and over, we cache them in memory and update them at each iteration. Loading the KV-cache from memory becomes the dominant bottleneck instead of the compute itself.</p><h2>Multi&#8209;Query Attention</h2><p>In 2019, Noam Shazeer published a variant of the multi-head attention: <a href="https://arxiv.org/pdf/1911.02150">the multi-query attention (MQA)</a>. He realized that the model could maintain most of its capability while sharing a single set of keys and values across all heads. This insight might seem counterintuitive since the whole point of multi-head attention was to have different "perspectives" on the same information. What Shazeer discovered was that the diversity in the query projections still allows different heads to extract different information, the shared key-value store acts as a common knowledge repository, and each head can still attend to different parts of this shared repository.</p><p>Formally, this means that the projection matrix <em><strong>W<sup>Q</sup></strong></em> is still a <em><strong>d<sub>model</sub> x d<sub>model</sub></strong></em> matrix, but the projections <em><strong>W<sup>K</sup></strong></em> and <em><strong>W<sup>V</sup></strong></em> are <em><strong>d<sub>model</sub> x d<sub>head</sub></strong></em> matrices. With an incoming hidden state <em><strong>H</strong></em> of dimension <em><strong>d<sub>model</sub> x N</strong></em>, we have at training time the initial projections:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    Q &amp;= W^QH, \\quad \\text{Shape: }d_\\text{model}\\times N \\nonumber\\\\\n\n    K &amp;= W^KH, \\quad \\text{Shape: }d_\\text{head}\\times N  \\nonumber\\\\\n\n    V &amp;= W^VH, \\quad \\text{Shape: }d_\\text{head}\\times N\n\n\\end{align}&quot;,&quot;id&quot;:&quot;FEKPSFVMVI&quot;}" data-component-name="LatexBlockToDOM"></div><p>We then reshape the matrices into tensors to highlight the number of heads:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    Q' &amp;= \\text{Reshape}(Q), \\quad \\text{Shape: }n_\\text{head}\\times d_\\text{head}\\times N \\nonumber\\\\\n\n    K' &amp;= K, \\quad \\text{Shape: }d_\\text{head}\\times N  \\nonumber\\\\\n\n    V' &amp;= V, \\quad \\text{Shape: }d_\\text{head}\\times N\n\n\\end{align}&quot;,&quot;id&quot;:&quot;RXWAQZAMDF&quot;}" data-component-name="LatexBlockToDOM"></div><p>At inference time, we only consider the last query <em><strong>q&#8217;<sub>N</sub></strong></em> of size <em><strong>d<sub>head</sub> x n<sub>head</sub></strong></em> in the input sequence since we only need to predict the last token. The alignment scores are computed by broadcasting the matrix <em><strong>K&#8217;</strong></em> to all the heads:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{e}_N'=\\frac{\\mathbf{q}_N'^{\\top}K'}{\\sqrt{d_\\text{head}}}, \\quad \\text{Shape: } n_\\text{head}\\times N&quot;,&quot;id&quot;:&quot;LPMJAOYYUS&quot;}" data-component-name="LatexBlockToDOM"></div><p>We perform the softmax transformation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{a}_N'=\\text{Softmax}(\\mathbf{e}_N'), \\quad \\text{Shape: } n_\\text{head}\\times N&quot;,&quot;id&quot;:&quot;PUYAICDBRL&quot;}" data-component-name="LatexBlockToDOM"></div><p>The value is again broadcasted to all heads to compute the context vector corresponding to the prediction:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{c}_N' = \\mathbf{a}_N'V'^{\\top}, \\quad \\text{Shape: }  n_\\text{head} \\times d_\\text{head}&quot;,&quot;id&quot;:&quot;WBMNMTFPYD&quot;}" data-component-name="LatexBlockToDOM"></div><p>The context vector is reshaped in the original dimensions:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{c}_N = \\text{Reshape}(\\mathbf{c}_N'), \\quad \\text{Shape: } d_\\text{model} \\times 1&quot;,&quot;id&quot;:&quot;NNBQNOARMF&quot;}" data-component-name="LatexBlockToDOM"></div><p>Finally, the context vector is projected one last time to mix the information from the different heads:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{c}_N^{\\text{final}} = W^O \\mathbf{c}_N, \\quad \\text{Shape: } d_\\text{model} \\times 1&quot;,&quot;id&quot;:&quot;SIQMCJAQTT&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-DnP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-DnP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png 424w, https://substackcdn.com/image/fetch/$s_!-DnP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png 848w, https://substackcdn.com/image/fetch/$s_!-DnP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png 1272w, https://substackcdn.com/image/fetch/$s_!-DnP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-DnP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:370501,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158893660?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-DnP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png 424w, https://substackcdn.com/image/fetch/$s_!-DnP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png 848w, https://substackcdn.com/image/fetch/$s_!-DnP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png 1272w, https://substackcdn.com/image/fetch/$s_!-DnP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98070c97-a2c5-4200-8bff-48c5c03ac6d9_1500x599.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let's consider the memory access complexity (or memory bandwidth complexity). It measures the total amount of data that must be transferred between memory and compute units during the entire sequence of operations. This measures bandwidth requirements. For MHA, at each decoding step, we need to load <em><strong>W<sup>Q</sup></strong></em>, <em><strong>W<sup>K</sup></strong></em>, <em><strong>W<sup>V</sup></strong></em>, and <em><strong>W<sup>O</sup></strong></em>. They are all <em><strong>d<sub>model</sub> x d<sub>model</sub></strong></em> matrices, so over <em><strong>N</strong></em> decoding steps, the memory access complexity is <em><strong>~O(Nd<sub>model</sub><sup>2</sup>)</strong></em>. As we generate each token, we must reload the entire history. For the i-th token, we load keys and values of size <em><strong>i x d<sub>model</sub></strong></em>. Summing over all <em><strong>N</strong></em> steps:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{i=1}^N i\\cdot d_\\text{model}=d_\\text{model}\\frac{N(N+1)}{2}\\sim \\mathcal{O}(d_\\text{model}N^2)&quot;,&quot;id&quot;:&quot;EACBBDTMAA&quot;}" data-component-name="LatexBlockToDOM"></div><p>We also need to load the <em><strong>N</strong></em> input hidden states of size <em><strong>d<sub>model</sub></strong></em>. So, the overall memory access complexity for MHA is </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}(d_\\text{model}N+  Nd_\\text{model}^2+ d_\\text{model}N^2)&quot;,&quot;id&quot;:&quot;VXHUBBMKKP&quot;}" data-component-name="LatexBlockToDOM"></div><p>For long sequences, <em><strong>d<sub>model</sub>N<sup>2</sup></strong></em> dominates, which is the problematic bottleneck.</p><p>For MQA, loading <em><strong>W<sup>Q</sup></strong></em>, <em><strong>W<sup>K</sup></strong></em>, <em><strong>W<sup>V</sup></strong></em>, and <em><strong>W<sup>O</sup></strong></em> is the same asymptotic behavior <em><strong>~O(Nd<sub>model</sub><sup>2</sup>)</strong></em> and the input hidden states as well <em><strong>~O(Nd<sub>model</sub>)</strong></em>. However, the <em><strong>K</strong></em> and <em><strong>V</strong></em> matrices have a size <em><strong>i x d<sub>head</sub></strong></em> at the i-th decoding step, which leads to:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sum_{i=1}^N i\\cdot d_\\text{head}=d_\\text{head}\\frac{N(N+1)}{2}\\sim \\mathcal{O}(d_\\text{head}N^2)&quot;,&quot;id&quot;:&quot;RYFWBRTOLD&quot;}" data-component-name="LatexBlockToDOM"></div><p>So, the overall memory access complexity is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}(d_\\text{model}N+  Nd_\\text{model}^2+ d_\\text{head}N^2)&quot;,&quot;id&quot;:&quot;HLBIORQHXA&quot;}" data-component-name="LatexBlockToDOM"></div><p>As the <em><strong>d<sub>head</sub>N<sup>2</sup></strong></em> term dominates for long sequences, it has a substantial impact because memory access increase leads to higher latency. For the specific experiments run in Shazeer's paper, he found that MQA achieves ~12x faster decoder inference with minimal performance loss. For inference with a sequence length of 128 tokens, MHA's decoding time per token was 46 &#956;s whereas MQA's was 3.8 &#956;s.</p><h2>Grouped&#8209;Query Attention</h2><p>With multi-query attention, we gain in decoding speed, but we lose in performance compared to multi-head attention. The <a href="https://arxiv.org/pdf/2305.13245">grouped-query attention</a> provides a middle ground between MQA and MHA to keep performance high while improving decoding speed performance. GQA divides query heads into <em><strong>G</strong></em> groups, where each group shares a single key head and value head. This creates a configurable spectrum:</p><ul><li><p>When <em><strong>G = 1</strong></em>, it is equivalent to MQA (single key-value head for all queries)</p></li><li><p>When <em><strong>G = n<sub>head</sub></strong></em>, it is equivalent to standard MHA (separate key-value for each query)</p></li><li><p><em><strong>1 &lt; G &lt; n<sub>head</sub></strong></em>, we have the GQA sweet spot that balances efficiency and quality</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fm_M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fm_M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png 424w, https://substackcdn.com/image/fetch/$s_!Fm_M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png 848w, https://substackcdn.com/image/fetch/$s_!Fm_M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png 1272w, https://substackcdn.com/image/fetch/$s_!Fm_M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fm_M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png" width="1456" height="896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:896,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:724775,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158893660?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fm_M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png 424w, https://substackcdn.com/image/fetch/$s_!Fm_M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png 848w, https://substackcdn.com/image/fetch/$s_!Fm_M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png 1272w, https://substackcdn.com/image/fetch/$s_!Fm_M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47adb729-f210-45ea-aee9-0c126ec65041_1500x923.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It is important to understand that latency is not proportional to memory access. Memory-level parallelism refers to a computer system's ability to process multiple memory operations simultaneously rather than sequentially. There is a regime where the memory-level parallelism of the hardware is underutilized when we are trying to load too few groups, and there is a regime where we are saturating it when we are loading too many groups. Because of the parallelism, it is quite likely that having <em><strong>G = 1</strong></em> (MQA) induces as much latency as <em><strong>G = 4</strong></em> despite the increased predictive performance. This relationship between groups and latency is highly hardware-specific, which is why empirical testing is necessary to find the optimal configuration for any given system. The TPUs used in the original paper showed this saturation around <em><strong>G = 8</strong></em>, but different architectures might have different saturation points. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aWn1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aWn1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png 424w, https://substackcdn.com/image/fetch/$s_!aWn1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png 848w, https://substackcdn.com/image/fetch/$s_!aWn1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png 1272w, https://substackcdn.com/image/fetch/$s_!aWn1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aWn1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png" width="1456" height="863" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:863,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:278593,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158893660?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aWn1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png 424w, https://substackcdn.com/image/fetch/$s_!aWn1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png 848w, https://substackcdn.com/image/fetch/$s_!aWn1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png 1272w, https://substackcdn.com/image/fetch/$s_!aWn1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87aa47e5-756f-42f6-a391-a95a41243141_1500x889.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">From https://arxiv.org/pdf/2305.13245</figcaption></figure></div><p>GQA achieves quality close to MHA with speed comparable to MQA. GQA is more stable during training than pure MQA, which can sometimes exhibit training instability. For example, on the T5-XXL model, the GQA-8 version presented in the original paper achieved an average performance of 47.1 across key benchmarks compared to 47.2 for MHA and 46.6 for MQA, while maintaining inference speeds much closer to MQA.</p><h3>From MHA to GQA</h3><p>One of the main appeals of GQA is the ability to convert a vanilla MHA into GQA with minimal effort. The ability to convert existing models rather than train new ones has accelerated the adoption of GQA across the industry. Large models like <a href="https://arxiv.org/pdf/2305.10403">PaLM 2</a> and <a href="https://arxiv.org/pdf/2307.09288">LLaMA 2</a> have incorporated GQA, partly because this conversion process made it practical to do so. Converting from MHA to GQA is a two steps process:</p><ul><li><p><strong>Checkpoint Conversion:</strong> We first convert the model's weights by mean-pooling the key and value projection matrices within each group.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3tk1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3tk1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png 424w, https://substackcdn.com/image/fetch/$s_!3tk1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png 848w, https://substackcdn.com/image/fetch/$s_!3tk1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png 1272w, https://substackcdn.com/image/fetch/$s_!3tk1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3tk1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png" width="1456" height="1056" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1056,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:927368,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158893660?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3tk1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png 424w, https://substackcdn.com/image/fetch/$s_!3tk1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png 848w, https://substackcdn.com/image/fetch/$s_!3tk1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png 1272w, https://substackcdn.com/image/fetch/$s_!3tk1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff96365f7-3fce-4a18-bbe0-e5f64cd43cc5_1500x1088.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div></li><li><p><strong>Additional Pre-training:</strong> After conversion, we continue pre-training the model for a small fraction (about 5%) of the original training steps, using the same pre-training dataset and objectives.</p></li></ul><p>This approach requires only about 5% of the original training compute to achieve comparable performance to the original MHA model. For large models that cost millions to train, this represents enormous savings.</p><h2>DeepSeek&#8217;s Multi-head Latent Attention</h2><p>Multi-Head Latent Attention (MLA) was introduced in <a href="https://arxiv.org/pdf/2405.04434">DeepSeek-V2</a> as a way to optimize for training and inference speed while preserving the predictive performance. Even if GQA keeps performance high, it still has to balance efficiency with performance. MLA provides a solution that improves on both fronts at the same time.</p>
      <p>
          <a href="https://newsletter.theaiedge.io/p/how-to-improve-decoding-latency-with">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How To Reduce The Memory Usage Of The Self-Attention]]></title><description><![CDATA[With a bit of magic, we take a very inefficient computation like the Self-Attention and make it super memory-optimized for the specific hardware we use for training and inference.]]></description><link>https://newsletter.theaiedge.io/p/how-to-reduce-the-memory-usage-of</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/how-to-reduce-the-memory-usage-of</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Thu, 06 Mar 2025 16:03:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!K5lb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>With a bit of magic, we take a very inefficient computation like the Self-Attention and make it super memory-optimized for the specific hardware we use for training and inference. And we all need a bit of magic!</strong></em></p><ul><li><p><em><strong>Self-attention Does Not Need O(N<sup>2</sup>) Memory</strong></em></p></li><li><p><em><strong>The GPU Architecture</strong></em></p></li><li><p><em><strong>The FlashAttention-1</strong></em></p></li><li><p><em><strong>The FlashAttention-2</strong></em></p></li><li><p><em><strong>The FlashAttention-3</strong></em></p></li></ul><div><hr></div><p>So far, we have mainly explored how to reduce the complexity of the attention mechanisms by approximating the vanilla attention. The vanilla attention has a strict <em><strong>O(N<sup>2</sup>)</strong></em> time complexity, but the space complexity doesn't need to be <em><strong>O(N<sup>2</sup>)</strong></em>! Computing <em><strong>Q<sup>T</sup>K</strong></em> requires ~<em><strong>O(N<sup>2</sup>)</strong></em> operations, but the full <em><strong>N x N</strong></em> alignment scores and attention matrices do not need to be fully materialized all at once in memory. As models and sequence lengths scale, it becomes essential to minimize the memory requirements at training and inference time to better utilize the underlying hardware.  </p><h2>Self-attention Does Not Need <em><strong>O(N<sup>2</sup>)</strong></em> Memory</h2><p>Let's consider again the computation of the context vectors:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n      \\mathbf{c}_i &amp;= \\frac{\\sum_{j=1}^N \\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_j}{\\sqrt{d_\\text{model}}}\\right) \\mathbf{v}_j}{\\sum_{j=1}^N \\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_j}{\\sqrt{d_\\text{model}}}\\right)}. \n\n\\end{align}&quot;,&quot;id&quot;:&quot;PNONKVBILP&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here is how we arrive at the typical <em><strong>O(N<sup>2</sup>)</strong></em> space complexity:</p><ul><li><p>The typical assumption is that we first compute the dot product between the query <em><strong>q<sub>i</sub></strong></em> and all the keys <em><strong>[k<sub>1</sub>, &#8230;,k<sub>N</sub>]</strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;  \\mathbf{e}_i = \\left[\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_1}{\\sqrt{d_\\text{model}}}, \\ldots, \\frac{\\mathbf{q}_i^\\top \\mathbf{k}_N}{\\sqrt{d_\\text{model}}}\\right] &quot;,&quot;id&quot;:&quot;SMFSLEBUDQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where <em><strong>e<sub>i</sub></strong></em> is the alignment score vector of size <em><strong>N</strong></em> for the query <em><strong>q<sub>i</sub></strong></em>, which leads to the <em><strong>N x N</strong></em> matrix for the <em><strong>N</strong></em> queries. </p></li><li><p>We then perform the softmax transformation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{a}_i = \\frac{1}{\\sum_{j=1}^N\\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_j}{\\sqrt{d_\\text{model}}}\\right)}\\left[\\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_1}{\\sqrt{d_\\text{model}}}\\right), \\ldots, \\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_N}{\\sqrt{d_\\text{model}}}\\right)\\right] &quot;,&quot;id&quot;:&quot;EBQVFWJRZK&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here,  <em><strong>a<sub>i</sub></strong></em> is the attention vector of size N for the query <em><strong>q<sub>i</sub></strong></em>. Again, for <em><strong>N</strong></em> queries, it leads to the typical <em><strong>N x N</strong></em> attention matrix. </p></li><li><p>And finally, we project <em><strong>a<sub>i</sub></strong></em> onto the different values <em><strong>[v<sub>1</sub>, &#8230;,v<sub>N</sub>]</strong></em>, which leads to <em><strong>c<sub>i</sub></strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n      \\mathbf{c}_i &amp;= \\mathbf{a}_i^\\top V\\nonumber\\\\\n\n      &amp;=\\sum_{j=1}^Na_{ij}\\mathbf{v}_j\n\n\\end{align}&quot;,&quot;id&quot;:&quot;DTBMEPCVCK&quot;}" data-component-name="LatexBlockToDOM"></div></li></ul><p>Therefore, naively computing the alignment scores and the attention matrices first forces the materialization of those matrices in memory, which leads to the <em><strong>O(N<sup>2</sup>)</strong></em> space complexity. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K5lb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K5lb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png 424w, https://substackcdn.com/image/fetch/$s_!K5lb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png 848w, https://substackcdn.com/image/fetch/$s_!K5lb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png 1272w, https://substackcdn.com/image/fetch/$s_!K5lb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K5lb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png" width="1456" height="823" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58590382-6df0-40ff-8732-da015ddc2106_1500x848.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:823,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:315856,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158491599?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K5lb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png 424w, https://substackcdn.com/image/fetch/$s_!K5lb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png 848w, https://substackcdn.com/image/fetch/$s_!K5lb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png 1272w, https://substackcdn.com/image/fetch/$s_!K5lb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58590382-6df0-40ff-8732-da015ddc2106_1500x848.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, we do not need to order the computations in this manner! In 2021, <a href="https://arxiv.org/pdf/2112.05682">Rabe and Staats</a> realized that by reordering the operations, we can greatly reduce the requirements on the memory. The idea is to consider the unnormalized context vector <em><strong>c&#771;<sub>i</sub> </strong></em>and the normalization constant <em><strong>&#8721;exp(q<sub>i</sub><sup>T</sup>k<sub>j</sub>)</strong></em> of the softmax transformation separately:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n      \\tilde{\\mathbf{c}}_i &amp;= \\sum_{j=1}^N \\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_j}{\\sqrt{d_\\text{model}}}\\right) \\mathbf{v}_j \\nonumber\\\\ s_i&amp;=\\sum_{j=1}^N\\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_j}{\\sqrt{d_\\text{model}}}\\right)\\nonumber\\\\\n\n      \\mathbf{c}_i &amp;= \\frac{\\tilde{\\mathbf{c}}_i}{s_i}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;UYMUCCEAQK&quot;}" data-component-name="LatexBlockToDOM"></div><p>Because <em><strong>c&#771;<sub>i</sub> </strong></em> and <em><strong>s<sub>i</sub></strong></em> are just sums, we can easily loop through the key-value pairs to compute the context vector:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TmHb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TmHb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png 424w, https://substackcdn.com/image/fetch/$s_!TmHb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png 848w, https://substackcdn.com/image/fetch/$s_!TmHb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png 1272w, https://substackcdn.com/image/fetch/$s_!TmHb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TmHb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png" width="1456" height="463" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:463,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:169243,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158491599?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TmHb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png 424w, https://substackcdn.com/image/fetch/$s_!TmHb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png 848w, https://substackcdn.com/image/fetch/$s_!TmHb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png 1272w, https://substackcdn.com/image/fetch/$s_!TmHb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4218b5a-3ec5-422f-a217-b395d6270212_1956x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At any point during the for-loop, we only need to store the intermediary values of <em><strong>c&#771;<sub>i</sub> </strong></em> and <em><strong>s<sub>i</sub></strong></em>. <em><strong>c&#771;<sub>i</sub></strong></em> is a vector of size <em><strong>d<sub>model</sub></strong></em> (ignoring heads for simplicity), and <em><strong>s<sub>i</sub></strong></em> is a scalar. Therefore, for one query, we need constant space complexity <em><strong>O(1)</strong></em> to compute one context vector. Even iterating through all the queries, we never need to capture more than the intermediary values of <em><strong>c&#771;<sub>i</sub></strong></em> and <em><strong>s<sub>i</sub></strong></em>, so we can compute the full attention mechanism in <em><strong>O(1)</strong></em> space complexity:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fax_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fax_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png 424w, https://substackcdn.com/image/fetch/$s_!Fax_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png 848w, https://substackcdn.com/image/fetch/$s_!Fax_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png 1272w, https://substackcdn.com/image/fetch/$s_!Fax_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fax_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png" width="1456" height="634" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:634,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:185994,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158491599?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fax_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png 424w, https://substackcdn.com/image/fetch/$s_!Fax_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png 848w, https://substackcdn.com/image/fetch/$s_!Fax_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png 1272w, https://substackcdn.com/image/fetch/$s_!Fax_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c302a6-e300-4ef0-8623-723dc7fe723b_1952x850.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RJVe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RJVe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png 424w, https://substackcdn.com/image/fetch/$s_!RJVe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png 848w, https://substackcdn.com/image/fetch/$s_!RJVe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png 1272w, https://substackcdn.com/image/fetch/$s_!RJVe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RJVe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png" width="1456" height="832" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:832,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:371261,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158491599?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RJVe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png 424w, https://substackcdn.com/image/fetch/$s_!RJVe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png 848w, https://substackcdn.com/image/fetch/$s_!RJVe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png 1272w, https://substackcdn.com/image/fetch/$s_!RJVe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62fd9f00-3e53-49ad-b498-3664a5ce2c35_1500x857.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In reality, this is not a practical solution because sequential operations are not adapted to the parallelization capability of the CPU, GPU, or TPU hardware that is commonly used for neural network computations. In practice, the queries, keys, and values are partitioned into chunks to allow for a high degree of parallelization while keeping the memory requirement low. Let's assume that we partition the queries into <em><strong>n<sub>q</sub></strong></em> chunks and the keys and values into <em><strong>n<sub>k</sub></strong></em> chunks:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    Q&amp;=\\left[Q_1, Q_2, \\ldots, Q_{n_q}\\right] \\nonumber\\\\\n\n    K&amp;=\\left[K_1, K_2, \\ldots, K_{n_k}\\right] \\nonumber\\\\\n\n    V&amp;=\\left[V_1, V_2, \\ldots, V_{n_k}\\right]\n\n\\end{align}&quot;,&quot;id&quot;:&quot;TGQAHJOMDJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where each <em><strong>Q<sub>i</sub></strong></em> is a <em><strong>N / n<sub>q</sub> x d<sub>model</sub></strong></em> matrix and <em><strong>K<sub>i</sub></strong></em>, <em><strong>V<sub>i</sub></strong></em> are <em><strong>N / n<sub>k</sub> x d<sub>model</sub></strong></em> matrices. Let's call <em><strong>N<sub>q </sub>= N / n<sub>q</sub></strong></em>, the number of queries per chunk, and <em><strong>N<sub>k </sub>= N / n<sub>k</sub></strong></em>, the number of key-value pairs per chunk. We can now iterate through the chunks exactly in the same way:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-4T0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-4T0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png 424w, https://substackcdn.com/image/fetch/$s_!-4T0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png 848w, https://substackcdn.com/image/fetch/$s_!-4T0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png 1272w, https://substackcdn.com/image/fetch/$s_!-4T0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-4T0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png" width="1456" height="691" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:691,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267033,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158491599?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-4T0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png 424w, https://substackcdn.com/image/fetch/$s_!-4T0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png 848w, https://substackcdn.com/image/fetch/$s_!-4T0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png 1272w, https://substackcdn.com/image/fetch/$s_!-4T0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bcf646f-e9a2-407c-a8e7-ca4f537a0e16_1956x928.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xKcQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xKcQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png 424w, https://substackcdn.com/image/fetch/$s_!xKcQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png 848w, https://substackcdn.com/image/fetch/$s_!xKcQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png 1272w, https://substackcdn.com/image/fetch/$s_!xKcQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xKcQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png" width="1456" height="946" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:946,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:541313,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/158491599?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xKcQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png 424w, https://substackcdn.com/image/fetch/$s_!xKcQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png 848w, https://substackcdn.com/image/fetch/$s_!xKcQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png 1272w, https://substackcdn.com/image/fetch/$s_!xKcQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ed556f9-1e77-44a2-aa7d-42357353d833_1500x975.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As before, we need to store intermediary values of <em><strong>C&#771;<sub>i</sub></strong></em> and <em><strong>S<sub>i</sub></strong></em>. In this context, <em><strong>Q<sub>i</sub>K<sup>T</sup><sub>j</sub></strong></em> is a matrix of size <em><strong>N<sub>q</sub> x N<sub>k</sub></strong></em>, and so is <em><strong>A<sub>ij</sub></strong></em>. <em><strong>C&#771;<sub>i</sub></strong></em> is a matrix of size <em><strong>N<sub>q</sub> x d<sub>model</sub></strong></em> and <em><strong>S<sub>i</sub></strong></em> is a vector of size <em><strong>N<sub>q</sub></strong></em>. Therefore the space complexity is <em><strong>O(N<sub>q</sub> x N<sub>k</sub> + N<sub>q</sub> x d<sub>model</sub>)</strong></em>. To balance the number of chunks and the number of key-value pairs per chunk, they chose <em><strong>N<sub>k </sub>= n<sub>k </sub>= &#8730;N</strong></em> and fixed <em><strong>N<sub>q </sub></strong></em><strong>= 1024</strong>. This results in a space complexity:  </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}(1024\\sqrt{N} + 1024 d_\\text{model})=\\mathcal{O}(\\sqrt{N})&quot;,&quot;id&quot;:&quot;MJXDQGQCYZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>This approach allows for efficient tensor operations within each chunk while dramatically reducing the peak memory requirements. Note that no approximation has been made, and it is mathematically equivalent to the vanilla attention mechanism. However, this approach is slower (8-13% slower during the forward pass and 30-35% slower during the backward pass) due to the sequential computations, but it enables the processing of much longer sequences that would otherwise be impossible due to memory constraints. Along with the <a href="https://arxiv.org/pdf/2205.14135">FlashAttention</a>, it is one of the memory optimization strategies used in the <a href="https://github.com/facebookresearch/xformers">xFormers</a> package developed by Meta and used in the development of the Llama models.</p><h2>Stabilizing The Computations</h2><p>Until now, we have been ignoring the numerical stability of the softmax computation, but most implementations (PyTorch, TensorFlow, ...) are using a couple of tricks to ensure its stability. Let's remind ourselves of the softmax function:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;   \\text{Softmax}(x_i) = \\frac{e^{x_i}}{\\sum_{j=1}^N e^{x_j}}&quot;,&quot;id&quot;:&quot;NQUSTJLPNZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Computing <em><strong>e<sup>xi</sup></strong></em> can be tricky because if <em><strong>x<sub>i</sub> &#8805; 89</strong></em>, then <em><strong>e<sup>xi</sup></strong></em><strong> &#8773; 4.4e38</strong>, which exceeds the floating-point limit of <em><strong>3.4e38</strong></em> for a 32-bit float number, potentially leading to float overflow errors. To prevent this from happening, we typically modify the exponent by finding the maximum <em><strong>x<sub>i</sub></strong></em> value:</p>
      <p>
          <a href="https://newsletter.theaiedge.io/p/how-to-reduce-the-memory-usage-of">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[How To Linearize The Attention Mechanism!]]></title><description><![CDATA[Today, we talk about how to engineer attention mechanisms in O(n) complexity instead of O(n2).]]></description><link>https://newsletter.theaiedge.io/p/how-to-linearize-the-attention-mechanism</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/how-to-linearize-the-attention-mechanism</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Wed, 26 Feb 2025 16:01:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92c74496-f93d-4235-b5ae-ad9be59a914d_1500x675.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>Today, we talk about how to engineer attention mechanisms in O(n) complexity instead of O(n<sup>2</sup>). This newsletter tends to be a bit more math-flavored than my usual content, but it is liberating to be able to use math for the greater good! </strong></em></p><ul><li><p><em><strong>Low-Rank Projection of Attention Matrices: Linformer</strong></em></p></li><li><p><em><strong>Recurrent Attention Equivalence: The Linear Transformer</strong></em></p></li><li><p><em><strong>Kernel Approximation: Performer</strong></em></p></li></ul><div><hr></div><p>Self-attention&#8217;s quadratic complexity in sequence length has long been a central bottleneck for large-scale Transformer models. Handling tens of thousands of tokens becomes computationally prohibitive and can quickly exhaust available memory. Linear attention mechanisms represent a paradigm shift in transformer architecture by mathematically re-engineering the attention operation to achieve <em><strong>O(n)</strong></em> complexity while maintaining global context awareness. Unlike sparse attention's pattern restrictions, which preserve quadratic complexity but limit interactions to predefined token subsets, linear attention fundamentally redefines how all tokens interact by reformulating the attention matrix computation rather than pruning token interactions. Where sparse attention sacrifices theoretical completeness for practical speed, linear attention preserves global relationships at the cost of approximating pairwise token influences. This enables native handling of extreme sequence lengths (<em>1M+</em> tokens) while avoiding sparse attention's blind spots.</p><h2>Low-Rank Projection of Attention Matrices: Linformer</h2><p>With Sparse attention mechanisms, we understood that most of the token interaction information was contained in a small subset of token pairs. <a href="https://arxiv.org/pdf/2006.04768">Linformer</a> introduced the idea that the token-token interaction matrix could be compressed into a smaller representation without too much information loss. Instead of computing the full <em><strong>N x N</strong></em> interaction <em><strong>Q<sup>T</sup>K / &#8730;d</strong></em> (ignoring heads for simplicity), we could first project <em><strong>K</strong></em> into a lower rank dimension <em><strong>k</strong></em>, and compute the lower rank <em><strong>N x k</strong></em> approximation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;   \\frac{Q^\\top EK}{\\sqrt{d}}&quot;,&quot;id&quot;:&quot;JKDKZQCFIT&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>E</strong></em> is a <em><strong>N x k</strong></em> projection matrix that projects <em><strong>K</strong></em> from the original dimension <em><strong>d x N</strong></em> to <em><strong>d x k</strong></em>. This leads to <em><strong>N x k</strong></em> alignment score and attention matrices.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NsmD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NsmD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png 424w, https://substackcdn.com/image/fetch/$s_!NsmD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png 848w, https://substackcdn.com/image/fetch/$s_!NsmD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png 1272w, https://substackcdn.com/image/fetch/$s_!NsmD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NsmD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png" width="1456" height="515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:515,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222275,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/157921789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NsmD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png 424w, https://substackcdn.com/image/fetch/$s_!NsmD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png 848w, https://substackcdn.com/image/fetch/$s_!NsmD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png 1272w, https://substackcdn.com/image/fetch/$s_!NsmD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa341a472-41e5-475e-9405-b64b0b65e3f6_1500x531.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When we project with <em><strong>E</strong></em>, the approximation leads to the error:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{error} = \\left\\vert\\frac{Q^\\top K}{\\sqrt{d}}-\\frac{Q^\\top EK}{\\sqrt{d}} \\right\\vert&quot;,&quot;id&quot;:&quot;IOPXRIDWGB&quot;}" data-component-name="LatexBlockToDOM"></div><p>If the elements of <em><strong>E</strong></em> follow a Gaussian distribution ~<em><strong>N(0, 1/k)</strong></em>, the <a href="https://api.semanticscholar.org/CorpusID:117819162">Johnson&#8211;Lindenstrauss lemma</a> guarantees that:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;P\\left[\\text{error} >\\epsilon\\right]\\leq e^{-\\gamma\\epsilon^2 k}.&quot;,&quot;id&quot;:&quot;VRYEADHXVE&quot;}" data-component-name="LatexBlockToDOM"></div><p>This means that the probability that we choose <em><strong>E</strong></em> such that the error is greater than <em><strong>&#120540;</strong></em> is bounded by <em><strong>exp(-&#120690;&#120540;<sup>2</sup>k)</strong></em>, where <em><strong>&#120690;</strong></em> is just a scaling constant. If we choose <em><strong>k &#8594; &#8734;</strong></em>, then <em><strong>P[error &gt; &#120540;] &#8594; 0</strong></em> for any <em><strong>&#120540;</strong></em>. A good choice is <em><strong>k ~ log N / &#120540;<sup>2</sup></strong></em>, yielding:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;P\\left[\\text{error} >\\epsilon\\right]\\leq N^{-\\gamma}.&quot;,&quot;id&quot;:&quot;XMEKFCFXZO&quot;}" data-component-name="LatexBlockToDOM"></div><p>This means that we can choose an arbitrarily small <em><strong>&#120540;</strong></em> such that <em><strong>P[error &gt; &#120540;] &#8594; 0</strong></em> as the sequence length increases <em><strong>N &#8594; &#8734;</strong></em>. Understand this as a mere theoretical guide that tells us that choosing <em><strong>k ~ log N</strong></em> will guarantee smaller errors as <em><strong>N</strong></em> increases. In practice, <em><strong>k</strong></em> is chosen independently of <em><strong>N</strong></em>, leading to the <em><strong>O(N)</strong></em> linear complexity while accepting the cost of the approximation error. Additionally, <em><strong>E</strong></em> is chosen as a parameter layer for the model to learn. For example, they showed that choosing <em><strong>k = 64</strong></em> with <em><strong>N = 512</strong></em> leads to slightly worse performance than the full attention. </p><p>Since the attention matrix has dimension <em><strong>N x k</strong></em>, we also need to project the values:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C = \\text{Softmax}\\left(\\frac{Q^\\top EK}{\\sqrt{d}}\\right)FV&quot;,&quot;id&quot;:&quot;RQEWJRLRWA&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>F</strong></em> is the <em><strong>N x k</strong></em> projection matrix for the tensor <em><strong>V</strong></em>. As for <em><strong>E</strong></em>, <em><strong>F</strong></em> is also learned during training. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z--Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z--Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png 424w, https://substackcdn.com/image/fetch/$s_!Z--Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png 848w, https://substackcdn.com/image/fetch/$s_!Z--Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png 1272w, https://substackcdn.com/image/fetch/$s_!Z--Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z--Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png" width="1456" height="877" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:877,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:575795,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/157921789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z--Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png 424w, https://substackcdn.com/image/fetch/$s_!Z--Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png 848w, https://substackcdn.com/image/fetch/$s_!Z--Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png 1272w, https://substackcdn.com/image/fetch/$s_!Z--Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e57b90f-205a-481c-bf71-cd250d5ff5e9_1500x903.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Projecting the keys and values <em><strong>EK</strong></em>, <em><strong>FV</strong></em> leads to complexity <em><strong>O(Nk)</strong></em>. Computing the alignment scores <em><strong>Q<sup>T</sup>EK</strong></em> and the context vectors <em><strong>C = AFV</strong></em> are also following <em><strong>O(Nk)</strong></em>. Since we fix <em><strong>k</strong></em>, the overall time and space complexity is <em><strong>O(N).</strong></em> </p><h2>Recurrent Attention Equivalence: The Linear Transformer</h2><p>So far, we have accepted the attention mechanism to be represented by the following computation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    C &amp;= \\text{Softmax}\\left(\\frac{Q^\\top K}{\\sqrt{d}}\\right)V \\quad \\text{or}\\nonumber\\\\\n\n      \\mathbf{c}_i &amp;= \\frac{\\sum_{j=1}^N \\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_j}{\\sqrt{d}}\\right) \\mathbf{v}_j}{\\sum_{j=1}^N \\exp\\left(\\frac{\\mathbf{q}_i^\\top \\mathbf{k}_j}{\\sqrt{d}}\\right)} \\quad \\text{for individual vectors} \n\n\\end{align}&quot;,&quot;id&quot;:&quot;YTQKJIGABO&quot;}" data-component-name="LatexBlockToDOM"></div><p>However, this specific analytical choice is not the only one that could be chosen to fulfill the same functional role in capturing pairwise interactions between tokens. Let's review the roles of the different elements in this equation:</p><ul><li><p><strong>The dot-product </strong><em><strong>Q<sup>T</sup>K</strong></em><strong>: Similarity computation</strong>. For each query vector, it tells you how "compatible" or similar it is to each key vector. This yields a matrix of unnormalized attention scores.</p></li><li><p><strong>Normalizing by </strong><em><strong>&#8730;d</strong></em><strong>: Variance control.</strong> The primary purpose of scaling by <em><strong>&#8730;d</strong></em> is to control the scale of the attention logits before softmax, ensuring stable gradient flow and preventing the softmax from becoming too "confident" (peaked). Furthermore, extremely large logits can cause numerical instability (e.g., NaN in floating-point arithmetic), and scaling mitigates this.</p></li><li><p><strong>Softmax operation: Normalization and nonlinearity.</strong> The softmax turns the unnormalized similarity scores into a probability distribution, amplifying the effect of the most relevant keys.</p></li><li><p><strong>Multiplication by </strong><em><strong>V</strong></em><strong>: Weighted aggregation.</strong> Each output is a weighted sum of the values, where the weights come from the normalized similarity scores. This is how the model &#8220;mixes&#8221; information from across the input sequence.</p></li></ul><p>Functionally, we need a similarity function <em><strong>sim</strong></em> that is non-linear and captures the pairwise token interaction:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{c}_i = \\frac{\\sum_{j=1}^N \\text{sim}(\\mathbf{q}_i, \\mathbf{k}_j) \\mathbf{v}_j}{\\sum_{j=1}^N \\text{sim}(\\mathbf{q}_i, \\mathbf{k}_k)}&quot;,&quot;id&quot;:&quot;HUVKXEOWLB&quot;}" data-component-name="LatexBlockToDOM"></div><p>where the denominator ensures that the similarity function is normalized to 1. If we choose <em><strong>sim(q<sub>i</sub>, k<sub>j</sub>) = exp(q<sub>i</sub><sup>T</sup>k<sub>j</sub> / &#8730;d)</strong></em>, we recover the softmax transformation. The <a href="https://arxiv.org/pdf/2006.16236">Linear Transformer</a> proposed a new attention mechanism with a different analytical form, but with similar functional roles. More specifically, they suggested a similarity function where we can factorize the contribution from the keys and the queries as a product:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;   \\text{sim}(\\mathbf{q}_i, \\mathbf{k}_j)=\\phi(\\mathbf{q}_i)^\\top\\phi(\\mathbf{k}_j).&quot;,&quot;id&quot;:&quot;EGLRRDGPVD&quot;}" data-component-name="LatexBlockToDOM"></div><p>In the context of kernel methods in machine learning, <em><strong>&#632;</strong></em> is called a "feature map". A feature map is a function that transforms an input vector into a new space, often a higher-dimensional one, so that a kernel function (which measures similarity) can be expressed as an inner product in that space. Essentially, <em><strong>&#632;</strong></em> extracts or "maps" the original features into a new representation where the desired similarity (that mimics the softmax behavior) is computed simply by taking a dot product. In the context of the Linear Transformer, they simply chose <em><strong>&#632;</strong></em> as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;  \\phi(x) = \\begin{cases} \n\nx+ 1 &amp; \\text{if } x > 0, \\\\\n\n\\exp x &amp; \\text{otherwise}.\n\n\\end{cases}&quot;,&quot;id&quot;:&quot;FAIVUTXUGR&quot;}" data-component-name="LatexBlockToDOM"></div><p>This ensures that <em><strong>sim(q<sub>i</sub>, k<sub>j</sub>)</strong></em> is always positive and is computationally stable while being non-linear. The main appeal of this linearization of the similarity kernel is the associativity property of the matrix multiplication:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{c}_i = \\frac{\\sum_{j=1}^N \\phi(\\mathbf{q}_i)^\\top\\phi(\\mathbf{k}_j)\\mathbf{v}_j}{\\sum_{j=1}^N\\phi(\\mathbf{q}_i)^\\top\\phi(\\mathbf{k}_j)}=\\frac{ \\phi(\\mathbf{q}_i)^\\top\\sum_{j=1}^N \\phi(\\mathbf{k}_j)\\mathbf{v}_j^\\top}{\\phi(\\mathbf{q}_i)^\\top\\sum_{j=1}^N \\phi(\\mathbf{k}_j)}&quot;,&quot;id&quot;:&quot;VSBKZBOUAT&quot;}" data-component-name="LatexBlockToDOM"></div><p>For one key and one query, <em><strong>&#632;(q<sub>i</sub>)<sup>T</sup>&#632;(k<sub>j</sub>)</strong></em> takes <em><strong>d</strong></em> operations. Multiplying the resulting scalar alignment score to <em><strong>v<sub>j</sub></strong></em> takes another <em><strong>d</strong></em> operations. Therefore, for all the keys, computing <em><strong>&#8721;&#632;(q<sub>i</sub>)<sup>T</sup>&#632;(k<sub>j</sub>)v<sub>j</sub></strong></em> takes <em><strong>2Nd</strong></em> operations, and the time complexity is <em><strong>O(Nd)</strong></em> per query. Similarly, the denominator <em><strong>&#8721;&#632;(q<sub>i</sub>)&#632;(k<sub>j</sub>)</strong></em> follows a time complexity of <em><strong>O(Nd)</strong></em>. Because we have <em><strong>N</strong></em> queries, the total cost is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    \\mathcal{O}(N^2d_\\text{model})\\quad \\text{(our typical quadratic complexity!).}&quot;,&quot;id&quot;:&quot;ZOTYADKNDZ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LRGg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LRGg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png 424w, https://substackcdn.com/image/fetch/$s_!LRGg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png 848w, https://substackcdn.com/image/fetch/$s_!LRGg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png 1272w, https://substackcdn.com/image/fetch/$s_!LRGg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LRGg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png" width="1456" height="1201" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1201,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:637653,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.theaiedge.io/i/157921789?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LRGg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png 424w, https://substackcdn.com/image/fetch/$s_!LRGg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png 848w, https://substackcdn.com/image/fetch/$s_!LRGg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png 1272w, https://substackcdn.com/image/fetch/$s_!LRGg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f8e2b29-3520-46e4-8265-69cef0bea33e_1500x1237.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If we consider the multiplications in a different order, <em><strong>&#632;(k<sub>j</sub>)v<sub>j</sub><sup>T</sup></strong></em> is an outer product and results in <em><strong>d<sup>2</sup></strong></em> operations. For <em><strong>N</strong></em> keys and values, we end up with <em><strong>Nd<sup>2</sup></strong></em> operations for <em><strong>&#8721;&#632;(k<sub>j</sub>)v<sub>j</sub><sup>T</sup></strong></em>. In the denominator, summing the different keys <em><strong>&#8721;&#632;(k<sub>j</sub>)</strong></em> requires <em><strong>Nd</strong></em> operations. Let's call <em><strong>S = &#8721;&#632;(k<sub>j</sub>)v<sub>j</sub><sup>T</sup></strong></em> and <em><strong>z = &#8721;&#632;(k<sub>j</sub>)</strong></em>. <em><strong>S</strong></em> is a matrix of size <em><strong>d x d</strong></em> and <em><strong>z</strong></em> is a vector of size <em><strong>d</strong></em>. Computing <em><strong>&#632;(q<sub>i</sub>)<sup>T</sup>S</strong></em> brings another <em><strong>d<sup>2</sup></strong></em> operations, and computing <em><strong>&#632;(q<sub>i</sub>)<sup>T</sup>z</strong></em> takes <em><strong>d</strong></em> operations. Therefore, the cost of <em><strong>&#632;(q<sub>i</sub>)<sup>T</sup>S / &#632;(q<sub>i</sub>)<sup>T</sup>z</strong></em> per query is <em><strong>O(d<sup>2</sup> + d) </strong></em>= <em><strong>O(d<sup>2</sup>)</strong></em>. For <em><strong>N</strong></em> queries, we obtain a total complexity of:</p>
      <p>
          <a href="https://newsletter.theaiedge.io/p/how-to-linearize-the-attention-mechanism">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Understanding The Sparse Transformers!]]></title><description><![CDATA[The First Sparse Attention: Sparse Transformers]]></description><link>https://newsletter.theaiedge.io/p/understanding-the-sparse-transformers</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/understanding-the-sparse-transformers</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Wed, 19 Feb 2025 16:01:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c670f0a5-065a-435f-9f14-0551384609d9_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<ul><li><p><em><strong>The First Sparse Attention: Sparse Transformers</strong></em></p></li><li><p><em><strong>Choosing Sparsity Efficiently: Reformer</strong></em></p></li><li><p><em><strong>Local vs Global Attention: Longformer and BigBird</strong></em></p></li></ul><div><hr></div><p>The original Transformer architecture introduced in the <a href="https://arxiv.org/pdf/1706.03762">"Attention is All You Need"</a> paper opened a whole new avenue of research, and it became useful to address some of the bottlenecks that this architecture came with. Today, modern Transformers rarely use the original vanilla Transformer blueprint without modifications. Instead, they often combine multiple techniques:</p><ul><li><p>Faster or more memory-friendly attentions (sparse, linear, or memory-efficient) for large contexts,</p></li><li><p>Improved positional schemes (RoPE, ALiBi, or relative embeddings),</p></li><li><p>Enhanced feed-forward layers (MoE, GLU variants), and</p></li><li><p>Better normalization/optimizer choices (RMSNorm, AdamW).</p></li></ul><p>These enhancements address the core bottlenecks, quadratic complexity, high memory usage, limited context, and potentially weak or rigid feed-forward/positional representations, making Transformers more scalable, expressive, and practical for today&#8217;s large language modeling tasks. </p><p>Self-attention&#8217;s quadratic complexity in sequence length has long been a central bottleneck for large-scale Transformer models. Handling tens of thousands of tokens becomes computationally prohibitive and can quickly exhaust available memory. In the original Transformer, each query token attends to all tokens in the sequence (including itself), resulting in <em><strong>O(N<sup>2</sup>)</strong></em> time and memory complexity for a sequence of length <em><strong>N</strong></em>. As context windows grow into the thousands or tens of thousands of tokens, this quadratic scaling becomes impractical, consuming excessive memory and computational resources.</p><p>Sparse attention addresses this bottleneck by restricting, or "sparsifying", which tokens can attend to which. Instead of forming attention connections from every token to every other token, sparse mechanisms allow each token to attend to a subset of the sequence according to a specific pattern. By reducing the total number of key/value pairs, sparse attention can often achieve <em><strong>O(N log N)</strong></em>  or even <em><strong>O(N)</strong></em>  complexity.</p><h2>The First Sparse Attention: Sparse Transformers</h2><p>One of the first attempts at sparse attention was proposed by <a href="https://arxiv.org/pdf/1904.10509">OpenAI in 2019</a>, and this was the strategy chosen to build <a href="https://arxiv.org/pdf/2005.14165">GPT-3</a>. The idea is to limit the number of keys each query can attend when it comes to computing the alignment scores. This will reduce the number of alignment scores and attention weights computed. Because of this, we need to subset the values as well to ensure we can compute the context vectors <em><strong>C = AV<sup>T</sup></strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Agkd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Agkd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png 424w, https://substackcdn.com/image/fetch/$s_!Agkd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png 848w, https://substackcdn.com/image/fetch/$s_!Agkd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png 1272w, https://substackcdn.com/image/fetch/$s_!Agkd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Agkd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png" width="1456" height="683" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:683,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:569686,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Agkd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png 424w, https://substackcdn.com/image/fetch/$s_!Agkd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png 848w, https://substackcdn.com/image/fetch/$s_!Agkd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png 1272w, https://substackcdn.com/image/fetch/$s_!Agkd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0f47827-9e97-478e-9441-3ff3b83f8f4f_2016x946.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI suggested two different sparse patterns where, in the different heads, the queries attend the keys differently. The first pattern is the<em> strided</em> pattern. One head focuses on a local window of nearby tokens by having the <em>i</em>-th query only attend the keys in <em><strong>[i-w, i]</strong></em>, where <em><strong>w</strong></em> is the window size. For example, if <em><strong>w = 64</strong></em>, it means we only select the keys <em><strong>[i - 64, i - 63, &#8230;, i - 1, i]</strong></em>. The other heads focus on more global token interactions by having <em>i</em>-th query attending every <em><strong>c</strong></em> key. <em><strong>c</strong></em> is the stride and can be different for each head. For example, if <em><strong>c = 8</strong></em>, then we would only select the keys <em><strong>[0, &#8230;, i - 24,i - 16, i - 8, i]</strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FS09!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FS09!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png 424w, https://substackcdn.com/image/fetch/$s_!FS09!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png 848w, https://substackcdn.com/image/fetch/$s_!FS09!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!FS09!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FS09!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png" width="1456" height="995" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:995,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:411601,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FS09!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png 424w, https://substackcdn.com/image/fetch/$s_!FS09!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png 848w, https://substackcdn.com/image/fetch/$s_!FS09!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!FS09!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3fa13c9-8733-427c-95c3-20e0cf7de4e1_2016x1378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>They observed that the strided pattern (attending every <em>k</em>-th position) did not work well for text data, which lacks a naturally periodic structure. As a result, they introduced a <em>fixed</em> attention pattern. In one attention head, the sequence is divided into blocks of fixed size (e.g., 128 tokens). Each token within a block attends only to other tokens in that same block, capturing local dependencies in a more straightforward manner.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zlpO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zlpO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png 424w, https://substackcdn.com/image/fetch/$s_!zlpO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png 848w, https://substackcdn.com/image/fetch/$s_!zlpO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png 1272w, https://substackcdn.com/image/fetch/$s_!zlpO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zlpO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png" width="1456" height="729" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/faa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:729,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:599776,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zlpO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png 424w, https://substackcdn.com/image/fetch/$s_!zlpO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png 848w, https://substackcdn.com/image/fetch/$s_!zlpO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png 1272w, https://substackcdn.com/image/fetch/$s_!zlpO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffaa4cf5e-bc45-4d23-b97a-803f73ca6874_2100x1051.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>However, using purely local attention inside blocks would prevent information from flowing across blocks in deeper layers. To address this, another head connects the "summary token" at the end of each block to the corresponding summary tokens of all previous blocks. Because that last token has attended to all tokens in its own block, its hidden state acts as a summary of that entire sub-sequence. By letting future blocks attend to these summary tokens, the model propagates information globally across blocks, layer by layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NRWI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NRWI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png 424w, https://substackcdn.com/image/fetch/$s_!NRWI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png 848w, https://substackcdn.com/image/fetch/$s_!NRWI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png 1272w, https://substackcdn.com/image/fetch/$s_!NRWI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NRWI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:793358,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NRWI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png 424w, https://substackcdn.com/image/fetch/$s_!NRWI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png 848w, https://substackcdn.com/image/fetch/$s_!NRWI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png 1272w, https://substackcdn.com/image/fetch/$s_!NRWI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8870b653-e8e0-4c39-b1f0-4b96f3baa6fc_2100x1247.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Thus, the in-block pattern computes context vectors (weighted averages of values) only from nearby tokens, while the across-block pattern computes context from previous summary tokens. In combination, these patterns ensure both local and long-range context can be aggregated throughout the Transformer stack, even without a dense (i.e., fully quadratic) attention mechanism.   </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-gbW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-gbW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png 424w, https://substackcdn.com/image/fetch/$s_!-gbW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png 848w, https://substackcdn.com/image/fetch/$s_!-gbW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png 1272w, https://substackcdn.com/image/fetch/$s_!-gbW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-gbW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png" width="1456" height="1095" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1095,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:600878,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-gbW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png 424w, https://substackcdn.com/image/fetch/$s_!-gbW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png 848w, https://substackcdn.com/image/fetch/$s_!-gbW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png 1272w, https://substackcdn.com/image/fetch/$s_!-gbW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1b5de9b-6bc1-459f-90d4-cb9c6b98aa78_2100x1579.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let's estimate the time complexity of those sparse attentions. In the strided case, we have a local window of size <em><strong>w</strong></em> and a stride <em><strong>c</strong></em>. Therefore, each query attends to roughly <em><strong>w + N / c</strong></em> keys (the local window plus the strided tokens). For <em><strong>N</strong></em> queries, the total cost is  <em><strong>~ N(w + N / c)</strong></em>. By tuning <em><strong>w</strong></em> and <em><strong>c</strong></em>, one can achieve sub-quadratic complexity. For example:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    N\\left(w+\\frac{N}{c}\\right) \\sim \\mathcal{O}(N\\sqrt{N}), \\quad &amp;\\text{if }c= \\sqrt{N}\\nonumber\\\\\n\n    N\\left(w+\\frac{N}{c}\\right) \\sim \\mathcal{O}(N\\log{N}), \\quad &amp;\\text{if }c=\\frac{N} {\\log{N}}\\nonumber\n\n\\end{align}&quot;,&quot;id&quot;:&quot;PENWTNUHMP&quot;}" data-component-name="LatexBlockToDOM"></div><p>In the fixed case, The sequence is split into blocks of length <em><strong>l</strong></em>. Within each block, each query attends at most <em><strong>l</strong></em> keys. Furthermore, each query attends <em><strong>c = N / l</strong></em> summary tokens. Therefore, a query sees <em><strong>O(l + c)</strong></em> keys, and the total cost for all queries is <em><strong>O(N(l + c))</strong></em>. Typically, we choose <em><strong>l</strong></em> such that it grows sub-linearly with <em><strong>N</strong></em>. For example:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; N(l + c) \\sim \\mathcal{O}(N\\sqrt{N}), \\quad \\text{if }l= \\sqrt{N}&quot;,&quot;id&quot;:&quot;BBFJEGTJOH&quot;}" data-component-name="LatexBlockToDOM"></div><p>Despite the improved time complexity, it is essential for those computations to remain performed as tensor operations to fully utilize the high parallelism of the GPU hardware. For example, let's assume that we want to compute the attentions for the local sliding window described in the strided case. The original keys tensor <em><strong>K</strong></em> is of dimension <em><strong>n<sub>head</sub> X d<sub>head</sub> X N</strong></em>. We can construct the windowed keys tensor <em><strong>K<sup>w</sup></strong></em> to compute the sliding window all at once by adding another dimension representing the window size <em><strong>w</strong></em>. Constructing this tensor is a <em><strong>O(N)</strong></em> operation. The resulting tensor is of size <em><strong>n<sub>head</sub> X d<sub>head</sub> X N X w</strong></em>, and each slice of size <em><strong>d<sub>head</sub> X N X w</strong></em> contains the necessary keys to compute the windowed attentions for each head.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QQ8Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png 424w, https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png 848w, https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png 1272w, https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png" width="1456" height="711" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:711,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1284841,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png 424w, https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png 848w, https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png 1272w, https://substackcdn.com/image/fetch/$s_!QQ8Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ec58c7-58d5-4b50-88bf-ad4282f66ecf_5541x2706.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let's now compute the product between the windowed keys <em><strong>K<sup>w</sup></strong></em> and the queries <em><strong>Q</strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    E_{hnw}=\\sum_{d}\\frac{Q_{hdn}K^w_{hdnw}}{\\sqrt{d_\\text{head}}}, \\quad \\text{shape: }  n_\\text{head}\\times N\\times w&quot;,&quot;id&quot;:&quot;VAMBQSCEBQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>h</strong></em> represents the head dimension (<em><strong>n<sub>head</sub></strong></em>), <em><strong>n</strong></em> the sequence dimension (<em><strong>N</strong></em>), <em><strong>d</strong></em> the hidden size per head dimension (<em><strong>d<sub>head</sub></strong></em>) and <em><strong>w</strong></em> the window dimension. The resulting alignment scores tensor <em><strong>E</strong></em> is of shape <em><strong>n<sub>head</sub> X N X w</strong></em>. The time complexity of this operation is <em><strong>O(d<sub>head</sub>Nw)</strong></em> instead of the vanilla <em><strong>O(d<sub>head</sub>N<sup>2</sup>)</strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!juVx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!juVx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png 424w, https://substackcdn.com/image/fetch/$s_!juVx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png 848w, https://substackcdn.com/image/fetch/$s_!juVx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png 1272w, https://substackcdn.com/image/fetch/$s_!juVx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!juVx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png" width="1456" height="614" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:614,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:870344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!juVx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png 424w, https://substackcdn.com/image/fetch/$s_!juVx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png 848w, https://substackcdn.com/image/fetch/$s_!juVx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png 1272w, https://substackcdn.com/image/fetch/$s_!juVx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F737fb9d1-6b01-4deb-a88f-96e94a9a6553_2100x885.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Choosing Sparsity Efficiently: Reformer</h2>
      <p>
          <a href="https://newsletter.theaiedge.io/p/understanding-the-sparse-transformers">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Attention Is All You Need: The Original Transformer Architecture]]></title><description><![CDATA[This newsletter is the latest chapter of the Big Book of Large Language Models. You can find the preview here, and the full chapter is available in this newsletter]]></description><link>https://newsletter.theaiedge.io/p/attention-is-all-you-need-the-original</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/attention-is-all-you-need-the-original</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Wed, 12 Feb 2025 16:02:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!z90S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>This newsletter is the latest chapter of the <a href="https://book.theaiedge.io/">Big Book of Large Language Models</a>. You can find the preview <a href="https://drive.google.com/file/d/15JrUy6JsY8BzhJrqlyTAtf6NiHV-_Of2/view">here</a>, and the full chapter is available in this newsletter</strong></em></p><ul><li><p><em><strong>The Self-Attention Mechanism</strong></em></p></li><li><p><em><strong>The Multi-head Attention Layer</strong></em></p></li><li><p><em><strong>The Positional Encoding</strong></em></p></li><li><p><em><strong>The Encoder</strong></em></p></li><li><p><em><strong>The Residual Connections</strong></em></p></li><li><p><em><strong>The Layer Normalization</strong></em></p></li><li><p><em><strong>The Position-wise Feed-Forward Network</strong></em></p></li><li><p><em><strong>The Decoder</strong></em></p></li><li><p><em><strong>The Cross-Attention</strong></em></p></li><li><p><em><strong>Masking The Self-Attention Layer</strong></em></p></li><li><p><em><strong>The Prediction Head</strong></em></p></li><li><p><em><strong>The Decoding Process</strong></em></p></li><li><p><em><strong>Training For Causal Language Modeling</strong></em></p></li><li><p><em><strong>Understanding the scale of the model</strong></em></p></li><li><p><em><strong>Estimating The Number Of Model Parameters</strong></em></p></li><li><p><em><strong>Estimating The Floating&#8208;Point Operations</strong></em></p></li><li><p><em><strong>The Different Architecture Variations</strong></em></p></li><li><p><em><strong>The Encoder-Only Architecture</strong></em></p></li><li><p><em><strong>The Decoder-Only Architecture</strong></em></p></li><li><p><em><strong>The Encoder-Decoder Architecture</strong></em></p></li></ul><div><hr></div><p>The <a href="https://arxiv.org/pdf/1706.03762">"Attention Is All You Need"</a> paper is one of the most influential works in modern AI. By replacing recurrence with self-attention mechanisms, the authors introduced the Transformer architecture, a design that enabled parallelized training, captured long-range dependencies in data, and scaled effortlessly to unprecedented model sizes. This innovation not only rendered RNNs obsolete but also laid the groundwork for BERT, GPT, and the modern LLM revolution, powering breakthroughs from conversational AI to protein folding. Beyond technical innovations, the paper catalyzed a paradigm shift toward general-purpose models with the rise of foundation models trained on massive datasets and reshaped industries from healthcare to creative arts. In essence, it transformed how humanity interacts with language, knowledge, and intelligence itself.</p><h2>Architecture Overview</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z90S!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z90S!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png 424w, https://substackcdn.com/image/fetch/$s_!z90S!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png 848w, https://substackcdn.com/image/fetch/$s_!z90S!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png 1272w, https://substackcdn.com/image/fetch/$s_!z90S!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z90S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png" width="1456" height="921" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:921,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1808569,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z90S!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png 424w, https://substackcdn.com/image/fetch/$s_!z90S!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png 848w, https://substackcdn.com/image/fetch/$s_!z90S!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png 1272w, https://substackcdn.com/image/fetch/$s_!z90S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feede3dd0-159d-43f2-a620-0f34b8d81652_4800x3037.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The original Transformer architecture is composed of the encoder that computes a rich representation of the input sequence, the decoder that generates the output sequence, and the prediction head that uses the decoder output to predict the tokens of the output sequence.</figcaption></figure></div><p>The architecture presented in the "Attention Is All You Need" paper builds directly from the RNN encoder-decoder architecture while discarding recurrence entirely and replacing Bahdanau/Luong's cross-attention with intra-sequence attention. There are four important components to the architecture:</p><ul><li><p><strong>The embeddings:</strong> Besides the token embeddings necessary to project the tokens into their vector representations, the Transformer introduced the need for positional encoding to ensure that the information related to the token positions is captured by the model.</p></li><li><p><strong>The encoder:</strong> As for the RNN encoder-decoder, the encoder is in charge of encoding the input sequence into vector representations such that the decoder has enough information to decode the output sequence. It comprises a stack of identical encoder blocks, each with multi-head self-attention (capturing global dependencies) and a position-wise feed-forward network (applying non-linear transformations).</p></li><li><p><strong>The decoder:</strong> Similar to the encoder but adds masked multi-head self-attention (preventing future token visibility) and encoder-decoder attention (aligning decoder inputs with encoder outputs, akin to Bahdanau/Luong but without RNNs). As before, the autoregressive generation proceeds token-by-token.</p></li><li><p><strong>Prediction head:</strong> The prediction head is a classifier over the whole token vocabulary made out of a linear layer followed by Softmax, converting the decoder's final hidden states into token probabilities to predict the next word.</p></li></ul><p>We will cover each component in detail in the remainder of this chapter. Self-attention and position embedding are central to the transformer architecture, and we need to discuss those technical innovations before we can understand the entire architecture. </p><h2>The Self-Attention</h2><h3>The Self-Attention Mechanism</h3><h4>The Architecture</h4><p>In the case of the Bahdanau/Luong attention, the goal was to capture the interactions between the tokens of the input sequence and the ones of the output sequence. In the Transformer, the self-attention captures the token interactions within the sequences. It is composed of three linear layers: <em><strong>W<sup>K</sup></strong></em>, <em><strong>W<sup>Q</sup></strong></em>, and <em><strong>W<sup>V</sup></strong></em>. The input vectors to the attention layer are the internal hidden states <em><strong>h<sub>i</sub></strong></em> resulting from the model inputs. There are as many hidden states as tokens in the input sequence, and <em><strong>h<sub>i</sub></strong></em> corresponds to the <em><strong>i<sup>th</sup></strong></em> token. <em><strong>W<sup>K</sup></strong></em>, <em><strong>W<sup>Q</sup></strong></em> and <em><strong>W<sup>V</sup></strong></em> project the incoming hidden states into the so-called keys <em><strong>k<sub>i</sub></strong></em>, queries <em><strong>q<sub>i</sub></strong></em> and values <em><strong>v<sub>i</sub></strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    \\mathbf{k}_i &amp;= W^K\\mathbf{h}_i, \\quad \\text{keys} \\nonumber\\\\\n\n    \\mathbf{q}_i &amp;= W^Q\\mathbf{h}_i, \\quad \\text{queries} \\nonumber\\\\\n\n    \\mathbf{v}_i &amp;= W^V\\mathbf{h}_i, \\quad \\text{values} \n\n\\end{align}&quot;,&quot;id&quot;:&quot;FHVATDPNMV&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AeFl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AeFl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png 424w, https://substackcdn.com/image/fetch/$s_!AeFl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png 848w, https://substackcdn.com/image/fetch/$s_!AeFl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png 1272w, https://substackcdn.com/image/fetch/$s_!AeFl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AeFl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png" width="1456" height="959" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:959,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:777068,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AeFl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png 424w, https://substackcdn.com/image/fetch/$s_!AeFl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png 848w, https://substackcdn.com/image/fetch/$s_!AeFl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png 1272w, https://substackcdn.com/image/fetch/$s_!AeFl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea7b4a7-9de4-445b-923f-b3062451eaa4_4625x3047.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>W<sup>K</sup></strong></em>, <em><strong>W<sup>Q</sup></strong></em>, and <em><strong>W<sup>V</sup></strong></em> are used to project the hidden states into keys, queries, and values.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NWT3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NWT3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png 424w, https://substackcdn.com/image/fetch/$s_!NWT3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png 848w, https://substackcdn.com/image/fetch/$s_!NWT3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png 1272w, https://substackcdn.com/image/fetch/$s_!NWT3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NWT3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png" width="1456" height="675" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:675,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:616977,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NWT3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png 424w, https://substackcdn.com/image/fetch/$s_!NWT3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png 848w, https://substackcdn.com/image/fetch/$s_!NWT3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png 1272w, https://substackcdn.com/image/fetch/$s_!NWT3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44681136-0ebb-4f6f-97f7-70e4d490db99_5719x2653.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The keys and queries are used to compute the alignment scores.</figcaption></figure></div><p>The keys and queries are used to compute the alignment scores:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;e_{ij} =  \\frac{\\mathbf{k}_i^\\top\\mathbf{q}_j}{\\sqrt{d_{\\text{model}}}}&quot;,&quot;id&quot;:&quot;IHKTUXKMFT&quot;}" data-component-name="LatexBlockToDOM"></div><p>As in the case of the Bahdanau attention, <em><strong>e<sub>ij</sub></strong></em> is the alignment score between the <em><strong>i<sup>th</sup></strong></em> word and the <em><strong>j<sup>th</sup></strong></em> word in the input sequence. <em><strong>d</strong></em><strong><sub>model</sub></strong> is the common naming convention for the hidden size:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\left\\vert\\mathbf{h}_i\\right\\vert=\\left\\vert\\mathbf{k}_i\\right\\vert=\\left\\vert\\mathbf{q}_i\\right\\vert=\\left\\vert\\mathbf{v}_i\\right\\vert=d_{\\text{model}}=\\text{Hidden size}&quot;,&quot;id&quot;:&quot;CXCTOXPZKA&quot;}" data-component-name="LatexBlockToDOM"></div><p>The scaling factor <em><strong>&#8730;d</strong></em><strong><sub>model</sub></strong> in the scaled dot-product is used to counteract the effect of the dot product's magnitude growing with the dimensionality <em><strong>d</strong></em><strong><sub>model</sub></strong>, which stabilizes gradients and ensures numerical stability during training. It is common to represent to represent those operations as matrix multiplications. With the matrix <em><strong>K =[k<sub>1</sub>, &#8230;, k<sub>N</sub>]</strong></em> and <em><strong>Q =[q<sub>1</sub>, &#8230;, q<sub>N</sub>]</strong></em>, we have:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    E =  \\frac{Q^\\top K}{\\sqrt{d_{\\text{model}}}}&quot;,&quot;id&quot;:&quot;XFHNYZCGCN&quot;}" data-component-name="LatexBlockToDOM"></div><p>or:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    E = \\frac{1}{\\sqrt{d_{\\text{model}}}} \n\n    \\begin{bmatrix}\n\n        \\mathbf{q}_1^\\top \\mathbf{k}_1 &amp; \\mathbf{q}_1^\\top \\mathbf{k}_2 &amp; \\cdots &amp; \\mathbf{q}_1^\\top \\mathbf{k}_N \\\\\n\n        \\mathbf{q}_2^\\top \\mathbf{k}_1 &amp; \\mathbf{q}_2^\\top \\mathbf{k}_2 &amp; \\cdots &amp; \\mathbf{q}_2^\\top \\mathbf{k}_N \\\\\n\n        \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\\\\n\n        \\mathbf{q}_N^\\top \\mathbf{k}_1 &amp; \\mathbf{q}_N^\\top \\mathbf{k}_2 &amp; \\cdots &amp; \\mathbf{q}_N^\\top \\mathbf{k}_N \\\\\n\n    \\end{bmatrix}&quot;,&quot;id&quot;:&quot;QUHJLJTFOQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>with <em><strong>N </strong></em>being the number of tokens in the sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h-Ci!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h-Ci!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png 424w, https://substackcdn.com/image/fetch/$s_!h-Ci!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png 848w, https://substackcdn.com/image/fetch/$s_!h-Ci!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png 1272w, https://substackcdn.com/image/fetch/$s_!h-Ci!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h-Ci!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png" width="1456" height="588" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:588,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:560548,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h-Ci!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png 424w, https://substackcdn.com/image/fetch/$s_!h-Ci!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png 848w, https://substackcdn.com/image/fetch/$s_!h-Ci!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png 1272w, https://substackcdn.com/image/fetch/$s_!h-Ci!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd3cb281a-2675-4d12-876a-f28ef94dfc19_5528x2231.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The attention weights are the result of normalizing the alignment scores by using the softmax transformation.</figcaption></figure></div><p>As for the other attentions, the alignment scores are normalized to 1 through a Softmax transformation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    a_{ij} =  \\text{Softmax}(e_{ij})=\\frac{\\exp(e_{ij})}{\\sum_{j=1}^N \\exp(e_{ij})}&quot;,&quot;id&quot;:&quot;FZXETUFCIE&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>a<sub>ij</sub></strong></em> is the attention weight between the tokens <em><strong>i</strong></em> and <em><strong>j</strong></em>, quantifying how strongly the model should attend to token <em><strong>j</strong></em> when processing token <em><strong>i</strong></em>. Because we have <em><strong>&#931; a<sub>ij</sub> = 1</strong></em>, <em><strong>a<sub>ij</sub></strong></em> can be interpreted as the probability that token <em><strong>j</strong></em> is relevant to token <em><strong>i</strong></em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Hu2V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Hu2V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png 424w, https://substackcdn.com/image/fetch/$s_!Hu2V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png 848w, https://substackcdn.com/image/fetch/$s_!Hu2V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png 1272w, https://substackcdn.com/image/fetch/$s_!Hu2V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Hu2V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png" width="1456" height="901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:901,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:704307,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Hu2V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png 424w, https://substackcdn.com/image/fetch/$s_!Hu2V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png 848w, https://substackcdn.com/image/fetch/$s_!Hu2V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png 1272w, https://substackcdn.com/image/fetch/$s_!Hu2V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4474b1fd-5d6b-49c7-a87b-c8cf575e4aca_5134x3178.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Each context vector is the result of a weighted average of the value vectors by using the attention weights.</figcaption></figure></div><p>The attention weights are used to compute a weighted average of the values vectors:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;   \\mathbf{c}_i = \\sum_{j=1}^N a_{ij}\\mathbf{v}_j&quot;,&quot;id&quot;:&quot;SZNOVUYFBL&quot;}" data-component-name="LatexBlockToDOM"></div><p>In the jargon used in the previous chapter, <em><strong>c<sub>i</sub></strong></em> are the context vectors coming out of the attention layer, but we can think of them as another intermediary set of hidden states within the network. Using the more common matrix notation, we have:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    C = AV^\\top&quot;,&quot;id&quot;:&quot;TLIGDIJEEC&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>V =[v<sub>1</sub>, &#8230;, v<sub>N</sub>]</strong></em>, <em><strong>C =[c<sub>1</sub>, &#8230;, c<sub>N</sub>]</strong></em> and <em><strong>A = Softmax(E)</strong></em> is the matrix of attention weights.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ugmk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ugmk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png 424w, https://substackcdn.com/image/fetch/$s_!ugmk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png 848w, https://substackcdn.com/image/fetch/$s_!ugmk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png 1272w, https://substackcdn.com/image/fetch/$s_!ugmk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ugmk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png" width="1456" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1253691,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ugmk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png 424w, https://substackcdn.com/image/fetch/$s_!ugmk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png 848w, https://substackcdn.com/image/fetch/$s_!ugmk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png 1272w, https://substackcdn.com/image/fetch/$s_!ugmk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03846f51-d681-4aba-8a6a-377924ecd8f3_5459x2241.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The entire attention layer process.</figcaption></figure></div><p>The whole set of computations happening in the attention layer can be summarized as the following equation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; C = \\text{Softmax}\\left(\\frac{QK^\\top}{\\sqrt{d_{\\text{model}}}}\\right)V&quot;,&quot;id&quot;:&quot;EWLONKQQCC&quot;}" data-component-name="LatexBlockToDOM"></div><h4>The Keys, Queries, and Values Naming Convention</h4><p>The names "queries," "keys," and "values" are inspired by information retrieval systems (such as databases or search engines). Each token generates a query, key, and value to "retrieve" relevant context from other tokens. The model learns to search for relationships between tokens dynamically. </p><p>The queries represent what the current token is "asking for." For example, the word <em>"it"</em> in <em>"The cat sat because it was tired,"</em> the query seeks antecedents (e.g., <em>"cat"</em>). The keys represent what other tokens "offer" as context. In our example, the key for <em>"cat"</em> signals it is a candidate antecedent for <em>"it."</em> The values are the actual content to aggregate based on attention weights. The value for <em>"cat"</em> encodes its contextual meaning (e.g., entity type, role in the sentence, ...). For each query (current token), the model "retrieves" values (context) by comparing the query to all keys (other tokens). For example, let us consider the sentence:</p><blockquote><p><em><strong>"The bank is steep, so it's dangerous to stand near it." </strong></em></p></blockquote><ul><li><p>Query (<em>"it"</em>): "What does <em>'it'</em> refer to?"</p></li><li><p>Keys (<em>"bank," "steep," "dangerous"</em>): Highlight candidates for reference.</p></li><li><p>Values: Encode the meaning of each candidate.</p></li></ul><p>The model computes high attention weights between the query (<em>"it"</em>) and keys (<em>"bank,"</em> <em>"steep"</em>), then aggregates their values to infer <em>"it"</em> refers to the riverbank.</p><h3>The Multi-head Attention Layer</h3><h4>The Naive Description</h4><p>We have talked about self-attention so far, but we use the so-called multi-head attention layer in the transformer architecture. The multi-head attention layer works as multiple parallel attention mechanisms. By having multiple attention layers in parallel, they will be able to learn various interaction patterns between the different tokens in the input sequence. Combining those will lead to more heterogeneous learning, and we will be able to learn richer information from the input sequence. Think about the multi-head attention layer as an ensemble of self-attentions, a bit like the random forest is an ensemble of decision tree models.</p><p>We call "heads" the parallel attention mechanisms. To ensure that the time complexity of the computations remains independent of the number of attention heads, we need to reduce the size of the internal vectors within the layers. The hidden size dimensionality per head is divided by the number of heads:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    d_{\\text{head}} = \\frac{d_{\\text{model}}}{n_{\\text{head}}}&quot;,&quot;id&quot;:&quot;KDVDCETDJF&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>n</strong></em><strong><sub>head</sub></strong> is the number of heads. This implies that the hidden size has to be chosen so that it is divisible by the number of heads. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bHiN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bHiN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png 424w, https://substackcdn.com/image/fetch/$s_!bHiN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png 848w, https://substackcdn.com/image/fetch/$s_!bHiN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png 1272w, https://substackcdn.com/image/fetch/$s_!bHiN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bHiN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png" width="1456" height="869" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:869,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:604777,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bHiN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png 424w, https://substackcdn.com/image/fetch/$s_!bHiN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png 848w, https://substackcdn.com/image/fetch/$s_!bHiN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png 1272w, https://substackcdn.com/image/fetch/$s_!bHiN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14bfbad7-67ef-4a19-818a-b533c621f679_4441x2650.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Each attention generates vectors of size <em><strong>d<sub>head</sub> = d<sub>model </sub>/ n<sub>head</sub></strong></em>, depending of the number of heads.</figcaption></figure></div><p>Let us call <em><strong>H =[h<sub>1</sub>, &#8230;, h<sub>N</sub>]</strong></em> the incoming hidden states. Each head <em><strong>h</strong></em> generates resulting hidden states <em><strong>H&#8217;<sub>h</sub></strong></em> of size <em><strong>d</strong></em><strong><sub>head</sub></strong><em><strong><sub> </sub>=</strong></em> <em><strong>d</strong></em><strong><sub>model </sub></strong><em><strong>/ n</strong></em><strong><sub>head</sub></strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H'_h = \\text{Attention}_h(H)&quot;,&quot;id&quot;:&quot;UOTTXUYKTI&quot;}" data-component-name="LatexBlockToDOM"></div><p>To combine those heads' hidden states, we concatenate them, and we pass them through a final linear layer <em><strong>W<sup>O</sup></strong></em> to mix the signals coming from the different heads:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    H' = \\text{Concat}(H'_1, \\ldots, H'_{n_{\\text{head}}})W^O&quot;,&quot;id&quot;:&quot;IBANJPEEOD&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZRiF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZRiF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png 424w, https://substackcdn.com/image/fetch/$s_!ZRiF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png 848w, https://substackcdn.com/image/fetch/$s_!ZRiF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png 1272w, https://substackcdn.com/image/fetch/$s_!ZRiF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZRiF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png" width="1456" height="764" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:764,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1210922,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZRiF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png 424w, https://substackcdn.com/image/fetch/$s_!ZRiF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png 848w, https://substackcdn.com/image/fetch/$s_!ZRiF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png 1272w, https://substackcdn.com/image/fetch/$s_!ZRiF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4db2633a-6d32-4f38-863f-90c3c41e56d7_4359x2287.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The result of each head is concatenated, and the signals are further mixed by a final linear layer <em><strong>W<sup>O</sup></strong></em>.</figcaption></figure></div><p>To generate smaller hidden states, we need to reduce the dimensionality of the internal matrices. In each head, the projection matrices <em><strong>W<sup>K</sup></strong></em>, <em><strong>W<sup>Q</sup></strong></em>, and <em><strong>W<sup>V</sup></strong></em> take vectors of size <em><strong>d</strong></em><strong><sub>model</sub></strong> and generate vectors of size <em><strong>d</strong></em><strong><sub>model </sub></strong><em><strong>/ n</strong></em><strong><sub>head</sub></strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P2Ur!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P2Ur!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png 424w, https://substackcdn.com/image/fetch/$s_!P2Ur!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png 848w, https://substackcdn.com/image/fetch/$s_!P2Ur!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png 1272w, https://substackcdn.com/image/fetch/$s_!P2Ur!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P2Ur!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png" width="1456" height="689" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:689,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1351673,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P2Ur!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png 424w, https://substackcdn.com/image/fetch/$s_!P2Ur!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png 848w, https://substackcdn.com/image/fetch/$s_!P2Ur!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png 1272w, https://substackcdn.com/image/fetch/$s_!P2Ur!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae9aa5b3-f9af-44ee-b12e-423d7e06a2a6_5853x2769.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">To generate smaller context vectors, we underlying projection matrices <em><strong>W<sup>K</sup></strong></em>, <em><strong>W<sup>Q</sup></strong></em>, and <em><strong>W<sup>V</sup></strong></em> need to be of size <em><strong>d</strong></em><strong><sub>model</sub></strong><em><strong> X d</strong></em><strong><sub>head</sub></strong> for each head.</figcaption></figure></div><h4>The Tensor Representation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X0dD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X0dD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png 424w, https://substackcdn.com/image/fetch/$s_!X0dD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png 848w, https://substackcdn.com/image/fetch/$s_!X0dD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png 1272w, https://substackcdn.com/image/fetch/$s_!X0dD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X0dD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png" width="488" height="468.56043956043953" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1398,&quot;width&quot;:1456,&quot;resizeWidth&quot;:488,&quot;bytes&quot;:340764,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X0dD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png 424w, https://substackcdn.com/image/fetch/$s_!X0dD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png 848w, https://substackcdn.com/image/fetch/$s_!X0dD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png 1272w, https://substackcdn.com/image/fetch/$s_!X0dD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F802344b7-919f-4a25-8881-1b8f7ee88dda_3153x3028.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><strong>I</strong>n reality, the projection matrices are not spread across multiple heads, but different sections of the matrices handle the projections for the different heads.</figcaption></figure></div><p>Although the information we have described so far about the multi-head attention layer is accurate, there is a critical subtlety to understand when it comes to its implementation. To illustrate the mathematical properties of the attention heads, we pictured separate "boxes" where each attention mechanism evolved in parallel, but in reality, they are slightly more connected. To fully utilize the efficient parallelization capability of the GPU hardware, it is critical to rethink every operation as a tensor operation. We described <em><strong>W<sup>K</sup></strong></em>, <em><strong>W<sup>Q</sup></strong></em>, and <em><strong>W<sup>V</sup></strong></em> of each head as separate matrices, but in practice, it is just three matrices that we conceptually break down by the number of heads needed.    </p><p>Similarly, there is only one set of keys, queries, and values, and each head processes the entire sequence of tokens but operates on a distinct subset of features. The keys, queries, and values have dimension <em><strong>d</strong></em><strong><sub>model</sub></strong><em><strong> X N</strong></em>, where <em><strong>N</strong></em> is the number of tokens in the input sequence. To specify each head's sub-segment explicitly, we reshape the matrices into 3-dimensional tensors with dimension <em><strong>n</strong></em><strong><sub>head</sub></strong><em><strong> X d</strong></em><strong><sub>head</sub></strong><em><strong> X N</strong></em>. Let us consider the incoming set of the hidden states. It is first projected into keys, queries, and values:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    K = W^K H, \\quad \\text{shape: }  d_\\text{model}\\times N\\\\\n\n    Q = W^Q H, \\quad \\text{shape: }  d_\\text{model}\\times N \\\\\n\n    V = W^V H, \\quad \\text{shape: }  d_\\text{model}\\times N \n\n\\end{align}&quot;,&quot;id&quot;:&quot;JWJTPUQMVW&quot;}" data-component-name="LatexBlockToDOM"></div><p>We then reshape the resulting matrices into 3-dimensional tensors:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    K' = \\text{Reshape}(K), \\quad \\text{shape: }  n_\\text{head}\\times d_\\text{head}\\times N\\\\\n\n    Q' = \\text{Reshape}(Q), \\quad \\text{shape: }  n_\\text{head}\\times d_\\text{head}\\times N \\\\\n\n    V' = \\text{Reshape}(V), \\quad \\text{shape: }  n_\\text{head}\\times d_\\text{head}\\times N \n\n\\end{align}&quot;,&quot;id&quot;:&quot;BGHFYALNMA&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rL4J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rL4J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png 424w, https://substackcdn.com/image/fetch/$s_!rL4J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png 848w, https://substackcdn.com/image/fetch/$s_!rL4J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png 1272w, https://substackcdn.com/image/fetch/$s_!rL4J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rL4J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png" width="1456" height="1161" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1161,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1998212,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rL4J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png 424w, https://substackcdn.com/image/fetch/$s_!rL4J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png 848w, https://substackcdn.com/image/fetch/$s_!rL4J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png 1272w, https://substackcdn.com/image/fetch/$s_!rL4J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28cd36c0-6226-4f42-863d-1e194a995827_4131x3293.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The keys, queries, and values are reshaped into 3-dimensional tensors with dimension <em><strong>n</strong></em><strong><sub>head</sub></strong><em><strong> X d</strong></em><strong><sub>head</sub></strong><em><strong> X N</strong></em>, where each slice of the tensors corresponds to one head.</figcaption></figure></div><p>Reshaping is computationally efficient as it only reorganizes the tensor dimensions. When we compute the alignment scores <em><strong>E'</strong></em> from new tensors, this leads to <em><strong>N X N</strong></em> score for each head:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E' =  \\frac{Q' K'^\\top}{\\sqrt{d_{\\text{head}}}}, \\quad \\text{shape: }  n_\\text{head}\\times N\\times N&quot;,&quot;id&quot;:&quot;AOMZWXATAX&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, we use the shorthand notation <em><strong>K&#8217;<sup>T</sup></strong></em> to streamline the notation and imply permutation on the last two indices of the tensor, similar to the transpose operation for matrices:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    k'_{ikj} = k_{ijk}'^\\top , \\quad \\text{shape: }  n_\\text{head}\\times N\\times d_\\text{head}&quot;,&quot;id&quot;:&quot;YUTSIVWAPA&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>where <em><strong>k&#8217;<sub>ijk</sub></strong></em> is an element of <em><strong>K'</strong></em>. Notice that the way the operations are performed ensures the computation of <em><strong>N X N</strong></em>  attention weights per head while keeping the number of arithmetic operations constant compared to the vanilla attention layer. The attention weights <em><strong>A'</strong></em> are obtained by normalizing on the last dimension:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;a'_{ijk} =  \\text{Softmax}(e'_{ijk})=\\frac{\\exp(e'_{ijk})}{\\sum_{m=1}^N \\exp(e'_{ijm})}, \\quad \\text{shape: } n_\\text{head}\\times N\\times N&quot;,&quot;id&quot;:&quot;HHLDUCQAXP&quot;}" data-component-name="LatexBlockToDOM"></div><p>again, <em><strong>e&#8217;<sub>ijk</sub></strong></em> is an element of the tensor <em><strong>E'</strong></em> and <em><strong>a&#8217;<sub>ijk</sub></strong></em> of the tensor <em><strong>A'</strong></em>. The context vectors are computed as the weighted average of the values with the attention weights:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;c'_{ijl} = \\sum_{k=1}^N a'_{ijk}v'_{ilk}, \\quad \\text{shape: }  n_\\text{head}\\times d_\\text{head}\\times N&quot;,&quot;id&quot;:&quot;HAKKOQURPT&quot;}" data-component-name="LatexBlockToDOM"></div><p>or in tensor notation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    C' = A'V'^\\top, \\quad \\text{shape: }  n_\\text{head}\\times d_\\text{head}\\times N&quot;,&quot;id&quot;:&quot;AGQPBERNBO&quot;}" data-component-name="LatexBlockToDOM"></div><p>At this point, we have <em><strong>N</strong></em> context vectors of size <em><strong>d</strong></em><strong><sub>head</sub></strong> per head. We can reshape this tensor such that we have <em><strong>N</strong></em> context vectors of size <em><strong>d</strong></em><strong><sub>model</sub></strong><em><strong><sub> </sub>= n</strong></em><strong><sub>head</sub></strong> <em><strong>d</strong></em><strong><sub>head</sub></strong>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C = \\text{Reshape}(C'), \\quad \\text{shape: }  d_\\text{model}\\times N&quot;,&quot;id&quot;:&quot;WNTJLRKJMJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>We described this earlier as the concatenation of the different heads' context vectors. As a way to combine further the signal coming from the different heads, we pass the resulting context vectors through a final linear layer:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;C_{\\text{final}} = W^OC, \\quad \\text{shape: }  d_\\text{model}\\times N&quot;,&quot;id&quot;:&quot;QTPGZBTIKT&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BEBv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BEBv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png 424w, https://substackcdn.com/image/fetch/$s_!BEBv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png 848w, https://substackcdn.com/image/fetch/$s_!BEBv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png 1272w, https://substackcdn.com/image/fetch/$s_!BEBv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BEBv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png" width="1456" height="760" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:760,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2810533,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BEBv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png 424w, https://substackcdn.com/image/fetch/$s_!BEBv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png 848w, https://substackcdn.com/image/fetch/$s_!BEBv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png 1272w, https://substackcdn.com/image/fetch/$s_!BEBv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20a2ff45-1fad-49de-971e-d8b4fffd786a_5922x3090.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The computation of the attention and context vectors across multiple heads happens in parallel by making use of efficient tensor operations for GPU computing.</figcaption></figure></div><p>This approach lets the model process information more efficiently than sequential methods, making it better at understanding both nearby and far-apart relationships in the data.</p><h2>The Positional Encoding</h2><h3>The Structure</h3><p>The goal of the positional encoding (a.k.a. position embedding) in the Transformer architecture is to inject sequential order information into the model, enabling it to understand the position of tokens in a sequence. Since Transformers process all tokens in parallel (unlike sequential models like RNNs), they lack inherent awareness of token order. Position embeddings address this by encoding positional data. Without positional information, the Transformer would treat the input as a "bag of words," losing critical order-dependent structure.</p><p>In the "Attention is all you need" paper, the positional encoding is defined as another embedding matrix with the same embedding size as the token embedding. The number of rows in the position embedding defines the maximum number of tokens that the model can ingest within a sequence, also known as the <em>context size</em>. The positional information of the token is added to the model by summing the semantic vector representations of the tokens from the token embedding and their positional vector representations from the position embedding. This ensures that the self-attention weights carry the positional information such that the order of the tokens impacts the model inference.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QAiH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QAiH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png 424w, https://substackcdn.com/image/fetch/$s_!QAiH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png 848w, https://substackcdn.com/image/fetch/$s_!QAiH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png 1272w, https://substackcdn.com/image/fetch/$s_!QAiH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QAiH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png" width="1456" height="959" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:959,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:682137,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QAiH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png 424w, https://substackcdn.com/image/fetch/$s_!QAiH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png 848w, https://substackcdn.com/image/fetch/$s_!QAiH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png 1272w, https://substackcdn.com/image/fetch/$s_!QAiH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa5da8ed-1287-4e33-b717-06fcd12b9006_4622x3044.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The first set of hidden states is computed by summing the vector representations from the token embedding with the position encoding.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xO0r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xO0r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png 424w, https://substackcdn.com/image/fetch/$s_!xO0r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png 848w, https://substackcdn.com/image/fetch/$s_!xO0r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png 1272w, https://substackcdn.com/image/fetch/$s_!xO0r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xO0r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:794688,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xO0r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png 424w, https://substackcdn.com/image/fetch/$s_!xO0r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png 848w, https://substackcdn.com/image/fetch/$s_!xO0r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png 1272w, https://substackcdn.com/image/fetch/$s_!xO0r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba35348e-fa37-461b-b23c-eb7653bb0135_4931x2778.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Without the position encoding, the model could not understand the order of the tokens in the sequence.</figcaption></figure></div><p>The position embedding (PE) is a static matrix of numbers. If <em><strong>i</strong></em> is the index position of the vectors in the embedding, and<em><strong> j</strong></em> is the index position of the elements in the vectors, the matrix elements are defined by the following formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{PE}(i, j)\n\n= \n\n\\begin{cases}\n\n\\sin\\left(\\frac{i}{1000^{j/d_\\text{model}}}\\right) &amp; \\text{if $j$ is even} ,\\\\\n\n\\cos\\left(\\frac{i}{1000^{(j-1)/d_\\text{model}}}\\right) &amp; \\text{if $j$ is odd},\n\n\\end{cases}.&quot;,&quot;id&quot;:&quot;TUTGLKTKJS&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>i</strong></em> ranges in <em><strong>[0, context size - 1]</strong></em> and <em><strong>j</strong></em> in <em><strong>[0, d</strong></em><strong><sub>model</sub></strong><em><strong> - 1]</strong></em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G5Al!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G5Al!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png 424w, https://substackcdn.com/image/fetch/$s_!G5Al!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png 848w, https://substackcdn.com/image/fetch/$s_!G5Al!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png 1272w, https://substackcdn.com/image/fetch/$s_!G5Al!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G5Al!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png" width="1456" height="769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:769,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:492769,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G5Al!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png 424w, https://substackcdn.com/image/fetch/$s_!G5Al!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png 848w, https://substackcdn.com/image/fetch/$s_!G5Al!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png 1272w, https://substackcdn.com/image/fetch/$s_!G5Al!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06c20705-94b5-4fd2-b968-b9f918f59c87_4763x2516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Capturing The Relative Token Positions</h3><p>The motivation behind this choice of sinusoidal functional form for positional encodings is so the model can more easily learn attention weights reflecting each token's relative position. It stems from the trigonometric identities for sine and cosine functions:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    \\sin(x + y) &amp;= \\sin(x)\\cos(y)+\\cos(x)\\sin(y) \\nonumber\\\\\n\n    \\cos(x + y) &amp;= \\cos(x)\\cos(y)-\\sin(x)\\sin(y)\n\n\\end{align}&quot;,&quot;id&quot;:&quot;WYEVZGKGEF&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us consider a fixed offset <em><strong>k</strong></em>, and we apply the trigonometric identities to the encoding formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    \\sin(\\omega_j(i + k)) &amp;= \\sin(\\omega_j i)\\cos(\\omega_j k)+\\cos(\\omega_j i)\\sin(\\omega_j k) \\nonumber\\\\\n\n    \\cos(\\omega_j(i + k)) &amp;= \\cos(\\omega_j i)\\cos(\\omega_j k)-\\sin(\\omega_j i)\\sin(\\omega_j k)\n\n\\end{align}&quot;,&quot;id&quot;:&quot;CVLFDYFSRF&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em><strong>&#120538;<sub>j</sub> = 1 / 1000<sup>j/dmodel</sup></strong></em>. In matrix notation, we have:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{bmatrix}\n\n\\sin(\\omega_j (i + k)) \\\\\n\n\\cos(\\omega_j (i + k))\n\n\\end{bmatrix}\n\n=\n\n\\begin{bmatrix}\n\n\\cos(\\omega_j k) &amp; \\sin(\\omega_j k) \\\\\n\n-\\sin(\\omega_j k) &amp; \\cos(\\omega_j k)\n\n\\end{bmatrix}\n\n\\begin{bmatrix}\n\n\\sin(\\omega_j i) \\\\\n\n\\cos(\\omega_j i)\n\n\\end{bmatrix}.&quot;,&quot;id&quot;:&quot;EBCTENPXIS&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us define <strong>PE</strong><em><strong>(i, j)</strong></em> as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n\\mathbf{PE}(i, j)&amp;= \\begin{bmatrix}\n\n\\text{PE}(i, j) \\\\\n\n\\text{PE}(i, j+1)\n\n\\end{bmatrix} = \\begin{bmatrix}\n\n\\sin(\\omega_j i) \\\\\n\n\\cos(\\omega_j i)\n\n\\end{bmatrix}, \\text{and}\\nonumber\\\\ \n\n\\mathbf{PE}(i+k, j) &amp;=\\begin{bmatrix}\n\n\\text{PE}(i+k, j) \\\\\n\n\\text{PE}(i+k, j+1)\n\n\\end{bmatrix}= \\begin{bmatrix}\n\n\\sin(\\omega_j (i + k)) \\\\\n\n\\cos(\\omega_j (i + k))\n\n\\end{bmatrix}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;QKPMBGZZTB&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>We obtain:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\mathbf{PE}(i, j+k) = R_i(k) \\mathbf{PE}(i, j)&quot;,&quot;id&quot;:&quot;DTKNNMRYRU&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;  R_j(k)=\\begin{bmatrix}\n\n\\cos(\\omega_j k) &amp; \\sin(\\omega_j k) \\\\\n\n-\\sin(\\omega_j k) &amp; \\cos(\\omega_j k)\n\n\\end{bmatrix}&quot;,&quot;id&quot;:&quot;GSAABYXJTM&quot;}" data-component-name="LatexBlockToDOM"></div><p>In linear algebra, <em><strong>R<sub>j</sub>(k)</strong></em> is called the rotation matrix and is used to perform a rotation in Euclidean space. Effectively, it means that <strong>PE</strong><em><strong>(i+k, j)</strong></em> is the rotation of <strong>PE</strong><em><strong>(i, j)</strong></em> by an angle <em><strong>-&#120538;<sub>j</sub>k</strong></em>. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oWu4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oWu4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png 424w, https://substackcdn.com/image/fetch/$s_!oWu4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png 848w, https://substackcdn.com/image/fetch/$s_!oWu4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png 1272w, https://substackcdn.com/image/fetch/$s_!oWu4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oWu4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png" width="530" height="309.77335164835165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:851,&quot;width&quot;:1456,&quot;resizeWidth&quot;:530,&quot;bytes&quot;:112129,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oWu4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png 424w, https://substackcdn.com/image/fetch/$s_!oWu4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png 848w, https://substackcdn.com/image/fetch/$s_!oWu4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png 1272w, https://substackcdn.com/image/fetch/$s_!oWu4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca4f1c3c-13e9-4306-a195-e4c16a462009_2716x1588.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p>So far, we have shown that, for two tokens with relative distance <em><strong>k</strong></em>, each pair of elements <em><strong>(j, j+1)</strong></em> within their positional encodings are related to each other through a rotation with angle <em><strong>-&#120538;<sub>j</sub>k</strong></em>. Let us call <strong>PE</strong><em><strong>(i) = [</strong></em><strong>PE</strong><em><strong>(i, 0), </strong></em><strong>PE</strong><em><strong>(i, 1), &#8230;, </strong></em><strong>PE</strong><em><strong>(i, d</strong></em><strong><sub>model</sub></strong><em><strong>)]</strong></em>. We can relate <strong>PE</strong><em><strong>(i)</strong></em> and <strong>PE</strong><em><strong>(i+k)</strong></em> through the pairwise rotation matrix:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{PE}(i+k) =  R(k)\\mathbf{PE}(i)&quot;,&quot;id&quot;:&quot;GWTDZMQYFQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;    R(k) = \n\n    \\begin{bmatrix}\n\n        R_0(k) &amp; 0 &amp; \\cdots &amp; 0 \\\\\n\n        0 &amp; R_2(k) &amp; \\cdots &amp; 0 \\\\\n\n        \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\\\\n\n        0 &amp; 0 &amp; \\cdots &amp;  R_{d_\\text{model}-2}(k) \\\\\n\n    \\end{bmatrix}&quot;,&quot;id&quot;:&quot;HGXJCFOUWD&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6hNc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6hNc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png 424w, https://substackcdn.com/image/fetch/$s_!6hNc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png 848w, https://substackcdn.com/image/fetch/$s_!6hNc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png 1272w, https://substackcdn.com/image/fetch/$s_!6hNc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6hNc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png" width="600" height="355.2197802197802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:862,&quot;width&quot;:1456,&quot;resizeWidth&quot;:600,&quot;bytes&quot;:402442,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6hNc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png 424w, https://substackcdn.com/image/fetch/$s_!6hNc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png 848w, https://substackcdn.com/image/fetch/$s_!6hNc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png 1272w, https://substackcdn.com/image/fetch/$s_!6hNc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcfb1d345-088c-4ff7-9b3e-5f0334bc2639_4591x2719.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Graphical representation of pairwise rotation of the different elements in the position encodings for two tokens separated by <em><strong>k</strong></em> tokens.</figcaption></figure></div><p>Let us now consider two hidden states <em><strong>h<sub>i</sub></strong></em> and <em><strong>h<sub>i+k</sub></strong></em>, corresponding to two tokens with relative distance <em><strong>k</strong></em>, coming into the self-attention layer. Both of them are the result of summing the token embedding vectors <em><strong>x<sub>i</sub></strong></em> and <em><strong>x<sub>i+k</sub></strong></em> and the positional encoding vectors <strong>PE</strong><em><strong>(i)</strong></em> and <strong>PE</strong><em><strong>(i+k)</strong></em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n    \\mathbf{h}_{i} = \\mathbf{x}_{i} + \\mathbf{PE}(i)\\nonumber\\\\\n\n    \\mathbf{h}_{i+k} = \\mathbf{x}_{i+k} + \\mathbf{PE}(i+k)\n\n\\end{align}&quot;,&quot;id&quot;:&quot;SNSMRSLOWO&quot;}" data-component-name="LatexBlockToDOM"></div><p>We can compute their alignment score after projecting them into their keys and queries (we ignore heads for simplicity):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\n        e_{i,i+k}&amp;=\\frac{\\mathbf{q}_{i}^\\top\\mathbf{k}_{i+k}}{\\sqrt{d_\\text{model}}} \\nonumber\\\\\n\n        &amp;=\\frac{\\left(W^Q\\mathbf{h}_{i}\\right)^\\top\\left(W^K\\mathbf{h}_{i+k}\\right)}{\\sqrt{d_\\text{model}}}\\nonumber\\\\\n\n        &amp;=\\frac{\\left(W^Q\\left[\\mathbf{x}_{i} + \\text{\\textbf{PE}}(i)\\right]\\right)^\\top\\left(W^K\\left[\\mathbf{x}_{i+k} + \\text{\\textbf{PE}}(i+k)\\right]\\right)}{\\sqrt{d_\\text{model}}}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;UBLCAUAPDK&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M9ls!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M9ls!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png 424w, https://substackcdn.com/image/fetch/$s_!M9ls!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png 848w, https://substackcdn.com/image/fetch/$s_!M9ls!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png 1272w, https://substackcdn.com/image/fetch/$s_!M9ls!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M9ls!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png" width="1456" height="649" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:649,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:745783,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M9ls!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png 424w, https://substackcdn.com/image/fetch/$s_!M9ls!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png 848w, https://substackcdn.com/image/fetch/$s_!M9ls!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png 1272w, https://substackcdn.com/image/fetch/$s_!M9ls!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45186bf2-6349-4cbd-8ce7-fd75d016203c_5453x2431.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">When we compute the alignment score between two tokens separated by <em><strong>k</strong></em> tokens, we can decompose its value into various contributions, including how the relative position <em><strong>k</strong></em> interacts with the absolute position <em><strong>i</strong></em>.</figcaption></figure></div><p>If we expend, we obtain:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\ne_{i,i+k}\\sqrt{d_\\text{model}} &amp;= \\quad\\underbrace{\\mathbf{x}_i^\\top W^{Q\\top} W^K \\mathbf{x}_{i+k}}_{{\\text{Token-Token Interaction}}} \\nonumber\\\\\n\n&amp;+\\quad \\underbrace{\\mathbf{x}_i^\\top W^{Q\\top} W^K \\mathbf{PE}(i) R(k)}_{{\\text{Token-Position Interaction}}} \\nonumber\\\\\n\n&amp;+ \\quad\\underbrace{\\mathbf{PE}(i)^\\top W^{Q\\top} W^K \\mathbf{x}_{i+k}}_{{\\text{Position-Token Interaction}}}\\nonumber\\\\\n\n&amp;+\\quad \\underbrace{\\mathbf{PE}(i)^\\top W^{Q\\top} W^K \\mathbf{PE}(i) R(k)}_{{\\text{Position-Position Interaction}}}\n\n\\end{align}&quot;,&quot;id&quot;:&quot;JZKOPLYXOA&quot;}" data-component-name="LatexBlockToDOM"></div><p>We effectively decomposed the alignment score into four components:</p><ul><li><p><strong>Token-Token Interaction: </strong>Pure content-based alignment between <em><strong>x<sub>i</sub></strong></em> and <em><strong>x<sub>i+k</sub></strong></em></p></li><li><p><em><strong>Token-Position Interaction:</strong></em> How the token at <em><strong>i</strong></em> interacts with the relative position <em><strong>k</strong></em> of <em><strong>x<sub>i+k</sub></strong></em></p></li><li><p><em><strong>Position-Token Interaction:</strong></em> How the position <em><strong>i</strong></em> interacts with the token at <em><strong>x<sub>i+k</sub></strong></em></p></li><li><p><em><strong>Position-Position Interaction:</strong></em> How the relative position <strong>k</strong> (encoded via <em><strong>R(k)</strong></em>) interacts with the absolute position <em><strong>i</strong></em>.</p></li></ul><p>Let's remember that <em><strong>R(k)</strong></em> is the fixed, mathematically defined transformation matrix (from sinusoidal identities) that maps <strong>PE</strong><em><strong>(i)</strong></em> to <strong>PE</strong><em><strong>(i+k)</strong></em> and it exists purely as a property of the positional encoding scheme. With this linear relationship, the model parameters <em><strong>W<sup>K</sup></strong></em> and <em><strong>W<sup>Q</sup></strong></em> can learn to leverage the structure of positional encodings to compute attention scores that depend on content and relative positions. During training, the model will learn to weigh these interactions by adjusting <em><strong>W<sup>K</sup></strong></em> and <em><strong>W<sup>Q</sup></strong></em>. It makes the training more efficient as the model does not need to relearn positional relationships from scratch; it builds on the mathematical structure of <strong>PE</strong><em><strong>(i)</strong></em>. The sinusoidal nature of the encoding also helps the model to generalize better to both unseen absolute positions and positional offsets.</p><h3>Positional Encoding's Multi-Frequency Design</h3><p>The positional encoding defines a frequency that depends on the position of vector elements:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\omega_j=\\frac{1}{10000^{j/d_\\text{model}}}&quot;,&quot;id&quot;:&quot;GMABXAMYXL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, the constant 10,000 is a hyperparameter that controls the range of frequencies used to encode positional information. This means that the period of oscillations is <em><strong>2&#120703; 10000<sup>j/dmodel</sup></strong></em>. The frequencies range between <em><strong>[1, 1 / 10000<sup>(dmodel-1)/dmodel</sup>]</strong></em>. High Frequencies lead to rapidly oscillating sine/cosine waves, which are adapted to distinguishing between nearby positions. This is crucial for local sentence syntax (e.g., word order in a phrase). Low frequencies lead to slowly oscillating waves that generalize over longer distances, which is useful for capturing global structure (e.g., paragraph-level coherence).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MnMC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MnMC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png 424w, https://substackcdn.com/image/fetch/$s_!MnMC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png 848w, https://substackcdn.com/image/fetch/$s_!MnMC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png 1272w, https://substackcdn.com/image/fetch/$s_!MnMC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MnMC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png" width="1456" height="698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:698,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1147434,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MnMC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png 424w, https://substackcdn.com/image/fetch/$s_!MnMC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png 848w, https://substackcdn.com/image/fetch/$s_!MnMC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png 1272w, https://substackcdn.com/image/fetch/$s_!MnMC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5de10dc7-0853-41ff-80f7-acd123a08a9b_4681x2244.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Each pair of columns captures different frequencies within the text data.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZMnx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZMnx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png 424w, https://substackcdn.com/image/fetch/$s_!ZMnx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png 848w, https://substackcdn.com/image/fetch/$s_!ZMnx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png 1272w, https://substackcdn.com/image/fetch/$s_!ZMnx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZMnx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png" width="550" height="404.18956043956047" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1070,&quot;width&quot;:1456,&quot;resizeWidth&quot;:550,&quot;bytes&quot;:639304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZMnx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png 424w, https://substackcdn.com/image/fetch/$s_!ZMnx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png 848w, https://substackcdn.com/image/fetch/$s_!ZMnx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png 1272w, https://substackcdn.com/image/fetch/$s_!ZMnx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3848ce8-9a43-495a-8739-b0042d855afc_3863x2838.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">High Frequencies are adapted to distinguishing between nearby positions. This is crucial for local sentence syntax. Low frequencies are useful for capturing global structure.</figcaption></figure></div><p>The high value of 10,000 ensures a smooth transition from high to low frequencies across the embedding dimensions. <em><strong>2&#120703; 10000</strong></em> is the maximum period supported by the model. Theoretically, this means it allows unique positional signals for tokens up to <em><strong>2&#120703; 10000 ~ 62,832</strong></em> positions. However, transformers trained on sequences of fixed, shorter lengths (e.g., 512&#8211;4096 tokens) do not learn to handle positional relationships beyond this range. While the encoding theoretically supports very long periods, the model's effective context size is constrained by training data.</p><h2>The Encoder</h2>
      <p>
          <a href="https://newsletter.theaiedge.io/p/attention-is-all-you-need-the-original">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Introducing The Big Book of Large Language Models!]]></title><description><![CDATA[For the past years, I have been creating educational content around machine learning and, specifically, large language models.]]></description><link>https://newsletter.theaiedge.io/p/introducing-the-big-book-of-large</link><guid isPermaLink="false">https://newsletter.theaiedge.io/p/introducing-the-big-book-of-large</guid><dc:creator><![CDATA[Damien Benveniste]]></dc:creator><pubDate>Thu, 30 Jan 2025 16:01:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1Fkp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>For the past years, I have been creating educational content around machine learning and, specifically, large language models. I have been acquiring a depth of knowledge through my experience and practice in the field, and I want to share it with everybody! I started the process of writing, I believe, one of the most complete books on the subject of Large Language Models. You can access the book website here: <strong><a href="https://book.theaiedge.io/">The Big Book Of Large Language Models</a></strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://book.theaiedge.io/" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Fkp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png 424w, https://substackcdn.com/image/fetch/$s_!1Fkp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png 848w, https://substackcdn.com/image/fetch/$s_!1Fkp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png 1272w, https://substackcdn.com/image/fetch/$s_!1Fkp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Fkp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png" width="284" height="453.3076923076923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1494,&quot;width&quot;:936,&quot;resizeWidth&quot;:284,&quot;bytes&quot;:958359,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://book.theaiedge.io/&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Fkp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png 424w, https://substackcdn.com/image/fetch/$s_!1Fkp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png 848w, https://substackcdn.com/image/fetch/$s_!1Fkp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png 1272w, https://substackcdn.com/image/fetch/$s_!1Fkp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6ffe702-fc6b-40ba-972f-61a8177517df_936x1494.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I will make the chapters available little by little as I write them. Don&#8217;t hesitate to leave comments so I can improve the current draft! The first chapter is now available: <strong><a href="https://drive.google.com/file/d/1O9Rfk1Sf_5coeUEiGyq0H5H0ucNCgoW9/view?usp=drive_link">Language Models Before Transformers</a></strong>. In that chapter, I address the following subjects:</p><ul><li><p><em>The Embedding Layers</em></p></li><li><p><em>Word2Vec</em></p></li><li><p><em>GloVe</em></p></li><li><p><em>The Jordan Network</em></p></li><li><p><em>The Elman Network</em></p></li><li><p><em>The Vanishing and Exploding Gradients Problem</em></p></li><li><p><em>Long Short Term Memory (LSTM)</em></p></li><li><p><em>Gated Recurrent Unit (GRU)</em></p></li><li><p><em>Sequence-to-Sequence Models</em></p></li><li><p><em>The RNN Encoder-Decoder Architecture</em></p></li><li><p><em>The Bahdanau Attention Mechanism</em></p></li><li><p><em>The Luong Attention</em></p></li></ul><p>Here are the chapters coming up:</p><ol><li><p><em>Introduction</em></p></li><li><p><em><a href="https://drive.google.com/file/d/1O9Rfk1Sf_5coeUEiGyq0H5H0ucNCgoW9/view?usp=drive_link">Language Models Before Transformers</a></em></p></li><li><p><em>Attention Is All You Need: The Original Transformer Architecture</em></p></li><li><p><em>A More Modern Approach To The Transformer Architecture</em></p></li><li><p><em>Multi-modal Large Language Models</em></p></li><li><p><em>Transformers Beyond Language Models</em></p></li><li><p><em>Non-Transformer Language Models</em></p></li><li><p><em>How LLMs Generate Text</em></p></li><li><p><em>From Words To Tokens</em></p></li><li><p><em>Training LLMs to Follow Instructions</em></p></li><li><p><em>Scaling Model Training</em></p></li><li><p><em>Fine-Tuning LLMs</em></p></li><li><p><em>Deploying LLMs</em></p></li></ol><p>My philosophy is to provide the depth of the mathematic notation along with the ease of visual illustrations of the different concepts. I believe the book can be read at different levels: </p><ul><li><p>For somebody looking for the finest details, the equations should provide the foundations to understand thoroughly the concepts. </p></li><li><p>For somebody looking for a simpler read, the equation can be ignored to focus on the textual and visual explanations. </p></li><li><p>For somebody looking to strengthen their mathematical fundamentals in ML, the connection between the math and the visuals should help bridge the difficulties usually encountered when learning mathematics. </p></li></ul><p>Let me know if you think the book is missing the target on that &#8220;mission.&#8221; I am truly excited to share this with you! I hope you will enjoy reading it as much as I enjoy writing it!  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fKTR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fKTR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png 424w, https://substackcdn.com/image/fetch/$s_!fKTR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png 848w, https://substackcdn.com/image/fetch/$s_!fKTR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png 1272w, https://substackcdn.com/image/fetch/$s_!fKTR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fKTR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png" width="339" height="338.44517184942714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:610,&quot;width&quot;:611,&quot;resizeWidth&quot;:339,&quot;bytes&quot;:569971,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fKTR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png 424w, https://substackcdn.com/image/fetch/$s_!fKTR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png 848w, https://substackcdn.com/image/fetch/$s_!fKTR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png 1272w, https://substackcdn.com/image/fetch/$s_!fKTR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc30835-6dd3-42dd-8819-951ca0468f9d_611x610.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>]]></content:encoded></item></channel></rss>