Attention Residuals

Learning how information should accumulate over depth

Core thesis

Residuals already decide
what survives depth.
AttnRes turns that routing
into an explicit mechanism.

The paper argues that residuals are already performing depth-wise routing. Once you see that, the natural next step is to let the model learn which prior layers should matter most.

This deck follows the paper's arc from problem to mechanism to practical system design to measurable wins.

Talk arc

Problem

why fixed residual accumulation loses selectivity over depth

Mechanism

full attention over prior layer outputs

Systems

blockwise variant that keeps the idea practical

Results

scaling and downstream wins that justify the change

Why it matters

Architecture

Residuals stop being a fixed rule and become a learned selector.

Infrastructure

Block AttnRes makes the mechanism trainable at large scale.

Evidence

The paper ties the story to scaling curves and downstream gains.