Attention Residuals
Learning how information should accumulate over depth
Core thesis
Residuals already decide
what survives depth.
AttnRes turns that routing
into an explicit mechanism.
The paper argues that residuals are already performing depth-wise routing. Once you see that, the natural next step is to let the model learn which prior layers should matter most.
This deck follows the paper's arc from problem to mechanism to practical system design to measurable wins.
Talk arc
1
2
3
4
Problem
why fixed residual accumulation loses selectivity over depth
Mechanism
full attention over prior layer outputs
Systems
blockwise variant that keeps the idea practical
Results
scaling and downstream wins that justify the change
Why it matters
Architecture
Residuals stop being a fixed rule and become a learned selector.
Infrastructure
Block AttnRes makes the mechanism trainable at large scale.
Evidence
The paper ties the story to scaling curves and downstream gains.