Almost-scaled dot-product attention

This post has moved.

Note that the approach and equations described in this post are legacy and do not reflect the current implementation of u-μP. Please see the code for a definitive reference.