Incorrect RMSNorm #4

arunmallya · 2024-03-13T18:25:12Z

The RMSNorm implementation in this codebase in wrong as it computes the RMS over the (T, D) dimensions instead of the (D) dimension. Assume input x is of shape (B, T, D).

The current code does this:

# x is (B, T, D).
ff_rms = torch.linalg.norm(x, dim=(1,2)) * x[0].numel() ** -.5  # (B,).
raw = x / ff_rms.unsqueeze(-1).unsqueeze(-1)  # (B, 1, 1).

The original RMSNorm is here - https://github.com/meta-llama/llama/blob/main/llama/model.py#L34-L77

x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

The correct version using Frobenius norm would be:

ff_rms = torch.linalg.norm(x, dim=-1, keepdims=True) / math.sqrt(x.shape[-1])  # (B, T, 1).
raw = x / (ff_rms + eps)

Normalization should be per-token, not per-sequence.

The text was updated successfully, but these errors were encountered:

nkkbr · 2024-05-12T09:56:31Z

I agree with you.

nkkbr · 2024-05-12T10:35:33Z

My version:

class RMSNorm(nn.Module):
    def __init__(self,layer_shape,eps=1e-8,bias=False):
        super(RMSNorm,self).__init__()
        self.register_parameter('scale',nn.Parameter(torch.ones(layer_shape)))
        self.eps=eps

    def forward(self,x):
        """
        assumes shape is (batch,seq_len,d_model)
        """
        f = torch.rsqrt((torch.mean(pow(x,2),dim=-1,keepdim=True)+self.eps))
        return x*f*self.scale[:x.shape[1],:].unsqueeze(0)

bkitano · 2024-05-29T21:13:17Z

hi! open a PR?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect RMSNorm #4

Incorrect RMSNorm #4

arunmallya commented Mar 13, 2024

nkkbr commented May 12, 2024

nkkbr commented May 12, 2024

bkitano commented May 29, 2024

Incorrect RMSNorm #4

Incorrect RMSNorm #4

Comments

arunmallya commented Mar 13, 2024

nkkbr commented May 12, 2024

nkkbr commented May 12, 2024

bkitano commented May 29, 2024