This is a great write-up. I like the intuition with the magnifying glass. I also usually tend to think of it as analogous to using multiple channels in a convolutional layer.
Thank you for your kind words Sebastian! Comparing multi-head attention to multiple channels in a convolutional layer is also a good way to think about it, emphasizing how different "perspectives" or "filters" can extract varied features from the same input.
This is a great write-up. I like the intuition with the magnifying glass. I also usually tend to think of it as analogous to using multiple channels in a convolutional layer.
Thank you for your kind words Sebastian! Comparing multi-head attention to multiple channels in a convolutional layer is also a good way to think about it, emphasizing how different "perspectives" or "filters" can extract varied features from the same input.