Physics of Language Models: Architecture Design and the Magic of Canon Layers papers.ssrn.com 19 points by nkko 9 months ago
darknoon 9 months ago anyone know why they mix in the 3 previous tokens? could have just as easily done 5 or 2 right?
anyone know why they mix in the 3 previous tokens? could have just as easily done 5 or 2 right?