Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Сохранить в:
Библиографические подробности
Опубликовано в::arXiv.org (Dec 3, 2024), p. n/a
Главный автор: Ren, Liliang
Другие авторы: Liu, Yang, Lu, Yadong, Shen, Yelong, Chen, Liang, Chen, Weizhu
Опубликовано:
Cornell University Library, arXiv.org
Предметы:
Online-ссылка:Citation/Abstract
Full text outside of ProQuest
Метки: Добавить метку
Нет меток, Требуется 1-ая метка записи!

MARC

LEADER 00000nab a2200000uu 4500
001 3067013628
003 UK-CbPIL
022 |a 2331-8422 
035 |a 3067013628 
045 0 |b d20241203 
100 1 |a Ren, Liliang 
245 1 |a Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling 
260 |b Cornell University Library, arXiv.org  |c Dec 3, 2024 
513 |a Working Paper 
520 3 |a Efficiently modeling sequences with infinite context length has long been a challenging problem. Previous approaches have either suffered from quadratic computational complexity or limited extrapolation ability in length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall recent memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and demonstrate that it significantly outperforms state-of-the-art models across a variety of benchmarks. Pretrained on sequences of 4K length, Samba shows improved perplexity in context lengths of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task, and exhibits superior retrieval extrapolation on the challenging Phonebook task compared to full-attention models. As a linear-time sequence model, Samba achieves a 3.73x higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our code for training on open source data is publicly available at https://github.com/microsoft/Samba. 
653 |a Recall 
653 |a Modelling 
653 |a Context 
653 |a Query processing 
653 |a State space models 
700 1 |a Liu, Yang 
700 1 |a Lu, Yadong 
700 1 |a Shen, Yelong 
700 1 |a Chen, Liang 
700 1 |a Chen, Weizhu 
773 0 |t arXiv.org  |g (Dec 3, 2024), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/3067013628/abstract/embedded/ZKJTFFSVAI7CB62C?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/2406.07522