Anderson, P. W. (1972). More is different.

*Science*,

*177*(4047), 393–396.

https://doi.org/10.1126/science.177.4047.393
Bavarian, M., Jun, H., Tezak, N., Schulman, J., McLeavey, C., Tworek, J., & Chen, M. (2022).

*Efficient training of language models to fill in the middle*. arXiv.

https://doi.org/10.48550/ARXIV.2207.14255
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. *J. Mach. Learn. Res.*, *3*(null), 1137–1155.

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., De Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., … Sifre, L. (2022). Improving language models by retrieving from trillions of tokens. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.),

*Proceedings of the 39th international conference on machine learning* (Vol. 162, pp. 2206–2240). PMLR.

https://proceedings.mlr.press/v162/borgeaud22a.html
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020).

*Language models are few-shot learners*. arXiv.

https://doi.org/10.48550/ARXIV.2005.14165
Chan, S. C. Y., Santoro, A., Lampinen, A. K., Wang, J. X., Singh, A., Richemond, P. H., McClelland, J., & Hill, F. (2022).

*Data distributional properties drive emergent in-context learning in transformers*. arXiv.

https://doi.org/10.48550/ARXIV.2205.05055
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks.

*Neural Networks*,

*4*(2), 251–257. https://doi.org/

https://doi.org/10.1016/0893-6080(91)90009-T
Izacard, G., Lewis, P., Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, A., Riedel, S., & Grave, E. (2022).

*Few-shot learning with retrieval augmented language models*. arXiv.

https://doi.org/10.48550/ARXIV.2208.03299
Karpathy, A. (2015).

*The unreasonable effectiveness of recurrent neural networks*.

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Kudo, T., & Richardson, J. (2018).

Sentence

Piece: A simple and language independent subword tokenizer and detokenizer for neural text processing.

*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, 66–71.

https://doi.org/10.18653/v1/D18-2012
Lazaridou, A., Gribovskaya, E., Stokowiec, W., & Grigorev, N. (2022).

*Internet-augmented language models through few-shot prompting for open-domain question answering*. arXiv.

https://doi.org/10.48550/ARXIV.2203.05115
Marr, D., & Poggio, T. (1976). *From understanding computation to understanding neural circuitry*. Massachusetts Institute of Technology.

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. *Advances in Neural Information Processing Systems*, *35*.

Meng, K., Sen Sharma, A., Andonian, A., Belinkov, Y., & Bau, D. (2022). Mass editing memory in a transformer. *arXiv Preprint arXiv:2210.07229*.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013).

*Efficient estimation of word representations in vector space*. arXiv.

https://doi.org/10.48550/ARXIV.1301.3781
Min, S., & Xie, S. M. (2022).

*How does in-context learning work? A framework for understanding the differences from traditional supervised learning*.

http://ai.stanford.edu/blog/understanding-incontext/
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., … Olah, C. (2022). In-context learning and induction heads. *Transformer Circuits Thread*.

Savinov, N., Chung, J., Binkowski, M., Elsen, E., & Oord, A. van den. (2022). Step-unrolled denoising autoencoders for text generation.

*International Conference on Learning Representations*.

https://openreview.net/forum?id=T0GpzBQ1Fg6
Schunk, D. H. (2011). *Learning theories* (6th ed.). Pearson.

Strudel, R., Tallec, C., Altché, F., Du, Y., Ganin, Y., Mensch, A., Grathwohl, W., Savinov, N., Dieleman, S., Sifre, L., & Leblond, R. (2022).

*Self-conditioned embedding diffusion for text generation*. arXiv.

https://doi.org/10.48550/ARXIV.2211.04236
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.),

*Advances in neural information processing systems* (Vol. 30). Curran Associates, Inc.

https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf