References

[1]

A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

[2]

J. Kaplan et al., “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.

[3]

J. Hoffmann, S. Borgeaud, A. Mensch, et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022.

[4]

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, 2020.

[5]

Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” ICLR, 2021.

[6]

E. J. Hu et al., “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.

[7]

L. Ouyang, J. Wu, X. Jiang, et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, 2022.

[8]

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” NeurIPS, 2023.

[9]

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[10]

R. Sutton, “The bitter lesson,” Incomplete Ideas (blog), 2019.

[11]

N. Tinbergen, “On aims and methods of ethology,” Zeitschrift für Tierpsychologie, vol. 20, no. 4, pp. 410–433, 1963.

[12]

K. Lorenz, The foundations of ethology. Springer-Verlag, 1981.

[13]

K. von Frisch, The dance language and orientation of bees. Harvard University Press, 1967.