Amazon's AlexaTM 20B Model Outperforms GPT-3 on NLP Benchmarks

2022-08-19 20:53:01 By : Ms. Kat Ding

Attend QCon San Francisco (Oct 24-28) and find practical inspiration from software leaders. Register

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

The panelists discuss ways to improve as developers. Are better tools the solution, or can simple changes in mindset help? And what practices are already here, but not yet universally adopted?

Legacy applications actually benefit the most from concepts like a Minimum Viable Product (MVP) and its related Minimum Viable Architecture (MVA). Once you realize that every release is an experiment in value in which the release either improves the value that customers experience or doesn’t, you realize that every release, even one of a legacy application, can be thought of in terms of an MVP.

In this annual report, the InfoQ editors discuss the current state of AI, ML, and data engineering and what emerging trends you as a software engineer, architect, or data scientist should watch. We curate our discussions into a technology adoption curve with supporting commentary to help you understand how things are evolving.

In this podcast Shane Hastie, Lead Editor for Culture & Methods spoke to Arpit Mohan about the importance and value of interpersonal skills in teamwork

Erin Schnabel discusses how application metrics align with other observability and monitoring methods, from profiling to tracing, and the limits of aggregation.

Learn how cloud architectures help organizations take care of application and cloud security, observability, availability and elasticity. Register Now.

Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.

Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.

InfoQ Homepage News Amazon's AlexaTM 20B Model Outperforms GPT-3 on NLP Benchmarks

Researchers at Amazon Alexa AI have announced Alexa Teacher Models (AlexaTM 20B), a 20-billion-parameter sequence-to-sequence (seq2seq) language model that exhibits state-of-the-art performance on 1-shot and few-shot NLP tasks. AlexaTM 20B outperforms GPT-3 on SuperGLUE and SQuADv2 benchmarks while having fewer than 1/8 the number of parameters.

The model and experiments were described in an Amazon Science whitepaper. Unlike other large decoder-only language models such as GPT-3 and PaLM, AlexaTM 20B is a seq2seq model; that is, it contains an encoder as well as a decoder. The encoder stage gives AlexaTM 20B better performance on summarization and machine translation (MT) tasks than larger decoder-only models such as PaLM. The model is multilingual and achieves state-of-the-art performance on few-shot MT tasks on the Flores-101 dataset, even on low-resource languages. According to co-author Saleh Soltan,

All in all, we demonstrated in our work that the proposed style of pretraining enables seq2seq models that outperform much larger decoder-only LLMs across different tasks, both in a few-shot setting and with fine-tuning. We hope our work presents a compelling case for seq2seq models as a powerful alternative to decoder-only models for LLM training.

The Alexa research team noted that their work is subject to several constraints that do not generally apply to language models. The Alexa digital assistant supports multiple languages, and the input text is "spoken-form" which can be different from the written form of text used in training datasets. Further, because their work is intended to be used in an edge device, memory is at a premium and the model inference must be low-latency; both of these favor smaller models.

To further reduce model size, the Amazon team investigated knowledge distillation. In a paper to be presented at the upcoming Knowledge Discovery and Data Mining Conference (KDD), the researchers demonstrated using a large model as a teacher. The team then trained smaller student models which were only 0.2% the size of the teacher (for example, 17M parameters vs 9.3B).

The researchers evaluated the 20B teacher model on several NLP benchmarks. On the MLSum benchmark, AlexaTM outperformed the state-of-the-art for 1-shot summarization in German, Spanish, and French and on 1-shot MT tasks for most language pairs. In particular, on low-resource languages like Telugu, Tamil, and Marathi, the improvement was "significant." The model outperformed GPT-3 on MT tasks "in most English centric cases." Although the model outperformed GPT-3 on most SuperGLUE NLP tasks, it trailed behind Google's much larger PaLM model.

Several users discussed the work in a thread on Hacker News. One pointed out the advantages of AlexaTM 20B over GPT-3:

Building a model downstream of GPT-3 is difficult and usually yields suboptimal results; however, 20b is small enough that it would be easy to finetune this on a smaller dataset for a specific task. You could then distill that model and end up with something that’s a fraction of the size (6b parameters for example, just under 1/3, would fit on commercial GPUs like 3090s).

The AlexaTM 20B model has not yet been publicly released, but the researchers created a repository for it on GitHub and note that it will be released soon.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

Clumio is a secure backup as a service that provides comprehensive data protection against ransomware attacks and account compromises in AWS. Start Free Trial.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Real-world technical talks. No product pitches. Practical ideas to inspire you and your team. QCon San Francisco - Oct 24-28, In-person. QCon San Francisco brings together the world's most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices. Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now

InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with. Privacy Notice, Terms And Conditions, Cookie Policy