HuggingFace's Transformers: State-of-the-art Natural Language Processing

  • 2020-02-11 14:42:10
  • Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Jamie Brew
  • 0

Abstract

Recent advances in modern Natural Language Processing (NLP) research havebeen dominated by the combination of Transfer Learning methods with large-scalelanguage models, in particular based on the Transformer architecture. With themcame a paradigm shift in NLP with the starting point for training a model on adownstream task moving from a blank specific model to a general-purposepretrained architecture. Still, creating these general-purpose models remainsan expensive and time-consuming process restricting the use of these methods toa small sub-set of the wider NLP community. In this paper, we presentHuggingFace's Transformers library, a library for state-of-the-art NLP, makingthese developments available to the community by gathering state-of-the-artgeneral-purpose pretrained models under a unified API together with anecosystem of libraries, examples, tutorials and scripts targeting manydownstream NLP tasks. HuggingFace's Transformers library features carefullycrafted model implementations and high-performance pretrained weights for twomain deep learning frameworks, PyTorch and TensorFlow, while supporting all thenecessary tools to analyze, evaluate and use these models in downstream taskssuch as text/token classification, questions answering and language generationamong others. The library has gained significant organic traction and adoptionamong both the researcher and practitioner communities. We are committed atHuggingFace to pursue the efforts to develop this toolkit with the ambition ofcreating the standard library for building NLP systems. HuggingFace'sTransformers library is available at\url{https://github.com/huggingface/transformers}.

 

Quick Read (beta)

Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, \ANDClement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, \ANDRémi Louf, Morgan Funtowicz, Jamie Brew

HuggingFace Inc., Brooklyn, USA
NAVER LABS Europe, Grenoble, France

{first-name}@huggingface.co
Abstract

Recent advances in modern Natural Language Processing (NLP) research have been dominated by the combination of Transfer Learning methods with large-scale language models, in particular based on the Transformer architecture. With them came a paradigm shift in NLP with the starting point for training a model on a downstream task moving from a blank specific model to a general-purpose pretrained architecture. Still, creating these general-purpose models remains an expensive and time-consuming process restricting the use of these methods to a small sub-set of the wider NLP community. In this paper, we present HuggingFace’s Transformers library, a library for state-of-the-art NLP, making these developments available to the community by gathering state-of-the-art general-purpose pretrained models under a unified API together with an ecosystem of libraries, examples, tutorials and scripts targeting many downstream NLP tasks. HuggingFace’s Transformers library features carefully crafted model implementations and high-performance pretrained weights for two main deep learning frameworks, PyTorch and TensorFlow, while supporting all the necessary tools to analyze, evaluate and use these models in downstream tasks such as text/token classification, questions answering and language generation among others. The library has gained significant organic traction and adoption among both the researcher and practitioner communities. We are committed at HuggingFace to pursue the efforts to develop this toolkit with the ambition of creating the standard library for building NLP systems. HuggingFace’s Transformers library is available at https://github.com/huggingface/transformers.

 

Transformers: State-of-the-art Natural Language Processing


  Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Jamie Brew HuggingFace Inc., Brooklyn, USA NAVER LABS Europe, Grenoble, France {first-name}@huggingface.co

\@float

noticebox[b]\[email protected]

1 Introduction

In the past 18 months, advances on many Natural Language Processing (NLP) tasks have been dominated by deep learning models and, more specifically, the use of Transfer Learning methods (ruder2019transfer) in which a deep neural network language model is pretrained on a web-scale unlabelled text dataset with a general-purpose training objective before being fine-tuned on various downstream tasks. Following noticeable improvements using Long Short-Term Memory (LSTM) architectures (Howard2018UniversalLM; peters2018deep), a series of works combining Transfer Learning methods with large-scale Transformer architectures (Vaswani2017AttentionIA) has repeatedly advanced the state-of-the-art on NLP tasks ranging from text classification (Yang2019XLNetGA), language understanding (Liu2019RoBERTaAR; Wang2018GLUEAM; Wang2019SuperGLUEAS), machine translation (Lample2019CrosslingualLM), and zero-short language generation (Radford2019LanguageMA) up to co-reference resolution (joshi2019spanbert) and commonsense inference (Bosselut2019COMETCT).

While this approach has shown impressive improvements on benchmarks and evaluation metrics, the exponential increase in the size of the pretraining datasets as well as the model sizes (Liu2019RoBERTaAR; shoeybi2019megatron) has made it both difficult and costly for researchers and practitioners with limited computational resources to benefit from these models. For instance, RoBERTa (Liu2019RoBERTaAR) was trained on 160 GB of text using 1024 32GB V100. On Amazon-Web-Services cloud computing (AWS), such a pretraining would cost approximately 100K USD.

Contrary to this trend, the booming research in Machine Learning in general and Natural Language Processing in particular is arguably explained significantly by a strong focus on knowledge sharing and large-scale community efforts resulting in the development of standard libraries, an increased availability of published research code and strong incentives to share state-of-the-art pretrained models. The combination of these factors has lead researchers to reproduce previous results more easily, investigate current approaches and test hypotheses without having to redevelop them first, and focus their efforts on formulating and testing new hypotheses.

To bring Transfer Learning methods and large-scale pretrained Transformers back into the realm of these best practices, the authors (and the community of contributors) have developed Transformers, a library for state-of-the art Natural Language Processing with Transfer Learning models. Transformers addresses several key challenges:

Sharing is caring

Transformers gathers, in a single place, state-of-the art architectures for both Natural Language Understanding (NLU) and Natural Language Generation (NLG) with model code and a diversity of pretrained weights. This allows a form of training-computation-cost-sharing so that low-resource users can reuse pretrained models without having to train them from scratch. These models are accessed through a simple and unified API that follows a classic NLP pipeline: setting up configuration, processing data with a tokenizer and encoder, and using a model either for training (adaptation in particular) or inference. The model implementations provided in the library follow the original computation graphs and are tested to ensure they match the original author implementations’ performances on various benchmarks.

Easy-access and high-performance

Transformers was designed with two main goals in mind: (i) be as easy and fast to use as possible and (ii) provide state-of-the-art models with performances as close as possible to the originally reported results. To ensure a low entry barrier, the number of user-facing abstractions to learn was strongly limited and reduced to just three standard classes: configuration, models and tokenizers, which all can be initialized in a simple and unified way by using a common ‘from_pretrained()‘ instantiation method.

Interpretability and diversity

There is a growing field of study, sometimes referred as BERTology from BERT (Devlin2018BERTPO), concerned with investigating the inner working of large-scale pretrained models and trying to build a science on top of these empirical results. Some examples include Tenney2019BERTRT, Michel2019AreSH, clark2019what. Transformers aims at facilitating and increasing the scope of these studies by (i) giving easy access to the inner representations of these models, notably the hidden states, the attention weights or heads importance as defined in Michel2019AreSH and (ii) providing different models in a unified API to prevent overfitting to a specific architecture (and set of pretrained weights). Moreover, the unified front-end of the library makes it easy to compare the performances of several architectures on a common language understanding benchmark. Transformers notably includes pre-processors and fine-tuning scripts for GLUE (Wang2018GLUEAM), SuperGLUE (Wang2019SuperGLUEAS) and SQuAD1.1 (Rajpurkar2016SQuAD10).

Pushing best practices forward

Transformers seeks a balance between sticking to the original authors’ code-base for reliability and providing clear and readable implementations featuring best practices in training deep neural networks so that researchers can seamlessly use the code-base to explore new hypothesis derived from these models. To accommodate a large community of practitioners and researchers, the library is deeply compatible with (and actually makes compatible) two major deep learning frameworks: PyTorch (paszke2017automatic) and TensorFlow (from release 2.0) (tensorflow2015-whitepaper).

From research to production

Another essential question is how to make these advances in research available to a wider audience, especially in the industry. Transformers also takes steps towards a smoother transition from research to production. The provided models support TorchScript, a way to create serializable and optimizable models from PyTorch code, and features production code and integration with the TensorFlow Extended framework.

2 Community

The development of the Transformers originally steamed from open-sourcing internals tools used at HuggingFace but has seen a huge growth in scope over its ten months of existence as reflected by the successive changes of name of the library: from pytorch-pretrained-bert to pytorch-transformers to, finally, Transformers.

A fast-growing and active community of researchers and practitioners has gathered around Transformers. The library has quickly become used both in research and in the industry: at the moment, more than 200 research papers report using the library11 1 http://search.arxiv.org:8081/?query=huggingface&qid=1565055415921multi_nCnN_-1835167213&byDate=1. Transformers is also included either as a dependency or with a wrapper in several popular NLP frameworks such as Spacy (spacy2), AllenNLP (Gardner2017AllenNLP) or Flair (akbik2018coling).

Transformers is an ongoing effort maintained by the team of engineers and research scientists at HuggingFace22 2 https://huggingface.co, with support from a vibrant community of more than 120 external contributors. We are committed to the twin efforts of developing the library and fostering positive interaction among its community members, with the ambition of creating the standard library for modern deep learning NLP.

Transformers is released under the Apache 2.0 license and is available through pip or from source on GitHub33 3 https://github.com/huggingface/transformers. Detailed documentation along with on-boarding tutorials are available on HuggingFace’s website44 4 https://huggingface.co/transformers/.

3 Library design

Transformers has been designed around a unified frontend for all the models: parameters and configurations, tokenization, and model inference. These steps reflect the recurring questions that arise when building an NLP pipeline: defining the model architecture, processing the text data and finally, training the model and performing inference in production. In the following section, we’ll give an overview of the three base components of the library: configuration, model and tokenization classes. All of the components are compatible with PyTorch and TensorFlow (starting 2.0). For complete details, we refer the reader to the documentation available on https://huggingface.co/transformers/.

3.1 Core components

All the models follow the same philosophy of abstraction enabling a unified API in the library.

Configuration - A configuration class instance (usually inheriting from a base class ‘PretrainedConfig‘) stores the model and tokenizer parameters (such as the vocabulary size, the hidden dimensions, dropout rate, etc.). This configuration object can be saved and loaded for reproducibility or simply modified for architecture search.

The configuration defines the architecture of the model but also architecture optimizations like the heads to prune. Configurations are agnostic to the deep learning framework used.

Tokenizers - A Tokenizer class (inheriting from a base class ‘PreTrainedTokenizer‘) is available for each model. This class stores the vocabulary token-to-index map for the corresponding model and handles the encoding and decoding of input sequences according to the model’s tokenization-specific process (ex. Byte-Pair-Encoding, SentencePiece, etc.). Tokenizers are easily modifiable to add user-selected tokens, special tokens (like classification or separation tokens) or resize the vocabulary.

Furthermore, Tokenizers implement additional useful features for the users, by offering values to be used with a model; these range from token type indices in the case of sequence classification to maximum length sequence truncating taking into account the added model-specific special tokens (most pretrained Transformers models have a maximum sequence length they can handle, defined during their pretraining step).

Tokenizers can be instantiated from existing configurations available through Transformers originating from the pretrained models or created more generally by the user from user-specifications.

Model - All models follow the same hierarchy of abstraction: a base class implements the model’s computation graph from encoding (projection on the embedding matrix) through the series of self-attention layers and up to the last layer hidden states. The base class is specific to each model and closely follows the original implementation, allowing users to dissect the inner workings of each individual architecture.

Additional wrapper classes are built on top of the base class, adding a specific head on top of the base model hidden states. Examples of these heads are language modeling or sequence classification heads. These classes follow similar naming pattern: XXXForSequenceClassification or XXXForMaskedLM where XXX is the name of the model and can be used for adaptation (fine-tuning) or pre-training.

All models are available both in PyTorch and TensorFlow (starting 2.0) and offer deep inter-operability between both frameworks. For instance, a model trained in one of frameworks can be saved on drive for the standard library serialization practice and then be reloaded from the saved files in the other framework seamlessly, making it particularly easy to switch from one framework to the other one along the model life-time (training, serving, etc.).

Auto classes - In many cases, the architecture to use can be automatically guessed from the shortcut name of the pretrained weights (e.g. ‘bert-base-cased‘). A set of Auto classes provides a unified API that enable very fast switching between different models/configs/tokenizers. There are a total of four high-level abstractions referenced as Auto classes: AutoConfig, AutoTokenizer, AutoModel (for PyTorch) and TFAutoModel (for TensorFlow). These classes automatically instantiate the right configuration, tokenizer or model class instance from the name of the pretrained checkpoints.

3.2 Training

Optimizer - The library provides a few optimization utilities as subclasses of PyTorch ‘torch.optim.Optimizer‘ which can be used when training the models. The additional optimizer currently available is the Adam optimizer (Kingma2014AdamAM) with an additional weight decay fix, also known as ‘AdamW‘ (loshchilov2017fixing).

Scheduler - Additional learning rate schedulers are also provided as subclasses of PyTorch ‘torch.optim.lr_scheduler.LambdaLR‘, offering various schedules used for transfer learning and transformers models with customizable options including warmup schedules which are relevant when training with Adam.

4 Experimenting with Transformers

In this section, we present some of the major tools and examples provided in the library to experiment on a range of downstream Natural Language Understanding and Natural Language Generation tasks.

4.1 Language understanding benchmarks

The language models provided in Transformers are pretrained with a general purpose training objective, usually a variant of language modeling like standard (sometime called causal) language modeling as used for instance in Radford2019LanguageMA or masked language modeling as introduced in Devlin2018BERTPO. A pretrained model is often evaluated using wide-range language understanding benchmarks. Transformers includes several tools and scripts to evaluate models on GLUE (Wang2018GLUEAM) and SuperGLUE (Wang2019SuperGLUEAS). These two benchmarks gather a variety of datasets to evaluate natural language understanding systems. Details of the datasets can be found in the Appendix on page References.

Regarding the machine comprehension tasks, the library feature evaluations on SQuAD1.1 (Rajpurkar2016SQuAD10) and SQuAD2.0 (Rajpurkar2018KnowWY).

Others currently-supported benchmarks include SWAG (Zellers2018SWAGAL), RACE (Lai2017RACELR) and ARC (Clark2018ThinkYH).

4.2 Language model fine-tuning

Fine-tuning a language model on a downstream text corpus usually leads to significant gains for tasks on this corpus, in particular when the domain is different (domain adaptation). It also significantly reduces the amount of training data required for fine-tuning on a target task in the target domain. Transformers provides simple scripts to fine-tune models on custom text datasets with the option to add or remove tokens from the vocabulary and several other adaptability features.

4.3 Ecosystem

Write with Transformer Because Natural Language Processing does not have to be serious and boring, the generative capacities of auto-regressive language models available in Transformers are showcased in an intuitive and playful manner. Built by the authors on top of Transformers, Write with Transformer55 5 https://transformer.huggingface.co is an interactive interface that leverages the generative capabilities of pretrained architectures like GPT, GPT2 and XLNet to suggest text like an auto-completion plugin. Generating samples is also often used to qualitatively (and subjectively) evaluate the generation quality of language models (Radford2019LanguageMA). Given the impact of the decoding algorithm (top-K sampling, beam-search, etc.) on generation quality (Holtzman2019TheCC), Write with Transformer offers various options to dynamically tweak the decoding algorithm and investigate the resulting samples from the model.

Figure 1: Write With Transformer

Conversational AI HuggingFace has been using Transfer Learning with Transformer-based models for end-to-end Natural language understanding and text generation in its conversational agent, Talking Dog. The company also demonstrated in fall 2018 that this approach can be used to reach state-of-the-art performances on academic benchmarks, topping by a significant margin the automatic metrics leaderboard of the Conversational Intelligence Challenge 2 held at the Thirty-second Annual Conference on Neural Information Processing Systems (NIPS 2018). The approach used to reach these performances is described in Wolf2019TransferTransfoAT; golovanov2019large and the code and pretrained models, based on the Transformers library, are available online66 6 https://github.com/huggingface/transfer-learning-conv-ai.

Using in production To facilitate the transition from research to production, all the models in the library are compatible with TorchScript, an intermediate representation of a PyTorch model that can then be run either in Python in a more efficient way, or in a high-performance environment such as C++77 7 https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html. Fine-tuned models can thus be exported to production-friendly environment.

Optimizing large machine learning models for production is an ongoing effort in the community and there are many current engineering efforts towards that goal. The distillation of large models (e.g. DistilBERT (sanh2019distilbert)) is one of the most promising directions. It lets users of Transformers run more efficient versions of the models, even with strong latency constraints and on inexpensive CPU servers. We also convert Transformers models to Core ML weights that are suitable to be embbeded inside a mobile application, to enable on-the-edge machine learning. Code is also made available88 8 https://github.com/huggingface/swift-coreml-transformers.

Community Many libraries in NLP and Machine Learning have been created on top of Transformers or have integrated Transformers as a package dependency or through wrappers. At the time of writing, the authors have been mostly aware of FastBert99 9 https://github.com/kaushaltrivedi/fast-bert, FARM1010 10 https://github.com/deepset-ai/FARM, flair (akbik2018coling; akbik2019naacl), AllenNLP (Gardner2017AllenNLP) and PyText1111 11 https://github.com/facebookresearch/pytext but there are likely more interesting developments to be found, from research and internal projects to production packages.

5 Architectures

Here is a list of architectures for which reference implementations and pretrained weights are currently provided in Transformers. These models fall into two main categories: generative models (GPT, GPT-2, Transformer-XL, XLNet, XLM) and models for language understanding (Bert, DistilBert, RoBERTa, XLM).

  • BERT (Devlin2018BERTPO) is a bi-directional Transformer-based encoder pretrained with a linear combination of masked language modeling and next sentence prediction objectives.

  • RoBERTa (Liu2019RoBERTaAR) is a replication study of BERT which showed that carefully tuning hyper-parameters and training data size lead to significantly improved results on language understanding.

  • DistilBERT (sanh2019distilbert) is a smaller, faster, cheaper and lighter version BERT pretrained with knowledge distillation.

  • GPT (Radford2018GPT) and GPT2 (Radford2019LanguageMA) are two large auto-regressive language models pretrained with language modeling. GPT2 showcased zero-shot task transfer capabilities on various tasks such as machine translation or reading comprehension.

  • Transformer-XL (Dai2019TransformerXLAL) introduces architectural modifications enabling Transformers to learn dependency beyond a fixed length without disrupting temporal coherence via segment-level recurrence and relative positional encoding schemes.

  • XLNet (Yang2019XLNetGA) builds upon Transformer-XL and proposes an auto-regressive pretraining scheme combining BERT’s bi-directional context flow with auto-regressive language modeling by maximizing the expected likelihood over permutations of the word sequence.

  • XLM (Lample2019CrosslingualLM) shows the effectiveness of pretrained representations for cross-lingual language modeling (both on monolingual data and parallel data) and cross-lingual language understanding.

We systematically release the model with the corresponding pretraining heads (language modeling, next sentence prediction for BERT) for adaptation using the pretraining objectives. Some models fine-tuned on downstream tasks such as SQuAD1.1 are also available. Overall, more than 30 pretrained weights are provided through the library including more than 10 models pretrained in languages other than English. Some of these non-English pretrained models are multi-lingual models (with two of them being trained on more than 100 languages) 1212 12 https://huggingface.co/transformers/multilingual.html.

6 Related work

The design of Transformers was inspired by earlier libraries on transformers and Natural Language Processing. More precisely, organizing the modules around three main components (configuration, tokenizers and models) was inspired by the design of the tensor2tensor library (tensor2tensor) and the original code repository of Bert (Devlin2018BERTPO) from Google Research while concept of providing easy caching for pretrained models steamed from features of the AllenNLP library (Gardner2017AllenNLP) open-sourced by the Allen Institute for Artificial Intelligence (AI2).

Works related to the Transformers library can be generally organized along three directions, at the intersection of which stands the present library. The first direction includes Natural Language Processing libraries such as AllenNLP1313 13 https://allennlp.org/ (Gardner2017AllenNLP), SpaCy1414 14 https://spacy.io// (spacy2), flair1515 15 https://github.com/zalandoresearch/flair (akbik2018coling; akbik2019naacl) or PyText1616 16 https://github.com/facebookresearch/pytext. These libraries precede Transformers and target somewhat different use-cases, for instance those with a particular focus on research for AllenNLP or a strong attention to production constrains (in particular with a carefully tuned balance between speed and performance) for SpaCy. The previously mentioned libraries have now been provided with integrations for Transformers, through a direct package dependency for AllenNLP, flair or PyText or through a wrapper called spacy-transformers1717 17 https://github.com/explosion/spacy-transformers for SpaCy.

The second direction concerns lower-level deep-learning frameworks like PyTorch (paszke2017automatic) and TensorFlow (tensorflow2015-whitepaper) which have both been extended with model sharing capabilities or hubs, respectively called TensorFlow Hub1818 18 https://github.com/tensorflow/hub and PyTorch Hub1919 19 https://pytorch.org/hub. These hubs are more general and while they offer ways to share models they differ from the present library in several ways. In particular, they provide neither a unified API across models nor standardized ways to access the internals of the models. Targeting a more general machine-learning community, these Hubs lack the NLP-specific user-facing features provided by Transformers like tokenizers, dedicated processing scripts for common downstream tasks and sensible default hyper-parameters for high performance on a range of language understanding and generation tasks.

The last direction is related to machine learning research frameworks that are specifically used to test, develop and train architectures like Transformers. Typical examples are the tensor2tensor2020 20 https://github.com/tensorflow/tensor2tensor library (tensor2tensor), fairseq2121 21 https://github.com/pytorch/fairseq (ott2019fairseq) and Megatron-LM2222 22 https://github.com/NVIDIA/Megatron-LM. These libraries are usually not provided with the user-facing features that allow easy download, caching, fine-tuning of the models as well as seamless transition to production.

7 Conclusion

We have presented the design and the main components of Transformers, a library for state-of-the-art NLP. Its capabilities, performances and unified API make it easy for both practitioners and researchers to access various large-scale language models, build and experiment on top of them and use them in downstream task with state-of-the-art performance. The library has gained significant organic traction since its original release and has become widely adopted among researchers and practitioners, fostering an active community of contributors and an ecosystem of libraries building on top of the provided tools. We are committed to supporting this community and making recent developments in transfer learning for NLP both accessible and easy to use while maintaining high standards of software engineering and machine learning engineering.

References

Appendix A GLUE and SuperGLUE

The datasets in GLUE are: CoLA (warstadt2018neural), Stanford Sentiment Treebank (SST) (socher2013recursive), Microsoft Research Paragraph Corpus (MRPC) dolan2005automatically, Semantic Textual Similarity Benchmark (STS) agirre2007semantic, Quora Question Pairs (QQP) WinNT, Multi-Genre NLI (MNLI) williams2018broad, Question NLI (QNLI) Rajpurkar2016SQuAD10, Recognizing Textual Entailment (RTE) dagan2006pascal; bar2006second; giampiccolo2007third; bentivogli2009fifth and Winograd NLI (WNLI) levesque2011winograd.

The datasets in SuperGLUE are: Boolean Questions (BoolQ) clark2019boolq, CommitmentBank (CB) demarneffe:cb, Choice of Plausible Alternatives (COPA) roemmele2011choice, Multi-Sentence Reading Comprehension (MultiRC) khashabi2018looking, Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) zhang2018record, Word-in-Context (WiC) pilehvar2018wic, Winograd Schema Challenge (WSC) rudinger2018winogender, Diverse Natural Language Inference Collection (DNC) poliak2018dnc, Recognizing Textual Entailment (RTE) dagan2006pascal; bar2006second; giampiccolo2007third; bentivogli2009fifth and Winograd NLI (WNLI) levesque2011winograd