Hugging Face NLP Course

Course by Hugging Face

Hugging Face offers a natural language processing (NLP) course that covers the libraries, models, and datasets within their ecosystem. This course is currently being covered by the Data Science & Machine Learning Collaborative Learning Group meetup. The group meets on the fourth Tuesday of every month and covers about one chapter per meeting.

There are 12 chapters planned in the course, of which 9 have been written.

Chapter 1 - Transformer Models

November 26, 2024

Introduction

This course teaches about Natural Language Processing (NLP) using libraries from the Hugging Face ecosystem.

Chapters 1-4 provide an introduction to the Hugging Face Transformers library, including how they work, how to use the models from the hub, how to fine-tune models, and how to share results on the hub.

Chapters 5-8 teach the basics of Hugging Face datasets and tokenizers before diving into classic NLP tasks.

Chapters 9-13 go beyond NLP and explore how Transformer models can tackle tasks in speech recognition and computer vision.

This course requires a good knowledge of Python and a fair knowledge of deep learning, but does not require any in-depth knowledge of PyTorch or TensorFlow.

Natural Language Processing

NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP is to understand the context of the words, not just the single words. Common tasks in NLP include:

Classifying whole sentences - sentiment review, spam email detection, grammatical correctness of sentences, and whether two sentences are logically related or not
Classifying each word in a sentence - grammatical components of a sentence (noun, verb, adjective) or the named entities (person, location, organization)
Generating text content - Completing a prompt with auto-generating text or filling in the blank in a text
Extracting an answer from a text - given a question and a context, extract the answer to the question based on the context
Generating a new sentence from an input text - translating to another language or summarizing a text

NLP is difficult for computers because the context of very similar sentences can be wildly different.

Transformers, what can they do?

Hugging Face provides a tool in the form of the pipeline() function, which connects a model with its necessary preprocessing and postprocessing steps. In addition, Hugging Face Hub has thousands of pre-trained models published and ready to use. You can also publish your own models.

The pipeline function selects a pre-trained model that has been fine-tuned for the task you're requesting (e.g. sentiment analysis in English). Passing text arguments to the returned pipeline will perform the task on the text. There are three main steps:

Input text is preprocessed into a format the model can understand
The preprocessed inputs are passed to the model
The predictions of the model are post-processed, so you (the human) can make sense of them

Many pipelines are available:

Feature extraction - get the vector representation of the text
Fill mask - fill in the masked word or phrase in the text
NER (Named Entity Recognition) - find the named entities in the text, e.g. people, locations, organizations
Question answering - given a context and a question in the form of text, deduce the answer to the question
Sentiment analysis - analyze the text for positive or negative sentiment
Summarization - summarize the input text into output text
Text generation - generate some text that logically completes the prompted text as input
Translation - translate input text from one language into another
Zero shot classification - given a text and a set of candidate labels, generate a probability distribution of the text over the candidate label set

Pipelines have a default model associated with them, but you can always override the default and select a specific model in addition to the pipeline.

How do Transformers work?

The transformer architecture was introduced in June 2017, and multiple pre-trained transformers have been released since.

Most of the released Transformer models are trained as language models. Language models are trained on large amounts of raw text in a self-supervised fashion, where the objectives are automatically computed from the inputs of the model. No humans are needed to label the data. While language models develop a statistical understanding of the language that they’re trained on, they’re not useful for specific tasks. General pretrained models then go through a process called transfer learning, where the model is fine-tuned in a supervised way (using human-annotated labels) on a given task.

Unfortunately, training a big model requires a big amount of data, which leads to high compute cost and time, even translating to environmental impact. Sharing pre-trained models allows a significant cost saving and reduces overall carbon footprint. Fine-tuning a model occurs after it has been pretrained, and typically requires a much smaller amount of specialized data applicable to the end purpose of the finely tuned model. Because the amount of data is smaller, the compute resources / time / carbon footprint are also smaller.

Models are typically composed of two blocks: Encoder and Decoder.

Encoder Models

The encoder receives an input and builds a representation of it. Encoder-only models are good for sentence classification and named entity recognition.

Decoder Models

The decoder uses the encoder’s representation along with other inputs to generate a target sequence. Decoder-only models are good for text generation.

Sequence-to-Sequence Models

Encoder-decoder (or sequence-to-sequence) models have an encoder and a decoder component. The outputs of the encoder are passed to the decoder. Sequence-to-sequence models are good for translation and summarization.

Attention layers are features of models that tell the model which words to pay attention to.

An encoder typically has a single attention layer. During training, the attention layer of the encoder has access to all the words in the input. As a result, encoder models are sometimes called auto-encoding models because they have bidirectional attention. BERT-like models are encoder-only.
A decoder has two attention layers - one for the input text and one for the generated output. During training, the first attention layer of the decoder has access to all the words in the input, and all the thus-far-generated words in the output. As a result, decoder models are sometimes called auto-regressive models because they only know about the words that have been generated thus far. GTP-like models are decoder-only.
A sequence-to-sequence model has both an encoder and a decoder. BART-like models are sequence-to-sequence.

A model has an architecture and a checkpoint. The architecture is the skeleton of the model - the definition of each layer and each operation that happens within the model. The checkpoint is the set of weights that will be loaded into a given architecture. A model is an umbrella term that could mean an architecture, a checkpoint, or the combination of both.

Bias and Limitations

Models have the same biases as the underlying data that they’re trained on. And because of the large data requirement, most models have the best and worst of what the Internet has to offer. The inherent bias needs to be kept in mind when using these models.

SPEED AND BALANCE