# Introducing [Huggingface](https://huggingface.co/)  ðŸ¤—

## Resource for pre-trained models, data, and more

<span style="color:red">Disclaimer: Using models will download model weights onto your machine!</span>

Note: Distilled models are downsized clones of the original model that often achieve comparable performance. For an explanation of the process, see for example [this post](https://towardsdatascience.com/distillation-of-bert-like-models-the-theory-32e19a02641f).

## Setup

The following cells only need to be executed once.
To run them, remove the # in the code cells below and execute them.

The ! tells the notebook to execute the following on the command line.

First, install the requirements with Python's package manager (pip3).

Huggingface requires either [pytorch](https://pytorch.org/) or [tensorflow](https://www.tensorflow.org/) as deep learning backend. We will use pytorch here.

In [None]:
#!pip3 install --user torch torchvision torchaudio

Pitfall: Models can also be pre-trained using either pytorch or tensorflow as backend. The resulting weight files will be in different formats!
Huggingface provides an option to use them with either backend, but this requires both backends to be installed.

In [None]:
#!pip3 install --user tensorflow

Last, install the huggingface library.

In [None]:
#!pip3 install --user transformers

## Load backend

Python loads modules using the **import** statement.
We will use the pytorch backend here.

In [None]:
import torch

Note: For the sake of completeness, we will import all the neccessary _transformer_ 
packages (libraries/modules) in each section. 
Usually it is enough to import each of them only once within the notebook or script.

## Using pre-trained models
* Browse pre-trained models on https://huggingface.co/models
* There are official models trained by the huggingface people and user-built models (everyone can upload a model)
* Filter available models by
 * tasks
 * languages
 * training data
 * DL framework
* Find additional information (also on limitations) in model card
 * Check model card content and model "popularity" as a metric for usability
* Test model on Hosted Inference API in browser (at least for some of them)
* Get examples on how to use them
* Check [model class documentation](https://huggingface.co/docs/transformers/model_doc/auto) for further details 


## Using pipelines

* Pipelines provide black-box wrappers for many tasks for very easy use
* See [pipelines documentation](https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/pipelines) for further details on how to use them

In the following, we will first see how to use the pipeline wrapper and then look at what actually happens inside it for a few example tasks.


### 1. Using pre-trained models for text completion

#### a. Generate a cloze-style probability distribution for a masked word

In [None]:
# Cloze-style word prediction using masked language models (BERT & co.) with pipeline

from transformers import pipeline

unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("Hello I'm a [MASK] model.")

What is happening within this pipeline?

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")

In [None]:
# encode input

# The input for BERT-based models always starts with the classification token 
# representing the whole sequence and ends with the separator token 
# marking the end of the sequence.
text = "[CLS] Hello I'm a [MASK] model. [SEP]"

tokenized_text = tokenizer.tokenize(text)
print(f"Tokenized input: {tokenized_text}")

# get index of mask token in tokenized text for extracting predictions later
masked_index = tokenized_text.index("[MASK]")
print(f"Mask token is at position {masked_index}")

# convert tokens to IDs
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(f"Token IDs: {indexed_tokens}")

# transform IDs to troch tensor datatype
tokens_tensor = torch.tensor([indexed_tokens])

In [None]:
# get model predictions (= score distribution across all vocabulary tokens for each token)
# we use no_grad becasue we want to use the model without training it further 
# (i.e., computing gradients)
with torch.no_grad():
    outputs = model(tokens_tensor)
    print(f"Model output:\n{outputs}")

In [None]:
predictions = outputs.logits
print(predictions.shape)

In [None]:
# transform scores into a probability distribution for the mask token using softmax
# see https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html#torch-nn-functional-softmax
# for details
print(f"Shape of [MASK] token vector {predictions[0, masked_index].shape}")
probs = torch.nn.functional.softmax(predictions[0, masked_index], dim=-1)

# extract the top 5 matches
top_k_probs, top_k_indices = torch.topk(probs, 5, sorted=True)

# print top 5 matches with their probabilities
for token_prob, token_id in zip(top_k_probs, top_k_indices):
    # convert token id back to token
    # token_id is a tensor with one element, .item() extracts that element
    predicted_token = tokenizer.convert_ids_to_tokens(token_id.item())
    print(f"[MASK]: {predicted_token} | prob: {token_prob}")

#### b. Sample a continuation of a given prompt

In [None]:
# Text continuation sampling from autoregressive (causal) language models (GPT-2 & co.)
from transformers import pipeline, set_seed

set_seed(42)

generator = pipeline('text-generation', model='distilgpt2')
generator("This is the beginning of a very thrilling story. One day,", max_length=40)

What is happening within this pipeline?

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

# Generation requires sampling, set seed for reproducibility
set_seed(42)

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

In [None]:
text = "This is the beginning of a very thrilling story. One day,"

tokenized_text = tokenizer.tokenize(text)
print(f"Tokenized input: {tokenized_text}")
# the GPT tokenizers apply subword splitting (to increase vocabulary coverage) 
# and each beginning of a token is marked with the special character Ä 

# convert tokens to IDs
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
print(f"Token IDs: {indexed_tokens}")

# transform IDs to troch tensor datatype
encoded_input = torch.tensor([indexed_tokens])

# short version of tokenize + convert_to_ids + transform into torch tensor 
# + create attention mask (required for padding when processing several sequences of different length at once)
#encoded_input = tokenizer(text, return_tensors='pt')
#print(f"Encoded Input:\n{encoded_input}")

In [None]:
# model.generate() repeatedly samples from token distributions, see 
# https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate
# for different strategies (the parameters can also be passed into the pipeline above)
tokens= model.generate(encoded_input, max_length=40, do_sample=True)

output = tokenizer.decode(tokens.squeeze()) # squeeze removes the automatically added batch dimension (which is 1 in our case as we only have one input sequence)
print(output)

------------

### 2. Using fine-tuned models for text classification
* given a sentence (or segement), predict a label for it
* possible labels: 
 * sentiment (positive/negative/neutral), 
 * paraphrase (yes/no)
 * **grammatical acceptability (acceptable/unacceptable)**
 * ...

#### [CoLA: Corpus of Linguistic Acceptability](https://aclanthology.org/Q19-1040/)
* collection of acceptable and unacceptable linguistic examples
* extracted from syntax text books
* used to train models on predicting grammatical acceptability
* [several fine-tuned versions of models](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=cola) available through huggingface
    * select one based on model card/popularity/Inference API

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

# Here we have a special case:
# We want to use a user-built model that was trained using the tensorflow backend, 
# so we need to specify the from_tf flag to tell pytorch to import the weights accordingly
classification_model = AutoModelForSequenceClassification.from_pretrained("ccsobral/distilbert-base-uncased-finetuned-cola", from_tf=True)
classification_tokenizer = AutoTokenizer.from_pretrained("ccsobral/distilbert-base-uncased-finetuned-cola")

# We can also explicitely pass a model and tokenizer into the pipeline (instead of just the identifier as above)
classifier = pipeline("text-classification", model=classification_model, tokenizer=classification_tokenizer)
print(classifier("This is a grammatical sentence."))
print(classifier("This grammatical is sentence a."))

What is happening within this pipeline?

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ccsobral/distilbert-base-uncased-finetuned-cola")
# This model was pre-trained using the tensorflow backend, so we need to specify the 
# from_tf flag to tell pytorch to import the weights accordingly
model = AutoModelForSequenceClassification.from_pretrained("ccsobral/distilbert-base-uncased-finetuned-cola", from_tf=True)

In [None]:
input_sentence = "This is a grammatical sentence."
#input_sentence = "This grammatical is sentence a."

In [None]:
inputs = tokenizer(input_sentence, return_tensors="pt")
# feed the input into the model and extract the calculated scores at the output layer
with torch.no_grad():
    logits = model(**inputs).logits

# convert into probabilities
probs = torch.nn.functional.softmax(logits, dim=-1)
print(f"Label probabilities: {probs}")

# get the maximum score
predicted_class_id = probs.argmax().item()
print(f"Predicted class ID: {predicted_class_id}")
# for training the model, classes were mapped to ids as well
# this mapping is stored in the model.config
print(f"Model class mapping: {model.config.id2label}")
# so we can transform this back into the label
print(f"label: {model.config.id2label[predicted_class_id]}, score: {probs.max()}")

See the [Huggingface documentation on how to fine-tune a model yourself](https://huggingface.co/docs/transformers/training)

----------
### 3. Using signals from pre-trained models
#### a. Calculating surprisal

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
model = AutoModelForCausalLM.from_pretrained("distilgpt2")

In [None]:
text = "I like my coffe with cream and"

model_input = tokenizer(text, return_tensors="pt")

next_words = ["Ä sugar", "Ä honey", "Ä dog"]
next_word_ids = [tokenizer.convert_tokens_to_ids(word) for word in next_words]
    
with torch.no_grad():
    outputs = model(**model_input)

predictions = outputs.logits
# apply softmax on last position to get probability distribution for next word
probs = torch.nn.functional.softmax(predictions[0, -1], dim=-1)
# compute surprisals 
surprisals = -torch.log(probs)

for word, word_id in zip(next_words, next_word_ids):
    print(f"{word}\n\tProbability: {probs[word_id]}\n\tSurprisal: {surprisals[word_id]}")

#### b. Extracting embeddings, hidden representations and attention weights
This goes beyond the scope of this introduction, but if you are interested in what other information one can extract from pre-trained models, you'll find the documentation [here](https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/output#model-outputs).