Bag of Words (BoW) model
In the Bag of Words (BoW) model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
It is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.
Specifically,
- create a vacabulary (with length vocab_size) from dataset
- construct a vector, (transform input text into a “bag of words”), where
- each word in the vacabulary corresponds to an index of the vector
- each elem in the vector is the (frequency of) occurrence of that word in the input text
- take that vector and its label as a “(features, label)” sample
An Example of BoW
Here are two simple text documents:
- John likes to watch movies. Mary likes movies too.
- Mary also likes to watch football games.
We can create a vacabulary:
{“John”,”likes”,”to”,”watch”,”movies”,”Mary”,”too”,”also”,”football”,”games”}
Representing each bag-of-words as a JSON object, and attributing to the respective JavaScript variable:
BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};
Create corresponding vectors of these two text docs:
- [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
- [0, 1, 1, 1, 0, 1, 0, 1, 1, 1]
We can think of BoW as a sum of all one-hot-encoded vectors for individual words in the text.
We can then take a vector and its label as a sample for training a classifier
Drawbacks
not memory-efficient
BoW explicitly converts from low-dimensional positional representation vectors into high-dimensional sparse one-hot representation.
Solutions: embedding
frenquency
In BoW representation, word occurrences are evenly weighted, regardless of the word itself. However, it is clear that frequent words, such as a, in, etc. are much less important for the classification, than specialized terms. In fact, in most NLP tasks some words are more relevant than others.
Solutions: Term Frequency Inverse Document Frequency TF-IDF
word independency
Each word is treated independently from each other, i.e. one-hot encoded vectors do not express any semantic similarity between words.
Solutions: embedding, …
word ordering
Solutions: Ngram, …
Text Classification via BoW
Here is an classifier example that gives text docs different classes. It will take the bow of a text doc as an input.
dateset: AG_NEWS
create a vocabulary from dataset
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
train_iter = AG_NEWS(root="./data", split="train")
tokenizer = get_tokenizer("basic_english")
def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
# example
# vocab(["here", "is", "an", "example"])
# output: [475, 21, 30, 5297]
create a func to transform a text doc to a bow (a vector)
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1
encode = text_pipeline
import torch
vocab_size = len(vocab)
def to_bow(text, bow_vocab_size=vocab_size):
res = torch.zeros(bow_vocab_size, dtype=torch.float32)
for i in encode(text):
if i < bow_vocab_size:
res[i] += 1
return res
# example
# b = to_bow("how are you")
# print(b, b.shape)
# output: tensor([0., 0., 0., ..., 0., 0., 0.]) torch.Size([95811])
train the classifier
from torch.utils.data import DataLoader
import numpy as np
# this collate function gets list of batch_size tuples, and needs to
# return a pair of label-feature tensors for the whole minibatch
def bowify(b):
return (
torch.LongTensor([t[0]-1 for t in b]),
torch.stack([to_bow(t[1]) for t in b])
)
# Because datasets are iterators, if we want to use the
# data multiple times we need to convert it to list
train_dataset, test_dataset = AG_NEWS(root="./data")
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)
# convert our dataset for training in such a way, that
# all positional vector representations are converted to
# bag-of-words representation
train_loader = DataLoader(train_dataset, batch_size=16, collate_fn=bowify, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=bowify, shuffle=True)
net = torch.nn.Sequential(torch.nn.Linear(vocab_size,4),torch.nn.LogSoftmax(dim=1))
def train_epoch(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.NLLLoss(),epoch_size=None, report_freq=200):
optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
net.train()
total_loss,acc,count,i = 0,0,0,0
for labels,features in dataloader:
optimizer.zero_grad()
# forward & get loss
out = net(features)
loss = loss_fn(out,labels) #cross_entropy(out,labels)
# backword to get gradient
loss.backward()
# update params
optimizer.step()
# collect data to print
total_loss+=loss
_,predicted = torch.max(out,1)
acc+=(predicted==labels).sum()
count+=len(labels)
i+=1
if i%report_freq==0:
print(f"{count}: acc={acc.item()/count}")
if epoch_size and count>epoch_size:
break
return total_loss.item()/count, acc.item()/count
# specify small epoch_size because of low compute power
train_epoch(net,train_loader,epoch_size=15000)
# logs and output
3200: acc=0.811875
6400: acc=0.845
9600: acc=0.8578125
12800: acc=0.863671875
Out[14]: (0.026340738796730285, 0.8640724946695096)
Let’s examine some examples. As we know, classes = [‘World’, ‘Sports’, ‘Business’, ‘Sci/Tech’]
net.eval()
x = torch.stack([to_bow("Let's play football")]); y = net(x); print(y)
# output:
# tensor([[-1.5573, -0.7191, -1.8730, -1.9076]], grad_fn=<LogSoftmaxBackward0>)
x = torch.stack([to_bow("the stock market")]); y = net(x); print(y)
# output:
# tensor([[-2.8083, -2.6868, -0.5916, -1.1453]], grad_fn=<LogSoftmaxBackward0>