For machines to understand our language, we need to represent words numerically. We can do so with supervised or unsupervised ML techniques. After training our model, we can use the concept of similarity for various tasks, and in doing so, oh surprise, we run into bias! Read on to know more! (Jupyter Notebook included)
In late 2020 I began my exploration of AI. After the excellent introductory course at Columbia University and the Deep Learning lessons at MIT, it became clear to me that I want to pursue my career in this area.
I always liked and enjoyed mathematics, statistics, and other pure sciences while in college. Sadly I rarely used their concepts during my professional career as a CS Engineer. So, it was a pleasant surprise to see calculus, linear algebra, or probability distributions again during class!.
After some time exploring the applications of ML, I became very interested in NLP since other of my passions are knowledge, learning, and data/information management. That is why I decided to go back to the classroom. In this case, I choose Stanford's excellent CS224N (Natural Language Processing with Deep Learning) taught by the charismatic Professor Manning.
After the first assignment, I thought it would be interesting to record my impressions, experiences, and discoveries here for later reference or because it might be helpful to someone else.
Let's get down to business. In this article, I explore a fundamental concept of NLP: The representation of words so that machines can understand them.
It is easy to represent words as discrete symbols like hot-vectors. Unfortunately, such representations lack notions of similarity.
There are better solutions. The most common one represents words as dense vectors whose dimensions show degrees of similarity with other words that usually appear in their context.
A word vector for banking (source: CS244N)
We can implement word vectors using count-based or prediction-based techniques.
In this technique, we use co-occurrence matrices to represent word embeddings. These symmetric matrices count the times a word happens in a context taking into account a fixed window surrounding a word .
These matrices are high dimensional given the nature of human language, so it is needed to apply dimensionality reduction techniques like SVD, PCA, or Truncated SVD to select the top components.
These approaches have better performance than count-based methods. There are two famous techniques to train the models. If we want an unsupervised method we can use Stanford's GloVe, otherwise, we can train a model with the word2vec framework.
word2vec comes in two flavors, both able to learn word embeddings: CBOW (Continuous bag-of-words) and SG (skip-gram). The skip-gram architecture is commonly used for its ability to predict rare words. Skip-gram tries to predict the words that are around a center word given a parameters vector which contains all word vectors in the corpus' vocabulary:
The objective function, , is the average negative of which we want to minimize though SGD (notice that the vector space is enormous):
Once we have trained our model, we can use it to make predictions based on the similarity concept. For this we can use a popular metric called or better .
With this metric, we can find synonyms, antonyms, and even analogies among words.
Once we have the word embeddings we can explore how well they have captured meaning. For this, we can use Gensim. Gensim is an excellent package to explore the similarity concept. Please check the related Jupyter Notebook to see how to do it in detail. In the next section, I present some highlights from the execution of the mentioned notebook.
numpy: For linear algebra tasks when working with matrices
matplotlib: To display word vectors in 2-D after dimensionality reduction
nltk: To work with the Reuter's corpus (10,788 news documents, 1.3 million words in 90 categories. The documents are split into train and test folders)
sklearn: To reduce the co-occurrence matrices dimensionality with PCA or Truncated SVD
gensim: To work with the pre-trained GloVe word embeddings
To make things easier I created a function to determine the words in the vicinity of the center word :
def _get_vicinity(w_i: str, corpus: list, window_size: int=4) -> list: """ Returns a list with words located after or before of current word w_ij given a window_size. For example, if the corpus is [ ['<START>', 'All', 'that', 'glitters', "isn't", 'gold', '<END>'], ['<START>', "All's", 'well', 'that', 'ends', 'well', '<END>'] ] the vicinity/CONTEXT of "well" is: ["All's", 'that', 'ends', '<END>'] Parameters ---------- - w_i (str): The target word for which we'll determine its vicinity - corpus (list): A list of lists, each item represent a document - window_size (int): The range where we'll find neighbor words for w_i Returns ------- A list of words which are neighbors of w_i given a window_size """ neighbors =  for doc in corpus: if w_i in doc: # The word can appear several time in the document! # Indices where w_i appears in the current document indices = [idx for idx, value in enumerate(doc) if value == w_i] for i in indices: lower_bound = i - window_size if i > window_size else 0 upper_bound = i + 1 + window_size # No error if x[i: out-of-bounds] neighbors.append(doc[lower_bound: i]) neighbors.append(doc[i + 1: upper_bound]) return [neighbor for neighbor in neighbors if len(neighbor) > 0]
This is an important note to take into account:
TruncatedSVD returns , then we need to normalize (rescale) the returned vectors so that all the vectors will appear around the unit circle. We achieve this normalization through the NumPy concept of broadcasting.
I used Gensim to load the
glove-wiki-gigaword-200 and convert it to a word2vec representation. This model contains 400K words on its vocabulary!
The results show a slightly better performance using the Prediction-Based approach (image on the right):
Count and prediction-based plots
In the Prediction-Based Model, we have a main cluster containing most of the words in the sample set. Nevertheless, it is still odd
kuwait is far apart from words like
oil. Why is this happening? Perhaps it is because we are using a small sample of 10K GloVe vectors, or it may also be related to the truncation done when reducing the dimensionality. Another possibility is that in the selected 10K GloVe vectors
kuwait doesn't appear with enough frequency in the context of the other words.
Gensim's method called
most_similar allows us to perform the above tasks with ease. This method uses the concept of Cosine Similarity in its implementation. Some highlights about its usage:
most_similarwithout extra parameters returns the top 10 words most similar to the word passed as argument. Nevertheless, there are some polyseme cases not working as expected, for example for
branchI couldn't find any tree/plant-related words:
Why is that? It seems to me that it is related to the dataset used during the training process. It might happen that the dataset didn't have varied enough context for this word sense.
wv_from_bin.most_similar("branch") [('branches', 0.7101372480392456), ('central', 0.5476117730140686), ('railway', 0.5329204797744751), ('established', 0.5197478532791138), ('line', 0.5076225399971008), ('authority', 0.491929292678833), ('offices', 0.48285460472106934), ('railroad', 0.4816432297229767), ('headquarters', 0.4756273925304413), ('department', 0.4709719121456146)]
Why is this happening? Here, I also guess it is because of the data used to train the model. Probably the training examples were rich in contexts for the word "top" that involved
# wv_from_bin.most_similar("top") # w1 = top, w2 = pinnacle/summit/peak/apex, w3 = bottom # w1 and w2 can be considered synonyms # w1 and w3 can be considered antonyms # Distance from synonyms are 0.76/0.67/0.67/0.79 # wv_from_bin.distance("top", "apex") # Distance from antonym is 0.49 wv_from_bin.distance("top", "bottom")
businesses. But those contexts were not so rich in vocabulary.
most_similarmethod to find analogies. But here also, the results are not always satisfactory.
I wonder if we'll get better results with a word2vec pre-trained model instead.
wv_from_bin.most_similar(positive=['hand', 'glove'], negative=['foot']) 45,000-square
Word embeddings are susceptible to bias. This is dangerous because it can reinforce stereotypes through applications that employ these models.
A classical example:
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'worker'], negative=['man'])) print() pprint.pprint(wv_from_bin.most_similar(positive=['man', 'worker'], negative=['woman'])) [('employee', 0.6375863552093506), ('workers', 0.6068919897079468), ('nurse', 0.5837947726249695), ('pregnant', 0.5363885164260864), ('mother', 0.5321309566497803), ('employer', 0.5127025842666626), ('teacher', 0.5099576711654663), ('child', 0.5096741914749146), ('homemaker', 0.5019454956054688), ('nurses', 0.4970572590827942)] [('workers', 0.6113258004188538), ('employee', 0.5983108282089233), ('working', 0.5615328550338745), ('laborer', 0.5442320108413696), ('unemployed', 0.5368517637252808), ('job', 0.5278826951980591), ('work', 0.5223963260650635), ('mechanic', 0.5088937282562256), ('worked', 0.505452036857605), ('factory', 0.4940453767776489)]
teacher are "exclusive" jobs for women? Same for "mechanic".
This is sad, and it is present in other examples:
pprint.pprint(wv_from_bin.most_similar(positive=['latin', 'criminal'], negative=['white'])) print() pprint.pprint(wv_from_bin.most_similar(positive=['white', 'criminal'], negative=['latin'])) [('trafficking', 0.4996181130409241), ('transnational', 0.44992437958717346), ('crimes', 0.43998926877975464), ('laundering', 0.4213477373123169), ('crime', 0.42046865820884705), ('cartels', 0.417102575302124), ('dealing', 0.4154001474380493), ('traffickers', 0.40704718232154846), ('judicial', 0.39766523241996765), ('extradition', 0.3974517583847046)] [('prosecution', 0.5594319105148315), ('crimes', 0.5117124915122986), ('fbi', 0.5068686008453369), ('attorney', 0.5007576942443848), ('investigation', 0.49686378240585327), ('charges', 0.49135079979896545), ('charged', 0.48554402589797974), ('prosecutors', 0.4846910238265991), ('attorneys', 0.47757965326309204), ('suit', 0.476983904838562)]
How can we avoid this? Not sure. The only thing I can think of is taking random samples of the data before training, then examining them, and removing those biased documents on topics like gender, ethnicity, sexual orientation, etc. If the percentage of biassed samples is high, it probably would be better to use another dataset for training.
Interesting right? The saying "garbage in, garbage out" is so accurate. But this does not apply to ML only. The same thing happens in children's education. Also, when we ingest information (TV shows, podcasts, social media feeds, newspapers, etc.). We have to be very careful and selective with what we consume!
And that's all for the moment! If you are reading this, I want to thank you for your interest and time, hope you like it!