Understanding Word Embeddings and BERT with PyTorch


This is a system created to compare simple word embedding with the BERT contextual embedding using PyTorch.  The purpose of the system is to understand the semantic similarity and word meaning in any vector space, such that we can understand how LLMs generates the context-based understanding.

Python code:

1. Install the requirements:

use command prompt and run the command to install the requirements:

pip install torch transformers

2. Import the required library:

import torch
import torch.nn as nn
import torch.nn.functional as Func
from transformers import BertTokenizer, BertModel

here,

import torch will import the base library.
import torch.nn will import the layers (nn=neural network)
torch.nn.functional will import thr direct functions like F.relu, F.softmax, etc.

3. Embedding model:

vocab = {
    "king": 0,
    "queen": 1,
    "man": 2,
    "woman": 3,
    "dog": 4,
    "cat": 5
}
vocab_size=len(vocab)
embedding_dim=8
embedding_layer=nn.Embedding(vocab_size,embedding_dim)
def get_embedding(word):
    idx=torch.tensor([vocab[word]])
    return embedding_layer(idx)
def cosine_sim(a,b):
    return F.cosine_similarity(a,b)

vocab → lookup table

embedding_dim=8 → 8D is semantic similarity space.

embedding_layer=nn.Embedding(vocab_size,embedding_dim) → create embedding matrix

4. Test Simple Embedding:

To check the cosine similarity between king&queen and King&Man.

king=get_embedding("king")
queen=get_embedding("queen")
man=get_embedding("man")

print("king vs queen:",cosine_sim(king,queen).item())
print("king vs man:",cosine_sim(king,man).item())

5. Using BERT model for conceptual Embedding: 

This model will help to generate the context-based encoding.

model_name="bert-base-uncased"

tokenizer=BertTokenizer.from_pretrained(model_name)
bert_model=BertModel.from_pretrained(model_name)

def get_bert_embedding(text):
    inputs=tokenizer(text,return_tensors="pt")

    with torch.no_grad():
        outputs=bert_model(**inputs)

    # sentence Embedding
    embeddings=outputs.last_hidden_state
    sentence_embedding=embeddings.mean(dim=1)

    return sentence_embedding

6. Testing BERT model:

sent1 = "I deposited money in the bank"
sent2 = "I sat near the river bank"

emb1 = get_bert_embedding(sent1)
emb2 = get_bert_embedding(sent2)

print("\n=== BERT CONTEXTUAL EMBEDDINGS ===")
print("Sentence similarity:",
      F.cosine_similarity(emb1, emb2).item())

7. To compare different meanings:

s1 = "I love AI"
s2 = "Artificial intelligence is amazing"

e1 = get_bert_embedding(s1)
e2 = get_bert_embedding(s2)

print("\n=== SEMANTIC SIMILARITY ===")
print("AI sentences similarity:",
      F.cosine_similarity(e1, e2).item())
0 Comments Report