
Text Embedding Model Steps
My overview of a basic Vector Database workflow is to take the unstructured dataset (text, image, video, audio, etc) and embed it (turning it into a vector). This embedding (vector) is used in the Vector Database for a Similarity Search. Vectors are used because most machine learning algorithms, including neural networks, cannot process plain text in its raw form. Once the similar vector(s) is located… it’s returned and then run through a language model to convert it back to text.
What are the specific steps? As I’m still looking into all of this, I think the steps are different for audio, test, and video. So, I decided to start with ‘text embedding’ and will continue to use the “Does snow absorb sound?” text. Text model embedding is a process that converts text into a numerical representation that computers can understand. This representation is a vector of numbers or a high-dimensional point in a latent space.
Now I know there are already some open-source text models, such as HuggingFace, so why reinvent the wheel? It’s just my nature to want to know what goes on behind the scenes, I want to know why we are doing this and that.
Taking information from online and ChatGPT, I believe these are the general steps that involve working with a Vector Database. I say believe because I am still trying to understand all the specifics.
1. Tokenization
The sentence is broken into tokens (words and punctuation). Before text embedding can begin, the input text is split into smaller units called tokens (e.g., words, subwords, or characters).
EXAMPLE:
Tokens: [“Does”, “snow”, “absorb”, “sound”, “?”]
_________________________________
2. Mapping Tokens to Vectors, or encoding.
Each token is mapped to an initial numerical representation, often a one-hot encoding or an index. These representations are sparse and high-dimensional, so they need to be transformed into dense, meaningful vectors.
EXAMPLE: Using One-Hot Encoding..
Each token is represented as a one-hot vector. For a vocabulary size of 5 (the tokens themselves), the one-hot vectors are:
One-Hot Vectors:
I need to look into this further. Are vectors just been assigned or are there calculations going on?
_________________________________
3. Word Embedding Models
An embedding layer is a type of neural network layer that maps discrete inputs (like words or tokens) into dense, continuous vector representations. It’s learned during training (or pre-trained in models like Word2Vec or BERT).
Embedding layers or pre-trained embedding models (like Word2Vec, GloVe, or embeddings from transformer models) transform tokens into dense, low-dimensional vectors.
Embedding Matrix:
Each token’s one-hot vector is multiplied by the embedding matrix to retrieve its corresponding dense embedding. In practice, this is done via direct indexing into the embedding matrix.
Another step I need to look into further. Is the embedding matrix a static matrix or is there some calculation that needs to be done to get the embedding matrix.
_________________________________
4. Contextual Embeddings (Advanced Models like BERT)
Contextual embeddings consider the meaning of a word in context. For example, “sound” in “absorb sound” differs from “sound reasoning.”
Transformers use attention mechanisms to calculate these embeddings:
- Input Embedding
- Linear Transformations
- Attention Scores
- Contextualized Embedding
_________________________________
5. Fine-Tuning or Training
If the embeddings are part of a larger model (e.g., GPT or BERT), the embedding vectors are fine-tuned during training using gradient descent and backpropagation to minimize a loss function (like cross-entropy loss).
_________________________________
6. Math Summary of Embedding Generation
At its core, text embedding involves:
- Matrix multiplication: Transforming sparse vectors into dense ones.
- Linear transformations: Projecting embeddings into different spaces.
- Attention mechanisms: Computing relationships between tokens.
- Softmax: Normalizing scores for probabilistic interpretation.
_________________________________
7. Dimensionality of the Embedding Space
The final embedding vector is typically d-dimensional
_________________________________
As I’m going through these steps and am inclined to create an article on each step… going into as much detail as I can. For now, I believe these are the steps taken for the text embedding model.
They say that “the best way to learn is to teach” so these articles are to help me learn and I hope others can get something from them as well. ?
______________________________________________
All articles posted Wednesdays & Saturdays by 8PM
(with additional postings here and there)