How to Read in a Sentence Into a Vector C++
Unlike techniques to represent words as vectors (Give-and-take Embeddings)
From Count Vectorizer to Word2Vec
Currently, I'1000 working on a Twitter Sentiment Analysis projection. While reading about how I could input text to my neural network, I identified that I had to convert the text of each tweet into a vector of a specified length. This would allow the neural network to railroad train on the tweets and correctly learn sentiment classification.
Thus, I jot down to accept a thorough analysis of the various approaches I tin can have to convert the text into vectors — popularly referred to equally Word Embeddings.
Word embedding is the collective proper name for a set of linguistic communication modelling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. — Wikipedia
In this article, I'll explore the post-obit word embedding techniques:
- Count Vectorizer
- TF-IDF Vectorizer
- Hashing Vectorizer
- Word2Vec
Sample text information
I'chiliad c r eating 4 sentences on which nosotros'll apply each of these techniques and sympathize how they work. For each of the techniques, I'll employ lowercase words merely.
Count Vectorizer
The near basic way to catechumen text into vectors is through a Count Vectorizer.
Footstep 1: Identify unique words in the complete text information. In our example, the listing is equally follows (17 words):
['ended', 'everyone', 'field', 'football game', 'game', 'he', 'in', 'is', 'it', 'playing', 'raining', 'running', 'started', 'the', 'towards', 'was', 'while']
Stride two: For each judgement, we'll create an assortment of zeros with the same length as to a higher place (17)
Step iii: Taking each sentence 1 at a time, we'll read the first discussion, find information technology's full occurrence in the judgement. Once we have the number of times it appears in that judgement, nosotros'll place the position of the word in the listing above and replace the same zip with this count at that position. This is repeated for all words and for all sentences
Example
Let'due south have the showtime sentence, He is playing in the field. Its vector is [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
.
The kickoff give-and-take is He
. It'due south full count in the judgement is 1. As well, in the list of words above, its position is 6th from the starting (all are lowercase). I'll only update its vector and it will now be:
[0, 0, 0, 0, 0, ane, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Considering second give-and-take, which is is
, the vector becomes:
[0, 0, 0, 0, 0, i, 0, one, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Similarly, I'll update the residuum of the words besides and the vector representation for the first sentence would be:
[0, 0, one, 0, 0, 1, 1, i, 0, ane, 0, 0, 0, 1, 0, 0, 0]
The same will be repeated for all other sentences as well.
Code
sklearn
provides the CountVectorizer() method to create these word embeddings. Later importing the package, we just need to apply fit_transform()
on the complete listing of sentences and we get the array of vectors of each sentence.
The output in the higher up gist shows the vector representations of each sentence.
TF-IDF Vectorizer
While Count Vectorizer converts each sentence into its own vector, information technology does not consider the importance of a discussion across the complete list of sentences. For instance, He
is in two sentences and it provides no useful information in differentiating betwixt the ii. Thus, it should accept a lower weight in the overall vector of the sentence. This is where the TF-IDF Vectorizer comes into the motion-picture show.
TF-IDF is a product of two parts:
- TF (Term Frequency) — It is divers as the number of times a give-and-take appears in the given sentence.
- IDF (Inverse Certificate Frequency) — Information technology is divers as the log to the base due east of number of the total documents divided past the documents in which the discussion appears.
Pace 1: Place unique words in the complete text data. In our case, the list is equally follows (17 words):
['concluded', 'everyone', 'field', 'football game', 'game', 'he', 'in', 'is', 'information technology', 'playing', 'raining', 'running', 'started', 'the', 'towards', 'was', 'while']
Step 2: For each judgement, nosotros'll create an array of zeros with the aforementioned length as above (17)
Step iii: For each discussion in each sentence, we'll calculate the TF-IDF value and update the corresponding value in the vector of that sentence
Example
We'll beginning ascertain an array of zeros for all the 17 unique words in all sentences combined.
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
I'll have the word he
in the first sentence, He is playing in the field and employ TF-IDF for information technology. The value will and so be updated in the array for the judgement and repeated for all words.
Full documents (North): 4
Documents in which the give-and-take appears (n): two
Number of times the give-and-take appears in the first sentence: 1
Number of words in the starting time sentence: 6 Term Frequency(TF) = 1 Inverse Document Frequency(IDF) = log(Due north/n)
= log(4/ii)
= log(2) TF-IDF value = 1 * log(2)
= 0.69314718
Updated vector:
[0, 0, 0, 0, 0, 0 .69314718 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
The aforementioned will go repeated for all other words. Notwithstanding, some libraries may use dissimilar methods to calculate this value. For example, sklearn
, calculates the Inverse Certificate Frequency as:
IDF = (log(N/northward)) + 1
Thus, the TF-IDF value would exist:
TF-IDF value = 1 * (log(iv/2) + i)
= 1 * (log(2) + 1)
= 1.69314718
The process when repeated would represent the vector for first judgement equally:
[0, 0, 1.69314718, 0, 0, 1.69314718, 1.69314718, one.69314718, 0, 1.69314718, 0, 0, 0, 1, 0, 0, 0]
Code
sklearn
provides the method TfidfVectorizer
to calculate the TF-IDF values. Nevertheless, it applies l2
normalization on information technology which I'd ignore using the flag value None
and keep smooth_idf
flag equally false and then the in a higher place method is used past it for IDF calculation.
The output in the in a higher place gist shows the vector representations of each sentence.
Hashing Vectorizer
This vectorizer is very useful every bit it allows the states to catechumen any word into information technology'due south hash and does not require the generation of any vocabulary.
Step ane: Define the size of vector to be created for each sentence
Footstep 2: Utilise the hashing algorithm (like MurmurHash) to the judgement
Pace 3: Echo step 2 for all sentences
Code
As the process is just the application of a hash function, nosotros tin can simply take a look at the code. I'll apply HashingVectorizer
method from sklearn
. The normalization will be removed by setting it to none. Given that, for both the vectorization techniques discussed above, we've had 17 columns in each vector, I'll fix the number of features 17 hither as well.
This will generate the necessary hashing value vector.
Word2Vec
These are a set of neural network models that take the aim to correspond words in the vector space. These models are highly efficient and performant in agreement the context and relation between words. Like words are placed shut together in the vector space while dissimilar words are placed wide apart.
It is then astonishing to stand for words that it is fifty-fifty able to identify key relationships such that:
King - Human being + Woman = Queen
It is able to decipher that what a Man is to a Male monarch, a Adult female is to a Queen. The corresponding relationships could be identified through these models.
There are two models in this course:
- CBOW (Continuous Bag of Words): The neural network takes a look at the surrounding words (say 2 to the left and 2 to the right) and predicts the discussion that comes in between
- Skip-grams: The neural network takes in a word and then tries to predict the surrounding words
The neural network has 1 input layer, ane hidden layer and one output layer to train on the data and build the vectors. As it'due south the basic functionality on how a neural network works, I'll skip the step past step process.
Code
To implement the word2vec
model, I'll apply the gensim
library which provides many features in the model such as finding the odd one out, virtually like words etc. However, it does non lowercase/tokenize the sentences, and so I do the aforementioned. The tokenized sentences are then passed to the model. I've set the size
of vector to be 2, window
to be three which defines the distance upto which to expect and sg
= 0 uses the CBOW model.
I used the most_similar
method to find all similar words to the word football
and so print out the most similar. For different trainings, nosotros'll get different results but in the last case I tried I got the most like word to exist game
. The dataset here is merely of four sentences. If nosotros'd increase the same, the neural network will be able to better find relationships.
Conclusion
There we have it. We've looked at the 4 ways for word embeddings and how nosotros can utilize code to implement the same. If yous have whatsoever thoughts, ideas and suggestions, do share and let me know. Cheers for reading!
Source: https://towardsdatascience.com/different-techniques-to-represent-words-as-vectors-word-embeddings-3e4b9ab7ceb4
0 Response to "How to Read in a Sentence Into a Vector C++"
Postar um comentário