How to Read in a Sentence Into a Vector C++

Unlike techniques to represent words as vectors (Give-and-take Embeddings)

From Count Vectorizer to Word2Vec

Currently, I'1000 working on a Twitter Sentiment Analysis projection. While reading about how I could input text to my neural network, I identified that I had to convert the text of each tweet into a vector of a specified length. This would allow the neural network to railroad train on the tweets and correctly learn sentiment classification.

Thus, I jot down to accept a thorough analysis of the various approaches I tin can have to convert the text into vectors — popularly referred to equally Word Embeddings.

Word embedding is the collective proper name for a set of linguistic communication modelling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. — Wikipedia

In this article, I'll explore the post-obit word embedding techniques:

Count Vectorizer
TF-IDF Vectorizer
Hashing Vectorizer
Word2Vec

Sample text information

I'chiliad c r eating 4 sentences on which nosotros'll apply each of these techniques and sympathize how they work. For each of the techniques, I'll employ lowercase words merely.

Count Vectorizer

The near basic way to catechumen text into vectors is through a Count Vectorizer.

Footstep 1: Identify unique words in the complete text information. In our example, the listing is equally follows (17 words):

          ['ended', 'everyone', 'field', 'football game', 'game', 'he', 'in', 'is', 'it', 'playing', 'raining', 'running', 'started', 'the', 'towards', 'was', 'while']

Stride two: For each judgement, we'll create an assortment of zeros with the same length as to a higher place (17)

Step iii: Taking each sentence 1 at a time, we'll read the first discussion, find information technology's full occurrence in the judgement. Once we have the number of times it appears in that judgement, nosotros'll place the position of the word in the listing above and replace the same zip with this count at that position. This is repeated for all words and for all sentences

Example

Let'due south have the showtime sentence, He is playing in the field. Its vector is [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

The kickoff give-and-take is He. It'due south full count in the judgement is 1. As well, in the list of words above, its position is 6th from the starting (all are lowercase). I'll only update its vector and it will now be:

          [0, 0, 0, 0, 0, ane, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Considering second give-and-take, which is is, the vector becomes:

          [0, 0, 0, 0, 0, i, 0, one, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Similarly, I'll update the residuum of the words besides and the vector representation for the first sentence would be:

          [0, 0, one, 0, 0, 1, 1, i, 0, ane, 0, 0, 0, 1, 0, 0, 0]

The same will be repeated for all other sentences as well.

Code

sklearn provides the CountVectorizer() method to create these word embeddings. Later importing the package, we just need to apply fit_transform() on the complete listing of sentences and we get the array of vectors of each sentence.

The output in the higher up gist shows the vector representations of each sentence.

TF-IDF Vectorizer

While Count Vectorizer converts each sentence into its own vector, information technology does not consider the importance of a discussion across the complete list of sentences. For instance, He is in two sentences and it provides no useful information in differentiating betwixt the ii. Thus, it should accept a lower weight in the overall vector of the sentence. This is where the TF-IDF Vectorizer comes into the motion-picture show.

TF-IDF is a product of two parts:

TF (Term Frequency) — It is divers as the number of times a give-and-take appears in the given sentence.
IDF (Inverse Certificate Frequency) — Information technology is divers as the log to the base due east of number of the total documents divided past the documents in which the discussion appears.

Pace 1: Place unique words in the complete text data. In our case, the list is equally follows (17 words):

          ['concluded', 'everyone', 'field', 'football game', 'game', 'he', 'in', 'is', 'information technology', 'playing', 'raining', 'running', 'started', 'the', 'towards', 'was', 'while']

Step 2: For each judgement, nosotros'll create an array of zeros with the aforementioned length as above (17)

Step iii: For each discussion in each sentence, we'll calculate the TF-IDF value and update the corresponding value in the vector of that sentence

Example

We'll beginning ascertain an array of zeros for all the 17 unique words in all sentences combined.

          [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

I'll have the word he in the first sentence, He is playing in the field and employ TF-IDF for information technology. The value will and so be updated in the array for the judgement and repeated for all words.

          Full documents (North): 4
Documents in which             the give-and-take             appears (n): two
Number of times the             give-and-take             appears in the first sentence: 1
Number of words in the starting time sentence: 6          Term Frequency(TF) = 1          Inverse Document Frequency(IDF) = log(Due north/n)
            = log(4/ii)
            = log(2)          TF-IDF value = 1 * log(2)
            = 0.69314718

Updated vector:

          [0, 0, 0, 0, 0, 0            .69314718            , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The aforementioned will go repeated for all other words. Notwithstanding, some libraries may use dissimilar methods to calculate this value. For example, sklearn, calculates the Inverse Certificate Frequency as:

          IDF = (log(N/northward)) + 1

Thus, the TF-IDF value would exist:

          TF-IDF value = 1 * (log(iv/2) + i)
            = 1 * (log(2) + 1)
            = 1.69314718

The process when repeated would represent the vector for first judgement equally:

          [0, 0, 1.69314718, 0, 0, 1.69314718, 1.69314718, one.69314718, 0, 1.69314718, 0, 0, 0, 1, 0, 0, 0]

Code

sklearn provides the method TfidfVectorizer to calculate the TF-IDF values. Nevertheless, it applies l2 normalization on information technology which I'd ignore using the flag value None and keep smooth_idf flag equally false and then the in a higher place method is used past it for IDF calculation.

The output in the in a higher place gist shows the vector representations of each sentence.

Hashing Vectorizer

This vectorizer is very useful every bit it allows the states to catechumen any word into information technology'due south hash and does not require the generation of any vocabulary.

Step ane: Define the size of vector to be created for each sentence

Footstep 2: Utilise the hashing algorithm (like MurmurHash) to the judgement

Pace 3: Echo step 2 for all sentences

Code

As the process is just the application of a hash function, nosotros tin can simply take a look at the code. I'll apply HashingVectorizer method from sklearn. The normalization will be removed by setting it to none. Given that, for both the vectorization techniques discussed above, we've had 17 columns in each vector, I'll fix the number of features 17 hither as well.

This will generate the necessary hashing value vector.

Word2Vec

These are a set of neural network models that take the aim to correspond words in the vector space. These models are highly efficient and performant in agreement the context and relation between words. Like words are placed shut together in the vector space while dissimilar words are placed wide apart.

It is then astonishing to stand for words that it is fifty-fifty able to identify key relationships such that:

          King - Human being + Woman = Queen

It is able to decipher that what a Man is to a Male monarch, a Adult female is to a Queen. The corresponding relationships could be identified through these models.

There are two models in this course:

CBOW (Continuous Bag of Words): The neural network takes a look at the surrounding words (say 2 to the left and 2 to the right) and predicts the discussion that comes in between
Skip-grams: The neural network takes in a word and then tries to predict the surrounding words

The neural network has 1 input layer, ane hidden layer and one output layer to train on the data and build the vectors. As it'due south the basic functionality on how a neural network works, I'll skip the step past step process.

Code

To implement the word2vec model, I'll apply the gensim library which provides many features in the model such as finding the odd one out, virtually like words etc. However, it does non lowercase/tokenize the sentences, and so I do the aforementioned. The tokenized sentences are then passed to the model. I've set the size of vector to be 2, window to be three which defines the distance upto which to expect and sg = 0 uses the CBOW model.

I used the most_similar method to find all similar words to the word football and so print out the most similar. For different trainings, nosotros'll get different results but in the last case I tried I got the most like word to exist game. The dataset here is merely of four sentences. If nosotros'd increase the same, the neural network will be able to better find relationships.

Conclusion

There we have it. We've looked at the 4 ways for word embeddings and how nosotros can utilize code to implement the same. If yous have whatsoever thoughts, ideas and suggestions, do share and let me know. Cheers for reading!

roccowhentry.blogspot.com

Source: https://towardsdatascience.com/different-techniques-to-represent-words-as-vectors-word-embeddings-3e4b9ab7ceb4

How to Read in a Sentence Into a Vector C++

Unlike techniques to represent words as vectors (Give-and-take Embeddings)

From Count Vectorizer to Word2Vec

Sample text information

Count Vectorizer

Example

Code

TF-IDF Vectorizer

Example

Code

Hashing Vectorizer

Code

Word2Vec

Code

Conclusion

0 Response to "How to Read in a Sentence Into a Vector C++"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel