宽客秀

宽客秀

Quant.Show的Web3站点,Archives from quant.show

Neural Network-based Sentiment Analysis

1. Overview#

Sentiment analysis is a common case in natural language processing, which involves determining the positive or negative sentiment expressed in language. Previously, I read an article written by Mr. Yunfan titled Quantitative Analysis Using Natural Language Processing, which was very informative. Combined with my recent work on natural language processing projects, this article is a simple summary of sentiment analysis in natural language processing.

In this article, I will continue to use the Keras framework used by Mr. Yunfan, with the main neural network architectures being Embedding+LSTM or Embedding+CNN+Pooling. (This article will provide some explanations on Embedding, which is unique to natural language processing.)

The training dataset and validation dataset used are from the Sarcasm_Headlines_Dataset (Misra, Rishabh and Arora, Prahal, 2019). This dataset classifies article headlines as sarcastic or non-sarcastic, with 3 attributes (this article mainly uses the first 2 attributes):

  1. is_sarcastic: 1 if the record is sarcastic, otherwise 0
  2. headline: the headline of the news article
  3. article_link: link to the original news article. Useful in collecting supplementary data

The pre-trained word vector data used is from GloVe (Jeffrey Pennington, Richard Socher, and Christopher D. Manning, 2014). These word vectors were trained on the English Wikipedia in 2014 and consist of 400k different words, each represented by a 100-dimensional vector. In addition to GloVe, word2vec-api also provides many pre-trained word vector libraries. Chinese word vectors also provide many pre-trained word vector libraries based on Chinese.

2. Data Acquisition#

2.1 Getting Data from Files#

def parse_data(file):
for l in open(file,'r'):
yield json.loads(l)

data = list(parse_data('./Sarcasm_Headlines_Dataset_v2.json'))

2.2 Splitting Data into Training and Validation Sets#

X = []
y = []
for s in data:
X.append(s['headline'])
y.append(s['is_sarcastic'])

Training and validation set: test set = 9:1#

trainval_size = int(len(X) * train_ratio)

Training and validation set#

X_trainval = X[]
y_trainval = y[]

Training set: validation set = 8:2#

X_train, X_valid, y_train, y_valid = train_test_split(X_trainval,
y_trainval,
test_size = valid_ratio,
random_state=12)

Test set#

X_test = X[trainval_size:]
y_test = y[trainval_size:]

2.3 Generating Sequences for Training and Validation#

  1. Generating a tokenizer
    The tokenizer's main function is to convert all words into indices.

Define tokenizer#

tokenizer = Tokenizer(num_words = num_words ,oov_token=oov_token)

Train tokenizer#

tokenizer.fit_on_texts(training_sentences)

tokenizer.word_index#

word_index = tokenizer.word_index

  1. Converting sentences composed of words into sequences composed of indices
    Note that the effect of the num_words parameter in the above code is shown here. In each sequence, not every word has a corresponding index, only the top num_words most frequent words have indices, and low-frequency words are ignored. Therefore, theoretically, num_words <= word_index.

sequences = token.texts_to_sequences(sentences)

  1. Padding sequences
    Although the tokenizer converts sentences into numerical sequences, the length of each sentence may vary, resulting in non-uniform sequences. Therefore, sequence padding is necessary. The alignment method is to align based on the length of the longest sentence (or a specified length), filling the empty spaces with 0. The padding can be done by adding 0 at the beginning or at the end. In this article, a maximum sequence length of 16 is used, and sentences shorter than 16 are padded with 0 at the beginning to align the sequences.

paddedSequences = pad_sequences(sequences,
padding=padding,
maxlen=MAX_SEQUENCE_LENGTH,
truncating=trunc_type)

3. Model Construction and Training#

3.1 Model Construction#

  1. Embedding layer + Flatten layer
    So far, we have obtained an integer sequence, which is a 2D tensor with a shape of (batch_size, sequence_length) (where batch_size is None if not specified). However, this approach does not capture the relationship between words.

A common approach is to convert the integers into high-dimensional sparse vectors using one-hot encoding, and then convert them into low-dimensional dense vectors. For example, a word may be represented by a 100D, 300D, or 500D vector, and the similarity between words is represented by the inner product of the vectors. This process is called word2vec, which is the process of converting words into vectors, and it is implemented using Embedding().

First, we need to convert each word in the sequence into a one-hot encoded sparse matrix. The purpose of this is to simplify calculations. When performing matrix calculations on sparse matrices, we only need to multiply and sum the numbers corresponding to 1 in the matrix. This is faster than calculating a 1D list composed of numbers from 0 to 9.

Next, the sparse matrix (m, n) is multiplied by a word vector matrix (n, output_dim) to form a new word vector matrix (m, output_dim).

Finally, replace all integers in the input sequences with their corresponding word vectors, resulting in a new tensor that is expanded to a 3D tensor with a shape of (batch_size, sequence_length, output_dim), where output_dim is the dimension of the word vectors.

For natural language processing, the Embedding layer is necessary and must be the first layer. For detailed Embedding principles, refer to the following:

02e73a94d4cd2815079ccb22ab798af0

Embedding(num_words,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
trainable=True),
Flatten(),

  1. LSTM layer
    The LSTM layer can be a single layer or multiple layers. If multiple layers are used, the return_sequences parameter of the first layer should be set to True.

Bidirectional(LSTM(64,return_sequences=True)),
Bidirectional(LSTM(32)),

  1. Conv1D layer + GlobalMaxPooling1D layer
    Convolutional layer + pooling layer.

Conv1D(128, 5, activation='relu'),
GlobalMaxPooling1D(),

3.2 Introduction of Pre-trained Word Vectors#

Let's go back to the Embedding layer. As you can see, the previous code did not introduce pre-trained word vectors when using the Embedding layer in Keras. The word vectors were randomly initialized and then trained during the training process.

Now, let's try introducing pre-trained word vectors, specifically GloVe.
First, construct the embedding_matrix, which should have the same dimensions as GloVe.
Next, map the values of the pre-trained word vectors to the embedding_matrix.
Finally, specify the embedding_matrix when defining the Embedding layer. Set embeddings_initializer=[embedding_matrix] and trainable=False (indicating that the word vectors will not be updated during training).

Embedding(num_words,
EMBEDDING_DIM,
embeddings_initializer=Constant(embedding_matrix),
input_length=MAX_SEQUENCE_LENGTH,
trainable=False),

3.3 Model Training#

We compile and train the model using different layer designs and with or without the introduction of pre-trained word vectors.

4. Model Prediction#

Based on the different layer designs and whether pre-trained word vectors are introduced, the model prediction results are as follows:

29705b1f81f94f5d81f4b59b15063b29

Taking case6 as an example, the accuracy and loss graphs are as follows:

84d4014fa62f4765627bdfd318ee5217

ca34c73ba4a2a3f7dbd689b0c96f94f3

From the above, we can see that:

  1. With the current parameter settings, the design of CNN+Pooling is slightly better than LSTM... However, this does not prove anything, and the key is to look at the design of the parameters. But LSTM does consume a lot of resources.
  2. The introduction of pre-trained word vectors can reduce training time, but it does not necessarily guarantee a better model.

5. Application of Chinese Sentiment Analysis#

So far, we have been conducting research based on English. How about Chinese?
The biggest difference between Chinese and English is that English words are naturally segmented, while Chinese words are connected. Therefore, for Chinese, the first thing to do is word segmentation before the tokenizer. Currently, commonly used word segmentation methods, such as jieba, can solve this problem well.

Next, we need to address the issue of pre-trained word vectors. From the results above, it can be seen that using pre-trained word vectors as features is very effective. Therefore, it is important to introduce pre-trained word vectors as much as possible. Generally, there are two ways to obtain pre-trained word vectors: using pre-trained word vectors trained on general corpora, such as Wikipedia. Advantages: includes a wide range of words and matches the semantics of everyday language; Disadvantages: lacks specialized terms and occupies a large amount of space. The other way is to train your own word vectors. Advantages: specialized terms, more accurate semantics for specific tasks; Disadvantages: poor generalization.

For semantic recognition, a good Python library is gensim, which also provides many methods for importing and training word vectors.

Once the word segmentation and word vector library issues are resolved, the remaining steps are basically the same.


The download link for the files used in this article is as follows:
Link: https://pan.baidu.com/s/1jH_4Vzj1kh4af6domUCObQ
Password: cr0t

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.