PLM-AS: Pre-trained Language Models Augmented with Scanpaths for Sentiment Classification

Recent research demonstrated that deep neural networks could generate meaningful feature representations from both eye-tracking data and sentences without designing handcrafted features, which achieved competitive performance across cognitive NLP tasks, such as sentiment classification over gaze datasets, but the previous works mainly encode the text and gaze data separately without considering the interaction between these two modalities or applying large-scaled pre-trained models. To address these challenges, we introduce PLM-AS, a novel framework to take full advantage of textual and eye-tracking features by sequence modeling in a highly interactive way for multimodal fusion. It is also the first attempt to combine large-scaled pre-trained models with eye-tracking features in the cognitive reading task. We show that PLM-AS captures cognitive signals from eye-tracking data and shows improved performance in sentiment classification within and across three datasets of different domains.


Introduction
Recent research studies have shown that eyetracking features reflect cognitive information and lead to stable improvement in natural language tasks [1,12,16,24,28], such as sentiment classification [14,17], sarcasm detection [18], named entity recognition [8], coreference resolution [3].It can be mainly explained in the following aspects: 1) Entities, lengthy and complex words could catch attention more easily than common words in terms of lexical level.2) While implicit expressions may also lead to a longer duration of fixation with a second check in terms of semantic level when reading the content.3) Sentiment judgment is an auxiliary task to content comprehension [2], the participants will face difficulty in comprehension and will review the whole sentence several times due to the complex phrasal structure in terms of syntactic level [11].In all these aspects, human gaze data can provide a wealth of cognitive information for content comprehension and support sentiment classification at the same time.
There have been breakthroughs in many fields by the advances in deep neural architectures in the recent decade, research on how to model text and human gaze data with deep neural networks was also conducted.A convolutional neural network was first applied to learn feature representations from both text and human gaze data [19], and the gaze component in their model handled with two fundamental eye-tracking attributes, including fixation and saccade.Another multi-task deep neural framework based on recurrent neural network LSTM also achieved competitive performance with gaze features [20].The sentence-level attention corresponding to fixated words and adjacent words could also be applied to sentiment classification [2].These works have limited capabilities from two aspects: (a) The text and gaze representations were learned by these neural networks without any interactions between two modalities, and the models just concatenated the final outputs for multimodal fusion (b) It is difficult to apply large-scaled transformer architecture directly within such a twotower framework.
In this paper, instead of encoding two different modalities separately with neural networks, we propose a novel neural network structure that allows us to encode text modality first and then per-form sequence modeling leveraging the fixation order of words from gaze scanpaths intuitively with the support of pre-trained language models.We use the abbreviation PLM-AS for our proposed model Pre-trained Language Models Augmented with Scanpaths in the paper.We conduct various experiments for evaluation based on different gaze datasets, ETSA-I Dataset, ETSA-II Dataset and ZuCo Dataset released by [9,11,18], respectively.In summary, our contributions in this work are: (1) We introduce a novel framework PLM-AS, combing the text representation with eye-tracking data in a more intuitive way than previous earlyfusion models, specifically by leveraging the processing information encoded in the fixation order.
(2) It is also the first attempt to combine contextualized embeddings from pre-trained language models with eye-tracking data over gaze datasets on sentiment classification.
(3) We conduct a series of controlled studies by organizing the outputs from pre-trained language models to investigate the impact of eye fixations towards the framework, e.g., fixation words, fixation order.(4) We also test the cross-corpus capabilities of this PLM-AS framework and analyze the results in the aspect of generalization.

Motivation
The concept of scanpath is first proposed by [22], which refers to the trajectories (paths) of the eyes when scanning the visual field and viewing and analyzing any kind of visual information.When it comes to human reading [25], the scanpath mainly demonstrates the sequence of eye fixations (50-1500 ms pause of viewing a fragment of text), revealing the saccades (a quick movement between two or more phases of fixation in the same direction) and the regression (backward saccade to a previously visited fragment).
The inspiration for our proposed model is that human reading is not a linear process in only one direction strictly, but the trajectory of the eyemovements could still be organized as the time series of eye fixations, and the composition of this new sequence is highly overlapped with the text itself, which means that we could represent the fixation sequence using different fragments of text, and it should work naturally in recurrent neural net-Figure 1: The scanpath of reading the sentence S from ETSA-II Dataset [18].The fixation sequence records the positions of fixation words in the sentence, following the time series of eye fixations.
works [26], e.g.GRU architecture [5], due to its sequential nature.Since the annotations are evaluated by the subjects, this means that the cognitive information would be automatically included in the scanpaths when they read the sentences.Applying the eye-tracking scanpaths into the deep neural networks directly is equivalent to combining the cognitive features with the corresponding text features in a more intuitive way.
Our framework would follow this way: 1) Firstly, contextualized word representation is generated by transformer architecture over the reading sentences, and we take full advantage of the final layer from BERT [7] as text representation; 2) By retrieving the features of the corresponding positions step by step from BERT according to the index sequences of fixated words (scanpaths), we would have the text feature sequences in the fixation order; 3) Since recurrent neural network is designed for sequence modeling in deep learning, the new generated feature sequences are regarded as the input of the scanpath encoder, the GRU architecture [4], for the final multimodal fusion; 4) According to the actual length of the fixation sequence, the output of scanpath encoder in the final step is picked up for sentiment polarity prediction.Text Representation: Given a sentence S, each word would be cut into subword level by wordpiece tokenizer before the BERT architecture.Then we have subword sequence x 0 with two special tokens [CLS] and [SEP] inserted at the beginning and the end of the sequence, respectively.
Each token could be represented by the concatenation of word embedding, position embedding, and segment embedding, and undergoes bidirectional multi-head self-attention across multiple transformer blocks: where Finally, we have the output sequence v from the final layer of BERT as a text representation (P refers to the dimension of hidden layer in BERT): Scanpath Encoder: Given a subword-based fixation index sequence f , we retrieve the features of the corresponding position step by step from the output sequence v according to the index sequence of fixated words, then generate the new scanpath feature sequence s (N refers to the set of the fixation word index) The scanpath feature sequence s is then passed to the GRU architecture and we have the output sequence o from the scanpath encoder: (Q refers to the output dimension of the scanpath encoder) Finally, we use the output in the last time step for training and evaluation (t refers to the actual length of the fixation index sequence).
Sentiment Polarity Classification: For the final classification, we take the outputs of GRU in the last step as the final features, according to the actual length of fixation index sequence.

1) Binary classification:
The feature vector is then passed to the linear layer with a sigmoid activation function to predict the sentiment label {0,1}, positive or negative.
We optimize the model with binary cross-entropy loss between the true labels and the predicted values during the training stage.
2) Multi-label classification The feature vector is passed to the linear layer with a softmax function, we pick up the index with the highest probability as the sentiment label {0,1,2}, positive, negative, or neutral.
We optimize the model with softmax crossentropy loss between the true labels and the predicted values during the training stage.

Datasets
All the experiments followed the principle: a pair of one single scanpath and one sentence was treated as an example, instead of multiple reading scanpaths with one sentence, so we reconstructed the examples in this way over three cognitive datasets.Apart from that, we removed some examples with senseless annotation results from readers.
ETSA-I: We also worked on another cognitive reading dataset, Eye-Tracking and Sentiment Analysis-I, which have been used by [11] for sentiment classification.The dataset contains 1059 sentences in total from movie reviews and tweets, and the annotations come from five subjects, including eye-tracking data recorded by a remote eye-tracker Tobii TX 300 with sentiment labels (positive, negative, and neutral) for each sentence.
ETSA-II: We first applied our proposed framework based on the cognitive reading dataset released by [18].The original dataset, Eye-Tracking and Sentiment Analysis-II Dataset, mainly supplemented with advanced eye-movement information over NLP dataset, contains fixation sequence data with 383 positive and 611 negative sentences, including sarcastic quotes, short movie reviews, and tweets.Eye-tracking data from 7 subjects are all included for each sentence, recorded by an SR-Research Eyelink-1000 eye-tracker during the reading.
ZuCo: Experiments were also carried out for cross-domain learning based on a cognitive dataset, the Zurich Cognitive Language Processing Corpus released by [9], combining EEG and eye-tracking recordings from subjects reading natural sentences as a resource for the investigation of the human reading process in adult English native speakers.This dataset includes simultaneous EEG and eyetracking signals collected from 12 subjects during natural text reading, but in this case, we just extracted the textual features and the gaze features.The gaze data was recorded by an SR-Research Eyelink-1000 Plus eye-tracker.The corpus contains 400 sentences in total, of which 140 are positive, 123 are negative, and 137 are neutral, including movie reviews and biographical sentences.

Parameter Settings
As for ETSA-I and ETSA-II datasets, we simply split the dataset into two subsets, 90% of the dataset are treated as training samples, while 10% of them are used for validation.We follow the instruction in [21] to perform 25 runs for each model setting with the different random initialization, using the same data split and the same hyperparameter settings, and the final results are averaged over these runs.The training is performed for 20 epochs with the batch size of 16, we adopt the AdamW optimizer by [15] with a learning rate of 0.0002 to minimize the loss and the default settings in PyTorch framework are kept unchanged, the learning rate is linearly increased for the first 10% of steps and linearly decayed to zero afterward.
All these settings are applied to BERT and the scanpath encoder equally.The scanpath encoder is designed as a single-direction GRU with one recurrent layer, the hidden size of GRU is set to 768, and the dropout with 0.1 are applied to the recurrent layer in GRU.We initialize the hidden state of scanpath encoder by using the special token [CLS] outputs from the final layer of BERT.Our implementation uses the PyTorch framework, and pre-trained models are loaded from HuggingFace Transform- Table 1: Performance evaluation over cognitive reading datasets [11,18].Except for removing a few noisy samples, we applied the same way to split the datasets as the previous work (*) did in [19].We report macro-averaged precision (P), recall (R), and F1 score (F).
ers [27], an open source machine learning library in Python.

Performance Evaluation
Similar to previous cognitive studies in [19], we evaluate the PLM-AS over two cognitive reading datasets for sentiment classification task.The goal of our experiments is to investigate if the proposed model could take full advantage of textual and eyetracking features for multimodal fusion over sentiment classification task and analyze where the improvement comes from by controlled baselines.Table 1 presents the performance of the previous works and our proposed model.In addition, we also evaluate our proposed model in cross-domain learning over three different datasets, shown in Table 3. Single modality vs. Multimodality: The previous works in [19] show that CNN architectures learned from both text and eye-tracking data outperform those settings with single modality only.However, applying large-scaled pre-trained models has become the mainstream approach across different natural language tasks.The results show that the BERT model become another strong baseline on this task, even with text input only, but our proposed framework, PLM-AS, could perform multimodal fusion and beat the new baseline over both these datasets by taking advantages of large-scaled pre-trained models and gaze features at the same time.It would always be good to replace the BERT with other advanced pre-trained models for text representation, e.g.RoBERTa in [13] to achieve more gains over all these related settings, but it is not our main research purpose here.
Effect of fixation words (a): We consider the fixation words are selected subconsciously by the human cognitive process during reading, contributing to sentiment judgments after the content comprehension [2].We also question if our proposed model could work smoothly with random word choices instead of this kind of certain word choices from human.To further investigate this question in PLM-AS, we carried out our first controlled baseline by randomly shuffling the BERT outputs before feeding them into the scanpath encoder, the results, Table 2, show that the performance of PLM-AS is better than the setting (a) over both datasets, to some extent, all these word choices selected during the natural reading by human share the common ground in cognition and support the sentiment judgments within our proposed model.

Effect of fixation order (b):
The core idea of our proposed model is to capture the eye-tracking features by the fixation sequences, which provide cognitive information about the word choices and fixation order.To better understand the impact of the fixation order in PLM-AS, we try to shuffle the fixation order before feeding them into the scanpath encoder but with the word choices remained.Unsurprisingly, PLM-AS is better than the shuffled setting (b) from Table 2, which means that the fixation words could not contribute to the overall performance individually without the order information in PLM-AS, at least not in such a RNN sequential model setting [26] of the scanpath en- We also question that the improvement might come from the scanpath encoder itself rather than the eyetracking features, so we carried out our third controlled baseline by replacing the fixation sequences with word sequences of natural text, the only difference between this setting and BERT is by adding an extra GRU architecture, and it becomes a text-only setting.The results in Table 2 show that the performance of this setting (c) is close to the BERT baseline but lower than the performance of PLM-AS, which indicates the GRU architecture itself without any cognitive features might not contribute a lot to the overall performance of PLM-AS.
Cross-domain evaluation: Apart from these controlled baselines, we also perform a crossdomain evaluation based on the ZuCo dataset.The results in Table 3 show that our PLM-AS framework can achieve more competitive cross-domain performance to the BERT baseline while the mod-  els are trained on ETSA-II Dataset rather than ETSA-I Dataset.Noted that the reading texts in the ETSA-II Dataset are collected from two popular sarcastic quote websites, Tweet and the Amazon Movie Corpus, [23], with a higher level of complexity and diversity than the ETSA-I Dataset and the ZuCo dataset.Nearly half of the reading texts in the ETSA-II Dataset are sarcastic, it could be assumed that the eye-tracking data (scanpaths) in ETSA-II Dataset would be more abundant and diverse, which improve the overall performance.In addition, human scanpaths might vary not only from the text domains but also from person to person due to reading behaviors.Instead of providing eye-tracking features on average across all the subjects at a time, our PLM-AS framework might learn the reading patterns from a certain group of subjects and face challenges in generalizing these learned reading patterns to other subjects in such a subject-based sample construction.When it comes to the testing stage of cross-domain measurements, to some extent, the inconsistency between datasets' subjects should also be considered for the undesirable results when the models are trained on the ETSA-I Dataset.

Conclusion
In this paper, we propose a novel framework to fully combine text representations with eye-tracking features by scanpath modeling and carry out experiments to evaluate our model (PLM-AS) for sentiment classification.The results show that PLM-AS captures cognitive signals from the eye-tracking data and shows improved performance on senti-ment classification within and across three datasets of different domains.This indicates that the order of fixation during text reading carries linguistic information that is useful for NLP tasks.Since all the experiments are carried out on small datasets with limited texts and unstable results are observed when applying large-scale language models, we decide to follow the evaluation strategy in [21], to obtain convincing results.Since it is not always practical to obtain related eye-tracking data for augmentation at the test time, many research studies have been proposed for gaze feature prediction on text [10] and image [6] in recent years.However, scanpath prediction on text has not yet been explored sufficiently, which could be investigated as an auxiliary task over different NLP challenges during training, to be free of this limitation.

Table 2 :
Performance evaluation based on controlled baselines.Noted that the text inputs of pretrained language model stay the same without any shuffle or replacements in order to provide the text representation across all these settings, but in the next stage of encoding with GRU: (a) we create a subword-based index sequence with random words from the sentence to replace the fixation sequence; (b) we create another index sequence by shuffling the fixation sequence; (c) we replace the fixation sequence with natural text, the same order as text inputs for the pre-trained language model.

Table 3 :
Cross-domain evaluation over three datasets.