Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open Access

On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model, Data, and Training

Published:21 December 2022Publication History

Skip Abstract Section

Abstract

Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind the social media texts or reviews, which has been a fundamental application to the real-world society. Since the early 2010s, ABSA has achieved extraordinarily high accuracy with various deep neural models. However, existing ABSA models with strong in-house performances may fail to generalize to some challenging cases where the contexts are variable, i.e., low robustness to real-world environments. In this study, we propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training. First, we strengthen the current best-robust syntax-aware models by further incorporating the rich external syntactic dependencies and the labels with aspect simultaneously with a universal-syntax graph convolutional network. In the corpus perspective, we propose to automatically induce high-quality synthetic training data with various types, allowing models to learn sufficient inductive bias for better robustness. Last, we based on the rich pseudo data perform adversarial training to enhance the resistance to the context perturbation and meanwhile employ contrastive learning to reinforce the representations of instances with contrastive sentiments. Extensive robustness evaluations are conducted. The results demonstrate that our enhanced syntax-aware model achieves better robustness performances than all the state-of-the-art baselines. By additionally incorporating our synthetic corpus, the robust testing results are pushed with around 10% accuracy, which are then further improved by installing the advanced training strategies. In-depth analyses are presented for revealing the factors influencing the ABSA robustness.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Sentiment analysis, mining the user’s opinion behind the social media or product review texts, has long been a hot research topic in the communities of data mining and natural language processing (NLP) [26, 47, 58, 70, 74]. The aspect-based sentiment analysis (ABSA), or fine-grained sentiment analysis, as a later emerged research direction aiming to infer the sentiment polarities toward a specific aspect in text, has gained an overwhelming number of research efforts since early 2010s [9, 11, 41, 42, 60, 82]. Within recent years, ABSA has secured prominent performance gains [4, 8, 35, 61, 71, 79], with the establishment of various deep neural networks.

Although current strong-performing ABSA models have achieved high accuracy on standard test sets (e.g., SemEval data [52, 53]), they may fail to generalize correctly to new cases in the wild where the contexts can be varying. Especially in real-world applications, different from the enclosure test,1 the ABSA system will receive all kinds of diversified inputs from a variety of users, which naturally calls for more robust ABSA models. In Figure 1, we give a running example. An ABSA system being well trained on in-house training data can perform well when evaluated on in-house testing data.2 Unfortunately, once deploying the ABSA model on recommend system with real-world user inputs, it fails to generalize to those unseen cases in the wild, i.e., being low robustness to factual environment.

Fig. 1.

Fig. 1. An example to illustrate the performance gap between in-house evaluation and real-world scenario of aspect-based sentiment analysis model.

In recent research on robustness, it was shown that the performances of current ABSA methods can drop drastically by over 50% in terms of predicting accuracy [76]. Within one piece of text, only a small subset of contexts will genuinely trigger the sentiment polarity of the target aspect (generally are opinion terms). And, correspondingly, a robust ABSA system needs to place most focus on such critical cues instead of the other trivial or even misleading clues3 and should not be disturbed by the change of non-critical background contexts [76].

Based on recent study [35, 76], two major robustness test4 challenges in ABSA can be summarized, as shown in Table 1. The first type, aspect-context binding challenge, requires that the target aspect should be correctly bound to its corresponding key clues, instead of some other trivial words. Take as the example of [Raw S1], altering the crucial opinion expressions (i.e., from “fabulous” to “awful”) of the target aspect should directly flip its polarity (i.e., from positive into negative) as in [Mod S1-1]. Also, diversifying the non-critical words (i.e., by altering the background trivial contexts) should not influence its polarity, as in [Mod S1-2]. A low-robust model would be vulnerable when facing with such changes. Another type is multi-aspect anti-interference challenge. When multiple aspects coexist in one sentence, the sentiment of target aspect should not be interfered by other aspects. For example, based on [Raw S2], adding additional non-target aspect (as in [Mod S2-1]), especially with opposite polarity (as in [Mod S2-2]), should not influence the judgement of the polarity for target aspect.

Table 1.

Table 1. Challenges of Robustness Tests in Aspect-based Sentiment Analysis

In this study, we explore the enhancement for ABSA robustness. We cast the following researching questions:

Q1:

What types of neural models are more robust for ABSA?

Q2:

Is the current ABSA corpus informative enough for models to learn good bias with high robustness?

Q3:

Will the model become more robust via a better training strategy?

These three questions together reflect the bottlenecks of ABSA robustness from different aspects, i.e., model, data, and training.

With respect to the robust ABSA model, the key is to effectively model the relationship between the target aspect and its valid contexts,5 e.g., using attention mechanisms [32, 60, 72] or position encoding [59] to enhance the sense of the location of the target aspect. Especially, a large proportion of work has shown that the leverage of syntactic dependency information helps the most [31, 68, 76, 81]. We, however, note that most existing syntax-aware models only integrate the word dependence while leaving the syntactic dependency labels unemployed. Actually the dependency arcs with different types carrying distinct evidences may contribute in different degrees, which may help to better infer the relations between aspect and valid clues. Thus, how to better navigate the rich external syntax for better robustness still remains unexplored.

As for corpus, almost all the ABSA models are trained and evaluated in an enclosure based on the SemEval dataset [51, 52, 53]. But 80% of sentences in these datasets have either single aspect or multiple aspects in same polarity [35]. Even a well-trained ABSA model on such data will suffer performance downgrading when exposed to an open environment with complex inputs. Ideally, a training corpus with varying and challenging instances would enable ABSA models to be more robust. Yet manually annotating data is labor-intensive, which makes automatic high-quality data acquisition indispensable. Besides, most of the ABSA frameworks are optimized directly toward the gold targets with cross-entropy loss. This will inevitably lead to low-efficient utilization of training data, or even poor resistance to perturbation. We believe a better training strategy could help to excavate the knowledge behind the data more efficiently and sufficiently.

To this end, we aim to enhance the ABSA robustness by rethinking the model, data, and training, respectively, for each of which we propose retrofitting solution. First, we introduce a universal-syntax graph convolutional network (USGCN) for incorporating the syntactic dependencies with labels simultaneously. By effectively modeling rich syntactic indications, USGCN learns to better reason between aspect and contexts. Second, we present an algorithm for automatic synthetic data construction. Three types of high-quality pseudo corpora are induced based on raw data, enriching the data diversification for robust learning. Third, we leverage two enhanced training strategies for robust ABSA, including the adversarial training [14, 56] and the contrastive learning [5, 28]. Based on the synthetic training data, the adversarial training help to reinforce the perception of contextual change, while the contrastive learning further unsupervisedly consolidates the recognition of different labels.

We note that the two main challenges of ABSA as shown in Table 1 can be essentially the same, i.e., they are the two sides of a coin. In this article, all the three perspectives (i.e., model, data, and training) and the corresponding methods we proposed here all target solving these two challenges. For example, we propose three different synthetic data construction methods, where the sentiment modification method (Section 4.1) and the background rewriting method (Section 4.2) both directly target addressing the aspect-context binding challenge and the non-target aspects addition method (Section 4.3) is proposed for relieving the multi-aspect anti-interference challenge. With respect to the model, our proposed syntax-aware ABSA model (Section 3) can enhance the aspect–opinion binding, which essentially indirectly solves the aspect anti-interference issue. And the advanced training strategies (Section 5) help solve both the two challenges. We neatly summarize the overall proposal in Figure 2.

Fig. 2.

Fig. 2. A high-level overview of our solutions for enhancing the robustness of aspect-based sentiment analysis.

We perform extensive experiments on multiple robustness testing datasets for ABSA. Experimental results show that the USGCN model achieves the most robust performances than all the state-of-the-art baseline systems. All the ABSA models substantially achieve enhanced robustness when additionally using our pseudo training data, which can be further strengthened by installing the advanced training strategies. Further in-depth analyses from multiple aspects have been conducted for revealing the factors influencing the ABSA robustness.

In general, the contributions of our work are as follows.

(1)

We propose a novel syntax-aware model: We model the syntactic dependency structure and the arc labels as well as the target aspect simultaneously with a Graph Convolutional Networks (GCN) encoder, namely USGCN. With USGCN, we achieve the goal of navigating richer syntax information for best ABSA robustness.

(2)

We build an algorithm for automatically inducing high-quality synthetic training data with various types, allowing models to learn sufficient inductive bias for better robustness. Each type of pseudo data aims to improve one certain angle of ABSA robustness.

(3)

We perform adversarial training based on the pseudo data to enhance the resistance to the environment perturbation. Meanwhile, we employ the unsupervised contrastive learning technique for further enhancement of representation learning, based on the contrastive samples in pseudo data.

(4)

Our overall framework has achieved significant improvements on robustness test on the benchmark datasets. In-depth analyses have been presented for revealing the factors influencing the ABSA robustness.

The remainder of the article is organized as follows. Section 2 surveys the related work. Section 3 elaborates in detail the enhanced syntactic-aware ABSA neural model. In Section 4, we present the algorithm for the synthetic training corpus construction. Section 5 shows how to perform the advanced training strategies. Section 6 gives the experimental setups and the results on the robustness study of our system. Section 7 analyzes in deep the factors influencing the ABSA robustness. Finally, in Section 8 we present the conclusions and future work.

Skip 2RELATED WORK Section

2 RELATED WORK

In this section, we give a literature review of related work on sentiment analysis and the robustness study of aspect-based sentiment analysis.

2.1 Sentiment Analysis and Opinion Mining

Sentiment analysis or opinion mining aims to use machines to automatically infer the sentiment intensities or attitudes of the texts generated by users in the Internet [12, 17, 48, 50]. Since it shows great impacts to the real-world society, sentiment analysis facilitates a wide range of downstream applications, and has long been fundamental research direction in the community of NLP and data mining within the past decades [55, 69, 70]. Initial methods for sentiment analysis employ the rule-based models, e.g., using sentiment or opinion lexicons or designing hard-coded regular expressions [48, 50]. Then, researchers incorporate statistical machine learning models with hand-crafted features for the tasks [47, 74].

Since the early 2010s, the deep learning methods have received great attention. Neural networks together with continuous distributed features are extensively adopted for enhancing the task performances of sentiment analysis [21, 22, 38, 55]. In particular, the Long-short Term Memory (LSTM) models [30], Convolutional Neural Networks (CNN) [37], Attention mechanisms [72, 73], and GCN [75, 81] are the most notable deep learning methods that have been extensively adopted for sentiment analysis. For example, Wang et al. [72] propose an attention-based LSTM network for attending different parts of the aspects for aspect-level sentiment classification. Xue et al. [79] propose a CNN-based model with gating mechanisms for selectively learning the sentiment features meanwhile keeping computational efficient with convolutions. Zhang et al. [81] build a GCN encoder over the syntactic dependency trees of sentences to exploit syntactical information and word dependencies.

However, the research focus has been shifted into ABSA that detects the sentiment polarities toward the specific aspects in the sentence [53, 64]. Compared with the standard coarse-grained (i.e., sentence-level) sentiment analysis, such fine-grained analysis shows more impacts on the real-world scenarios, such as social media texts and product reviews, and thus facilitate a wider range of downstream applications. Prior methods for sentiment analysis mostly employ statistical machine learning models with manually crafted discrete features [47, 50, 74]. Later, neural networks together with continuous distributed features, as used in sentence-level sentiment analysis, are extensively adopted to achieve big wins [12, 31, 35, 81]. The difference of the neural models between coarse-grained sentiment analysis and ABSA lies in that the ABSA needs additionally to model the target aspect concerning its contexts. Tang et al. [60] use a memory network to cache the sentential representations into external memory and then calculate the attention with the target aspect. Recently, Veyseh et al. [54] regulate the GCN-based representation vectors based on the dependency trees to benefit from the overall contextual importance scores of the words.

2.2 Robustness Study of Aspect-based Sentiment Analysis

Analyzing the robustness of learning systems is a crucial step prior to models’ deployment. Robustness study thus has been an important research direction in many areas. A highly performing system on test set may fail to generalize to new examples where the contexts are varying, such as with distribution shift or adversarial noises [3, 34, 45]. Similarly, given the fact that current state-of-the-art ABSA models obtain high scores on the test datasets, they could be low in robustness. In recent ABSA robustness probing study [35, 76], all of the existing models show huge accuracy degradation when testing on the robustness test set. It thus becomes imperative to strength the ABSA robustness. However, unlike the robustness problem in other NLP tasks such as text classification, ABSA task characterizes that multiple aspect mentions and their supporting clues can be intertwined together in one sentence, which make it more difficult to solve. As we argue earlier, there are at least three angles to begin with, i.e., model, data, and training.

Various neural models are investigated for better ABSA, e.g., RNN [59], memory networks [60], attention networks [32, 72], graph networks [31, 61, 68], and so on. Later research has repeatedly shown that the syntactic dependency trees are of great effectiveness for ABSA, since such information provide additional signals to help to infer the relations between target and valid contexts [31, 68, 81]. A very recent study [76], however, shows that many of those neural models that even achieve high accuracies on standard test sets, such as attention or memory mechanisms, and so on, show low robustness. They have revealed that explicit aspect-position modeling (such as syntax-aware models) and pre-trained language models show better robustness. We find that the arc labels in the dependency structure are also useful are abandoned by existing syntax-aware models. We thus present a better solution on leveraging the external syntax knowledge, i.e., simultaneously modeling the dependency arcs and types with graph models. Besides, we in our experiments further explore the possibility if better pre-trained language models (PLM) can improve robustness.

As preliminary works noted, most sentences in current ABSA datasets (i.e., SemEval) contain either single aspect or multiple aspects in same polarity, which downgrades the problem to coarse-grained (sentence-level) sentiment classification [35, 78]. This underlies the weak robustness of current ABSA models that even have high performances on the testing sets. To combat that, Jiang et al. [35] newly craft a much more challenging dataset, in which each sentence consists of at least two aspects with different sentiment polarities (i.e., multi-aspect multi-sentiment (MAMS) data). They show that MAMS can prevent ABSA from degenerating to sentence-level sentiment analysis and thus improve the ABSA robustness. We also in later experiments show that training with these data enables ABSA models to be more generalizable. However, we note that robust-driven MAMS data are fully annotated with human labor, which can incur huge costs. To ensure data diversification for robust learning while avoiding manual costs, we in this work consider a scalable method for automatic data construction. We obtain three types of high-quality pseudo corpora, including (1) flipping the sentiment of target aspect, (2) rewriting the background contexts of target aspect, and (3) adding extra non-target aspects. With respect to this, our work partially draws some inspirations from recent work of Xing et al. [76]. Yet we differ from their work in four ways. First, they locate the crucial opinion expressions for each target by additionally using the existing labeled TOWE data [15], while our automatic algorithm finds such valid expressions heuristically. Second, they only construct a small set for testing, but we construct larger volumes of data for training. Third, they take human evaluation for quality speculation, and our method ensures high quality of data without human interference. Moreover, we consider diversifying background contexts of examples, which does not exist in their consideration.

This work also relates to adversarial attack training, which alters the input slightly to keep the original meaning but leads to different predictions and has been a long-standing method to enhance the robustness of NLP systems [14, 46, 56, 80]. We note that the adversarial training strategy has been employed by some existing ABSA works but for improving the in-house performance [36, 39, 40]. In this work, we for the first time design adversarial training based on the multiple types of synthetic training data to reinforce the model perception of contextual change, so as to obtain a better environment independence. However, based on the synthetic examples, we further employ the contrastive learning algorithm to unsupervisedly consolidate the representations of examples with different polarities at high-dimension space. Contrastive learning is a novel unsupervised or self-supervised approach that has recently been successfully employed in multiple areas, e.g., computational vision and NLP [28, 29, 67]. The main idea is to force a model to narrow the distance between those examples with the similar target and meanwhile widen those with different targets. To our knowledge, we are the first utilizing contrastive learning technique on robust ABSA learning.

Skip 3SYNTAX-AWARE NEURAL MODEL Section

3 SYNTAX-AWARE NEURAL MODEL

Task Formulation. The goal of ABSA is to determine the sentiment polarity toward a specific aspect, which we formalize as a classification problem on sentence–aspect pairs. Technically, given an input sentence \(X=\lbrace x_1,\ldots ,x_n\rbrace\) and an aspect term \(A=\lbrace x_i,\ldots ,x_j\rbrace\) that is a sub-string of input sentence \(X\), the model is expected to predict the corresponding sentiment label \(\hat{y}\). Note that one sentence may contain multiple aspect terms, and we can correspondingly construct multiple sentence–aspect pairs for one sentence under a one-to-many mapping. Through our framework, the classification can be formalized as: (1) \(\begin{equation} y^C = \mathop {\text{argmax}}_{y \in C} p(y | X, A) \,, \end{equation}\) where \(C\) denotes the set of all sentiment polarity labels, i.e., “Positive,” “Negative,” and “Neutral.”

Model Overview. The proposed neural framework mainly consists of three layers: the base encoder layer, the syntax fusion layer, and the aggregation layer. The base encoder layer employs the Transformer model [65], taking as input the sentence and the aspect term, yielding contextualized word representations as well as the aspect term representation. The syntax fusion layer, also as our proposed USGCN, fuses the rich external syntactic knowledge into the feature representations. Finally, the aggregation layer summarizes and gathers the feature representations into total final one, based on which the classification layer makes a prediction. The overall framework is shown in Figure 3.

Fig. 3.

Fig. 3. The overall framework for aspect-based sentiment analysis.

3.1 Base Encoder Layer

We employ multi-layer Transformer to yield contextualized word representation \(\mathbf {h}^X_i\) as well as aspect word representation \(\mathbf {r}^{asp}\). Transformer encoder has been shown prominent on learning the interaction between each pair of input words, leading to better contextualized word representations. Technically, in Transformer encoder, the input \(\mathbf {x}\) is first mapped into queries \(\mathbf {Q}\), values \(\mathbf {V}\), and keys \(\mathbf {K}\) via linear projection. We then compute the relatedness between \(\mathbf {K}\) and \(\mathbf {Q}\) via Scaled Dot-Product alignment function, which is multiplied by values \(\mathbf {V}\), (2) \(\begin{equation} \mathbf {\alpha } = \text{Softmax}\left(\frac{\mathbf {Q}\cdot \mathbf {K}^{\mathrm{T}}}{\sqrt {d_{k}}}\right) \cdot \mathbf {V}, \end{equation}\) where \(d_{k}\) is a scaling factor. \(\mathbf {Q}\), \(\mathbf {K}\), and \(\mathbf {V}\) are the same of input words in our practice. Multiple parallel attention heads focus on different parts of semantic learning. Also, we can alternatively take the pre-trained BERT parameters [10] as Transformer’s initiation to boost the performances.

To form an input sequence, we first concatenate the input sentence \(X\) and the aspect term \(A\) and combine some special tokens: \(\hat{X}=\lbrace `CLS^{\prime }, X, `SEP^{\prime }, A, `SEP^{\prime } \rbrace\), where ‘CLS’ is a symbol token for yielding sentence-level overall representation and ‘SEP’ is a special token for separating the sentential words and the aspect terms. In total, we can summarize the calculations in base encoder as follows: (3) \(\begin{equation} \lbrace \mathbf {h}^{CLS}, \mathbf {H}^{X}, \mathbf {H}^{asp}\rbrace = \text{Trm}(\hat{X}) \,, \end{equation}\) where \(\mathbf {h}^{CLS}\) is the representation for overall sentence. \(\mathbf {H}^{X} = \lbrace \mathbf {h}^X_1, \ldots , \mathbf {h}^X_n\rbrace\) are the sentential word representations, and \(\mathbf {H}^{asp} = \lbrace \mathbf {{h}}^{asp}_i, \ldots , \mathbf {{h}}^{asp}_j\rbrace\) are the aspect term representations that will be pooled into one \(\mathbf {r}^{asp}\).

3.2 Syntax Fusion Layer

We further fuse the dependency syntax for feature enhancement with a USGCN module as illustrated in Figure 4. Previous works for ABSA unfortunately merely make use of the syntactic dependency edge features (i.e., the tree structure) [18, 27, 31, 54]. Without modeling the syntactic dependency labels attached to the dependency arcs, prior studies are limited by treating all word–word relations in the graph equally [16, 19, 20, 23]. Intuitively, the dependency edges with different labels can reveal the relationship more informatively between target aspect and the crucial clues within context, as exemplified in Figure 5.

Fig. 4.

Fig. 4. Illustration of the proposed (a) USGCN based on the (b) syntactic dependency tree of input sentence.

Fig. 5.

Fig. 5. An example of syntactic dependency structure with edges types, based on the sentence of [Raw S1] in Table 1.

Compared with other type of arcs within the syntax structure, the one with nsubj6 can presents the most distinctive clues to locate the aspect “food” with its direct opinion term “fabulous,” which strongly guides the sentiment polarity.

Also, GCN [81] has proven effective on aggregating the feature vectors within a syntactic structure of neighboring nodes and propagating the information of a node to its neighbors. Based on GCN, we propose a novel USGCN, modeling the dependency arcs and labels with the target aspect term simultaneously, as shown in Figure 4(a). Technically, given the input sentence \(s\) with its corresponding dependency parse (including edges \(\Omega\) and labels \(\Gamma\)). We define an adjacency matrix \(B = \lbrace b_{i,j}\rbrace _{n \times n}\) for dependency edges between each pair of words, i.e., \(w_i\) and \(w_j\), where \(b_{i,j}\)=1 if there is an edge (\(\in \Omega\)) between \(w_i\) and \(w_j\) and vice versa for \(b_{i,j}=0\). There is also a dependency label matrix \(R = \lbrace r_{i,j}\rbrace _{T \times T}\), where each \(r_{i,j}\) denotes the dependency relation label (\(\in \Gamma\)) between \(w_i\) and \(w_j\). In addition to the pre-defined labels in \(\Gamma\), we additionally add a “self” label as the self-loop arc \(r_{i,i}\) for \(w_i\), and a “none” label for representing no arc between \(w_i\) and \(w_j\). We maintain the vectorial embedding \(\mathbf {x}^e_{i,j}\) for each dependency label in \(\Gamma\).

USGCN consists of \(L\) layers, and we denote the resulting hidden representation of \(w_t\) at the \(l\)th layer as \(\mathbf {r}^{l}_i\), (4) \(\begin{equation} \mathbf {r}^{l}_i = \text{ReLU}\left(\sum _{j=1}^n \alpha _{i,j}^{l} (\mathbf {W}_a^{l} \cdot [ \mathbf {r}^{l-1}_j ; \mathbf {x}^e_{i,j} ; \mathbf {r}^{asp}] + b^{l}) \right) \,, \end{equation}\) where \(\mathbf {W}^{l}\) is parameters, \(b^{l}\) is the bias term, \([;]\) denotes concatenation, and \(\alpha _{i,j}^{l}\) is the neighbor connecting-strength distribution calculated via a softmax function, (5) \(\begin{equation} \alpha _{i,j}^{l} = \frac{ b_{i,j} \cdot \exp {(\mathbf {W}_b^{l}[\mathbf {r}^{l-1}_j;\mathbf {x}^e_{i,j}; \mathbf {r}^{asp}}])}{ \sum _{k=1}^n b_{i,k} \cdot \exp {(\mathbf {W}_b^{l}[\mathbf {r}^{l-1}_k;\mathbf {x}^e_{i,k}; \mathbf {r}^{asp}])} } \,. \end{equation}\)

The weight distribution \(\alpha _{i,j}\) entails the structural information from both the dependent edges and the corresponding labels jointly with the target aspect and thus can comprehensively reflect the syntactic attributes toward aspect. Note that for the first-layer USGCN, \(\mathbf {r}^{0}_i=\mathbf {h}_i^X\).

3.3 Aggregation Layer

Next, we perform an aspect-aware aggregation to collect the salient and useful features relevant to target aspect. (6) \(\begin{equation} \begin{aligned}\mathbf {v}_{i} &= \mathrm{Tanh}\left(\mathbf {W}_c [ \mathbf {r}^{L}_i ; \mathbf {r}^{asp}] + b\right) \,,\\ \beta _{i} &= \mathrm{Softmax}(\mathbf {v}_{i}) \,, \\ \mathbf {r}^{a} &= \sum \beta _{i} \cdot \mathbf {r}^{L}_i \end{aligned} \end{equation}\)

We then concatenate \(\mathbf {r}^{a}\) with the sentence representation \(\mathbf {r}^{CLS}\) into a final feature representation \(\mathbf {r}^f\), based on which we finally apply a softmax function for predicting \({y}^c\).

Skip 4SYNTHETIC CORPUS CONSTRUCTION Section

4 SYNTHETIC CORPUS CONSTRUCTION

Synthetic data construction is a popular direction in the NLP community that effectively helps relieve the data annotation issues, such as data scarcity [57], label imbalance [6], and cross-lingual data [25]. In this section, we elaborate the synthetic corpus construction for diversifying the raw training data (denoted as \(\mathbb {D}_o\)). We mainly introduce three types of pseudo data: (1) sentiment modification of target aspect (\(\mathbb {D}_a\)), (2) background rewrite of target aspect (\(\mathbb {D}_n\)), and (3) extra non-target aspects addition (\(\mathbb {D}_m\)). These three supplementary sets provide rich signals from different angles, together helping to learn sufficient bias for better robust ABSA. We denote the union set of the three synthetic data as \(\mathbb {D}_s = \mathbb {D}_a \cup \mathbb {D}_n \cup \mathbb {D}_m\).

4.1 Sentiment Modification

Modifying the sentiments of aspects is the primary operation. For \(k\)th aspect \(A_{i,k}\) (with polarity label \(y_{i,k}^{C}\)) in \(i\)th original sample \(X^o_i\), (\(X^o_i \in \mathbb {D}_o\)), we aim to generate a batch of new sentences \(X_{i,k(j)}^o \in \mathbb {D}_a\) where the sentiment polarity of \(A_{i,k}\) will be (1) kept same as \(y_{i,k}^{C}\) and (2) flipped into two other labels, i.e., \(y_{i,k}^{C} \mapsto y_{i,k}^{C^{^{\prime }}}\). The creation of \(\mathbb {D}_a\) involves two steps: locating opinion and changing sentiment.

4.1.1 Locating Opinion.

The key to sentiment modification is to locate the exact opinion texts \(O_{i,k}\) of the target aspect \(A_{i,k}\). In Xing et al. [76], the TOWE data [15] are used where such opinion expressions are labeled explicitly based on the SemEval data. However, in this work we do not consider using TOWE for multiple reasons. First, TOWE is from fully manual annotation, while we aim to build a completely automatic algorithm. Second, training ABSA models using additional labeled opinion signals (i.e., TOWE) can lead to unfair comparisons. Instead we thus reach the goal heuristically by defining some rules. We extract aspect’s explicit opinion expressions that satisfy following syntactic dependent relations.

(1)

amod (adjectival modifier) relation, for example the aspect–opinion pair “price”–“reasonable” in “a reasonable price.

(2)

nsubj (nominal subject) relation, e.g., a pair “room”–“small” in “the room is small.”

(3)

dobj (direct object) relation, e.g., “smell”–“love” in “I love the smell.”

(4)

xcomp (open clausal complement) relation, e.g., “beer”–“spicy” in “the beer tastes spicy.”

4.1.2 Changing Sentiment.

Then, we consult the sentiment lexicon resource for opinion word replacement, such as SentiWordNet [1]. For example, for the word “difficult,” we can obtain its antonymous opinion words “easy” and “simple” and synonymous word “hard” and “tough,” and so on. In addition, we can flip the polarity with some negation words or adverbs. By this we obtain a set of target candidate \(O^t_{i,k(j)}\) for the replacement of source opinion \(O^s_{i,k}\). We perform such replacement one by one to get the new sentences \(X_{i,k(j)}^o\).

To control the induction quality, we define a modification confidence as the likelihood for a successful modification, i.e., correctly finding the opinion statement and amending the sentiment into target. Note that with lexicon resource, for each word we can easily obtain its sentiment strength score \(a(O,C) \in [0,1]\) toward three respective polarities. For the source opinion expression \(O^s_{i,k}\) we take its sentiment score \(a(O^s,C_s)\) toward the source gold polarity \(y_{i,k}^{C_s}\) as the opinion localization confidence. Likewise, for \(O^s_{i,k}\)’s \(j\)th candidate replacement \(O^t_{i,k(j)}\), we collect its all three sentiment scores. We then define modification confidence as (7) \(\begin{equation} p_a(O^{s\mapsto t},y^{C_s\mapsto C_t}) = a(O^s,C_s) \cdot \frac{2 a(O^t,C_t)}{\sum _{C_e\ne C_t} a(O^t,C_e)} \,, \end{equation}\) where the first term \(a(O^s,C_s)\) indicates the confidence of the correct opinion localization and the latter part \(\frac{2 a(O^t,C_t)}{\sum _{C_e\ne C_t} a(O^t,C_e)}\) indicates the sentiment flipping confidence. We filter out those cases whose modification confidence is lower than a pre-defined threshold \(\theta _a\).

There are also several special cases worth noting. For example, we always keep the candidate opinion terms whose Part-of-Speech (POS) tags are the same as the source opinion terms. Specifically, for those target (after modification) opinion terms with the same POS tags as source opinion terms in the original sentences, we believe there is a high alignment in between, and the opinion positioning will be accurate. And those candidates should be kept. Besides, in some case, e.g., with neutral sentiment or opinions in dobj and xcomp syntax relations, adding negation words will be the only way for sentiment modification. For example, to change the sentiment state of the instance “I will try this restaurant next time,” the only feasible method is to add the negation word “not”: “I will not try this restaurant next time.” Also, it is more likely to find more than one potential opinion expression that can determine the aspect’s sentiment, and we conduct modifications combinatorially. Specifically, when multiple opinion expressions are detected, we can modify all those opinion expressions with the target replacements at the simultaneously. For each opinion expressions we perform the sentiment flipping just the same way as for the single-opinion case.

4.2 Background Rewriting

To enhance the robustness of ABSA models, it is important to not only diversify the opinion changes of aspects but also enrich the background contexts. We rewrite the non-opinion expression in original sentence \(X^o_i\) of aspect terms to form new sentences \(X^n_{i,k} \in \mathbb {D}_n\).

We mainly consider the following three strategies:

(1)

Changing the opinion-less contexts, such as morphology, tense, personal pronoun, punctuation, quantifier, and so on. Morphology reflects the structure of words and parts of words, e.g., stems, prefixes, and suffixes, to name a few, and transforming the words with its morphological derivation can diversify the contexts, such as “heterogeneous” vs. “homogeneous.” Also replacing the original tense or personal pronoun in a sentence with others can cater to the need. And adding the punctuation or changing quantifier also leads to context modification.

(2)

Substituting those neutral words with its synonym or antonym7 by looking up the WordNet [44]. This is partially the same as the step in Section 4, but we only make modifications for those words with neutral-opinion labels, i.e., by first consulting its sentiment intensity with SentiWordNet.

(3)

Paraphrasing the original sentence via back-translation, e.g., first translating into other languages8 and then translating them back into source language.9 Intuitively, the background texts of the raw sentences may be re-phrased after the back-translation but the core semantic idea is not totally changed; we reach of goal of background expression rewriting. Note that we also keep the target aspect term unchanged after the back-translation. We have the following three cases. (1) The opinion terms are not changed during back-translation, which is the best case we desire. (2) The opinion terms are partially changed, i.e., being replaced as a part of the raw phrase. For example, the phrase “French fries” may be turned into the word “fries,” but the meaning is not changed. For such a case, we will replace the translated partial expression with the original opinion expression. (3) The target opinion words are totally changed after the back-translation. For this case, we first consider using the sentiment lexicon SentiWordNet to find the most likely target opinion words that are the correspondence of the original opinion expression. If the likelihood is considerable, i.e., the sentiment polarity agreements are over 0.5 between the target one and the original one, then we replace the translated expression with the original opinion expression.

We maintain the validity of such modification for the rewritten sentence with the METEOR metric [2], i.e., the rewriting confidence, (8) \(\begin{equation} \begin{aligned}p_n\left(X^n_{i,k}\right) &= \text{METEOR}\left(X^n_{i,k}\right) \,. \\ \end{aligned} \end{equation}\) METEOR measures the sentence on its fluency by taking into consideration the matching rationality at the whole corpus level. We define a threshold \(\theta _n\) and drop low-quality modification, i.e., \(p_n(X^n_{i,k})\) < \(\theta _n\).

4.3 Non-target Aspects Addition

Finally, we add non-target aspects in existing sentences to create multiple-aspect coexistence cases. The construction of \(\mathbb {D}_m\) consists of three steps. First, for all the aspects at the corpus level we locate the opinion–aspect expressions with the method described in Section 4.1. We then extract the minimum text unit containing the opinion–aspect expression from different sentences. Inspired by Xing et al. [76], we extract the linguistic branch (e.g., noun/verb phrases) in a constituency structure, such as “a reasonable price,” and so on. Second, we perform grouping on all aspects based on their embeddings derived from a pre-trained language model (e.g., BERT), so as to gather the semantic relevance score between each pair of aspect, i.e., \(\phi (A,\hat{A}) \in [0,1]\).

Third, we select certain number (top \(J\)) of non-target aspects \(\hat{A}_{i,k(j)}\) for each target aspect by the descending order of their correlation degrees. We then concatenate the original sentence \(X^o_i\) of target aspect \(A_{i,k}\) with the opinion–aspect expressions of non-target aspects, as new sentence \(X^m_{i,k} \in \mathbb {D}_m\). Note that for each \(X^m_{i,k}\) we keep the expressions of non-target aspects diversified on their sentiment polarities. Also, we can construct more than one pseudo sentence for each target aspect with different non-target aspects. To control the quality of this construction, we define an addition confidence as the average similarity score between the target and non-target aspects in a pseudo sentence: (9) \(\begin{equation} p_m\left(X^m_{i,k}\right)=\frac{1}{J}\sum _j \phi (A_{i,k},\hat{A}_{i,k(j)}) \,, \end{equation}\) with those only \(p_m\gt \theta _m\) as valid constructions.

Also it is noteworthy that it is highly possible that the linguistically replacements or modification (i.e., data augmenting techniques) used in this section will cause the resulting sentences semantically modified or even meaningless, i.e., unnatural sentences. For example, the change of personal pronouns is more likely to cause semantic altering compared with other types of methods. We note that we mainly adopt these altering methods that also are commonly adopted for other tasks in NLP community: changing morphology, tense, personal pronoun, punctuation, and quantifier. And thus in our practice, to best avoid generating such semantically nonsensical sentences, we use the change of personal pronouns very carefully. For instance, we mainly perform the pronoun change for those easy sentences having very simple and few pronouns; for the compound sentences or sentences containing many pronouns, we only consider making changes between the third-person pronouns, i.e., “he” and “she,” or do not make any change.

Skip 5TOWARD ROBUSTNESS TRAINING Section

5 TOWARD ROBUSTNESS TRAINING

Training with Cross-entropy Objective. Generally, ABSA frameworks can be directly optimized toward the gold target \(\hat{y}^C\) with cross-entropy objective, based on the original training data \(\mathbb {D}_o\): (10) \(\begin{equation} \mathcal {L}_e(\mathbb {D}_o) = - \frac{1}{||\mathbb {D}_o||} \sum ^{||\mathbb {D}_o||}_i \hat{y}^C_i \log y^C_i, \end{equation}\) where \(||\mathbb {D}_o||\) is the length of training data. Further based on the enriched training data, i.e., original set (\(\mathbb {D}_o\)) + robust synthetic corpus (\(\mathbb {D}_s\) as in Section 4), an ABSA model will achieve much better robustness with \(\mathcal {L}_e(\mathbb {D}_o+\mathbb {D}_s)\).

5.1 Adversarial Training

As mentioned earlier, ABSA models under cross-entropy training will show lower sensitivity to the environment change, leading to weak robustness. For higher robustness, the resistance to context perturbation (e.g., opinion flip, background rewriting, and multi-aspects coexistence) should be enhanced. We thus devise an adversarial training procedure, based on the above three kinds of synthetic training data. As illustrated in Figure 6, in the adversarial framework, two individual neural models (as in Section 3), \(\Omega ^{o}\) and \(\Omega ^{s}\), (1) take as input the raw sentence in \(\mathbb {D}_o\) and the synthetic sentence in \(\mathbb {D}_s\), respectively; (2) produce middle-layer representations, i.e., \(\mathbf {r}^{adv,o}\) and \(\mathbf {r}^{adv,s}\), respectively; and (3) finally make their own predictions, i.e., \(y^{C,o}_i\) and \(y^{C,s}_i\). Here \(\mathbf {r}^{adv}=[\mathbf {r}^{CLS};\mathbf {r}^a;\mathbf {r}^s]\), where \(\mathbf {r}^s\) is the pooled representation of \(\lbrace \mathbf {r}^L_1,\ldots ,\mathbf {r}^L_n\rbrace\) from USGCN.

Fig. 6.

Fig. 6. The adversarial training framework.

The adversarial training is intermittently conducted with the regular training for \(\Omega ^{o}\) and \(\Omega ^{s}\). Specifically, a matcher first calculates the relatedness between \(\mathbf {r}^{adv,o}\) and \(\mathbf {r}^{adv,s}\), (11) \(\begin{equation} \mathbf {v} = [ \mathbf {r}^{adv,o} ; \mathbf {r}^{adv,s} ; \mathbf {r}^{adv,o} - \mathbf {r}^{adv,s} ; \mathbf {r}^{adv,o} \odot \mathbf {r}^{adv,s} ] \,, \end{equation}\) where the resulting representation \(\mathbf {v}\) then is passed into the type discriminator \(\mathcal {D}\) to distinguish the type of the synthetic input at \(\Omega ^{s}\). We define three output of \(\mathcal {D}\) as \(y^V_a, y^V_n, y^V_m\) for \(\mathbb {D}_a, \mathbb {D}_n, \mathbb {D}_m\), respectively. So the adversarial goal is to achieve min-max optimization: minimizing the cross-entropy loss of ABSA model for the sentiment prediction \(y^{C}\) and maximizing the cross-entropy loss of the type discriminator \(y^V\): (12) \(\begin{equation} \begin{aligned}\mathcal {L}_{a_1} &= \mathop {\min }\limits _{\Omega }\left[ \mathop {\max } \limits _{ \mathcal {D}} \bigg (\sum \hat{y}^{V} \log y^{V} \bigg)\right] \,, \\ \mathcal {L}_{a_2} &= - \sum \hat{y}^C \log y^C \,, \\ \mathcal {L}_{a}(\mathbb {D}_o+\mathbb {D}_s) &= \frac{1}{||\mathbb {D}_o+\mathbb {D}_s||} (\lambda _a \mathcal {L}_{a_1}+\mathcal {L}_{a_2}) \,, \end{aligned} \end{equation}\) where \(\lambda _a\) controls the interaction of two learning processes.

5.2 Contrastive Learning

To sufficiently utilize the synthetic training corpus, we further employ the contrastive learning technique such that we can further consolidate the recognition of the ABSA model on different labels. Contrastive learning has been shown effective on the representation enhancement in an unsupervised manner [5, 24, 62, 63]. It encourages us to narrow the distance between the embedding representation of examples with the similar target and meanwhile widen those with different targets. Here we set the goal to increase the awareness of the ABSA model on the sentiment changes of target aspects caused by (1) varying opinions and (2) non-aspect interference. Accordingly, we design two types of contrastive objectives: (1) intra-aspect objective and (2) inter-aspect objective, where the former reinforces the differentiation between homogeneous and contrary opinions for an aspect, and the latter is account for distinguishing between the alteration for target/non-target aspects.

In intra-aspect contrastive learning, for each raw sample \(X_i^o \in \mathbb {D}_o\) we construct positive pairs as \(\lt\)\(X_{i,j}^{a,+},X_i^o\)\(\gt\), where \(X_{i,j}^{a,+} \in \mathbb {D}_a\) is a pseudo instance in same polarity label with \(X_i^o\), and negative pairs as \(\lt\)\(X_{i,k}^{a,-},X_i^o\)\(\gt\), where \(X_{i,k}^{a,-} \in \mathbb {D}_a\) comes from a different polarity label. We encourage the system to learn nearer distance between positive pairs (Attract), while enlarge the distance between negative pairs (Repel), (13) \(\begin{equation} \begin{aligned} \mathcal {L}_c^{ita} =& - \sum _j \log \frac{\exp [ \text{Sim}(\mathbf {r}(X_{i,j}^{a,+}),\mathbf {r}(X_i^o)) / \mu ]}{\sum _k \exp [\text{Sim}(\mathbf {r}(X_{i,k}^{a,-}),\mathbf {r}(X_i^o)) / \mu ]} \,, \\ &\text{Sim}(\mathbf {r}(a),\mathbf {r}(b))= \frac{\mathbf {r}(a)^T \mathbf {r}(b)}{||\mathbf {r}(a)|| \cdot ||\mathbf {r}(b)||} \,, \end{aligned} \end{equation}\) where Sim(\(\cdot\)) is the cosine similarity measurement and \(\mu\) is a temperature factor.

Likewise, in inter-aspect contrastive learning, for each raw sample we use the same positive pairs \(\lt\)\(X_{i,j}^{a,+},X_i^o\)\(\gt\) as in \(\mathcal {L}_c^{ita}\), representing inner-aspect changes, and construct negative pairs as \(\lt\)\(X_{i,k}^{m},X_i^o\)\(\gt\), where \(X_{i,k}^{m} \in \mathbb {D}_m\) representing outer-aspect changes, (14) \(\begin{equation} \mathcal {L}_c^{itr} = - \sum _j \log \frac{\exp [ \text{Sim}(\mathbf {r}(X_{i,j}^{a,+}),\mathbf {r}(X_i^o)) / \mu ]}{\sum _k \exp [\text{Sim}(\mathbf {r}(X_{i,k}^{m}),\mathbf {r}(X_i^o)) / \mu ]} \,. \end{equation}\)

In Equations (13) and (14), \(\mathbf {r}(X)\) is the feature representation \(\mathbf {r}^f\) of the input \(X\), which summarizes the opinion for the target aspect, i.e., opinion-guided one. To further strengthen the learning effect, we propose a structure-guided contrastive method. Technically, we instead use the pooled representation of syntax representation from USGCN, i.e., \(\mathbf {r}^s=Pool(\lbrace \mathbf {r}^L_1,\ldots ,\mathbf {r}^L_n\rbrace\)), which directly reflects the structural skeleton of the overall sentence. Therefore, we have in total four types of contrastive learning schemes: \(\mathcal {L}_c^{ita\#o}\), \(\mathcal {L}_c^{ita\#s}\), \(\mathcal {L}_c^{itr\#o}\), \(\mathcal {L}_c^{itr\#s}\), as illustrated in Figure 7. We summarize the overall loss as (15) \(\begin{equation} \mathcal {L}_c(\mathbb {D}_o+\mathbb {D}_s) = \frac{1}{||\mathbb {D}_o+\mathbb {D}_s||} \left(\lambda _{c1} \mathcal {L}_c^{ita\#o} + \lambda _{c2} \mathcal {L}_c^{ita\#s} + \lambda _{c3} \mathcal {L}_c^{itr\#o} + \lambda _{c4} \mathcal {L}_c^{itr\#s} \right) \,. \end{equation}\) Jointly with Supervised Training. The unsupervised contrastive learning (\(\mathcal {L}_c\) in Equation (15)) can join (1) the cross-entropy objective (\(\mathcal {L}_e\) in Equation (10)) or (2) the adversarial training objective (\(\mathcal {L}_a\) in Equation (12)), i.e., \(\mathcal {L}_{e+c}\) or \(\mathcal {L}_{a+c}\).

Fig. 7.

Fig. 7. Different schemes of the contrastive representation learning. The opinion-guided contrastive learning happens at aggregation layer ( \(\blacksquare\) ), while the structure-guided contrastive learning happens at syntax fusion layer ( \(\blacksquare\) ).

Skip 6EXPERIMENT Section

6 EXPERIMENT

6.1 Setups

6.1.1 Data and Resources.

Our experiments are based on the SemEval 2014 data [53], which includes two subsets in two domains, Restaurant and Laptop. Vanilla SemEval data evaluates in-house performances of ABSA models. For the evaluation of robustness, we consider two datasets: MAMS [35] and ARTS [76], which are described in Section 2. Each dataset provides its own training/developing/testing sets. Table 2 details the statistics of the datasets. Note that here we cannot provide the data statistics of our constructed pseudo data (Section 4), because of the dynamic process of of the data induction, i.e., we control the data quality by changing the thresholds, during which the data quantity is varying. Besides, we obtain the syntax annotation10 of each sentence by a biaffine dependency parser [13], which is trained on Penn Treebank corpus11 and has an overall 93.4% testing LAS.

Table 2.
DatasetDomainSentencePositiveNeutralNegative
Training setSemEvalRes1,8952,164633805
Lap1,365987460866
MAMSRes4,2973,3805,0422,764
Developing setSemEvalRes84705426
Lap98572766
MAMSRes500403604325
Testing setSemEvalRes600728196196
Lap411341169128
MAMSRes500400607329
ARTSRes4921,9534731,104
Lap331883407587

Table 2. Statistics of Datasets

6.1.2 Comparing Methods.

To make a comprehensive comparisons with different neural network architectures, we consider a variety of types of existing ABSA systems as baselines.12

  • \(\blacktriangleright\) LSTM-based model. (1) TD-LSTM. Tang et al. [59] use two separate LSTMs to encode the forward and backward contexts of the target aspect (inclusive) and concatenate the last hidden states of the two LSTMs for making the sentiment classification.

  • \(\blacktriangleright\) Convolutional-based model. (1) GCAE. Xue et al. [79] propose a CNN-based model with gating mechanisms for selectively learning the sentiment features while keeping computational efficient with convolutions.

  • \(\blacktriangleright\) Attention-based models. (1) MenNet. Tang et al. [60] use a memory network to cache the sentential representations into external memory and then calculate the attention with the target aspect. (2) AttLSTM. Wang et al. [72] quip the LSTM model with an attention mechanism and concatenate the aspect and word embeddings of each token for the final prediction. (3) AOA. Huang et al. [32] introduce an attention-over-attention network to jointly and explicitly capture the interaction between aspects and context sentences.

  • \(\blacktriangleright\) Capsule network. (1) CapNet. Jiang et al. [35] employ a capsule network to encode the sentence as well as the aspect term so as to learn the encapsulated features of each sentiment polarity and then take the routing algorithm to predict the polarity.

  • \(\blacktriangleright\) Syntax-based models. (1) ASGCN. Zhang et al. [81] as the first effort utilize an aspect-specific GCN to encode the syntactic structure of the input sentence and then impose an aspect-specific masking layer on its top to make prediction. (2) TD-GAT. Huang et al. [31] propose a multi-layer target-dependent graph attention network to explicitly encode the dependency tree information for better modeling the syntactic context of the target aspect. (3) RGAT. Wang et al. [68] consider transform the original dependency tree into an aspect-oriented structure rooted at the target aspect, so as to prune the tree information for better sentiment prediction. (4) RGCN. Veyseh et al. [54] regulate the GCN-based representation vectors based on the dependency trees to benefit from the overall contextual importance scores of the words.

Also we explore the differences when additionally using the PLM representations, i.e., BERT,13 including PT+BERT [77], TD-LSTM+BERT, CapNet+BERT, ASGCN+BERT, and RGAT+BERT.

6.1.3 Implementations and Evaluations.

We use the pre-trained 300-dimensional (300D) Glove embeddings [49]. Transformer encoder has 768D in four layers. USGCN is with 300D in three layers (\(L=3\)). Syntax label embedding is with 100-d. We use the mini-batch with a size of 16, training 10k iteration with early stopping. We adopt the Adam optimizer with an initial learning rate as 1e-4, and a \(\ell _2\) weight decay of 5e-5. We apply a 0.3 dropout ratio for word embeddings and 0.1 for all other feature embeddings. The thresholds \(\theta _a, \theta _n\), and \(\theta _m\) are set with 0.2, 0.25, and 0.85, respectively. Based on preliminary experiments, \(\lambda _a=0.6\), \(\lambda _{c1}=\lambda _{c3}=0.3\), and \(\lambda _{c2}=\lambda _{c4}=0.2\). Following prior works, we use the accuracy to evaluate performance. Each results of our model is from the average of 10 times of running, and all the scores are presented statistically significant after paired \(t\)-test. We fine-tune the hyper-parameters for all models on the validation set. All experiments are conducted with a NVIDIA RTX GeForce 3090Ti GPU and 24 GB graphic memory.

6.2 Main Results

We consider two types of evaluations, i.e., training based on SemEval and MAMS data, where the former is with challengeless data, and the latter is with challenge-aware data. Based on both two setups, we evaluate the effectiveness of our model, corpus, and training strategies by making comparisons with baselines. In the first setup, ABSA models are trained on SemEval data and evaluated on different testing sets (i.e., SemEval, ARTS, and MAMS). In the second setup, we train ABSA models on MAMS and then perform testing.

6.2.1 Training Based on SemEval Data.

In Table 3, we show the main performances on each test set. In addition to the SemEval training data (denoted as \(\mathbb {D}_o\)), we also consider the training with the additional synthetic data (i.e., \(\mathbb {D}_o+\mathbb {D}_s\)). We correspondingly gain multiple observations. The starting observation is that all the ABSA models (even the state-of-the-art ones) trained on SemEval data can drop significantly when testing on the challenging data (ARTS and MAMS). This reveals the imperative to enhance the ABSA robustness.

Table 3.
TestSemEvalARTSMAMS
RestaurantLaptopRestaurantLaptopRestaurant
Train\(\mathbb {D}_o\)+\(\mathbb {D}_s\)\(\mathbb {D}_o\)+\(\mathbb {D}_s\)\(\mathbb {D}_o\)+\(\mathbb {D}_s\)\(\mathbb {D}_o\)+\(\mathbb {D}_s\)\(\mathbb {D}_o\)+\(\mathbb {D}_s\)
\(\bullet\)w/o BERT
MemNet75.1878.55(+3.37)\(^\dagger\)64.4273.15(+8.73)\(^\dagger\)33.34\(^\dagger\)40.12(+6.78)\(^\dagger\)32.34\(^\dagger\)43.50(+11.16)\(^\dagger\)39.85\(^\dagger\)48.20(+8.35)\(^\dagger\)
AttLSTM75.9877.14(+1.16)\(^\dagger\)67.5572.54(+4.99)\(^\dagger\)26.52\(^\dagger\)33.38(+6.86)\(^\dagger\)31.87\(^\dagger\)39.21(+7.34)\(^\dagger\)30.21\(^\dagger\)42.54(+12.33)\(^\dagger\)
TD-LSTM78.1278.92(+0.80)\(^\dagger\)68.0373.68(+5.65)\(^\dagger\)35.62\(^\dagger\)43.85(+8.23)\(^\dagger\)41.57\(^\dagger\)52.52(+10.95)\(^\dagger\)34.42\(^\dagger\)44.92(+10.50)\(^\dagger\)
AOA79.3280.15(+0.83)\(^\dagger\)72.6074.50(+1.90)\(^\dagger\)30.02\(^\dagger\)45.52(+15.50)\(^\dagger\)40.35\(^\dagger\)49.48(+9.13)\(^\dagger\)32.36\(^\dagger\)47.51(+15.15)\(^\dagger\)
GCAE79.5380.23(+0.70)\(^\dagger\)73.1574.82(+1.67)\(^\dagger\)36.58\(^\dagger\)48.31(+11.73)\(^\dagger\)35.66\(^\dagger\)50.68(+15.02)\(^\dagger\)40.25\(^\dagger\)50.89(+10.64)\(^\dagger\)
CapNet80.1680.58(+0.42)\(^\dagger\)73.5475.21(+1.67)\(^\dagger\)38.89\(^\dagger\)44.65(+5.76)\(^\dagger\)45.32\(^\dagger\)54.51(+9.19)\(^\dagger\)38.16\(^\dagger\)50.52(+12.36)\(^\dagger\)
ASGCN80.8681.39(+0.53)\(^\dagger\)74.6175.98(+1.37)\(^\dagger\)44.20\(^\dagger\)52.47(+8.27)\(^\dagger\)59.24\(^\dagger\)66.77(+7.53)\(^\dagger\)45.25\(^\dagger\)52.02(+6.77)\(^\dagger\)
TD-GAT81.2082.07(+0.87)\(^\dagger\)74.0075.34(+1.34)\(^\dagger\)40.32\(^\dagger\)49.15(+8.83)\(^\dagger\)53.38\(^\dagger\)60.85(+7.47)\(^\dagger\)43.10\(^\dagger\)52.77(+9.67)\(^\dagger\)
RGAT82.1282.65(+0.53)\(^\dagger\)75.2075.72(+0.52)\(^\dagger\)41.73\(^\dagger\)51.58(+9.85)\(^\dagger\)54.91\(^\dagger\)62.34(+7.43)\(^\dagger\)41.89\(^\dagger\)51.50(+9.61)\(^\dagger\)
Ours(\(\mathcal {L}_{e}\))82.85\(^\ddagger\)83.13(+0.28)\(^\ddagger\)76.22\(^\ddagger\)76.85(+0.63)\(^\ddagger\)46.57\(^\ddagger\)55.58(+9.01)\(^\ddagger\)61.33\(^\ddagger\)69.12(+7.79)\(^\ddagger\)47.25\(^\ddagger\)55.34(+8.09)\(^\ddagger\)
Ours(\(\mathcal {L}_{a}\))83.52\(^\ddagger\)77.12\(^\ddagger\)58.61\(^\ddagger\)70.53\(^\ddagger\)56.12\(^\ddagger\)
Ours(\(\mathcal {L}_{e+c}\))83.98\(^\ddagger\)77.07\(^\ddagger\)58.20\(^\ddagger\)70.68\(^\ddagger\)56.53\(^\ddagger\)
Ours(\(\mathcal {L}_{a+c}\))84.45\(^\ddagger\)77.53\(^\ddagger\)60.39\(^\ddagger\)71.21\(^\ddagger\)57.02\(^\ddagger\)
Avg.79.5380.48(+0.95)71.9374.78(+2.85)37.3846.46(+9.08)45.6054.90(+9.30)39.2749.62(+10.35)
\(\bullet\)w/ BERT
BERT83.04\(^\dagger\)84.66(+1.62)\(^\dagger\)77.59\(^\dagger\)78.69(+1.10)\(^\dagger\)66.23\(^\dagger\)75.35(+9.12)\(^\dagger\)62.42\(^\dagger\)69.55(+7.13)\(^\dagger\)51.32\(^\dagger\)56.85(+5.53)\(^\dagger\)
TD-LSTM+BERT84.51\(^\dagger\)85.28(+0.77)\(^\dagger\)77.98\(^\dagger\)78.86(+0.88)\(^\dagger\)68.45\(^\dagger\)75.56(+7.11)\(^\dagger\)63.26\(^\dagger\)69.63(+6.37)\(^\dagger\)50.67\(^\dagger\)57.12(+6.45)\(^\dagger\)
CapNet+BERT85.48\(^\dagger\)86.04(+0.56)\(^\dagger\)77.12\(^\dagger\)79.30(+2.18)\(^\dagger\)69.36\(^\dagger\)77.48(+8.12)\(^\dagger\)64.01\(^\dagger\)70.21(+6.20)\(^\dagger\)52.23\(^\dagger\)57.14(+4.91)\(^\dagger\)
PT+BERT86.40\(^\dagger\)86.75(+0.35)\(^\dagger\)78.06\(^\dagger\)79.12(+1.06)\(^\dagger\)71.41\(^\dagger\)77.59(+6.18)\(^\dagger\)65.23\(^\dagger\)72.02(+6.79)\(^\dagger\)54.16\(^\dagger\)58.69(+4.53)\(^\dagger\)
ASGCN+BERT86.82\(^\dagger\)87.24(+0.42)\(^\dagger\)78.53\(^\dagger\)79.53(+1.00)\(^\dagger\)73.48\(^\dagger\)78.18(+4.70)\(^\dagger\)67.63\(^\dagger\)72.85(+5.22)\(^\dagger\)55.42\(^\dagger\)59.48(+4.06)\(^\dagger\)
RGAT+BERT 202086.60\(^\dagger\)87.03(+0.43)\(^\dagger\)78.20\(^\dagger\)79.38(+1.18)\(^\dagger\)72.83\(^\dagger\)78.25(+5.42)\(^\dagger\)67.28\(^\dagger\)71.35(+4.07)\(^\dagger\)55.84\(^\dagger\)60.52(+4.68)\(^\dagger\)
Ours+BERT(\(\mathcal {L}_{e}\))87.05\(^\ddagger\)87.15(+0.10)\(^\ddagger\)79.61\(^\ddagger\)80.28(+0.67)\(^\ddagger\)75.01\(^\ddagger\)80.65(+5.64)\(^\ddagger\)68.78\(^\ddagger\)73.89(+5.11)\(^\ddagger\)57.03\(^\ddagger\)62.37(+5.34)\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{a}\))87.53\(^\ddagger\)80.85\(^\ddagger\)81.95\(^\ddagger\)74.52\(^\ddagger\)63.07\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{e+c}\))87.49\(^\ddagger\)80.34\(^\ddagger\)81.42\(^\ddagger\)74.36\(^\ddagger\)63.24\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{a+c}\))87.87\(^\ddagger\)81.26\(^\ddagger\)82.38\(^\ddagger\)75.65\(^\ddagger\)63.58\(^\ddagger\)
Avg.85.7086.31(+0.61)78.1679.31(+1.15)70.9777.58(+6.61)65.5271.36(+5.84)53.8158.88(+5.07)
  • In the brackets are the improvements by using additional synthetic training data. \(\dagger\) means significance test with \(p\) \(\le\) 0.05, and \(\ddagger\) means \(p\) \(\le\) 0.03. The underlined scores are the best results by using common cross-entropy training, and the bold scores are the best results using advanced training strategies.

Table 3. Testing Results (Accuracy) of ABSA Systems on Each Test Set, Where the Models Are Trained on Raw SemEval Data ( \(\mathbb {D}_o\) ) and the Hybrid Data with Synthetic Data (+ \(\mathbb {D}_s\) ), Respectively

  • In the brackets are the improvements by using additional synthetic training data. \(\dagger\) means significance test with \(p\) \(\le\) 0.05, and \(\ddagger\) means \(p\) \(\le\) 0.03. The underlined scores are the best results by using common cross-entropy training, and the bold scores are the best results using advanced training strategies.

The second observation is about the ABSA models. We see that baselines with different kinds of neural architectures show different generalization capabilities. For example, the syntax-aware models not only give stronger performances on in-house test than other types but also consistently preserve better robustness. This confirms the prior findings in Reference [76] that explicitly model the aspect-position information (such as a syntax-aware model), leading to superior robustness. More significantly, our proposed syntax-aware neural system shows the best performances in both in-house and out-of-house test, i.e., stronger generalization ability. At the same time, we find that the attention-based models actually give much lower robustness performances, while using the pre-trained BERT representations the drops on the challenging testing data by ABSA models are minimized markedly, i.e., PLM can help to enhance the ABSA robustness.

Furthermore, training additionally with the synthetic data all the ABSA models obtain improved performances than the counterparts (marked in the brackets) in both in-house and out-of-house test across all the testing sets universally. Especially we see that the robustness performances on out-of-house test data are substantially enhanced. These boosts are more obvious when the BERT PLM is not used. This reveals the significance to enrich the training data with additional challenging signals for robust ABSA. This again proves the help of PLM for improving ABSA robustness [35, 76].

Last, we see that using the training paradigm based on our pseudo corpus, our system can receive further enhancements consistently on all the test sets. Specifically, we consider different combination of training mechanisms, i.e., \(\mathcal {L}_e\), \(\mathcal {L}_a\), \(\mathcal {L}_{e+c}\), and \(\mathcal {L}_{e+a}\). It shows that both adversarial training \(\mathcal {L}_a\) and contrastive learning \(\mathcal {L}_{e+c}\) can result in better performances than the basic cross-entropy training \(\mathcal {L}_e\), while integrating both two training strategies (\(\mathcal {L}_{e+a}\)) our model gives the best effects. Also, we see from Table 4 that all comparing baselines achieve consistent improvements on robustness when the advanced training strategies (\(\mathcal {L}_{e+c}\) and \(\mathcal {L}_{e+a}\)) are equipped with pseudo data.

Table 4.
SemEvalARTSMAMS
ResLapResLapRes
\(\bullet\)Training with \(\mathcal {L}_{e+c}\)
TD-LSTM79.65\(^\dagger\)74.88\(^\dagger\)46.85\(^\dagger\)54.32\(^\dagger\)48.55\(^\dagger\)
GCAE81.42\(^\dagger\)75.41\(^\dagger\)50.47\(^\dagger\)52.40\(^\dagger\)52.02\(^\dagger\)
CapNet81.69\(^\dagger\)76.10\(^\dagger\)47.78\(^\dagger\)56.85\(^\dagger\)52.63\(^\dagger\)
ASGCN82.02\(^\dagger\)76.49\(^\dagger\)55.46\(^\dagger\)67.85\(^\dagger\)54.22\(^\dagger\)
RGAT83.02\(^\dagger\)76.28\(^\dagger\)53.91\(^\dagger\)63.47\(^\dagger\)53.28\(^\dagger\)
Ours83.98\(^\ddagger\)77.07\(^\ddagger\)58.20\(^\ddagger\)70.68\(^\ddagger\)56.53\(^\ddagger\)
Ours(w/o s.l.)82.16\(^\ddagger\)76.50\(^\ddagger\)54.56\(^\ddagger\)69.56\(^\ddagger\)54.70\(^\ddagger\)
Ours(w/o a.)83.65\(^\ddagger\)76.95\(^\ddagger\)57.92\(^\ddagger\)70.44\(^\ddagger\)55.00\(^\ddagger\)
Ours(w/o Trm)83.31\(^\ddagger\)76.88\(^\ddagger\)57.70\(^\ddagger\)70.32\(^\ddagger\)55.13\(^\ddagger\)
Avg.82.3276.2853.6563.9953.56
\(\bullet\)Training with \(\mathcal {L}_{a+c}\)
TD-LSTM80.67\(^\dagger\)75.23\(^\dagger\)48.61\(^\dagger\)55.11\(^\dagger\)49.34\(^\dagger\)
GCAE81.83\(^\dagger\)75.83\(^\dagger\)52.42\(^\dagger\)53.22\(^\dagger\)53.83\(^\dagger\)
CapNet81.96\(^\dagger\)76.76\(^\dagger\)50.03\(^\dagger\)58.23\(^\dagger\)52.95\(^\dagger\)
ASGCN82.35\(^\dagger\)76.89\(^\dagger\)56.74\(^\dagger\)68.50\(^\dagger\)54.87\(^\dagger\)
RGAT83.64\(^\dagger\)76.79\(^\dagger\)55.62\(^\dagger\)64.72\(^\dagger\)53.94\(^\dagger\)
Ours84.45\(^\ddagger\)77.53\(^\ddagger\)60.39\(^\ddagger\)71.21\(^\ddagger\)57.02\(^\ddagger\)
Ours(w/o s.l.)82.58\(^\ddagger\)76.95\(^\ddagger\)56.82\(^\ddagger\)70.14\(^\ddagger\)55.06\(^\ddagger\)
Ours(w/o a.)83.90\(^\ddagger\)77.32\(^\ddagger\)59.35\(^\ddagger\)71.02\(^\ddagger\)56.79\(^\ddagger\)
Ours(w/o Trm)83.72\(^\ddagger\)77.06\(^\ddagger\)58.45\(^\ddagger\)70.75\(^\ddagger\)56.21\(^\ddagger\)
Avg.82.7976.7155.3864.7754.45
  • “w/o s.l.” and “w/o a.”: removing syntax label (\(\mathbf {x}^e_{i,j}\)) and aspect embedding (\(\mathbf {r}^{asp}\)) from USGCN (Equation (5)), respectively. “w/o Trm”: replacing Transformer encoder with BiLSTM.

Table 4. Training with Advanced Strategies

  • “w/o s.l.” and “w/o a.”: removing syntax label (\(\mathbf {x}^e_{i,j}\)) and aspect embedding (\(\mathbf {r}^{asp}\)) from USGCN (Equation (5)), respectively. “w/o Trm”: replacing Transformer encoder with BiLSTM.

Table 4 shows the ablation results of our proposed models. Removing the aspect from the unified modeling with syntax in USGCN shows inferior accuracies. Without encoding the dependency syntax knowledge, our USGCN encoder results in significant performance drops, which reflects the importance to model the universal syntax for ABSA. Further, without using Transformer encoder, we also witness the downgraded performances. But each of our ablated model still outperforms the best baseline, i.e., ASGCN model only encodes the dependency edge information.

6.2.2 Fine-grained Robustness Testing.

In the ARTS challenging test set, there are three subsets (REVTGT, REVNON, and ADDDIFF), each of which evaluates the ABSA robustness from different aspects. For example, REVTGT measures if a model can correctly bind a target aspect to its critical opinion clues, REVNON detects the sensitivity of a model to the sentiment change of non-target aspects, and ADDDIFF testifies if a model is robust to the existence of non-target aspect. In Table 5, we show the specific performances w.r.t. each of the ARTS subset (Restaurant). To further evaluate the robustness to the change of trivial background contexts, we additionally build a test set14 RWTBG.

Table 5.
TestREVTGTREVNONADDDIFFRWTBG
Train\(\mathbb {D}_o\)+\(\mathbb {D}_s\)\(\mathbb {D}_o\)+\(\mathbb {D}_s\)\(\mathbb {D}_o\)+\(\mathbb {D}_s\)\(\mathbb {D}_o\)+\(\mathbb {D}_s\)
\(\bullet\)w/o BERT
MemNet27.54\(^\dagger\)80.73(+53.19)\(^\dagger\)73.65\(^\dagger\)84.46(+10.81)\(^\dagger\)60.71\(^\dagger\)75.18(+14.47)\(^\dagger\)77.50\(^\dagger\)80.33(+2.83)\(^\dagger\)
AttLSTM28.98\(^\dagger\)82.98(+54.00)\(^\dagger\)61.26\(^\dagger\)77.26(+16.00)\(^\dagger\)52.32\(^\dagger\)75.98(+23.66)\(^\dagger\)69.64\(^\dagger\)84.44(+14.80)\(^\dagger\)
AOA30.51\(^\dagger\)84.36(+53.85)\(^\dagger\)73.95\(^\dagger\)84.13(+10.18)\(^\dagger\)63.51\(^\dagger\)72.55(+9.04)\(^\dagger\)70.54\(^\dagger\)82.36(+11.82)\(^\dagger\)
GCAE33.02\(^\dagger\)85.15(+52.13)\(^\dagger\)75.02\(^\dagger\)85.63(+10.61)\(^\dagger\)63.72\(^\dagger\)76.45(+12.73)\(^\dagger\)74.27\(^\dagger\)84.67(+10.40)\(^\dagger\)
CapNet30.15\(^\dagger\)85.37(+55.22)\(^\dagger\)76.36\(^\dagger\)84.69(+8.33)\(^\dagger\)57.65\(^\dagger\)75.59(+17.94)\(^\dagger\)78.56\(^\dagger\)86.85(+8.29)\(^\dagger\)
ASGCN34.78\(^\dagger\)86.76(+51.98)\(^\dagger\)79.50\(^\dagger\)88.51(+9.01)\(^\dagger\)70.88\(^\dagger\)78.86(+7.98)\(^\dagger\)80.63\(^\dagger\)90.04(+9.41)\(^\dagger\)
RGAT37.05\(^\dagger\)87.26(+50.21)\(^\dagger\)81.15\(^\dagger\)87.03(+5.88)\(^\dagger\)67.05\(^\dagger\)79.48(+12.43)\(^\dagger\)78.15\(^\dagger\)89.85(+9.70)\(^\dagger\)
Ours(\(\mathcal {L}_{e}\))40.41\(^\ddagger\)88.33(+47.92)\(^\ddagger\)80.62\(^\ddagger\)90.52(+9.90)\(^\ddagger\)74.66\(^\ddagger\)81.56(+6.90)\(^\ddagger\)82.84\(^\ddagger\)92.54(+9.70)\(^\ddagger\)
Ours(\(\mathcal {L}_{a}\))89.51\(^\ddagger\)91.30\(^\ddagger\)82.69\(^\ddagger\)92.98\(^\ddagger\)
Ours(\(\mathcal {L}_{e+c}\))89.28\(^\ddagger\)90.89\(^\ddagger\)81.98\(^\ddagger\)92.71\(^\ddagger\)
Ours(\(\mathcal {L}_{a+c}\))90.42\(^\ddagger\)91.65\(^\ddagger\)83.13\(^\ddagger\)93.45\(^\ddagger\)
Avg.32.8185.12(+52.31)75.4485.53(+10.09)63.8176.96(+13.15)77.0286.63(+9.61)
\(\bullet\)w/ BERT
BERT63.00\(^\dagger\)84.15(+21.15)\(^\dagger\)83.33\(^\dagger\)86.33(+3.00)\(^\dagger\)79.20\(^\dagger\)85.79(+6.59)\(^\dagger\)81.36\(^\dagger\)82.20(+0.84)\(^\dagger\)
TD-LSTM+BERT67.32\(^\dagger\)85.85(+18.53)\(^\dagger\)80.68\(^\dagger\)88.15(+7.47)\(^\dagger\)79.35\(^\dagger\)86.22(+6.87)\(^\dagger\)80.30\(^\dagger\)88.41(+8.11)\(^\dagger\)
CapNet+BERT71.87\(^\dagger\)87.74(+15.87)\(^\dagger\)78.55\(^\dagger\)86.48(+7.93)\(^\dagger\)77.86\(^\dagger\)85.96(+8.10)\(^\dagger\)83.02\(^\dagger\)87.05(+4.03)\(^\dagger\)
PT+BERT72.83\(^\dagger\)84.33(+11.50)\(^\dagger\)81.76\(^\dagger\)88.87(+7.11)\(^\dagger\)80.27\(^\dagger\)87.77(+7.50)\(^\dagger\)82.48\(^\dagger\)84.68(+2.20)\(^\dagger\)
ASGCN+BERT74.51\(^\dagger\)89.76(+15.25)\(^\dagger\)85.12\(^\dagger\)90.35(+5.23)\(^\dagger\)82.52\(^\dagger\)88.31(+5.79)\(^\dagger\)83.85\(^\dagger\)91.68(+7.83)\(^\dagger\)
RGAT+BERT75.68\(^\dagger\)90.48(+14.80)\(^\dagger\)83.38\(^\dagger\)91.21(+7.83)\(^\dagger\)80.45\(^\dagger\)87.88(+7.43)\(^\dagger\)84.64\(^\dagger\)92.45(+7.81)\(^\dagger\)
Ours+BERT(\(\mathcal {L}_{e}\))78.02\(^\ddagger\)91.32(+13.30)\(^\ddagger\)86.32\(^\ddagger\)92.86(+6.54)\(^\ddagger\)82.14\(^\ddagger\)89.68(+7.54)\(^\ddagger\)85.45\(^\ddagger\)93.52(+8.07)\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{a}\))92.45\(^\ddagger\)93.45\(^\ddagger\)90.46\(^\ddagger\)94.22\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{e+c}\))92.04\(^\ddagger\)93.11\(^\ddagger\)90.35\(^\ddagger\)94.06\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{a+c}\))93.12\(^\ddagger\)93.76\(^\ddagger\)90.85\(^\ddagger\)95.18\(^\ddagger\)
Avg.72.3187.89(+15.58)83.0089.41(+6.41)80.2787.41(+7.14)83.1489.05(+5.91)

Table 5. Fine-grained Robustness Testing Performances on Each Subset of ARTS Data

From the results in Table 5, we learn that almost all ABSA models give most significant accuracy drops on REVTGT than on other subsets, which we regard as the major bottleneck of robust ABSA. However, our pseudo training data helps to substantially compensate such drops on REVTGT for all these models, e.g., an average of 52.31% accuracy increase. For other robust testing subset, our synthetic data also help, e.g., an approximately 10% accuracy increase. Likewise, our proposed ABSA model always shows better results than baselines. Interestingly, with BERT PLM information, the drops on robustness test of each model are much relieved, and correspondingly the positive effects from our pseudo data are not that prominent. But we still see that introducing of two advanced training strategies in our model steadily leads to further improvements.

6.2.3 Training Based on MAMS Data.

Table 6 shows the performances of ABSA models trained on MAMS data. We see that the earlier viewpoint is verified, where the robustness can be greatly improved by training with more challenging training data, i.e., the gaps of the accuracy between in-house testing (on MAMS) and out-of-house testing (on ARTS) are not as significant as those observed in Table 3. This conclusion is further supported by the observation that additionally using our pseudo data (\(\mathbb {D}_o+\mathbb {D}_s\)) helps to obtain much limited improvements. The rest of the observations are kept same with that in Table 3, i.e., (1) syntax-aware models show stronger capabilities, while (2) our proposed model gives the best performances, and (3) PLM helps achieve better robustness.

Table 6.
TestMAMSARTS
Train\(\mathbb {D}_o\)+\(\mathbb {D}_s\)\(\mathbb {D}_o\)+\(\mathbb {D}_s\)
\(\bullet\)w/o BERT
MemNet73.24\(^\dagger\)75.85(+2.61)\(^\dagger\)69.67\(^\dagger\)74.15(+4.48)\(^\dagger\)
AttLSTM70.53\(^\dagger\)74.12(+3.59)\(^\dagger\)65.25\(^\dagger\)70.45(+5.20)\(^\dagger\)
TD-LSTM74.59\(^\dagger\)76.27(+1.68)\(^\dagger\)69.51\(^\dagger\)72.36(+2.85)\(^\dagger\)
AOA75.27\(^\dagger\)77.54(+2.27)\(^\dagger\)68.33\(^\dagger\)71.85(+3.52)\(^\dagger\)
GCAE75.82\(^\dagger\)77.80(+1.98)\(^\dagger\)71.52\(^\dagger\)76.44(+4.92)\(^\dagger\)
CapNet75.77\(^\dagger\)77.36(+1.59)\(^\dagger\)73.78\(^\dagger\)77.38(+3.60)\(^\dagger\)
ASGCN76.95\(^\dagger\)79.45(+2.50)\(^\dagger\)75.12\(^\dagger\)78.57(+3.45)\(^\dagger\)
TD-GAT78.54\(^\dagger\)80.06(+1.52)\(^\dagger\)75.69\(^\dagger\)78.02(+2.33)\(^\dagger\)
RGAT79.09\(^\dagger\)81.20(+2.11)\(^\dagger\)76.24\(^\dagger\)79.24(+3.00)\(^\dagger\)
Ours(\(\mathcal {L}_{e}\))80.65\(^\ddagger\)82.63(+1.98)\(^\ddagger\)77.50\(^\ddagger\)80.48(+2.98)\(^\ddagger\)
Ours(\(\mathcal {L}_{a}\))83.48\(^\ddagger\)82.02\(^\ddagger\)
Ours(\(\mathcal {L}_{e+c}\))82.92\(^\ddagger\)81.25\(^\ddagger\)
Ours(\(\mathcal {L}_{a+c}\))84.17\(^\ddagger\)82.44\(^\ddagger\)
Avg.76.0478.23(+2.19)72.2675.89(+3.63)
\(\bullet\)w/ BERT
CapNet+BERT83.39\(^\dagger\)84.72(+1.33)\(^\dagger\)79.18\(^\dagger\)82.48(+3.30)\(^\dagger\)
BERT+Xu82.52\(^\dagger\)84.65(+2.13)\(^\dagger\)79.38\(^\dagger\)82.67(+3.29)\(^\dagger\)
PT+BERT83.10\(^\dagger\)84.88(+1.78)\(^\dagger\)80.07\(^\dagger\)83.24(+3.17)\(^\dagger\)
RGAT+BERT83.93\(^\dagger\)85.15(+1.22)\(^\dagger\)80.48\(^\dagger\)83.45(+2.97)\(^\dagger\)
Ours+BERT(\(\mathcal {L}_{e}\))84.23\(^\ddagger\)86.04(+1.81)\(^\ddagger\)81.56\(^\ddagger\)84.66(+3.10)\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{a}\))86.78\(^\ddagger\)85.47\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{e+c}\))86.45\(^\ddagger\)85.02\(^\ddagger\)
Ours+BERT(\(\mathcal {L}_{a+c}\))87.12\(^\ddagger\)86.93\(^\ddagger\)
Avg.83.4385.09(+1.66)80.1383.30(+3.17)

Table 6. Robustness Test Results Where Models Are Trained on MAMS (Denoted as \(\mathbb {D}_o\) ) and with Additional the Pseudo Data (+ \(\mathbb {D}_s\) )

Skip 7ANALYSIS AND DISCUSSION Section

7 ANALYSIS AND DISCUSSION

In prior experiments, we show the effectiveness of the proposed ABSA model, the synthetic training corpus, and the advanced training paradigms for better ABSA robustness. In this section, we take further steps, exploring the factors influencing the performances on these three aspects.

7.1 Model Evaluation

Above we show that the syntax integration and PLM greatly enhance the robustness. Here we try to find answers for the following questions.

Q1:

How much will the robustness score vary across the different syntax integration methods?

Q2:

Why does syntax-based model improve robustness?

Q3:

To what extent does syntax quality influence robustness?

Q4:

Can Stronger PLM bring better ABSA robustness?

7.1.1 Performances of Different Syntax Integration Methods.

Previously, we compare the performances of several syntax-based models, e.g., ASGCN [81], TD-GAT [31], RGAT [68], and our model. In addition, we take into consideration other types of state-of-the-art syntax-aware ABSA models in recent works, such as DGEDT [61] and KumaGCN [4]. Note that both ASGCN and our model use GCN to encode syntactic dependency structure, while our model additionally navigates the syntax labels and aspect into the modeling. TD-GAT employs a graph attention network (GAT) [66] to encode dependency tree. RGAT reshapes the original syntax tree into new one rooted at a target aspect. Besides encoding the dependency tree, DGEDT additionally considers the flat representations learnt from Transformer, while KumaGCN leverages the latent syntax structure. We also implement an ABSA model encoding random tree for comparison.

We measure their performances15 on five subsets of robustness testing (Restaurant), as plotted in Figure 8. We obtain some interesting patterns, where different models have distinct capabilities on each type of robustness test. Among all the kind, our USGCN-based system performs the best on REVTGT, ADDDIFF, and MAMS challenges. RGAT gives the strongest performance on REVNON test, while DGEDT is most reliable on RWTBG test. Also we note that encoding random trees gives the worst results on all attributes. And encoding the latent structure (KumaGCN) actually helps little with robustness, largely due to the noise introduction.

Fig. 8.

Fig. 8. Radar map of the performances by different syntax-aware ABSA models on each specific robust test.

7.1.2 Faithfulness of Opinion Clues of Aspect.

The key to robust ABSA modeling (for both the “Aspect-context binding” and “Multi-aspect anti-interference” challenge) lies in the capability of locating the exact opinion texts for the target aspect, i.e., the faithfulness of target aspect’s opinion clues. To confirm this faithfulness, we experiment with the manual TOWE test set [15] where the exact opinion expressions of each target aspect are explicitly annotated. We measure the deviation between the highly weighted words decided by a ABSA model and the gold opinion expressions, which is taken as the faithfulness. We make comparisons between different syntax-aware models, additionally including the attention-based AttLSTM model. In Figure 9, we plot the results. We clearly see that different models come with varying faithfulness. For example, RGAT, ASGCN, and our USGCN-based model gives much fewer deviations than other models. All the syntax-aware models show higher faithfulness than AttLSTM.

Fig. 9.

Fig. 9. Model deviations on the faithfulness of aspect’s opinion clues.

7.1.3 Impacts of Syntax Quality.

The quality of the syntax is crucial to the syntax-based models, since it influences the performances of robustness test. However, ABSA data has no gold syntactic dependency annotations, and therefore we take the automatic parses instead. By controlling the quality of the dependency parser, i.e., having varying testing LAS, we obtain an array of parsers with different quality. We use these parsers to general annotations in varying quality. We then perform the experiment and observe the corresponding performances. Figure 10 shows the robustness testing accuracy under varying quality of parser. We see that with the decreasing of parse quality, the performance drops dramatically. Interestingly, the RGAT model performs the worst when the syntax quality decreases, mostly because reshaping the suboptimal syntax structure will dramatically introduce noises. Besides, comparing with ASGCN, our model is more sensitive on the quality, as it additionally relies on the syntax label information.

Fig. 10.

Fig. 10. Influence of the syntax quality.

7.1.4 Effect of Pre-trained Language Model.

The robustness of ABSA models are universally improved from BERT in that PLM entailed abundant linguistic and semantic knowledge for reasoning the relation between aspect and valid contexts, which coincides with related works [35, 76, 78]. Here we try to explore if we can obtain better results with enhanced PLMs, e.g., other type of PLM, task-aware pre-training. First, we compare BERT with RoBERTa,16 an upgraded version of BERT. Besides, we additionally perform a “post-training” of PLMs between the pre-training and fine-tuning stages, i.e., based on the synthetic data (\(\mathbb {D}_a\)) predicting the opinion texts of a given aspect via masked language modeling (MLM) technique (“A.O.MLM”). From the trends in Figure 11, we see that compared to BERT, RoBERTa has been shown to give very substantial improvements. Further, with a post-training of aspect–opinion MLM, each BERT/RoBERTa-wise model obtains improved results prominently.

Fig. 11.

Fig. 11. Performances with different pre-trained language models.

7.2 Corpus Evaluation

We study two major questions w.r.t. the synthetic data induction.

Q1:

What is the contribution from each type of three different pseudo data?

Q2:

How does the quality of pseudo data influence the robustness learning?

7.2.1 Contributions from Different Type of Synthetic Data.

Each type of our constructed pseudo training data (\(\mathbb {D}_a, \mathbb {D}_n\), and \(\mathbb {D}_m\)) is devoted to enhancing the robust challenges from different perspectives. Here we examine the contribution of each type of the data to different robust testing subsets. In Figure 12, we show the results (based on Restaurant) from which we gain some interesting observations. First, it is clear that the sentiment modification data (\(\mathbb {D}_a\)) contributes the most to REVTGT and REVNON, where the former takes the major proportion in the overall robustness test. This is reasonable, since enriching the sentiment diversification of each target aspect with various opinion words via \(\mathbb {D}_a\) can directly enhance the capability of the first aspect-context binding challenge, making the ABSA model more correctly linking the target aspects to the critical opinion clues.

Fig. 12.

Fig. 12. Training using additional synthetic data of different types, and evaluating on each specific robustness test set.

Second, the non-target aspects addition data (\(\mathbb {D}_m\)) benefits ADDDIFF and MAMS more, while the background rewriting data (\(\mathbb {D}_n\)) mainly improves RWTBG. This is easy to understand: Because we in \(\mathbb {D}_m\) increase the number of non-target aspects in sentences, we create the rich cases of multiple aspect coexistence for facilitating the learning of ABSA model. When faced with the multi-aspect challenge in ADDDIFF and MAMS, the model naturally gives better performances. Finally, when combining the full set of all three data (\(\mathbb {D}_o+\mathbb {D}_s\)), all the robust challenges receive the highest results, which, notably, can be further enhanced by using better training strategies, i.e., \(\mathcal {L}_{a+c}(\mathbb {D}_o+\mathbb {D}_s)\). Also we notice that any use of our enhanced data will improve the robustness in the comparison with the setting of \(\mathcal {L}_{e}(\mathbb {D}_o)\).

7.2.2 Impacts of the Pseudo Data Quality.

In Section 4, we devised three threshold values, respectively, i.e., \(\theta _a\), \(\theta _n\), and \(\theta _m\), for the quality control of each corresponding data. Now we study the influences of the constructed synthetic data quality. Figure 13 plots the curves of the performances under varying threshold values. First, we see that when the thresholds are increased (inclusively \(\theta _a\), \(\theta _n\), and \(\theta _m\)), all the numbers of induced samples are reduced dramatically. Intuitively, these constructed samples with higher qualities are always the minority. However, ABSA models achieve their best performances in the tradeoff between data quantity and quality. In other words, too few training data cause insufficient signals to learn the inductive bias, though with comparatively high quality of training instances. However, larger numbers of training data with noisy signals also undermines the learning. The equilibrium points vary among different types of the synthetic data, e.g., \(\theta _a=0.2\), \(\theta _n=0.25\), and \(\theta _m=0.85\). At the same time, the sample numbers in \(\mathbb {D}_a\), \(\mathbb {D}_n\), and \(\mathbb {D}_m\) are approximately 10,000, 12,500, and 4,000. We find that our USGCN model consistently performs the best in any case among three datasets.

Fig. 13.

Fig. 13. Results under different quality of synthetic data.

7.3 Training Evaluation

We have confirmed earlier that better training strategies further help improve robustness. Correspondingly, we care about one main questions:

Q1:

How does the training paradigm affect the robustness learning?

In Section 5.2, we proposed total four types of learning paradigms to fully utilize the rich contrastive signals within the synthetic corpus unsupervisedly. Furthermore, we now try to explore the following:

Q2:

How varied are the performances of each contrastive learning scheme?

7.3.1 Visualization for Advanced Training Strategy.

Q1 asks for the underlying reason that different training methods lead to diversified performances. As we introduced earlier, the adversarial training helps to reinforce the perception of contextual change with three types of enhanced pseudo data, while contrastive learning can unsupervisedly consolidate the recognition of different labels. To confirm this, we consider empirically performing visualization of the resulting model representations by different training strategies, e.g., adversarial training (\(\mathcal {L}_a\)) and contrastive learning (\(\mathcal {L}_c\)) as well as hybrid training (\(\mathcal {L}_{e+c}\) and \(\mathcal {L}_{a+c}\)). We render the final feature representation \(\mathbf {r}^f\) of each instance in the ARTS test set (Restaurant) with T-SNE algorithm, as shown in Figure 14. It is quite easy to see the gaps of the models’ capability between each training method. First, from the patterns between (b) and (c) we understand that the decision boundaries learnt by the standard cross-entropy training objective can be quite obscure, while the advanced training, especially the adversarial training, indeed helps greatly in learning clearer decision boundaries between different sentiment labels. Besides, without employing the adversarial training, we instead combine the cross-entropy training objective with additional unsupervisedly contrastive representation learning, and the decision boundaries can also became much clearer. This reflects the importance of leveraging contrastive representation learning for sufficiently mining the inherent knowledge in the data for better ABSA robustness. Also notedly, we see that the hybrid training of adversarial training and contrastive learning (\(\mathcal {L}_{a+c}\)) helps to give the best effect. Additionally, comparing Figure 14(a) with Figure 14(b) we understanding the high effectiveness of leveraging the pseudo training corpus.

Fig. 14.

Fig. 14. Visualizations of the model representations by different training strategies. Best viewed in color and by zooming in.

7.3.2 Into Contrastive Learning.

Each of the contrastive learning schemes focuses on one case of intra-/inter-aspect and opinion/structure-guided perspectives. In all the above experiments, we take the total form of all these learnings in the pursuit of maximum effect. Here we would like to check the contribution of each separately. We reach the goal by ablating each loss term and observing the corresponding performance drop. Intuitively, the bigger the drop, the more important it should be. We plot the results of accuracy drops for \(\mathcal {L}_{e+c}\) and \(\mathcal {L}_{a+c}\) in Figure 15. In general, the drops by \(\mathcal {L}_{e+c}\) are higher than those by \(\mathcal {L}_{a+c}\). The most plausible reason could be that the adversarial training alone can learn good biases, compared with the training with cross-entropy objective. Besides, from a global view, for both \(\mathcal {L}_{e+c}\) and \(\mathcal {L}_{a+c}\), we witness the same trend, i.e., opinion-guided learning is primary than the structure-guided one. This largely proves that the final feature representations at last aggregation layer carry the major opinion features for the target aspect. In contrast, the representation from syntax fusion module at the middle layer may not able to fully cover the final opinion-aware feature representation. But we note that, within the scope of inter-aspect learning, the role of structure-guided one is on par with the opinion-guided one. This is because two different aspects can have clearer distinction in syntax structures, allowing for a better contrast. Overall, all these four types of learning schemes can contribute the ABSA robustness.

Fig. 15.

Fig. 15. Performances by different contrastive learning schemes.

Skip 8CONCLUSION AND FUTURE WORK Section

8 CONCLUSION AND FUTURE WORK

Since the early 2010s, a good amount of ABSA neural models are emerged for pursuing stronger task performances and higher testing scores. They, however, could be vulnerable to new cases in the wild where the contexts can be varying. Improving the ABSA robustness thus becomes imperative. In this study, we rethink the bottleneck of ABSA robustness and improve it from a systematic perspective, i.e., model, data, and training. We propose improving the ABSA model robustness, strengthening the adapting capability of the model in real-world environments and facilitating the commercial applications for our society. In addition, the methods we proposed for the robustness improvements of ABSA scenario can effortlessly transfer to other AI-technique based applications and tasks and thus benefit society. In the following, we conclude what works for robust ABSA and then shed light on what is next.

After a comprehensive comparison between current strong-performing ABSA models, syntax-based models show the best robustness among others, due to their extraordinary capability on locating exact opinion texts for target aspect. In this work, we introduce a novel syntax-aware model: We model the syntactic dependency structure and the arc labels as well as the target aspect simultaneously with a GCN encoder, namely USGCN. With USGCN, we achieve the goal of navigating richer syntax information for best ABSA robustness. Also we reveal that better pre-trained language models help robustness learning greatly. As future work, we encourage to either relieving the negative effect of syntax-based methods (e.g., relying much on syntax quality) or devising syntax-agnostic models but with strong aspect-context binding abilities. Alternatively, we recommend integrating external syntax knowledge into PLMs during the post-training stage and then performing opinion-aware fine-tuning.

Another key bottleneck is the data. Strong ABSA models achieve good accuracy on in-house testing data but fail to scale to unseen cases, because the insufficiency of learning good inductive bias on training set. We thus construct additional synthetic training data. Three types of high-quality corpora are automatically induced based on raw SemEval data, enabling sufficient robust learning of ABSA models. Each type of pseudo data aims to improve one certain angle of ABSA robustness. Future work may explore better approaches to automatically construct a higher-quality corpus, e.g., inducing more reliable data with less sentiment uncertainty. In addition, automatically constructing large-scale sentiment data for training better PLM for robust ABSA will be a promising direction.

The training paradigm is also important. Most of the existing ABSA frameworks take the standard training with negative cross-entropy objective. In this work, we propose to perform adversarial training based on pseudo data to enhance the resistance to the environment perturbation, such as opinion flip, background rewriting, and multi-aspects coexistence. Meanwhile, we employ the unsupervised contrastive learning technique for further enhancement of representation learning based on the contrastive samples in pseudo data. We design four different learning schemes to fully consolidate the recognition of robustness challenges. As future work, we believe it will be meaningful to build a more reasonable and efficient adversarial training framework, achieving higher robustness performance in shorter time.

Footnotes

  1. 1 ‘Enclosure test’ is to describe a process that the data used for training and testing are draw from the same sources and in same distributions.

    Footnote
  2. 2 Here “well-trained” is used to describe an ABSA model that is trained on the in-house training set and achieves the peak performance on the developing set.

    Footnote
  3. 3 We use “trivial contexts” and “misleading clues” to describe those contexts that are not the opinion expressions for triggering the aspects in ABSA.

    Footnote
  4. 4 “Robustness test” is referred to as a probing test on a model in terms of its robustness, which often performed based on a certain robust testing dataset.

    Footnote
  5. 5 “Valid contexts” describes the parts of context are critical cues of the opinion expressions.

    Footnote
  6. 6 The dependency labels nsubj refers to nominal subject.

    Footnote
  7. 7 Replacing with words having same POS tags.

    Footnote
  8. 8 Majority languages, including five language pairs other than English: Chinese–English, French–English, German–English, Spanish–English, and Portuguese–English. According to the recent findings in NMT, the performances of back-translation in those languages are satisfactory [7, 33, 43].

    Footnote
  9. 9 We employ the off-the-shelf Translation system for high-quality translation, i.e., Google Translation https://translate.google.com/.

    Footnote
  10. 10 Following universal dependency v3.9.2.

    Footnote
  11. 11 https://catalog.ldc.upenn.edu/LDC99T42.

    Footnote
  12. 12 We run the experiments via their released codes.

    Footnote
  13. 13 https://github.com/google-research/bert , uncased base version.

    Footnote
  14. 14 We first derive the pseudo data as in Section 4.2 and then manually inspect the data to ensure the quality.

    Footnote
  15. 15 We normalize each value by dividing the max one on each sub set.

    Footnote
  16. 16 https://github.com/pytorch/fairseq/tree/master/examples/roberta.

    Footnote

REFERENCES

  1. [1] Baccianella Stefano, Esuli Andrea, and Sebastiani Fabrizio. 2010. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the International Conference on Language Resources and Evaluation. 22002204.Google ScholarGoogle Scholar
  2. [2] Banerjee Satanjeev and Lavie Alon. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Evaluation Measures for Machine Translation and Summarization. 6572.Google ScholarGoogle Scholar
  3. [3] Belinkov Yonatan and Bisk Yonatan. 2018. Synthetic and natural noise both break neural machine translation. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  4. [4] Chen Chenhua, Teng Zhiyang, and Zhang Yue. 2020. Inducing target-specific latent structures for aspect sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 55965607.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chen Ting, Kornblith Simon, Norouzi Mohammad, and Hinton Geoffrey E.. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. 15971607.Google ScholarGoogle Scholar
  6. [6] Chia Yew Ken, Bing Lidong, Poria Soujanya, and Si Luo. 2022. RelationPrompt: Leveraging prompts to generate synthetic data for zero-shot relation triplet extraction. In Findings of the Association for Computational Linguistics: ACL 2022. 4557.Google ScholarGoogle Scholar
  7. [7] Chiang Ting-Rui, Chen Yi-Pei, Yeh Yi-Ting, and Neubig Graham. 2022. Breaking down multilingual machine translation. In Findings of the Association for Computational Linguistics: ACL 2022. 27662780.Google ScholarGoogle Scholar
  8. [8] Chivukula Aneesh Sreevallabh and Liu Wei. 2019. Adversarial deep learning models with multiple adversaries. IEEE Trans. Knowl. Data Eng. 31, 6 (2019), 10661079.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Clercq Orphée De, Lefever Els, Jacobs Gilles, Carpels Tijl, and Hoste Véronique. 2017. Towards an integrated pipeline for aspect-based sentiment analysis in various domains. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 136142.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics. 41714186.Google ScholarGoogle Scholar
  11. [11] Do Hai Ha, Prasad P. W. C., Maag Angelika, and Alsadoon Abeer. 2019. Deep learning for aspect-based sentiment analysis: A comparative review. Expert Syst. Appl. 118 (2019), 272299.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Dong Li, Wei Furu, Tan Chuanqi, Tang Duyu, Zhou Ming, and Xu Ke. 2014. Adaptive recursive neural network for target-dependent Twitter sentiment classification. In Proceedings of the Annual Meeting of the Association for Computer Linguistics (ACL’14). 4954.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Dozat Timothy and Manning Christopher D.. 2017. Deep biaffine attention for neural dependency parsing. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  14. [14] Ebrahimi Javid, Rao Anyi, Lowd Daniel, and Dou Dejing. 2018. HotFlip: White-box adversarial examples for text classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 3136.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Fan Zhifang, Wu Zhen, Dai Xin-Yu, Huang Shujian, and Chen Jiajun. 2019. Target-oriented opinion words extraction with target-fused neural sequence labeling. In Proceedings of the North American Chapter of the Association for Computational Linguistics. 25092518.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Fei Hao, Li Fei, Li Bobo, and Ji Donghong. 2021. Encoder-decoder based unified semantic role labeling with label-aware syntax. In Proceedings of the AAAI Conference on Artificial Intelligence. 1279412802.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Fei Hao, Li Jingye, Ren Yafeng, Zhang Meishan, and Ji Donghong. 2022. Making decision like human: Joint aspect category sentiment analysis and rating prediction with fine-to-coarse reasoning. In Proceedings of the WWW: The Web Conference. 30423051.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Fei Hao, Ren Yafeng, and Ji Donghong. 2020. Improving text understanding via deep syntax-semantics communication. In Findings of the Association for Computational Linguistics: EMNLP 2020. 8493.Google ScholarGoogle Scholar
  19. [19] Fei Hao, Ren Yafeng, and Ji Donghong. 2020. Mimic and conquer: Heterogeneous tree structure distillation for syntactic NLP. In Findings of the Association for Computational Linguistics: EMNLP 2020. 183193.Google ScholarGoogle Scholar
  20. [20] Fei Hao, Ren Yafeng, and Ji Donghong. 2020. Retrofitting structure-aware transformer language model for end tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 21512161.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Fei Hao, Ren Yafeng, Wu Shengqiong, Li Bobo, and Ji Donghong. 2021. Latent target-opinion as prior for document-level sentiment classification: A variational approach from fine-grained perspective. In Proceedings of the WWW: The Web Conference. 553564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Fei Hao, Ren Yafeng, Zhang Yue, and Ji Donghong. 2021. Nonautoregressive encoder-decoder neural framework for end-to-end aspect-based sentiment triplet extraction. IEEE Trans. Neural Netw. Learn. Syst. (2021), 113.Google ScholarGoogle Scholar
  23. [23] Fei Hao, Wu Shengqiong, Ren Yafeng, Li Fei, and Ji Donghong. 2021. Better combine them together! Integrating syntactic constituency and dependency representations for semantic role labeling. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. 549559.Google ScholarGoogle Scholar
  24. [24] Fei Hao, Wu Shengqiong, Ren Yafeng, and Zhang Meishan. 2022. Matching structure for dual learning. In Proceedings of the International Conference on Machine Learning (ICML’22). 63736391.Google ScholarGoogle Scholar
  25. [25] Fei Hao, Zhang Meishan, and Ji Donghong. 2020. Cross-lingual semantic role labeling with high-quality translated training corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 70147026.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Fei Hao, Zhang Yue, Ren Yafeng, and Ji Donghong. 2020. Latent emotion memory for multi-label emotion classification. In Proceedings of the AAAI Conference on Artificial Intelligence. 76927699.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Gao Yuze, Zhang Yue, and Xiao Tong. 2017. Implicit syntactic features for target-dependent sentiment analysis. In Proceedings of the International Joint Conference on Natural Language Processing. 516524.Google ScholarGoogle Scholar
  28. [28] He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, and Girshick Ross B.. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 97269735.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] He Keqing, Zhang Jinchao, Yan Yuanmeng, Xu Weiran, Niu Cheng, and Zhou Jie. 2020. Contrastive zero-shot learning for cross-domain slot filling with adversarial attack. In Proceedings of the International Conference on Computational Linguistics. 14611467.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Hochreiter Sepp and Schmidhuber Jürgen. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 17351780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Huang Binxuan and Carley Kathleen. 2019. Syntax-aware aspect level sentiment classification with graph attention networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 54695477.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Huang Binxuan, Ou Yanglan, and Carley Kathleen M.. 2018. Aspect level sentiment classification with attention-over-attention neural networks. In Proceedings of the International Conference of Social, Cultural, and Behavioral Modeling (SBP-BRiMS’18). 197206.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Huang Dandan, Wang Kun, and Zhang Yue. 2021. A comparison between pre-training and large-scale back-translation for neural machine translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 17181732.Google ScholarGoogle Scholar
  34. [34] Jia Robin and Liang Percy. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 20212031.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Jiang Qingnan, Chen Lei, Xu Ruifeng, Ao Xiang, and Yang Min. 2019. A challenge dataset and effective models for aspect-based sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 62806285.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Akbar Karimi, Rossi Leonardo, and Full Andrea Katharina. 2020. Adversarial training for aspect-based sentiment analysis with bert. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR). 8797–8803.Google ScholarGoogle Scholar
  37. [37] Kim Yoon. 2014. Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 17461751.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Li Xin, Bing Lidong, Lam Wai, and Shi Bei. 2018. Transformation networks for target-oriented sentiment classification. In Proceedings of the Annual Meeting of the Association of Computer Linguistics (ACL’18). 946956.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Li Zheng, Li Xin, Wei Ying, Bing Lidong, Zhang Yu, and Yang Qiang. 2019. Transferable end-to-end aspect-based sentiment analysis with selective adversarial learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 45904600.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Liang Bin, Yin Rongdi, Gui Lin, He Yulan, and Xu Ruifeng. 2020. Aspect-invariant sentiment features learning: Adversarial multi-task learning for aspect-based sentiment analysis. In Proceedings of the International Conference on Information and Knowledge Management. 825834.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Liu Ning and Shen Bo. 2020. ReMemNN: A novel memory neural network for powerful interaction in aspect-based sentiment analysis. Neurocomputing 395 (2020), 6677.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Liu Peng, Zhang Lemei, and Gulla Jon Atle. 2021. Multilingual review-aware deep recommender system via aspect-based sentiment analysis. ACM Trans. Inf. Syst. 39, 2 (2021), 15:1–15:33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Liu Xuebo, Wang Longyue, Wong Derek F., Ding Liang, Chao Lidia S., Shi Shuming, and Tu Zhaopeng. 2021. On the complementarity between pre-training and back-translation for neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2021. 29002907.Google ScholarGoogle Scholar
  44. [44] Miller and George A.. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 3941.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Miller John, Krauth Karl, Recht Benjamin, and Schmidt Ludwig. 2020. The effect of natural distribution shift on question answering models. In Proceedings of the International Conference on Machine Learning. 69056916.Google ScholarGoogle Scholar
  46. [46] Morris John X., Lifland Eli, Yoo Jin Yong, and Qi Yanjun. 2020. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 119–126.Google ScholarGoogle Scholar
  47. [47] Mullen Tony and Collier Nigel. 2004. Sentiment analysis using support vector machines with diverse information sources. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 412418.Google ScholarGoogle Scholar
  48. [48] Pang Bo and Lee Lillian. 2007. Opinion mining and sentiment analysis. Found. Trends Inf. Retriev. 2, 1-2 (2007), 1135.Google ScholarGoogle Scholar
  49. [49] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Phienthrakul Tanasanee, Kijsirikul Boonserm, Takamura Hiroya, and Okumura Manabu. 2009. Sentiment classification with support vector machines and multiple kernel functions. In Proceedings of the International Conference on Neural Information Processing. 583592.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Pontiki Maria, Galanis Dimitris, Papageorgiou Haris, Androutsopoulos Ion, Manandhar Suresh, AL-Smadi Mohammad, Al-Ayyoub Mahmoud, Zhao Yanyan, Qin Bing, Clercq Orphée De, Hoste Véronique, Apidianaki Marianna, Tannier Xavier, Loukachevitch Natalia, Kotelnikov Evgeniy, Bel Nuria, Jiménez-Zafra Salud María, and Eryiğit Gülşen. 2016. SemEval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the International Workshop on Semantic Evaluation (SemEval’16). 1930.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Pontiki Maria, Galanis Dimitrios, Papageorgiou Harris, Manandhar Suresh, and Androutsopoulos Ion. 2015. Semeval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the International Workshop on Semantic Evaluation (SemEval’15). 486495.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Pontiki Maria, Galanis Dimitris, Pavlopoulos John, Papageorgiou Harris, Androutsopoulos Ion, and Manandhar Suresh. 2014. SemEval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the International Workshop on Semantic Evaluation (SemEval’14). 2735.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Veyseh Amir Pouran Ben, Nouri Nasim, Dernoncourt Franck, Tran Quan Hung, Dou Dejing, and Nguyen Thien Huu. 2020. Improving aspect-based sentiment analysis with gated graph convolutional networks and syntax-based regulation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 45434548.Google ScholarGoogle Scholar
  55. [55] Ren Yafeng, Zhang Yue, Zhang Meishan, and Ji Donghong. 2016. Context-sensitive Twitter sentiment classification using neural network. In Proceedings of the AAAI Conference on Artificial Intelligence. 215221.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos. 2018. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 856865.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Samanta Bidisha, Ganguly Niloy, and Chakrabarti Soumen. 2019. Improved sentiment detection via label transfer from monolingual to synthetic code-switched text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 35283537.Google ScholarGoogle ScholarCross RefCross Ref
  58. [58] Schouten Kim and Frasincar Flavius. 2016. Survey on aspect-level sentiment analysis. IEEE Trans. Knowl. Data Eng. 28, 3 (2016), 813830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Tang Duyu, Qin Bing, Feng Xiaocheng, and Liu Ting. 2016. Effective LSTMs for target-dependent sentiment classification. In Proceedings of the International Conference on Computational Linguistics. 32983307.Google ScholarGoogle Scholar
  60. [60] Tang Duyu, Qin Bing, and Liu Ting. 2016. Aspect level sentiment classification with deep memory network. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 214224.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Tang Hao, Ji Donghong, Li Chenliang, and Zhou Qiji. 2020. Dependency graph enhanced dual-transformer structure for aspect-based sentiment classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 65786588.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Tian Yonglong, Krishnan Dilip, and Isola Phillip. 2020. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision. 776794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Tschannen Michael, Djolonga Josip, Rubenstein Paul K., Gelly Sylvain, and Lucic Mario. 2020. On mutual information maximization for representation learning. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  64. [64] Varghese Raisa and Jayasree M.. 2013. Aspect based sentiment analysis using support vector machine classifier. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics. 15811586.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. 2017. Attention is all you need. In Proceedings of the Conference and Workshop on Neural Information Processing Systems (NeurIPS’17). 59986008.Google ScholarGoogle Scholar
  66. [66] Velickovic Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Liò Pietro, and Bengio Yoshua. 2018. Graph attention networks. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  67. [67] Velickovic Petar, Fedus William, Hamilton William L., Liò Pietro, Bengio Yoshua, and Hjelm R. Devon. 2019. Deep graph infomax. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  68. [68] Wang Kai, Shen Weizhou, Yang Yunyi, Quan Xiaojun, and Wang Rui. 2020. Relational graph attention network for aspect-based sentiment analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 32293238.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Wang Shuai, Mazumder Sahisnu, Liu Bing, Zhou Mianwei, and Chang Yi. 2018. Target-sensitive memory networks for aspect sentiment classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 957967.Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Wang Wenya, Pan Sinno Jialin, Dahlmeier Daniel, and Xiao Xiaokui. 2017. Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In Proceedings of the AAAI Conference on Artificial Intelligence. 33163322.Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Wang Yanyan, Chen Qun, Ahmed Murtadha H. M., Li Zhanhuai, Pan Wei, and Liu Hailong. 2021. Joint inference for aspect-level sentiment analysis by deep neural networks and linguistic hints. IEEE Trans. Knowl. Data Eng. 33, 5 (2021), 20022014.Google ScholarGoogle Scholar
  72. [72] Wang Yequan, Huang Minlie, Zhu Xiaoyan, and Zhao Li. 2016. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 606615.Google ScholarGoogle ScholarCross RefCross Ref
  73. [73] Weston Jason, Chopra Sumit, and Bordes Antoine. 2015. Memory networks. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  74. [74] Wilson Theresa, Wiebe Janyce, and Hoffmann Paul. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 347354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Wu Shengqiong, Fei Hao, Ren Yafeng, Ji Donghong, and Li Jingye. 2021. Learn from syntax: Improving pair-wise aspect and opinion terms extraction with rich syntactic knowledge. In Proceedings of the 30th International Joint Conference on Artificial Intelligence. 39573963.Google ScholarGoogle ScholarCross RefCross Ref
  76. [76] Xing Xiaoyu, Jin Zhijing, Jin Di, Wang Bingning, Zhang Qi, and Huang Xuanjing. 2020. Tasty burgers, soggy fries: Probing aspect robustness in aspect-based sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 35943605.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Xu Hu, Liu Bing, Shu Lei, and Yu Philip. 2019. BERT Post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the North American Chapter of the Association for Computational Linguistics. 23242335.Google ScholarGoogle Scholar
  78. [78] Xu Hu, Liu Bing, Shu Lei, and Yu Philip S.. 2019. A failure of aspect sentiment classifiers and an adaptive re-weighting solution. CoRR abs/1911.01460 (2019).Google ScholarGoogle Scholar
  79. [79] Xue Wei and Li Tao. 2018. Aspect based sentiment analysis with gated convolutional networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 25142523.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Zeng Guoyang, Qi Fanchao, Qianrui Zhou, Tingji Zhang, Bairu Hou, Yuan Zang, Zhiyuan Liu, and Maosong Sun. 2020. OpenAttack: An open-source textual adversarial attack toolkit. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. 363–371.Google ScholarGoogle Scholar
  81. [81] Zhang Chen, Li Qiuchi, and Song Dawei. 2019. Aspect-based sentiment classification with aspect-specific graph convolutional networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 45684578.Google ScholarGoogle ScholarCross RefCross Ref
  82. [82] Zimbra David, Abbasi Ahmed, Zeng Daniel, and Chen Hsinchun. 2018. The state-of-the-art in Twitter sentiment analysis: A review and benchmark evaluation. ACM Trans. Inf. Syst. 9, 2 (2018), 5:1–5:29.Google ScholarGoogle Scholar

Index Terms

  1. On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model, Data, and Training

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Information Systems
          ACM Transactions on Information Systems  Volume 41, Issue 2
          April 2023
          770 pages
          ISSN:1046-8188
          EISSN:1558-2868
          DOI:10.1145/3568971
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 December 2022
          • Online AM: 19 September 2022
          • Accepted: 11 September 2022
          • Revised: 10 August 2022
          • Received: 30 May 2021
          Published in tois Volume 41, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format