DATA CARTOGRAPHY BASED AUGMENTATION TECHNIQUES FOR STANCE DETECTION

Authors

  • Bianca-Ștefania MUȘAT Quant Risk Analyst at London Stock Exchange Group Author
  • Cornelia CARAGEA University of Illinois at Chicago, USA Author
  • Florentina HRISTEA University of Bucharest, Romania Author

DOI:

https://doi.org/10.62229/aubinf63/93-113

Keywords:

stance detection, data cartography, training dynamics, data augmentation

Abstract

Stance detection is the task of determining whether the information conveyed in a text is against, neutral, or in favor of a particular target. Since there is a plethora of targets upon which one can adopt a position, one common challenge of the stance detection task is the scarcity of annotations. Conversely, the emphasis on data quantity frequently entails a compromise in terms of the quality of the data. To address both challenges, we propose two data augmentation techniques that leverage training dynamics – the model behavior on individual instances during training – to identify and combine data instances with properties that differ, triggering, for example, the improvement of the generalization capabilities of the model or the enhancement of its optimization process. The first data augmentation method uses training dynamics to generate additional virtual samples during model training by interpolating existing annotated samples with characteristics that differ. The second data annotation approach is defined as a conditional masked language modeling task that generates additional samples by predicting the masked words of the input sentence, conditioned not only on its context but also on an auxiliary sentence sampled based on its characteristics. We empirically validated that fine-tuning a pre-trained language model on a subset of the training data, such that the instances that harm the training process are excluded, achieves better performance as compared to the same model fine-tuned on the entire training dataset. Moreover, in most cases, the performance of the existing augmentation approaches was also improved by using data with properties that differ during the annotation process, as opposed to random sampling.

Author Biographies

  • Bianca-Ștefania MUȘAT, Quant Risk Analyst at London Stock Exchange Group

    Quant Risk Analyst at London Stock Exchange Group

  • Cornelia CARAGEA, University of Illinois at Chicago, USA

    Full Professor in the Department of Computer Science at the University of Illinois at Chicago, USA, and Adjunct Associate Professor at Kansas State University, USA

  • Florentina HRISTEA, University of Bucharest, Romania

    Full Professor Univ. Dr. in the Department of Computer Science, at the University of Bucharest, Romania

AUBINF63-20

Downloads

Published

2024-06-06