View on GitHub

SIDATA

tweet_irony_detection

URL: https://github.com/paihengxu/tweet_irony_detection

Description: Course project for CS666.

Dataset Creation

The dataset used in this repository is sourced from SemEval-2018 Task 3, which involves sarcasm and irony detection in tweets. Tweets were collected using the Twitter API with hashtags like #irony, #sarcasm, and #not. The corpus contains a total of 4,792 tweets which are split into:

The corpus includes two tasks:

Prior to annotation, the corpus was cleaned by removing:

Training Methods

Several feature generation techniques were used for training models:

  1. Feature Generation:
    • Baseline features were created.
    • Behavior-based features were generated.
    • Word embeddings like BERT, ELMo, Skip-gram, and CBOW were used for word representation.
  2. Classification Models:
    • Logistic Regression (LR) and Multilayer Perceptron (MLP) were used for training.
    • Various combinations of features were experimented with, such as:
      • Baseline + Logistic Regression
      • Behavior + Logistic Regression
      • Word embeddings + Logistic Regression
      • Fine-tuned BERT + Logistic Regression
  3. Fine-Tuning BERT:
    • The BERT model was fine-tuned for better tweet classification using the preprocessed dataset.
    • The fine-tuned BERT model was combined with other features for improved performance.

Results

Dataset Files: