View on GitHub

SIDATA

tweet_irony_detection

URL: https://github.com/paihengxu/tweet_irony_detection

Description: Course project for CS666.

Dataset Creation

The dataset used in this repository is sourced from SemEval-2018 Task 3, which involves sarcasm and irony detection in tweets. Tweets were collected using the Twitter API with hashtags like #irony, #sarcasm, and #not. The corpus contains a total of 4,792 tweets which are split into:

Training set (80%)
Testing set (20%)

The corpus includes two tasks:

Task A: Binary classification for irony detection (0 for non-ironic, 1 for ironic).
Task B: Multiclass classification with labels for various types of irony (1 for irony by clash, 2 for situational irony, 3 for other irony, and 0 for non-ironic).

Prior to annotation, the corpus was cleaned by removing:

Retweets
Duplicates
Non-English tweets
Emoji were converted into UTF-8 descriptions (e.g., “:smiling_face:”).

Training Methods

Several feature generation techniques were used for training models:

Feature Generation:
- Baseline features were created.
- Behavior-based features were generated.
- Word embeddings like BERT, ELMo, Skip-gram, and CBOW were used for word representation.
Classification Models:
- Logistic Regression (LR) and Multilayer Perceptron (MLP) were used for training.
- Various combinations of features were experimented with, such as:
  - Baseline + Logistic Regression
  - Behavior + Logistic Regression
  - Word embeddings + Logistic Regression
  - Fine-tuned BERT + Logistic Regression
Fine-Tuning BERT:
- The BERT model was fine-tuned for better tweet classification using the preprocessed dataset.
- The fine-tuned BERT model was combined with other features for improved performance.

Results

Task A (Binary):
- Labels: 0 (non-ironic), 1 (ironic).
- Dataset size: 4,792 tweets with an 80% training and 20% testing split.
Task B (Multiclass):
- Labels: 1 (irony by clash), 2 (situational irony), 3 (other irony), 0 (non-ironic).
- The dataset also includes additional emoji information.
Classification Models:
- Performance metrics for individual combinations of features are available, but specific results (accuracy, F1-score, etc.) would depend on model execution with the provided scripts.

Dataset Files:

Training Data:
- SemEval2018-T3-train-taskA.txt (size: 372 KB)
- SemEval2018-T3-train-taskA_emoji.txt (size: 359 KB)
- SemEval2018-T3-train-taskB.txt (size: 372 KB)
- SemEval2018-T3-train-taskB_emoji.txt (size: 359 KB)
Testing Data:
- SemEval2018-T3_input_test_taskA.txt (size: 76.7 KB)
- SemEval2018-T3_input_test_taskA_emoji.txt (size: 73.8 KB)
- SemEval2018-T3_input_test_taskB.txt (size: 76.7 KB)
- SemEval2018-T3_input_test_taskB_emoji.txt (size: 74.7 KB)