Irony-detection
URL: https://github.com/fatemenajafi135/Irony-detection
Description: Persian Irony Detection, including a Persian dataset, automatic dataset creation, and fine-tuning transformer-based language models for the task.
Dataset Creation
The Persian irony detection dataset was created using two methods:
- MirasIrony: A manually labeled dataset for Persian irony detection.
- Persian Irony Detection: An automatically labeled dataset generated by crawling Persian tweets from a Telegram channel. The process involves:
- Crawling: Collecting public messages from Telegram using the Telegram API (via the
crawling.pyscript). - Gathering: Combining the crawled data, cleaning it, and saving it into a CSV file (
messages.csv). - Cleaning: Basic text cleaning and saving the cleaned dataset (
messages_cleaned.csv). - Labeling: Tweets are labeled based on the most common reactions from users (top-2 reactions).
- Crawling: Collecting public messages from Telegram using the Telegram API (via the
Training Methods
The dataset is used to train transformer-based language models for Persian irony detection. Fine-tuning is applied to the following models:
- ParsBert vr3
- XLM-RoBERTa-Base
- XLM-RoBERTa-Large
The models are evaluated using common metrics like accuracy, recall, precision, and F1 score.
Results
The comparison of different fine-tuned language models on the Persian dataset shows the following results:
| Language Model | Accuracy | Recall | Precision | F1 |
|---|---|---|---|---|
| ParsBert vr3 | 81.3% | 81.4% | 81.3% | 81.3% |
| XLM-RoBERTa-Base | 82.6% | 82.8% | 82.6% | 82.5% |
| XLM-RoBERTa-Large | 84.7% | 84.7% | 84.6% | 84.6% |
Dataset Files:
- Persian_irony_detection.csv: 4.7 MB
- test.csv: 987 KB
- train.csv: 3.81 MB
Dataset Statistics:
- Ironic Tweets: 7,014
- Non-Ironic Tweets: 7,932
- Avg. Tokens per Ironic Tweet: 30
- Avg. Tokens per Non-Ironic Tweet: 45
- Max Tokens per Ironic Tweet: 260
- Max Tokens per Non-Ironic Tweet: 430
Sample Tweets:
- Ironic: “پشت یه کامیونه نوشته بود: سلطان خیانت هیدروژن! هم پیوند کوالانسی میگیره هم هیدروژنی! فکر کنم رانندش لیسانس شیمی داشته 🙁😂🤦🏻♂️”
- Non-Ironic: “آره مهاجرت خوبه ولی قشنگترش این بود که همینجا کنار خانواده و دوستامون به خواستههایی که داشتیم برسیم :(“
- Ironic: “تاس کباب داشتیم بابام جفت شیش آورد همهشو خورد”
- Non-Ironic: “مدیون تاول های پامون تو راه اشتباه نباشیم! هر جا که فهمیدیم مسیر درست را انتخاب نکردیم،بدون تردید دور بزنیم و برگردیم!”