View on GitHub

SIDATA

Irony-detection

URL: https://github.com/fatemenajafi135/Irony-detection

Description: Persian Irony Detection, including a Persian dataset, automatic dataset creation, and fine-tuning transformer-based language models for the task.

Dataset Creation

The Persian irony detection dataset was created using two methods:

  1. MirasIrony: A manually labeled dataset for Persian irony detection.
  2. Persian Irony Detection: An automatically labeled dataset generated by crawling Persian tweets from a Telegram channel. The process involves:
    • Crawling: Collecting public messages from Telegram using the Telegram API (via the crawling.py script).
    • Gathering: Combining the crawled data, cleaning it, and saving it into a CSV file (messages.csv).
    • Cleaning: Basic text cleaning and saving the cleaned dataset (messages_cleaned.csv).
    • Labeling: Tweets are labeled based on the most common reactions from users (top-2 reactions).

Training Methods

The dataset is used to train transformer-based language models for Persian irony detection. Fine-tuning is applied to the following models:

The models are evaluated using common metrics like accuracy, recall, precision, and F1 score.

Results

The comparison of different fine-tuned language models on the Persian dataset shows the following results:

Language Model Accuracy Recall Precision F1
ParsBert vr3 81.3% 81.4% 81.3% 81.3%
XLM-RoBERTa-Base 82.6% 82.8% 82.6% 82.5%
XLM-RoBERTa-Large 84.7% 84.7% 84.6% 84.6%

Dataset Files:

Dataset Statistics:

Sample Tweets:

  1. Ironic: “پشت یه کامیونه نوشته بود: سلطان خیانت هیدروژن! هم پیوند کوالانسی میگیره هم هیدروژنی! فکر کنم رانندش لیسانس شیمی داشته 🙁😂🤦🏻‍♂️”
  2. Non-Ironic: “آره مهاجرت خوبه ولی قشنگترش این بود که همینجا کنار خانواده و دوستامون به خواسته‌هایی که داشتیم برسیم :(“
  3. Ironic: “تاس کباب داشتیم بابام جفت شیش آورد همه‌شو خورد”
  4. Non-Ironic: “مدیون تاول های پامون تو راه اشتباه نباشیم! هر جا که فهمیدیم مسیر درست را انتخاب نکردیم،بدون تردید دور بزنیم و برگردیم!”