ArSarcasm
URL: https://github.com/iabufarha/ArSarcasm
Description: This repository contains the Arabic sarcasm dataset (ArSarcasm).
Project Overview
ArSarcasm is a dataset for sarcasm detection in Arabic tweets. It was built using existing Arabic sentiment analysis datasets (SemEval 2017 and ASTD) and includes annotations for sarcasm and dialect.
Dataset:
The dataset contains 10,547 tweets, where 1,682 (16%) are labeled as sarcastic. The dataset is available in CSV format with an 80/20 train-test split:
- Training set: 8,437 tweets
- Test set: 2,110 tweets
Dataset Fields:
- tweet: The original tweet text.
- sarcasm: Boolean indicating whether the tweet is sarcastic.
- sentiment: New annotation for sentiment (positive, negative, neutral).
- original_sentiment: Sentiment from the original dataset annotation.
- source: The original dataset (SemEval or ASTD).
- dialect: The Arabic dialect used in the tweet, categorized into:
- msa: Modern Standard Arabic
- egypt: Egyptian and Sudanese Arabic
- levant: Levantine Arabic (Palestine, Jordan, Syria, Lebanon)
- gulf: Gulf Arabic (Saudi Arabia, UAE, Qatar, etc.)
- magreb: North African Arabic (Algeria, Libya, Tunisia, Morocco)
Dataset Usage:
The dataset is structured for sarcasm detection research and includes sentiment and dialectal variations, making it useful for broader NLP tasks in Arabic.
Dataset Statistics:
- Total tweets: 10,547
- Sarcastic tweets: 1,682 (16%)
- Train set size: 8,437
- Test set size: 2,110
Training Methods:
No specific training methods are provided in the repository.
Results:
No specific results or performance metrics are provided in the repository.
Dataset Files:
- ArSarcasm_train.csv (1.7 MB)
- ArSarcasm_test.csv (435 KB)