View on GitHub

SIDATA

Automatic Sarcasm Detection

URL: https://github.com/EducationalTestingService/sarcasm

Description

This repository contains datasets and research materials related to sarcasm detection, including data from the 2nd FigLang Workshop at ACL 2020. It provides shared tasks and benchmarks for sarcasm detection in Twitter and Reddit conversations.

Dataset

The repository includes training and testing datasets for sarcasm detection in Twitter and Reddit, provided in JSONL format.

Data Format

Each entry in the dataset contains:

label: "SARCASM" or "NOT_SARCASM"
id: Unique identifier for each sample (used for submission tracking)
response: The sarcastic Tweet or Reddit post
context: Ordered conversation context, where each message is a reply to the previous one

Dataset Statistics

| Platform | Train Size | Test Size |
|———-|———–|———-|
| Reddit | 4,400 | 1,800 |
| Twitter | 5,000 | 1,800 |

Shared Task & Evaluation

Participants were provided with training data (response + context + label).
Test data included the response and context, but without the label.
Submissions had to predict sarcasm labels for the test samples.

References

Main Paper:

A Report on the 2020 Sarcasm Detection Shared Task
- Debanjan Ghosh, Avijit Vajpyee, Smaranda Muresan
- Proceedings of the Second Workshop on Figurative Language Processing

Ethical Considerations

Potentially Controversial Content:

The dataset is collected from social media, so it may contain informal, politically charged, or controversial discussions.
The training data has been lightly pre-processed to remove offensive content, but some sensitive material may still be present.

Implementation & Results

No specific model implementations are included in this repository.
There are no pre-trained models, evaluation metrics, or performance benchmarks.