View on GitHub

SIDATA

S3D

URL: https://github.com/surrey-nlp/S3D

Description:
This repository contains sarcasm-annotated datasets along with notebooks to use fine-tuned language models. The work was presented at the EMNLP 2022 workshop: “Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset.”

Datasets

The repository provides three datasets focused on sarcasm detection in Twitter data:

SAD: Contains Tweet IDs and sarcasm labels for 2,340 manually annotated tweets collected using the #sarcasm hashtag.
- Size: 50.3 KB
- Available on HuggingFace
S3D-v1: Contains Tweet IDs for 100,000 tweets labeled automatically by a fine-tuned BERTweet model. This model was trained on a corpus of over 1 million tweets and Reddit comments labeled for sarcasm in previous studies.
- Size: 2.1 MB
- Available on HuggingFace
S3D-v2: Contains Tweet IDs for 100,000 tweets labeled automatically by an ensemble of the three best fine-tuned sarcasm detection models.
- Available on HuggingFace

Experiments

The repository includes a notebook demonstrating the dataset labeling process. The experiments for creating S3D-v1 and S3D-v2 can be reproduced using Python notebooks available here. The models are loaded via HuggingFace.

Models Used

Model	Fine-tuned Version	Description
BERTweet	bertweet-base-finetuned-SARC-combined-DS	Fine-tuned on a combined dataset for sarcasm detection
BERTweet	bertweet-base-finetuned-SARC-DS	Fine-tuned specifically on the SARC dataset
RoBERTa_large	roberta-large-finetuned-SARC-combined-DS	RoBERTa-large model fine-tuned on the combined dataset

S3D

Datasets

Experiments

Models Used

Maintainers