View on GitHub

SIDATA

News-Headlines-Dataset-For-Sarcasm-Detection

URL: https://github.com/rishabhmisra/News-Headlines-Dataset-For-Sarcasm-Detection

Description:
This is a high-quality dataset for the task of Sarcasm Detection. It is collected from two major news websites, TheOnion (for sarcastic headlines) and HuffPost (for non-sarcastic headlines), and aims to provide better quality data for sarcasm detection compared to noisy Twitter datasets.

Methods

The dataset is collected from two sources:

By using formal language and self-contained headlines, the dataset avoids issues present in Twitter-based datasets, such as noisy labels and lack of context in replies.

Results

Dataset

Statistics

| Statistic/Dataset | Headlines | Semeval | |————————————————|———–|———| | # Records | 28,619 | 3,000 | | # Sarcastic records | 13,635 | 2,396 | | # Non-sarcastic records | 14,984 | 604 | | % of pre-trained word embeddings not available | 23.35 | 35.53 |

WordClouds

Word clouds are provided to visually explore the most frequent terms used in sarcastic vs. non-sarcastic headlines.

Acknowledgements

This dataset was collected from TheOnion and HuffPost.