Public NLP Datasets

The list is in progress. Your suggestions are most welcome, and if you have any additions or corrections email me.


Stanford Large Network Dataset Collection

Contains Online Reviews datasets suitable for Senitment Ananalysis and text mining applications including data classification and clustering using machine learning. tLarge network
Twitter and Memetracker : Memetracker phrases, links and 467 million Tweets
Online communities : Data from online communities such as Reddit and Flickr
Online reviews : Data from online review systems such as BeerAdvocate and Amazon
SNAP networks are also available from SuiteSparse Matrix Collection by Tim Davis