Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to 'encourage the creation and distribution of eBooks'. It was founded in 1971 by Michael S. Hart and is the oldest digital library. This dataset is a collection of the top 1000 most popular books on Project Gutenberg, as determined by downloads. Each book has information about its authorship, publication date, congressional classication, and a few other fields. It also has some simple, computed statistics based on common metrics such as sentiment analysis, Flesch Kincaid Reading level, and average sentence length.



This dataset has no indexes, so you cannot use it in a bar chart.


The following are research questions to explore about this dataset.

  1. What is the distribution of polarity and subjectivity?
  2. Is there a correlation between books' difficulty and their popularity?
  3. In what century were most of the books published?
  4. What is the relationship between a book's rank and its number of downloads?