Overview

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to 'encourage the creation and distribution of eBooks'. It was founded in 1971 by Michael S. Hart and is the oldest digital library. This dataset is a collection of the top 1000 most popular books on Project Gutenberg, as determined by downloads. Each book has information about its authorship, publication date, congressional classication, and a few other fields. It also has some simple, computed statistics based on common metrics such as sentiment analysis, Flesch Kincaid Reading level, and average sentence length.

https://www.gutenberg.org/ebooks/search/?sort_order=downloads

Downloads

Download all of the following files.

Field Descriptions

JSON Path Type Comment Example Value
[0].metadata.url unicode https://www.gutenberg.org/ebooks/1342
[0].metadata.downloads int The number of times this book has been downloaded from Project Gutenberg, as of the last update (circa Spring 2016). 36576
[0].metadata.id int Every book on Project Gutenberg has a unique ID number. You can use this number to check the book on project gutenberg (e.g., book 110 is http://www.gutenberg.org/ebooks/110). 1342
[0].metadata.rank int The rank of this book in comparison to other books on Gutenberg, measured by number of downloads. A lower rank indicatest that that book is more popular. 1
[0].metadata.formats dict {u'total': 8, u'types': [u'text/plain', u'text/plain; charset=us-ascii', u'application/pdf', u'application/x-mobipocket-ebook', u'application/zip', u'application/rdf+xml', u'application/epub+zip', u'text/html; charset=us-ascii']}
[0].metrics.statistics.polysyllables int The number of words that have 3 or more syllables. 4603
[0].metrics.statistics.characters int Characters are letters and symbols in a text, not the number of people. 586794
[0].metrics.statistics.average sentence length float 18.0
[0].metrics.statistics.words int 121533
[0].metrics.statistics.sentences int 6511
[0].metrics.statistics.syllables float 170648.1
[0].metrics.statistics.average sentence per word float 0.05
[0].metrics.statistics.average letter per word float 4.83
[0].metrics dict {u'difficulty': {u'flesch reading ease': 70.13, u'automated readability index': 10.7, u'coleman liau index': 10.73, u'flesch kincaid grade': 7.9, u'linsear write formula': 13.5, u'dale chall readability score': 5.7, u'gunning fog': 9.200000000000001, u'smog index': 3.1, u'difficult words': 9032}, u'statistics': {u'polysyllables': 4603, u'characters': 586794, u'average sentence length': 18.0, u'words': 121533, u'sentences': 6511, u'syllables': 170648.1, u'average sentence per word': 0.05, u'average letter per word': 4.83}, u'sentiments': {u'polarity': 0.13671337760500446, u'subjectivity': 0.5222391494704692}}
[0].bibliography dict {u'publication': {u'month': 6, u'month name': u'June', u'full': u'June, 1998', u'day': 1, u'year': 1998}, u'author': {u'death': 1817, u'name': u'Austen, Jane', u'birth': 1775}, u'title': u'Pride and Prejudice', u'languages': [u'en'], u'subjects': [u'Sisters -- Fiction', u'Courtship -- Fiction', u'Social classes -- Fiction', u'England -- Fiction', u'Domestic fiction', u'Young women -- Fiction', u'Love stories'], u'congress classifications': [u'PR'], u'type': u'Text'}
[0].metadata dict {u'url': u'https://www.gutenberg.org/ebooks/1342', u'downloads': 36576, u'id': 1342, u'rank': 1, u'formats': {u'total': 8, u'types': [u'text/plain', u'text/plain; charset=us-ascii', u'application/pdf', u'application/x-mobipocket-ebook', u'application/zip', u'application/rdf+xml', u'application/epub+zip', u'text/html; charset=us-ascii']}}
[0].bibliography.author.death int The recorded year of the author's death. If their death year is unknown, it is replaced with "0". 1817
[0].bibliography.author.name unicode Austen, Jane
[0].bibliography.author.birth int The recorded birth year of the author. If their birth year is unknown, it is replaced with "0". 1775
[0].metrics.difficulty dict {u'flesch reading ease': 70.13, u'automated readability index': 10.7, u'coleman liau index': 10.73, u'flesch kincaid grade': 7.9, u'linsear write formula': 13.5, u'dale chall readability score': 5.7, u'gunning fog': 9.200000000000001, u'smog index': 3.1, u'difficult words': 9032}
[0].metrics.statistics dict {u'polysyllables': 4603, u'characters': 586794, u'average sentence length': 18.0, u'words': 121533, u'sentences': 6511, u'syllables': 170648.1, u'average sentence per word': 0.05, u'average letter per word': 4.83}
[0].metrics.sentiments dict {u'polarity': 0.13671337760500446, u'subjectivity': 0.5222391494704692}
[0].metadata.formats.total int Project Gutenberg makes books available in a wide variety of file formats, including raw text files, HTML web pages, audio books, etc. This field indicates the number of ways that this book is available. 8
[0].metadata.formats.types list[unicode] [u'text/plain', u'text/plain; charset=us-ascii', u'application/pdf', u'application/x-mobipocket-ebook', u'application/zip', u'application/rdf+xml', u'application/epub+zip', u'text/html; charset=us-ascii']
[0].metrics.sentiments.polarity float Sentiment analysis attempts to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. Polarity in particular refers to how positive or negative the author is towards the content. 0.136713377605
[0].metrics.sentiments.subjectivity float Sentiment analysis attempts to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. Subjectivity (as opposed to Objectivity) in particular refers to whether the text is opinionated or attempts to stay factual. 0.52223914947
[0].bibliography.publication dict {u'month': 6, u'month name': u'June', u'full': u'June, 1998', u'day': 1, u'year': 1998}
[0].bibliography.title unicode Pride and Prejudice
[0].bibliography.author dict {u'death': 1817, u'name': u'Austen, Jane', u'birth': 1775}
[0].bibliography.languages list[unicode] [u'en']
[0].bibliography.subjects list[unicode] [u'Sisters -- Fiction', u'Courtship -- Fiction', u'Social classes -- Fiction', u'England -- Fiction', u'Domestic fiction', u'Young women -- Fiction', u'Love stories']
[0].bibliography.congress classifications list[unicode] [u'PR']
[0].bibliography.type unicode Text
[0].metrics.difficulty.flesch reading ease float The 'Flesch Reading Ease' uses the sentence length (number of words per sentence) and the number of syllables per word in an equation to calculate the reading ease. Texts with a very high Flesch reading Ease score (about 100) are very easy to read, have short sentences and no words of more than two syllables. 70.13
[0].metrics.difficulty.automated readability index float The Automated Readability Index is a number indicating the understandability of the text. This number is an approximate US Grade Level needed to comprehend the text, calculated using the characters per word and words per sentences. 10.7
[0].metrics.difficulty.coleman liau index float The Coleman Liau Index is a number indicating the understandability of the text. This number is an approximate US Grade Level needed to comprehend the text, calculated using characters instead of syllables, similar to the Automated Readability Index. 10.73
[0].metrics.difficulty.gunning fog float The Gunning Fog Index measures the readability of English writing. The index estimates the years of formal education needed to understand the text on a first reading. The formula is calculated using the ratio of words to sentences and the percentage of words that are complex (i.e. have three or more syllables). 9.2
[0].metrics.difficulty.linsear write formula float Linsear Write is a readability metric for English text, purportedly developed for the United States Air Force to help them calculate the readability of their technical manuals. It was designed to calculate the United States grade level of a text sample based on sentence length and the number words used that have three or more syllables. 13.5
[0].metrics.difficulty.dale chall readability score float The Dale Chall Readability Score provides a numeric gauge of the comprehension difficulty that readers come upon when reading a text. It uses a list of 3000 words that groups of fourth-grade American students could reliably understand, considering any word not on that list to be difficult. This number is an approximate US Grade Level needed to comprehend the text. 5.7
[0].metrics.difficulty.flesch kincaid grade float The "Flesch-Kincaid Grade Level Formula" presents a score as a U.S. grade level, making it easier to understand. It uses a similar formula to the Flesch Reading Ease measure. 7.9
[0].metrics.difficulty.smog index float The SMOG grade is a measure of readability that estimates the years of education needed to understand a piece of writing. SMOG is the acronym derived from "Simple Measure of Gobbledygook". Its formula is based on the number of polysyllables (words with three or more syllables) and the number of sentences. 3.1
[0].metrics.difficulty.difficult words int The number of words in the text that are considered "difficult"; that is, they are not on a list of 3000 words that are considered understandable by fourth-grade American students. 9032
[0].bibliography.publication.month name unicode June
[0].bibliography.publication.full unicode June, 1998
[0].bibliography.publication.year int The year when the book was published according to Project Gutenberg. Keep in mind that this may not be the original publication date of the work, just that particular edition of the work. Notice that missing values have been coded as "0". 1998
[0].bibliography.publication.day int The day of the month when the book was published. Notice that missing values have been coded as "0". 1
[0].bibliography.publication.month int The month of the year when the book was published; 1 corresponds to January, 2 to February, etc. Notice that missing values have been coded as "0". 6