Overview

From the United States Cancer Statistics as part of the U.S. Center for Disease Control, the following data set focuses on the crude rate for all types of cancer reported for different demograpic groups. Significant groupings include age, gender, race and geographical area.

http://www.cdc.gov/cancer/npcr/uscs/download_data.htm

Explore Structure




Index Type Example Value
0 dict { }
... ... ...
Key Type Example Value Comment
"Age Adjusted Rate" float 165.5 A number representing the expected cancer rate, adjusted for the age of the participants. An age-adjusted rate is a weighted average of the age-specific rates, where the weights are the proportions of persons in the corresponding age groups of a standard population. The potential confounding effect of age is reduced when comparing age-adjusted rates computed using the same standard population.
"Age Adjusted CI Lower" float 160.6 A number representing the expected lower bound for the cancer rate. It is unlikely that the actual rate is lower than this number. CI means "Confidence Interval".
"Age Adjusted CI Upper" float 170.5 A number representing the expected upper bound for the cancer rate, adjusted for the age of the participants. It is unlikely that the actual rate is higher than this number. CI means "Confidence Interval".
Key Type Example Value Comment
"Age" dict { }
"Year" int 1999 The 4-digit year that this report was created for.
"Data" dict { }
"Area" str "Alabama" The area of the country (typically the name of the state) for this report.
Key Type Example Value Comment
"Count" int 4366 The number of incidences of cancer in this particular group.
"Crude Rate" float 190.4 The estimated number of people with cancer adjusted by the population. This adjustment makes it easy to compare cancer rates between different locations and over time.
"Crude CI Upper" float 196.1 A number representing the upper bound for the Crude Rate. It is unlikely that the actual rate is higher than this number.
"Crude CI Lower" float 184.8 A number representing the lower bound for the Crude Rate. It is unlikely that the actual rate is lower than this number.
"Sex" str "Female" The gender of people in this particular report.
"Race" str "All Races" The races reported in this particular report.
"Event Type" str "Mortality" The type of event reported here - whether the participants lived or died.
"Population" int 2293259 The number of people present in this report.

Downloads

Download all of the following files.

Usage

This library has 3 functions you can use.
import cancer
list_of_report = cancer.get_reports()
list_of_report = cancer.get_reports_by_year(1999)
list_of_report = cancer.get_reports_by_area("Alabama")
Additionally, some of the functions can return a sample of the Big Data using an extra argument. If you use this sampled Big Data, it may be much faster. When you are sure your code is correct, you can remove the argument to use the full dataset.
import cancer
# These may be slow!
list_of_report = cancer.get_reports(test=True)
list_of_report = cancer.get_reports_by_year(1999, test=True)
list_of_report = cancer.get_reports_by_area("Alabama", test=True)

Documentation

 cancer.get_reports(test=False)

Returns cancer reports from the dataset.

 cancer.get_reports_by_year(year, test=False)

Given a year, returns all the cancer reports for that year in the database.

 cancer.get_reports_by_area(area, test=False)

Given a area, returns all the cancer reports for that area in the database.