Working with the census Package

Warning

This is the old documentation that prepared by Ben, we need to find the right place and merge it into the new documentation.

Running the Command Line Interface

When census is installed, a command line utility (also named census) is automatically made available in your environment. Documentation on the CLI can be accessed by running:

census --help

An example run would look like this:

census --var_file census_vars.yml -y 2009:2019 --geom tract -d population -i x --out ../data/census_tract_2009_2019.csv

This would take the variable definitions in census_vars.yml (--var_file census_vars.yml) , process them from the API for census tracts (--geom tract) for 2009 - 2019 (-y 2009:2019), calculate density per square mile for the “population” variable (-d population), would not interpolate (-i x), and would write the created data frame to the path specified by --out.

Main Python Workflow

All main functionality for this package is contained within the DataPlan object. For detailed documentation on its methods, please see assemble_data.py.

The general workflow is as follows:

  • Create your DataPlan Object (this one is for county data for 2000-2019):

    plan = census.DataPlan("census_vars.yml", "county", years = census.census_years(2000, 2019))
    

On creation, the object creates a plan for a series of API queries to calculate the desired variables based on the passed in yaml file. Details on how to structure that yaml file can be found in Census Variable File Structure.

  • Make the API calls:

    plan.assemble_data()
    

This tells the DataPlan object to start making and combing all the specified API calls.

After this completes, the data is usable for analysis. However, it will only contain data for years that are available through the US census. It also will not have any densities, or other columns. Despite this, if you are interpolating or calculating densities, it is still best practice for reproducuibility to save a copy of your data at this point. You can do that by running

plan.write_data("census_uninterpolated.csv").

After this, we can begin interpolation.

  • Interpolate the data:

    plan.interpolate(min_year = 1999, max_year = 2019)
    

This will interpolate missing data using a weighted moving average model missing data for each variable, for each geographic unit, for each year in the dataset. Since ACS data/Decennial data is available for counties in 2000, and in 2009 onward, this will create data for 1999, and 2001-2008. More information on the interpolation methods can be seen in the nsaph_utils package documentation.

  • Calculate Densities:

    plan.calculate_densities(["population"], sq_mi = True)
    

This will calculate the density per square mile of the variable population within the assembled data set. This can take some time to run, as it needs to get the area of each geographic unit from the tigerweb API. Also note that due to limited data availability, the area may not 100% correspond to the area of the year in question.

  • Write the data:

    plan.write_data("census_interpolated.csv")
    

Your data set is now complete and written to disk.

Shapefile Downloads

In addition to the general processing workflow, the package also includes a function for downloading Census geography shape files. Please see tigerweb.py for documentation on the functions interacting with the Census geographic resources.