Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a script to automatically pull 2010 population data and add to spreadsheet #6

Closed
wants to merge 8 commits into from

Conversation

minorsecond
Copy link

This will be useful for those who want to use the shooting data for research as rate calculation will be simplified.

A new spreadsheet will be created containing the following fields:

  • total population
  • white population
  • black population
  • asian population
  • hispanic population

Simply run the script by entering: python get_population.py, and it will iterate through the rows and query the US Census Bureau for the data. The Census API is slow, and because there are over 1,000 various towns, processing the entire spreadsheet will take some time.

Upon completion the script will report which towns it had trouble finding data for, so that the user can manually search if desired.

This was done in reference to issue #5 .

@fine101w
Copy link

fine101w commented Oct 5, 2016

OK...first I'm new to Git; second I know nothing about Python; just an end user of applied statistical applications. Have been working with the original spreadsheet and U.S. Census records (where I merged in some population data as well as %'s for white, black and Hispanic pop for the town/city or place referenced in the original file.
The issue I'm having is less about programming and more about goal/intent. When reviewing the 'places' named one can see that some are towns, others are cities, there are also unincorporated areas...and even a few counties. In addition, some of the places are actually neighborhoods within large urban areas, e.g. Los Angeles and Knoxville. I think these different levels of geography undercut efforts to generate rates. There's a denominator problem.
My interest in % racial/ethnic composition involves using those data for a multi-level modeling analysis (the distribution of fatalities within the US is not random so comparisons in the fatalities data set to population totals is flawed. We've done this sort of thing with some other health outcomes.)
Now, my wish list actually would be to get some ZIP code at the center/centroid or almost anywhere within each geoplace. And then use those ZIPs to merge in urban-rural status (urban/large town/small town/remote). The population #'s will not work because some of those neighborhoods/areas listed are small--but are really part of urban clusters. The ZIP codes could be used to merge in an old but excellent national database (RUCA or rural urban commuting area codes) that has been used alot...but nevermind.

@minorsecond
Copy link
Author

minorsecond commented Oct 5, 2016

@fine101w The discrepancies in the places are certainly an issue. I think using county-level population would probably be best since it would capture those smaller-level geographies. If you're familiar with GIS, it shouldn't be too difficult to take the FIPS and join it to a census shapefile, and then do a spatial join so that you have the number of cases per race, per county.

Another problem with the script and with the dataset are the place names. Because the place names in the CSV aren't always an exact match to what the Census uses, there are quite a few dropped cases that have to be manually checked.

@fine101w
Copy link

fine101w commented Oct 5, 2016

Thanks! appreciate the suggestions. Frankly I have not gone the FIPS joined to shapefile route. will look into it.

Would love more granularity than county. And also wonder (as it appears others have) about any additional variables that WaPo has not incorporated.

As you note, it's a bit of a problematic data set, including the misnamed places (besides typos, found a ghost town in Texas when hunting for, I think, Fuqua).

But, it could be somewhat interesting once populated with a few of these additional race/ethnicity measures.

Other issues abound, e.g. 'armed'...many values make sense; a few, not so much. Have tried a recoded version...but critical dichotomy will be unarmed (yes/no). Problem is some of the 'armed' are rather weakly the case or strange (stapler?). Here's current categorization (the 'other' was Taser i think...will probably place those recs with firearm)

[cid:c0437bcd-e6ab-46ed-b648-1bb496571497]

thanks again,

df


From: Robert Ross Wardrup [email protected]
Sent: Wednesday, October 5, 2016 3:39 PM
To: washingtonpost/data-police-shootings
Cc: fine101w; Mention
Subject: Re: [washingtonpost/data-police-shootings] Added a script to automatically pull 2010 population data and add to spreadsheet (#6)

@fine101whttps://github.com/fine101w The discrepancies in the places is certainly an issue. I think using county-level population would probably be best since it would capture those smaller-level geographies. If you're familiar with GIS, it shouldn't be too difficult to take the FIPS and join it to a census shapefile, and then do a spatial join so that you have the number of cases per race, per county.

Another problem with the script and with the dataset are the place names. Because the place names in the CSV aren't always an exact match to what the Census uses, there are quite a few dropped cases that have to be manually checked.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/6#issuecomment-251820212, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVjRolCQkVxgyKJz762yISfSFAYj2rzBks5qxCcdgaJpZM4KF37s.

@jmuyskens jmuyskens closed this Aug 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants