Saturday, June 21, 2014

Another breakthrough in gathering pertinent data

While searching for my next phase career, there is a useful database given to us by the motherload of the motherload database of careers post-medical school.

Their data is good, but to strategize the maximum cost-benefit ratio of investment on my application, I needed to cross-reference that data with city population/income of the respective cities.

Population states if the cities are rural or urban,
Income states if the cities are affluent.

Naturally, rural, low-income cities would have a higher cost-benefit ratio (at this point the reader can assume my lackluster potential application edge)

I already had a CSV file and the matter of the fact was to scrub the address and get a string (city, state initials) ready to put it into some census database. That took a little elbow grease as my regex knowledge is not so good and I always have to consistently remind myself of it. I succeeded, and thus 3 lines of address was able to turn into just a city,ST string.

Good. On to next phase.
The census database. My instinctive jerk was to go and use the US Census data they freely give away. Alas, it's freaking complex, it only has population data, and the cities are so minutely divided that it started giving headaches at the end.

Scrubbing the CSV file I got from US Census, I was able to get through some of the data but due to the minute discrepencies between city names, I ended up manually entering/fixing the data. Mind you, at the point I already gave up on getting income.

However, while manually entering in the data I realized that low-population doesn't really mean low competition. Low-population could also mean higher affluent, gentrified populations as well. Income data was becoming crucial.

I was splitting hairs, and I knew I needed a new breakthrough. I tried searching for Python wrappers of US Census APIs, but that was shooting at the darkness - I knew it was going to be a time sinkhole.

Then I tried looking at websites that have these information. I came across a pretty good one. Simple enough to be easily scraped, pertinent up-to-date information. Good. All I gotta do is a matter of finding a good way to auto-submit my city-ST string and scrape the data returned.

I tried the requests module, but making the headers and cookies set in such a way to fool the server to act as if a normal user was using the website was and still always is above my head. Mind you, this was already 12 hours into making this thing work.

Then something came up in the searches. Mechanize. My old Python friend I used when I scraped webpages for stupid things in college. I followed the tutorial, and voila! It submitted my query, and spit out a workable webpage HTML with the data I needed.

From there I was wondering if I should go and do BeautifulSoup on the data, or just do a raw Regex match on the strings that have the population/ income data. Luckily, the latter option worked.

Now the cross-referencing is set. At the moment I'm running the motherload CSV data against the census, writing a new CSV with the new info appended as two new columns at the end. Mmmm. I even went so far as to include do-at-least-two-more-tries of attempt at submitting, well since the case came about of where connection was reset by peer.

What I'm seeing is that quite a few of Californian cities have no population/income data. Weird.. They aren't no-name cities either.