[Cross-Posted at Digital Harrisburg]
As students have noted recently (here, here, and here), everyone feels a sense of accomplishment and relief in bringing to completion the “City Social” U.S. Census data project for the digital history class. What students have been able to accomplish in the last six weeks is truly remarkable both on an historical and logistical level. We’ve put some 60% of the population of Harrisburg in 1900 “on the map” and done so within the context of a couple of coterminous classes (Digital History and GIS) using over-the-counter software (MS Excel, Access, and ArcGIS) and scans of the original census records available via Ancestry. Every student in my digital history class spent about 20 hours keying this data, and students in GIS courses at Messiah College and Harrisburg University have put in their share of hours.
On the data entry side of the project, the records keyed by students passed through multiple checks for data consistency and integrity, which required fussy, tedious, mind-numbing scanning. Each student had the assignment of looking at another student’s 2,000 records to check and standardize the data. Here’s the workflow we adopted to move data from rough draft to final version:
Step 1: General Check for Completeness
This was a ‘big picture’ check to make sure all of the records were present and accounted for – to rule out the potentially cataclysmic possibility that a student assigned to digitize the records from, say, Ward 2, District 2, in the year 1900, accidentally keyed Ward 2, District 3 from 1920. In this step, students were required to:
Check ward, district, and sheet numbers on the 10 individual worksheets within each spreadsheet. Those numbers should agree with the file name of the excel spreadsheet and number sequentially from 1-10.
Check order and number of columns. There should be 25 columns (A-Y in Excel) representing 25 different census values, and follow the order we agreed on in class.
Check number of names per worksheet. There should be 10 worksheets each with 100 names. If not, double check against the original census sheets.
Scroll through each worksheet quickly to see if any data was left out
Where street names and/or numbers are absent, highlight cells in yellow
If major problems, note them in a document on Google Drive called “Troubleshooting” and return spreadsheet to the author for correction.
Step 2: Careful Data Check
After this general data check, I asked students to look at their classmate’s data more carefully:
Check the first name and last name on each worksheet against record #1 and 100 on the original census sheet
Check street name on each worksheet against the original census sheet
Run a full check on the first ten (1-10) and last ten (91-100) records on each sheet compared with the original census records. This constitutes a 20% quality check on all the data. Are there any mistakes? If so, highlight any cells with errors in blue and note the mistakes in the Troubleshooting document
This 20% quality check offered us a general sense of the data quality for each student, and how carefully we needed to recheck particular districts. In cases where multiple mistakes were found in this sample data check, we will run a more thorough quality check on 100% of the data.
Step 3: Aggregate and Normalize Data
At this step, I took over. I downloaded all spreadsheets from Google Drive to my computer and created a “Master” worksheet for every student combining individual census sheets into a continuous series of 1,000 names for the specific enumeration district. Scrolling through sets of 1,000 names, I was able to check for additional mistakes, missing data, or inconsistencies.
I then created a master Ward spreadsheet for every ward that aggregated all the names keyed for each ward. So, Ward 1 data keyed by one student was combined with Ward 1 data keyed by other students in a file called Ward1_Master. There were ten master spreadsheets, each with thousands of records.
To check data quality, I tried importing the records into Access. Because Access requires that you to categorize fields as text, number, etc.., it generates errors when one kind of data from Excel (e.g., text) doesn’t agree with the field’s properties (e.g., number). So, for example, an address value of “109” imports perfectly from Excel to Access when address is defined as “double” (number) in Access. But if the cell’s value in Excel has text (“BLANK”) instead of numbers, the import generates an error message. Doing this with data from each ward allowed me to highlight additional problem spots in the data entry, which I was then able to correct in the Excel Ward Master files.
Step 4: Create Unique IDs
At this stage, I created a Unique ID for the records in each ward. As I noted in an earlier post, every individual census record has to be tied to a unique identifying number that is machine-friendly and recognizable by ArcGIS. After considerable deliberation with Professors Jeff Erikson and Albert Sarvis about what this ID should look like, we decided that the best way to connect census data with GIS shape files was to create a connecting ID via the street address – with address number followed (without space) by the upper case spelling of the street name.
In the image of the spreadsheet below, the ID is shown in column G. The Spigelmyer family, for example, shown in Rows 4-6, lived at 1106 Ninth Street in the first ward. Their unique ID was 1106NINTH. That same ID is also given to 1106 Ninth St in the GIS. To create the unique ID in the census record, I copied column D into column E, and then removed the suffix “Street”, “Avenue”, “Lane”, etc… using Find / Replace. Then, I capitalized the street names in column F using the UPPER function, and combined the upper-case street name with street address (in column H) using the CONCATENATE function. The result is the ID you see in Column G.
When the two values match, the GIS will correctly import the full set of data from the census record. The unfortunate reality is that the two do not always match. Students working from census data used the inconsistent spellings of street names recorded by the census takers, while students working on GIS maps used the spelling of street names preserved on the 1901 Harrisburg Atlas. A census recorded labeled 101HANNAH will not relate to a shape file named 101HANNA – even though the two represent the same place on the map. Correcting these inconsistencies will require a good deal more fussy work.
The result of all of this work is a Microsoft Access database with 28,397 names, and dozens of GIS files. The following table shows the number of records keyed from each ward compared to the total number of individuals listed in each ward (the total number is actually an estimate of total number). As you can see, students have keyed 100% of Wards 1, 3, 4, and 10, about 70% of Wards 2 and 8, and about 25-50% of Wards 5,6, and 9. We can talk with total confidence about the social make-up of certain wards of the city and can talk reservedly about the rest.
|Ward||Number Keyed||Number Total||Percentage|
I am hoping we’ll have some concrete visuals to show in the next few weeks, and that my students will continue to discuss the census data as they have been this week and last. I’ll present some tabulation and analysis at some point as well. But we have another project to move forward: City Beautiful. Expect much more discussion now about that.
All of this marks only a start, of course. We’ll need all of next fall semester to finish the data entry for Harrisburg’s population in 1900, normalize, run integrity checks on the census records, and fix the wide range of inconsistencies between data entry and GIS. Fortunately, several students will continue working on this project five to ten hours per week in the fall.
Students who have contributed to this project via GIS and history classes at Messiah and Harrisburg University deserve this moment of accomplishment. In our first semester working on this project, we’ve put Harrisburg’s population in 1900 literally on the map. We’re looking forward to what this can now tell us about the city social and beautiful at the turn of the last century.