Incompetence Disclaimer: SAS code made available on this page implements theoretical and policy hypotheses that a prominent economist once described as suitable for "a motivated tenth-grader," using the programming skills of a first-grader. My (complete lack of) programming expertise was acquired through the curriculum of the Brute Force School of Mainframe Torture in the days of Fortran and Cobol. I wound up learning the statistical analysis system, SAS, in the days of green tractor-feed paper eagerly awaited at the "Output Window" of the campus computer center. Those were the days before SAS shifted from an emphasis on academic and scientific research towards elite business process applications, to become the "Cadillac" of corporate strategy and customer relationship management. See
Vertical replication: new construction in Sha Tin, Hong Kong SAR, February 2010
Gary King recommends a comprehensive suite of policies and practices known as the "replication standard." This principle requires researchers to provide sufficient information to allow others to trace their steps, and to reproduce analytical results used to draw inferences and make substantive conclusions. Independent replication is at the heart of the scientific enterprise. All claims and hypotheses are tentative and provisional at first. If they withstand repeated attempts at falsification, and if they receive consistent support through widespread replication by independent researchers, then hypotheses eventually become strong enough to be given the status of a scientific law.
Ah, that three-letter word: "law." You might recognize at this point the familiar history of the scientific method; and then, depending on your discipline and the intellectual currents that have inspired you, you may think of the history of positivism. Going all the way back to Comte, the original positivist, "law" has had a very particular meaning, shaped by what subsequent critics have attacked as "physics envy." In the early nineteenth century, Comte himself viewed the entire history of progress in the physical sciences -- physics, astronomy, chemistry -- bringing civilization to a dramatic turning point; he suggested that the triumph of the positive method over metaphysical and theological modes of inquiry had finally created the conditions that would allow a new and definitive knowledge of society itself. Compared to the subjects tackled by physics or astronomy, Comte argued, the subjects of politics and society were far more complicated and difficult. But Comte was optimistic that once people understood the history of progress in observation and the positive method in the natural sciences, then a societal consensus could also be achieved in politics and social relations. The new field of inquiry that Comte constructed to sort out these issues is today recognized as a blend of sociology and political science.
But Comte himself called it "social physics." In the twentieth century, the metaphor of that phrase eventually evolved into a dominant, "status quo" social science. Talk of scientific laws often involved a lot of arrogance, and an obsession with mathematical elegance or statistical complexity. Before long, there was a backlash against the hegemonic mid-twentieth century incarnation of positivism, and Thomas Kuhn's Structure of Scientific Revolutions introduced the concept of a "paradigm shift" to diagnose the crisis of explanation and representation that was just beginning. We are still living with the consequences, both positive and negative, of this history today. This is why the mention of the word "law" is freighted with such intellectual and political baggage.
Another reading, however, opens more constructive possibilities. Instead of the false and arrogant assertion that we can uncover the universal, timeless, and perfect laws of social organization, consider the suggestion that we can work and mobilize to build laws that are better than what we have today. 'Law' becomes less an assertion of infallible scientific Truth, and more of an ongoing political struggle involving communities, interest groups, legislatures, lawyers, and judges. Instead of trying to uncover some universal scientific law of social relations, we try to organize for better laws to advance the cause of social justice. Can social science help? Can we provide evidence with rigor and integrity, that can help the powerless mount a successful challenge, or help those in positions of power to make rational, well-informed decisions? Obviously, there is more than a little bit of naivetee here: power dynamics don't often allow much room for the careful consideration of objective, rigorous evidence. But sometimes they do. And even if evidence is ignored, we still have no excuse: the fact that political operatives now routinely use the scientific method known as Making Shit Up doesn't mean we should do this too. Scholars have a responsibility to open their methods and their evidence to scrutiny. We also have an obligation to draw clear distinctions between statements of scientific explanation, versus arguments of political, ideological, or ethical persuasion. Both elements are important, but they can be volatile and unstable when mixed without careful consideration.
Gary King's replication standard is a clear, simple set of principles and practices that address these concerns. King's central criterion is this: other researchers should be able to evaluate your work, or to build on your results, and you should provide sufficient information to allow them to do so. Most of the social sciences already operate on a somewhat informal basis, with codes of professional ethics that require researchers to share some information on request; sometimes the requirement is more explicit, such as in cases where taxpayer funds are used to create new, original data. But King's protocol goes further, and includes specific standards for for teachers, students, dissertation writers, graduate programs, authors, reviewers, funding agencies, and journal and book editors. Most of his policies are crafted specifically to deal with the specialized features of projects involving the creation of customized primary databases, but the underlying principles can also be extended to research involving secondary data.
Most of my research relies on secondary datasets -- the Home Mortgage Disclosure Act, various products derived from the U.S. Census of Population and Housing, the New York City Housing and Vacancy Survey, etc. -- and this page provides some of the tools I've used to analyze these data sources. Most of the files are not documented to the level of detail and rigor as original datasets submitted to ICPSR; for additional information, see the technical documentation of the original raw data sources as referenced below. For more information on the replication standard, see
Gary King (1995), "Replication, Replication." PS: Political Science & Politics, September, 444-452.
For a dissenting view, see
Paul S. Herrnson (1995). "Replication, Verification, Secondary Analysis, and Data Collection in Political Science." PS: Political Science & Politics, September, 452-455.
Note that all SAS files are presented here with .txt file extensions to permit display in a browser while avoiding unintentional invocations of SAS.
New Racial Meanings of Housing in America
Database compiled for American Quarterly article
To map the complex new realities of American housing, we need detailed information on the individuals and institutions involved in housing market relations that operate at multiple spatial scales. We exploit several under-utilized features of a widely-used data source, the Home Mortgage Disclosure Act. Each year, portions of the raw loan-application register (LAR) and institutional transmittal sheet (TS) data are disclosed under HMDA (FFIEC, annual). HMDA records provide a limited set of variables measuring the characteristics of loan applicants, the outcome of applications, and a proxy of subprime status based on a "rate spread" calculated from a benchmark of prevailing interest rates (FFIEC, 2006a, 2006b). These data are widely used to document various kinds of inequalities in the allocation of credit. We use the data for this purpose, but we also take advantage of the little-noticed possibilities for analyzing the characteristics of institutions in an industry that has undergone dramatic, turbulent innovation in recent years. We built several databases, one of them focused on the peak year of the subprime boom (2006), aggregating the 34.1 million applicant records to develop market specialization measures for each of the 8,886 separate organizations filing disclosure reports. These lender-level summaries are then merged with a more specialized institutional database compiled by the Federal Reserve (Avery, 2009) to track the increasingly complex structure of bank and financial holding companies and their many subsidiaries. Then we merge the detailed lender databases with the applicant records for conventional loan originations collateralized by single-family homes in the 1,086 metropolitan counties across the continental U.S. Finally, we enhance the database with the detailed analysis of state laws on subprime and predatory lending built by Bostic et al. (2008).
These databases provide an exceptionally detailed view of borrowers obtaining mortgage credit for homes in different cities and suburbs, and of the various lenders providing that credit -- banks, thrifts, and mortgage companies, and their "parent" conglomerates and bank holding companies. Since HMDA records also indicate whether a loan was sold in the same calendar year as origination, we also have a partial view of the securitization networks that were so decisive in transforming local mortgages into "electronic instruments" (Sassen, 2009) and "postindustrial widgets" (Newman, 2009) in an expanding transnational network of debt and investment (Gotham, 2009). The database is far from perfect: industry lobbyists never tire of pointing out that HMDA includes no measures of applicant creditworthiness (an absence that reflects the hard work of lobbyists who fought proposals to add credit history to HMDA several years ago; see Immergluck, 2004). Yet the database provides the broadest possible coverage of the market and some of the corporate actors involved in the "front end" of loan origination.
Bostic, R.W., K.C. Engel, P.A. McCoy, A. Pennington-Cross, and S.M. Wachter (2008) The Impact of State Anti-Predatory Lending Laws: Policy Implications and Insights (Cambridge, MA, Harvard University Joint Center for Housing Studies).
Federal Financial Institutions Examination Council (FFIEC) (Annual) Home Mortgage Disclosure Act, Raw Data (Washington, DC, Federal Financial Institutions Examination Council).
Gotham, K.F. (2009) Creating liquidity out of spatial fixity: The secondary circuit of capital and the subprime mortgage crisis, International Journal of Urban and Regional Research 33(2) pp. 355-371.
Immergluck, D. (2004) Credit to the Community (Armonk, NY, M.E. Sharpe).
Newman, K. (2009) Post-industrial widgets: Capital flows and the production of the urban, International Journal of Urban and Regional Research 33(2) pp. 314-331.
Sassen, S. (2009) When local housing becomes an electronic instrument: The global circulation of mortgages, International Journal of Urban and Regional Research 33(2) pp. 411-426.
The layout codes below are for the 2010 files; there are a few minor changes in the record layouts during this time period, but nothing as major as what happened with the release of the new pricing information in the 2004 disclosures. Curl up one night with the FFIEC website...
njshares.sas. Compare carefully with Phil's code. The tract tabulations I've provided do not screen out loans with quality or validity edit failures, and also impose no criteria on applicant income (other than excluding "NA" values).
kathe.xls Tract tabulations for each census tract.
Sas program editor batch file. Prerequisites: the three loan-application register exports from the 2004 HMDA national file, along with the export files of metropolitan area codes and labels, and extracts from Summary File 3 of the 2000 U.S. Census of Population and Housing. Revise04.sas is a substantially revised version of system04.sas, below, with OLS regressions of metropolitan-level mortgage flows and loan-level logistic regression models of prime/subprime mortgage market segmentation.
Extracts of housing and demographic information, aggregated to metropolitan area summary level, from SF 3 of the U.S. Census of Population and Housing. For variable labels, see revise04.sas. Metropolitan area codes are as defined in the new December, 2003 OMB definitions; more than four dozen new metropolitan areas cannot readily be matched to the metropolitan summary level provided for 2000 Census data.
Sas program editor batch file. Prerequisites: the three loan-application register exports from the 2004 HMDA national file, along with the export file of metropolitan area codes and labels. The file includes codes to create composite variables from different elements of the HMDA fields, estimation of an instrumental variable based on a random sample of applications rejected specifically for reasons of bad credit; cluster analysis of metropolitan- and tract-level variables, and the implementation of a multidimensional scaling algorithm applied to state-level variables published in: Wei Li and Keith S. Ernst (2006). The Best Value in the Subprime Market: State Predatory Lending Reforms. Durham, NC: Center for Responsible Lending.
Metropolitan-level aggregations of basic lending indicators for all metropolitan areas in the United States and Puerto Rico. This file is created by summing the individual loan-level records in system04.sas; see the datastep, "data system04.msasum." Metropolitan areas are as defined in the December, 2003 OMB revision to metropolitan definitions.
List of neighborhoods in 23 U.S. metropolitan areas that qualify as gentrified, according to fieldwork and statistical criteria developed in Daniel J. Hammel and Elvin K. Wyly (1996). "A Model for Identifying Gentrified Neighborhoods With Census Data." Urban Geography 17(3), 248-268, and refined in Elvin K. Wyly and Daniel J. Hammel (1998). "Modeling the Context and Contingency of Gentrification." Journal of Urban Affairs 20(3), 303-326, and Elvin K. Wyly and Daniel J. Hammel (1999). "Islands of Decay in Seas of Renewal: Housing Policy and the Resurgence of Gentrification." Housing Policy Debate 10(4), 711-771.
Sas program editor batch file to implement tract-level taxonomies shown in Tables 2.3 and 2.4 in the Atkinson and Briidge chapter: tracts are analyzed with a standard urban-ecological approach, and then classifed in the spirit of market segmentation analyses to highlight trajectories of inner-city inequality. Interpretive labels attached to the final cluster solution: vanilla playgrounds, gold coast enclaves, racialized redevelopment, precarious diversity, latino frontier, loft lightning, cells and apartments, downtown sweep, yuppies in training, and elite polarization.