There are several resources online to help you in working with these data.
Variable names in the longitudinal dataset are similar between the Baseline, First Follow-Up, Second Follow-Up, and Third Follow-Up Surveys. With the exception of MPRID and the credit risk variables, all variables in the data file will have a suffix that corresponds to a round of data collection. Baseline variables contain an _0 suffix, First Follow-Up variables contain an _1 suffix, Second Follow-Up variables contain an _2 suffix, and Third Follow-Up variables contain an _3 suffix.
*** Please note that these variable names differ from the dataset released in 2008 in which Baseline variables did not have a suffix, First Follow-Up data had an _f suffix, and the Second Follow-Up data had an _s suffix. The new suffixes ( _0- _3) should allow easier incorporation of each new year of data as it becomes available ( _4 - _7).
You can see how those map by going to the Census Bureau website. The public-use file has 2-digit NAICS codes, while the confidential version (accessed through the NORC enclave) contains 5-digit NAICS codes.
The study created the panel by using a random sample from the list of new businesses started in 2004 that were included in the Dun & Bradstreet (D&B) database, which totaled roughly 250,000 such businesses. In response to the Foundation’s interest in understanding the dynamics of high-technology businesses, the KFS oversampled these businesses based on the intensity of research and development employment in the businesses’ primary industries. Below, you'll see how the SIC codes map into the tech sampling strata.
Please see the baseline methodology report for more details. Since the sample was drawn, new more up-to-date definitions of high-tech have been developed and are now included on the public-use data file. There are three flags---high tech employers, high tech generators and high tech (which is =1 if either of the other flags is =1).
We screened business about indicators of business activity and whether these were conducted for the first time in the reference year (2004). These indicators included:
To be “eligible” for the KFS, at least one of these activities had to have been performed in 2004 and none performed in a prior year. Further details on this can be found also be found in the appendices of the overview report mentioned above.
The target population is the population on which conclusions are drawn. For the KFS, the target population was all new businesses that were started in the 2004 calendar year in the United States (the 50 states plus the District of Columbia). This population excludes any branch or subsidiary owned by an existing business or a business inherited from someone else. The issue that arose immediately with this target definition is the meaning of started. For the study population, a business started in 2004 was defined as a new, independent business that was created by a single person or a team of people, the purchase of an existing business, or the purchase of a franchise. Businesses were excluded if they had an EIN, Schedule C income, or a legal form or had paid state unemployment insurance or federal Social Security taxes prior to or after 2004.
There is rich detail about the firm and the owner(s). The information is collected each year so that changes can be tracked over time. These are some of the types of variables available on the KFS dataset.
Each year, the survey instrument is reviewed and occasionally questions must be modified or can be added. The third follow-up questionnaire shows additional questions on comparative advantage, international trade and markets, and financial constraints. The fourth follow-up questionnaire shows additional questions net worth of the respondent, loan guarantees, effect of the economic crisis, expected growth, business training, personal outlook, as well marital and family business status.
There are two different versions of the KFS dataset available to researchers, a public-use microdata set and a more detailed confidential microdata set. The main differences between the two are that the confidential dataset has five-digit industry (NAICS) codes, geographical detail such as zip code, state, and metropolitan statistical area (MSA), and various continuous variables that are not on the public-use file. The public-use file has two-digit industry codes and no geographical detail.
Yes! The KFS used a stratified sampling methodology, which oversampled high-tech firms. So you need to use weights to draw conclusions about the population the KFS represents. There are both cross-sectional and longitudinal weights that you should use. For example:
For more information, please see the overview report mentioned above.
In order to ensure participation and honesty by participating businesses in the KFS, the Kauffman Foundation has promised confidentiality. Firm names and addresses are not available to researchers in order to protect this confidentiality. Geographical detail is available in the confidential microdata (zip code, state, MSA).
Certainly. You could merge on other sources of data by industry with the public-use file. With the confidential microdata, you could merge on sources of data by geographical area (zip, msa, state), by detailed industry code, and other characteristics in the data. Merging in data sources by firm (by firm name) is not possible.
There is a list of researchers that are in the NORC Data Enclave using the confidential data and their respective research topics.