What resources are online to help me work with the KFS data?

There are several resources online to help you in working with these data.

  • There is a document called "Study Metadata Documentation", which gives a variable list with the variable name, its label/description, and information such as the type of variable and format. It is a data dictionary/codebook.
  • There are four annotated questionnaires, which contain the surveys themselves, as well as the associated variable names with each survey question. You can find these documents, in addition to many other documents that you might find helpful as well, by clicking on SSRN.
  • Additionally, the Kauffman Foundation has sponsored a series of webinars which are available for researchers wanting to use the data: Common researcher questions and research presentations from scholars.

What are the endings on many of the variables?

Variable names in the longitudinal dataset are similar between the Baseline, First Follow-Up, Second Follow-Up, and Third Follow-Up Surveys. With the exception of MPRID and the credit risk variables, all variables in the data file will have a suffix that corresponds to a round of data collection. Baseline variables contain an _0 suffix, First Follow-Up variables contain an _1 suffix, Second Follow-Up variables contain an _2 suffix, and Third Follow-Up variables contain an _3 suffix.

*** Please note that these variable names differ from the dataset released in 2008 in which Baseline variables did not have a suffix, First Follow-Up data had an _f suffix, and the Second Follow-Up data had an _s suffix. The new suffixes ( _0- _3) should allow easier incorporation of each new year of data as it becomes available ( _4 - _7).

Where can I find the definitions of the industry codes (NAICS)?

You can see how those map by going to the Census Bureau website. The public-use file has 2-digit NAICS codes, while the confidential version (accessed through the NORC enclave) contains 5-digit NAICS codes.

What can you tell me about the high-tech oversample?

The study created the panel by using a random sample from the list of new businesses started in 2004 that were included in the Dun & Bradstreet (D&B) database, which totaled roughly 250,000 such businesses. In response to the Foundation’s interest in understanding the dynamics of high-technology businesses, the KFS oversampled these businesses based on the intensity of research and development employment in the businesses’ primary industries. Below, you'll see how the SIC codes map into the tech sampling strata.


  • 28 Chemicals and allied products
  • 35 Industrial machinery and equipment
  • 36 Electrical and electronic equipment
  • 38 Instruments and related products


  • 131 Crude petroleum and natural gas operations
  • 211 Cigarettes
  • 291 Petroleum refining
  • 299 Miscellaneous petroleum and coal products
  • 335 Non-ferrous rolling and drawing
  • 371 Motor vehicles and equipment
  • 372 Aircraft and parts
  • 376 Guided missiles, space vehicles, parts
  • 737 Computer and data processing services
  • 871 Engineering and architectural services
  • 873 Research and testing services
  • 874 Management and public relations
  • 899 Services, not elsewhere classified
  • 229 Miscellaneous textile goods
  • 261 Pulp mills
  • 267 Miscellaneous converted paper products
  • 348 Ordinance and accessories, not elsewhere classified
  • 379 Miscellaneous transportation equipment


  • All other industries

Please see the baseline methodology report for more details. Since the sample was drawn, new more up-to-date definitions of high-tech have been developed and are now included on the public-use data file. There are three flags---high tech employers, high tech generators and high tech (which is =1 if either of the other flags is =1).

How did you define “Business Start”?

We screened business about indicators of business activity and whether these were conducted for the first time in the reference year (2004). These indicators included:

  • Payment of state unemployment (UI) taxes
  • Payment of Federal Insurance Contributions Act (FICA) taxes
  • Presence of a legal status for the business
  • Use of an Employer Identification Number (EIN)
  • Use of Schedule C to report business income on a personal tax return

To be “eligible” for the KFS, at least one of these activities had to have been performed in 2004 and none performed in a prior year. Further details on this can be found also be found in the appendices of the overview report mentioned above.

What target population does the KFS data represent?

The target population is the population on which conclusions are drawn. For the KFS, the target population was all new businesses that were started in the 2004 calendar year in the United States (the 50 states plus the District of Columbia). This population excludes any branch or subsidiary owned by an existing business or a business inherited from someone else. The issue that arose immediately with this target definition is the meaning of started. For the study population, a business started in 2004 was defined as a new, independent business that was created by a single person or a team of people, the purchase of an existing business, or the purchase of a franchise. Businesses were excluded if they had an EIN, Schedule C income, or a legal form or had paid state unemployment insurance or federal Social Security taxes prior to or after 2004.

What kind of information is available on the firms and owners? Is the information available for each year?

There is rich detail about the firm and the owner(s). The information is collected each year so that changes can be tracked over time. These are some of the types of variables available on the KFS dataset.

  • Firm characteristics: Industry, Legal Form, # of Owners, # of Employees (PT/FT), Types of Customers, Location
  • Firm strategy and innovation: Product/Service Offerings, Intellectual Property, Licensing In and Licensing Out, R&D
  • Detailed financial information: Equity & Debt Financing, Income Statement Info (Revenue, Expenses, Profits), Balance Sheet Info (Assets, Liabilities, Equity)
  • Employees: Types of Benefits Offered, Task/Work Structure
  • Owner characteristics and work behaviors (Information on up to 10 owners): Education, Age, Race, Ethnicity, Gender, Citizenship, Immigrant Status, Hours Worked, Previous Years of Work Experience, Previous Start-up Experience (same/different industry as this firm)

Each year, the survey instrument is reviewed and occasionally questions must be modified or can be added. The third follow-up questionnaire shows additional questions on comparative advantage, international trade and markets, and financial constraints. The fourth follow-up questionnaire shows additional questions net worth of the respondent, loan guarantees, effect of the economic crisis, expected growth, business training, personal outlook, as well marital and family business status.

What is the different about the confidential microdata only available through the NORC data enclave?

There are two different versions of the KFS dataset available to researchers, a public-use microdata set and a more detailed confidential microdata set. The main differences between the two are that the confidential dataset has five-digit industry (NAICS) codes, geographical detail such as zip code, state, and metropolitan statistical area (MSA), and various continuous variables that are not on the public-use file. The public-use file has two-digit industry codes and no geographical detail.

Public-use Microdata

  • Available for free with registration on the web
  • Identifying features such as geographical location are omitted

Data Enclave

  • Available via remote access and by application only
  • More identifying variables available along with other matched variables such as credit scores
  • Teams of geographically distributed researchers can access and share code, results and work in collaborative environment
  • Other datasets can be brought into the data enclave and linked to the KFS

Do I need to use survey weights when analyzing the KFS data?

Yes! The KFS used a stratified sampling methodology, which oversampled high-tech firms. So you need to use weights to draw conclusions about the population the KFS represents. There are both cross-sectional and longitudinal weights that you should use. For example:

  • Baseline survival rate defined as % of firms who survived the first year: Use the first year final weight: weight_final_1
  • Baseline survival rate defined as % of firms who survived the second year. Use the second follow-up cross sectional weight: weight_final_f2_2
  • Baseline survival rate defined as a % of firms who survived both the first and second year. Use the second follow-up longitudinal weight: weight_final_f12_long_2
  • Baseline survival rate defined as a % of firms who survived the first, second, and third year. Use the third follow-up longitudinal weight: weight_final_f123_long_3
  • Baseline survival rate defined as a % of firms who survived the first, second, third, and fourth year. Use the fourth follow-up longitudinal weight: weight_final_f1234_long_4

For more information, please see the overview report mentioned above.

Can I get the location or name of the firm?

In order to ensure participation and honesty by participating businesses in the KFS, the Kauffman Foundation has promised confidentiality. Firm names and addresses are not available to researchers in order to protect this confidentiality. Geographical detail is available in the confidential microdata (zip code, state, MSA).

Can I merge other datasets with the KFS?

Certainly. You could merge on other sources of data by industry with the public-use file. With the confidential microdata, you could merge on sources of data by geographical area (zip, msa, state), by detailed industry code, and other characteristics in the data. Merging in data sources by firm (by firm name) is not possible.

Is there a list of what people are working on with the KFS data?

There is a list of researchers that are in the NORC Data Enclave using the confidential data and their respective research topics.