At least some of you are considering a quantitative analysis for your thesis! Depending on the level of experience, getting started with your own analysis is pretty tricky, after all, you need to construct your very own data set.

In this post, I briefly discuss three tips on creating your own dataset that may prove helpful to you.

Starting point: a good base data set

The best starting point for any data set is a structure that is known to be correct. Possibly the most frequent “typical” analysis is a cross-section time series of countries, which can be used to compare countries at a point in time, or the development of countries over time. Often, this type of analysis uses indicators such as GDP/capita, GINI, HDI, etc. aggregated into annual values for each country, though there are datasets with more fine-grained information.

The best starting point for such a dataset is a list of all countries with either their start and end date (if you use data at sub-annual resolution) or a list of all countries in existence in any given year. Once you have such a list, you can easily add the indicators you would like to use while checking if there are any country-year combinations with missing values. Spotting such missing values is much trickier if you don’t have an overview of which countries and years to expect.

The Correlates of War (CoW) project provides a State Membership dataset (v. 2016) that provides both the starting and end date for all countries in the world since 1800 (states2016.csv) and a dataset of country-year combinations (system2016.csv). See the codebook and FAQ for further information.

The CoW dataset has become a respected standard and many other datasets also use the identification numbers of the CoW project (“cow codes”) to identify countries. This can prove a time saver when merging in other indicators.

If you are working with country-year data, here’s how to prepare the dataset for the addition of new indicators:

* Importing the CSV file to stata
insheet using "system2011.csv", comma clear

* Converting the year and country IDs to a string (temporarily)
tostring year, replace
tostring ccode, replace

* Generating a unique country-year identified
generate yearc=year+ccode

* Placing the yearc variable first in the list of variables
order yearc

* Converting country, year and country-year IDs back to numbers
destring yearc, replace
destring year, replace
destring ccode, replace

* Adding a label to the new variable lab var yearc "Country-Year ID"
label variable yearc "Country-Year ID"
* Sorting the dataset by yearc (needed for merging!)
sort yearc

The resulting yearc variable allows you to uniquely identify each country-year. Now, if you want to add a new indicator, you include a yearc variable in that dataset as well and merge the two datasets with the merge command.

Adding data: useful sources

Next, you will need indicators for your analysis. Of course, each analysis differs in its needs, but there are four good sources of data that can fast-track you on your way to a complete dataset.

Quality of Government Data

The Quality of Government Institute at the University of Gothenburg produces a number of useful, aggregated datasets. The best starting point is their “standard dataset“, which covers over 600 socio-economic variables from dozens of sources and gives you a decent leg up on most standard indicators. With a bit of luck, this dataset might be all you need.

Their data comes with a handy visualization tool that allows you to view the data before downloading. If you do decide to use QoG, consider the following notes:

It is a matter of professionalism that you quote the QoG dataset and the source data sets for the variables that you use.
You can find the cow code in the variable ccodecow, and the year value in the appropriately named variable year.
When merging the data, be sure to watch out for inconsistencies in the cow codes for individual countries. E.g. there are three codes for Germany for each year in the dataset (255 Germany, 260 West Germany, 265 East Germany) including the post-war period before West and East Germany became states and the 1990s and 2000s after the two states merged. Make sure that for each variable you actually use, you know which of the observations holds the proper data. Pay particular attention to this phenomenon when merging the data! (Variables from the same source dataset behave in the same way, so if you figured out the World Bank’s political stability variable, you also know how the World Bank’s rule of law variable will behave.
Make sure you read the codebook and run descriptives for all variables you are using to understand how they behave. The codebook provides a reference to further information on each datasource.
Optional, but recommended: the QoG dataset is large (almost 13’500 observations). drop variables you don’t need or keep variables that you do need to reduce the size of your dataset before doing time-intensive computations. In some cases, QoG’s “basic dataset” with over 100 variables already gives you all you need.

EUGene

Like QoG, the EUGene dataset has a number (though substantially fewer) of useful variables on states covering state capabilities, features of regimes, trade etc. But the big advantage of the EUGene dataset is that it allows you to do analyses on the relations between states. Not only can it give you a list of all states in the vicinity for each country-year (whether they share a land border, are sufficiently close across a body of water, or are within a set distance of each other), it also has data on the relationship between states (has there been peace and for how long, are there any disputes between the states, what’s the level of trade, etc.).

Such data is very tricky to generate on one’s own, making EUGene an exceedingly useful tool. (It’s a pity it only runs on Windows, sorry to all Mac and Linux aficionados out there.) You can download EUGene and the corresponding documentation at the EUGene website.

Finding more data

A big problem in quantitative analysis is that there are so many datasets, but they are widely spread, the file types, availability and quality of documentation differs drastically, and websites disappear or change all the time, making it very hard to find what you need or even know whether the data you need is out there. There are two prominent approaches that seek to solve this problem:

The Dataverse Network run by the Institute of Quantitative Social Science at Harvard offers a platform to scientists where data can be deposited. Once the data has been uploaded, the Dataverse ensures its stable availability and will convert the data into a useful format for you (including text files, Stata and R). So if you are looking for data, visit the Dataverse and search or browse for what you need.
A newer, but potentially much larger repository is Quandl. They aim to do for data what Google has done for the web at large, and they have already indexed over 2 million (!) datasets. It’s a simple thing to head to their home page and search for individual datasets, and if you like, Quandl can even combine multiple datasets into a superset for you.

Ensuring Reproducability: Data, Do Files And Logs

Finally, a brief reminder of three things that were covered in Econometrics as well:

Preserve your original data. This means downloading the data, the documentation and noting the source of the data, and then not changing any of these files. (Consider write-protecting or locking the files to prevent them from getting accidentally overwritten.) You should always be able to get back to the original starting point of your analysis in case you made any mistakes in between.
Use do-files for everything. Instead of manually preparing your datasets, e.g. by fixing individual values (typos, recording mistakes, etc.) or manually importing and merging your data, you should automate the process so that you can reproduce every step of the way to your result. Continuing the example above: some datasets continue to use the cow code for West Germany after 1990 since unified Germany is based on the institutions of West Germany, whereas other datasets revert to the pre-1945 code for Germany, a code that was also shared by the Third Reich and the Weimar Republic (the latter is the approach recommended by CoW, whereas the former makes more sense politically). If you merged two different datasets that handle Germany inconsistently, you will have duplicates or gaps in your data depending on how you handled the merger. If you manually prepare the data, you will a) have to redo it completely if you mess up at any point, and b) you won’t be able to spot what mistake you made since you cannot go through the process step by step. I recommend using:
1. one do file per dataset to prepare each individual dataset for the merger (label the variables, treat the cow codes, generate the country-year identifier)
2. one do file to merge all the datasets and produce the final, combined dataset for analysis
3. one do file for the analysis (or potentially two, with one for descriptives and exploration, and one for the final analysis)
Finally, keep logs of what you do, so you can follow your exploration of the data. If you put the command to open a log file at the beginning of each do file (and the corresponding command to close it at the end), you will always have a complete record. I recommend something along the lines of log using "analysis $S_DATE-$S_TIME.log", text which generates a new log file with the date and time in its file name every time you run the command, so you can easily have the old log files around for future reference. However, if its a one-time task that is only repeated to fix bugs (merging data files is such a task), then something simple like log using "merge.log", text replace will do — it only keeps the log of the most recent run on hand.

Hopefully, this advice helps you in putting together your dataset and avoiding the most common hurdles along the way.

Lutz F. Krebs

Getting started with a quantitative analysis