Archive for the ‘Data analysis’ Category
Centers of gravity in Russia
Doing Business in Russia
The World Bank published a Doing Business in Russia report several weeks ago. It focuses on 4 main indicators: starting a business, getting a construction permit (which is a total nightmare in Russia
), registering property and trading across borders. Each indicators is measured by the number of steps/documents, the time to complete the procedures and the costs as a percentage of national income. Strangely, the World Bank didn’t publish a simulator as it did for other countries (China, Mexico, etc…) Get the simulator as a Excel spreadsheet here http://www.mediafire.com/?nwnwjz3ugwm (I use exactly the same methodology and formulas as the World Bank does in other reports)
A few words about the methodology:
The index is calculated as the simple average of a city’s percentile ranking on each of the 4 topics covered in the study (starting a business, getting a construction permit, registering property & trading abroad). The ranking on each topic is in turn the simple average of the percentile ranking on its component indicators. [The percentile rank is the percentage of values below (<) OR less or equal (<=) to a given value, depending on the definition]
For example it takes in Moscow 9 procedures, 30 days and 2.7% of annual income per capita to open a business. The minimum capital requirement amounts to 2.2% of annual income per capita. It means that on the 4 component indicators, Moscow ranks in the 0th (best), 67th, 100th (worst) and 0th percentile. On average, Moscow ranks in the 53th percentile. It ranks in the 96th percentile on dealing with construction permits, 44th percentile on registering property and 67th percentile on trading across borders. The average of Moscow’s percentile rankings is 62%. If you now order all cities by their (ascending) average percentile rank, Moscow gets the last (and 10th) place.
However, percentiles are totally meaningless in small samples, especially when the observations are about the same and you get “tied ranks” while ordering the data. When you have ten values, of which eight are the same, you might have to assign the 0th or 100th percentile to the highest respectively the lowest value (some statisticians argues that the 0th and 100th percentile cannot be determined in a finite sample). And the remaining ones could lie in a range from 20th to 80th percentile (depending on HOW you define the percentile rank).
Now comes the problem: my calculations and the World Bank report’s result do not match, despite using exactly the same data and method. Let’s assume (out of goodwill) that the WB used a different method for calculating percentiles. I will post an update as soon as I get an answer from them.
How to make a Russian regional thematic map
Nathan at Flowing Data explained how to create a US County thematic map using free tools and it immensely helped me doing the same for Russian regions.
The Result
Step 1: Get the blank map
Unfortunately, we can’t take the blank Russia SVG map from Wikimedia Commons. First, it doesn’t take into account the latest mergers in 2008 (Irkutsk Oblast + Ust-Orda Buryat Autonomnous Okrug = Irkutsk Oblast, Chita Oblast + Agin-Buryat Autonomous Okrug = Zabaykalsky Krai). Second, the regions aren’t identified in the SVG file by their ISO_3166-2:RU code, but by their concatenated transliterated names. Third, the Federal Service of State Statistics doesn’t publish figures of several regions (like Yamalo-Nenets Autonomnous Okrug) because they fall under the administrative jurisdiction of other federal subjects of Russia (in this case Tyumen Oblast). It means that we have to group these regions together so that they can “inherit” the data visualization from their respective jurisdiction.
Get the new map I just uploaded to Wikimedia Commons here.
Step 2: Run python script
You can simply follow Nathan’s blog post and write it yourself, or use mine instead. I wrote it because I am too lazy to rewrite the script each time I want to colorize a map with different parameters.
Get it here (don’t forget to get BeautifulSoup first),: http://pastie.textmate.org/713719
Sample usage: python chloropleth.py -d Inward_FDI_Performance_Index_2005_2007.csv -c RdYlBu6.dat -l IFPI.dat -i Russia2009blank.svg -o IFPI.svg
Parameters
- -d : Region specific data. Remove the headers and save it as a CSV file. Put the ISO3166-2:RU code into the first column, and the value into the second column
- -c: Color scheme. It’s a simple text file containing the hexadecimal colors in reverse order, e.g.
#D73027 #FC8D59 #FEE090 #E0F3F8 #91BFDB #4575B4Use ColorBrewer to help you select the colors to use.
- -l: Legend file, e.g.
1 0.5 0.25 0.10 0 -0.1The script will assign the last color in your list if the data point is above the first value in your legend file, and so on - -i: Input SVG file, i.e. the blank map
- -o: Output SVG file
Enjoy
Easily unpivot data using Python

Denormalized data in multiple columns
I was looking for an easy way to unpivot data, i.e. expand values from multiple columns in a single record into multiple records with the same values in a single column.

Normalized data in multiple records
Since the only tool available was SQL Server/SSIS, I wrote a short script in python (less than 50 lines of code) to easily unpivot CSV data.
Get the code here: http://pastie.textmate.org/713615
Options:
- -v Verbose
- -i Input file
- -o Output file
- -c Number of columns to be “frozen”, the default is one
Sample usage:
python unpivot.py -v -i GRP.csv -o GRPunpivot.csv -c 1

