Exploratory Data Analysis

Understanding data from the Election Commission of India

The 2004 Lok Sabha Elections

The Lok Sabha is the lower house of Parliament of India (The Rajya Sabah is the upper house). Candidates are elected directly. The house is composed of proportionate numbers of representatives from each territory and state based on population. The Lok Sabha website states that "the total elective membership is distributed among the States in such a way that the ratio between the number of seats allotted to each State and the population of the State is, so far as practicable, the same for all States". This site also makes use of State Assembly Election data and turnout data in order to better understand how women have faired in past elections.

Background - 2004 State Assembly Elections

Data for this example may be found here.

This quick and dirty exploration of State Assembly candidates in the 2004 elections (Andrah Pradesh, Karnataka, Orissa and Sikkim), investigates the questions of the candidate gender, age and schedule across states. Data was collected from the Election Commission of India.

The example presented here utilizes the ggplot2 library. Ggplot2 helps us make beautiful lattice graphics with R. In the code example below, we begin by loading the library into our session. We use R's read.delim function in order to make the data available for manipulation. By default it is stored as a data.frame. We store this into a variable "elec". Here I use the "=" symbol for assignment (as I do throughout the code), but you may also use ← to symbolize this relationship as McQuaid has suggested this more intuitively suggests such a relationship. We can then check the data by using the summary() and class() functions. By invoking class(elec) we see our data is in fact a data.frame.

The examples to the left illustrate three types of pie charts available to us in ggplot2. A standard pie chart, a bullseye chart and a racetrack chart. Hadley Wickham's documentation and tutorials on the coord_polar function can be found here.

The Graphics

fig. 1 - Displays the total number of female candidates against males. Here is is interesting to note that the total number of female candidates competing for seats is at least third. The women’s reservation bill allots 33% of seats to female candidates. To be clear, we are not looking at winning candidates.

fig. 2 - India’s reservation system provides that seats should be reserved for Òother backward castesÓ, and amalgam of what we know electorally as Scheduled Tribes, Scheduled Castes, and here in the 2004 elections in Sikkim, a little known ethnic group called Bhutia Lepcha. Simply put, we see that over two thirds of candidates competing for Assembly seats represent Òdepreciated classesÓ.

fig. 3 - An interesting story in the history of Indian elections is the growing number of candidates below the age of 40 entering politics. This is part of what Yogendra Yadav has commented as development on in India’s second democratic upsurge of the 1990s.

fig. 4 - Age visualized again as a bull’s eye plot.

fig. 5 - A racetrack display of age by candidate type. Racetrack graphics arenÕt ideal. They require a lot of work to decipher. This could just as well be displayed by a bar chart.

fig. 6 - A histogram showing the same data more effectively.

R Code

#=====================================
# Quick and simple pie with R and ggplot2
# by Clint Newsom
#
# hcnewsom@gmail.com
#=====================================

# read our tab separated .txt file.

elec = read.delim("http://hcnewsom.org/expldata/QryCandidates_AC.txt", sep="\t",header=FALSE, col.names=c("state_name","ac_number", "ac_name","schedule", "cand_sl_no", "candidate_name", "candidate_sex", "party_abbreviation", "party_name", "symbol_number", "symbol_description", "candidate_age", "candidate_category"))

# attach our data.frame

attach(elec)

# check our data.

head(elec)

# lets see how many states where dealing with. Turns out we have 4 states that ran elections to state assemblies in 2004.

summary(elec)

# load ggplot2

library(ggplot2)

# now let's make a pie comparing male to female contestants. Generates fig. 1

pie = ggplot(elec, aes(x=factor(1), elec$candidate_sex, fill=elec$candidate_sex)) + geom_bar(binwdith=1) + opts(title="Male vs. Female Contestants: 2004 State Assembly Elections")

pie + coord_polar(theta="y")

# save our output.

ggsave("images/male_v_female_pie.png")

# make a pie comparing schedules competing in each election. Generates fig. 2

pie = ggplot(elec, aes(x=factor(1), elec$schedule, fill=elec$schedule)) + geom_bar(binwdith=1) + opts(title="Schedule Comparison: 2004 State Assembly Elections")

pie + coord_polar(theta="y")

ggsave("images/schedules_pie.png")

# let's try one more pie based on candidate age. We need to establish a few cutpoints and groups.
# first, lets see where to start. Summary will give us a good idea.

summary(elec$candidate_age)

# we make establish groups that will appear as labels in our legend.

groups=c("25-30","30-40", "40-50", "50-60", "60-70", "70-80", "80-87")

# create some cutpoints. Here I illustrate the use of min and max.

cutpoints=c(min(candidate_age),30,40,50,60,70,80,max(candidate_age))

# now we use the cut function. To learn more about cut, type ?cut in the R console.

divs=cut(candidate_age, breaks=cutpoints,labels=groups)

# build a pie chart. Generates fig. 3

pie = ggplot(elec, aes(x = factor(1), fill = factor(divs)))

pie = ggplot(elec, aes(x = factor(1), fill = factor(divs))) + geom_bar(width = 1) + opts(title="Candidate Age: 2004 India State Assembly Elections")

pie + coord_polar(theta = "y")

ggsave("images/assembly_candidate_ages.png")

# this will output a bullseye chart.

pie + coord_polar(theta = "x")

ggsave("images/assembly_candidate_ages_bullseye.png")

pie = ggplot(elec, aes(x = (divs), fill = factor(candidate_category))) + geom_bar(width = 0.9) + opts(title="Candidate Age by Type: 2004 India State Assembly Elections")

pie + coord_polar(theta = "y")

ggsave("images/assembly_candidate_ages_by_type_racetrack.png")

Get this code