Demographics of Women in India (A data-centric view)

Sambit Mahapatra
6 min readMar 7, 2020

India is the fastest growing economy and the largest democracy of the world with the potential to become the next superpower in a coming couple of decades. A country like India where women are the binding force of the very culture and society, the empowerment of women is the empowerment of the country. During the foreign invasion and British colonization, the country had gradually become a men-centric one. But in the last couple of decades, we are back on the track.

Here we are going to have a data-centric view on the demographics of women in India as of 2011 census and their socio-economic growth in comparison to the last census in 2001.

The data-analysis done to get these insights are divided into 3 major steps.

Data Collection

Before starting the analysis, some of the required libraries for these are pandas for data processing and seaborn for plotting.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)

We got the raw data of Govt. of India Census, 2001 District-Wise and 2011 District-Wise from Kaggle at https://www.kaggle.com/bazuka/census2001 and https://www.kaggle.com/danofer/india-census/version/1#india-districts-census-2011.csv

Then we filtered out the required data for our analysis and do some cleansing to help in analysis.

#Read the data
df_2011 = pd.read_csv(“india-districts-census-2011.csv”)
df_2001 = pd.read_csv(“india_census_2001.csv”)
#selecting required columns
df_2011 = df_2011[[‘District code’, ‘State name’, ‘District name’, ‘Population’, ‘Male’, ‘Female’, ‘Literate’, ‘Male_Literate’, ‘Female_Literate’, ‘Workers’, ‘Male_Workers’, ‘Female_Workers’,’Age_Group_0_29',’Age_Group_30_49',’Age_Group_50']]
df_2001 = df_2001[[‘State’,’District’,’Persons’,’Males’,’Females’,’Persons..literate’,’Males..Literate’,’Females..Literate’]]

In the two datasets, the names of states were different which we had to map with each other to ease our state-wise analysis. The dataset now looks like:

2011 Census India

We need to group by existing stats state-wise as we are doing a state-wise analysis but the dataset is a district-wise dataset. some code snippet here:

#2011 data processing
df_2011_pr = pd.DataFrame()
df_2011_pr['state'] = df_2011['State name'].unique().tolist()
df_2011_pr['population'] = df_2011.groupby(['State name'], sort=False)['Population'].sum().tolist()
df_2011_pr['population_male'] = df_2011.groupby(['State name'], sort=False)['Male'].sum().tolist()
......
df_2001_pr = pd.DataFrame()
df_2001_pr['state'] = df_2001['State'].unique().tolist()
df_2001_pr['population'] = df_2001.groupby(['State'], sort=False)['Persons'].sum().tolist()
df_2001_pr['population_male'] = df_2001.groupby(['State'], sort=False)['Males'].sum().tolist()
df_2001_pr['population_female'] = df_2001.groupby(['State'], sort=False)['Females'].sum().tolist()
......

Now we have 35 rows for each data-frame corresponding to census data of 29 States + 6 Union Territories of India.

Data Processing

Now we will do the analysis of women demographics in terms of the population, literacy and working-class as of 2011 and the corresponding change from 2001. For this, we calculated 7 factors from existing data i.e. sex_ratio, sex_ratio_literate (sex ratio in literates), sex_ratio_workers ((sex ratio in workers), literacy_rate_female (female literacy rate), working_women_pr (female workers in total working class). All these stats show the numbers of women per 1000 of the respective value. some code snippet here:

#new columns for stat 2011
df_2011_pr['sex_ratio'] = df_2011_pr['population_female']*1000//df_2011_pr['population_male']
df_2011_pr['sex_ratio_literate'] = df_2011_pr['literate_female']*1000//df_2011_pr['literate_male']
.......
#new columns for stat 2001
df_2001_pr['sex_ratio'] = df_2001_pr['population_female']*1000//df_2001_pr['population_male']
df_2001_pr['sex_ratio_literate'] = df_2001_pr['literate_female']*1000//df_2001_pr['literate_male']
.......

Stats and Visualization

We have got quite a few stats from the census 2011 and the comparison to 2001, to check all analysis go through the github link attached in the post. In the post, we will go through some major stats among them.

When it comes to sex ratio, we always go for the women population for each 1000 men population. But we need to fill the gender gap in the literate group and working-class group as well. Here some interesting stats on the top 10 states having the highest sex ratio in India:

#Important stats
df_2011_pr.sort_values(by=['sex_ratio'], ascending=False).iloc[:10][['state','sex_ratio','sex_ratio_literate','sex_ratio_workers']]
Top 10 states in terms of sex ratio (2011)

Like we see here, the states like Kerala and Pondicherry have high sex_ratio but need to do a lot to attract more women to the workforce. Another thing to notice is also the sex_ratio among literates is not that impressive e.g. in Chhatisgarh, there are 990 females for every 1000 males, but only 746 female literates for every 1000 male literates. An important observation is, the north-eastern states like Manipur which are having very high GDP growths lately(30–40%) are also having a very vital representation from women also in their socio-economy.

df_temp = df_2011_pr.sort_values(by=['sex_ratio'], ascending=False).iloc[:5][['state','sex_ratio','working_women_pr','literacy_rate_female', 'literacy_rate']]
df_temp = pd.melt(df_temp, id_vars=['state'], value_vars=['literacy_rate_female', 'literacy_rate', 'working_women_pr'])
plt.figure(figsize=(12, 5))
sns.barplot(y="state", x="value", hue='variable', data=df_temp)
Top 5 states in sex ratio (2011)

An important point to consider here is the gender gap is diminishing in terms of literacy but India has a lot to achieve in getting women workforce to thrive the economy.

We wanted to have a look at the youth demographics. We considered people having age 0–29 as a youth.

df_temp = pd.melt(df_2011_pr.sort_values(by=['sex_ratio'], ascending=False).iloc[:10], id_vars=['state'], value_vars=['literacy_rate_female','working_women_pr','youth_population'])
plt.figure(figsize=(18, 7))
sns.lineplot(x="state", y="value", hue='variable', data=df_temp)

An important trend here is the more the population of youth, the more the representation of women in the working class. This indicates, gradually with time women are playing a major part in the workforce of India.

Now we wanted to compare the change in demographics with respect to population, literacy, and workforce. But we couldn’t get the workforce-related data from the 2001 census. The study w.r.t. population and literacy are interesting.

df_2001_pr = df_2001_pr.sort_values(by=['state'])
df_2011_pr = df_2011_pr.sort_values(by=['state'])
df_comp = pd.DataFrame()
df_comp['state'] = list(df_2011_pr['state'])
df_comp['sex_ratio_change'] = df_2011_pr['sex_ratio'] - df_2001_pr['sex_ratio']
df_comp['sex_ratio_literate_change'] = df_2011_pr['sex_ratio_literate'] - df_2001_pr['sex_ratio_literate']
df_comp['literacy_rate_female_change'] = df_2011_pr['literacy_rate_female'] - df_2001_pr['literacy_rate_female']
df_comp['literacy_rate_change'] = df_2011_pr['literacy_rate'] - df_2001_pr['literacy_rate']
df_comp.sort_values(by=['sex_ratio_change'], ascending=False).iloc[:5,:3]
Top 5 states in terms of growth in sex ratio (2011)

The important factor here is the sex_ratio growth in 3 out of 5 states are double in terms of literacy than in terms of population. This should be the scenario as there is a lot more gender gap in terms of literacy than in terms of population.

Now to conclude with some major stats:

female_2001 = round(df_2001['Females'].sum()*100/df_2001['Persons'].sum(), 2)
female_2011 = round(df_2011['Female'].sum()*100/df_2011['Population'].sum(), 2)
female_growth = round(female_2011 - female_2001, 2)
female_literates_2001 = round(df_2001['Females..Literate'].sum()*100/df_2001['Persons..literate'].sum(), 2)
female_literates_2011 = round(df_2011['Female_Literate'].sum()*100/df_2011['Literate'].sum(), 2)
female_literates_growth = round(female_literates_2011 - female_literates_2001, 2)
female_workers_2001 = 'NA'
female_workers_2011 = round(df_2011['Female_Workers'].sum()*100/df_2011['Workers'].sum(), 2)
female_workers_growth = 'NA'
df_stat = pd.DataFrame()
df_stat['name'] = ['population', 'Literacy', 'Workers']
df_stat['2001'] = [str(female_2001)+'%', str(female_literates_2001)+'%', female_workers_2001]
df_stat['2011'] = [str(female_2011)+'%', str(female_literates_2011)+'%', str(female_workers_2011)+'%']
df_stat['growth'] = ['+'+str(female_growth)+'%', '+'+str(female_literates_growth)+'%', female_workers_growth]

If we see, India has done impressively good in beefing up the female literacy by 3.12%, the sex ratio has increased reasonably by 0.27%. It has 31.12% female work-force as of 2011. After 2011, there has been a dramatic change in technology, economy, and culture focusing more on gender equality, so we are expected to see a hooping growth in 2021.

As seen from the above stats, as of 2011 census women have made great strides in many areas with notable progress in reducing some gender gaps and thriving the demographics and socio-economy plot of the oldest civilization. Waiting for the 2021 census, to track the trend in the ongoing decade.

--

--

Sambit Mahapatra

Putting ML to Customer Support at CSAT.AI | Natural Language Processing | Full Stack Data Scientist (sambit9238@gmail.com)