Week 3 Assignment: Data Management

Variables Selected

As part of this assignment, I have selected the following three variables from the NESARC dataset to perform data manipulation operations:

Had hardening of arteries in last 12 months (S13Q6A1)
Had high blood pressure in last 12 months (S13Q6A2)
Had heart attack in last 12 months (S13Q6A7)

Python Program to perform Data Management

Below is the Python program that I used to perform data management operations on the selected variables:

import pandas
import numpy
# setting options so that pandas data do not get truncated
 
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
 
 
print (f"Total rows in NESARC dataset: {len(data)}") #number of observations (rows)
print (f"Total columns in NESARC dataset: {len(data.columns)}") # number of variables (columns)
 
# column names
colHardeningOfArteries = 'S13Q6A1'
colHighBloodPressure = 'S13Q6A2'
colHeartAttack = 'S13Q6A7'
 
#setting variables you will be working with to numeric (updated)
data[colHardeningOfArteries] = pandas.to_numeric(data[colHardeningOfArteries], errors='coerce')
data[colHighBloodPressure] = pandas.to_numeric(data[colHighBloodPressure], errors='coerce')
data[colHeartAttack] = pandas.to_numeric(data[colHeartAttack], errors='coerce')
 
recode = {1:1, 2:0}
 
# Frequency distribution for hardening of arteries without any data management
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries}")
c1 = data[colHardeningOfArteries].value_counts(sort=False)
print(c1)
 
# Frequency distribution for hardening of arteries by indicating missing data
data[colHardeningOfArteries] = data[colHardeningOfArteries].replace(9, numpy.nan)
 
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries} with missing data (9) recoded as NaN")
c1 = data[colHardeningOfArteries].value_counts(sort=False, dropna=False)
print(c1)
 
# Frequency distribution with No mapped to 0
data[colHardeningOfArteries] = data[colHardeningOfArteries].map(recode)
 
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0  ")
c1 = data[colHardeningOfArteries].value_counts(sort=False, dropna=False)
print(c1)
 
# Frequency distribution for high blood pressure without any data management
print(f"Counts of people having high blood pressure in last 12 months: {colHighBloodPressure}")
c2 = data[colHighBloodPressure].value_counts(sort=False)
print(c2)
 
# Frequency distribution for high blood pressure by indicating missing data
data[colHighBloodPressure] = data[colHighBloodPressure].replace(9, numpy.nan)
 
print(f"Counts of people having high blood pressure in last 12 months: {colHighBloodPressure} with missing data (9) recoded as NaN")
c2 = data[colHighBloodPressure].value_counts(sort=False, dropna=False)
print(c2)
 
# Frequency distribution with No mapped to 0
data[colHighBloodPressure] = data[colHighBloodPressure].map(recode)
 
print(f"Counts of people having high blood pressure in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0  ")
c2 = data[colHighBloodPressure].value_counts(sort=False, dropna=False)
print(c2)
 
# Frequency distribution for heart attack without any data management
print(f"Counts of people having heart attack in last 12 months: {colHeartAttack}")
c3 = data[colHeartAttack].value_counts(sort=False)
print(c3)
 
# Frequency distribution for heart attack by indicating missing data
data[colHeartAttack] = data[colHeartAttack].replace(9, numpy.nan)
 
print(f"Counts of people having heart attack in last 12 months: {colHighBloodPressure} with missing data (9) recoded as NaN")
c3 = data[colHeartAttack].value_counts(sort=False, dropna=False)
print(c3)
 
# Frequency distribution with No mapped to 0
data[colHeartAttack] = data[colHeartAttack].map(recode)
 
print(f"Counts of people having heart attack in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0  ")
c3 = data[colHeartAttack].value_counts(sort=False, dropna=False)
print(c3)

Output of the Data Management Program

Below is the output from the data management Python program:

Total rows in NESARC dataset: 43093

Total columns in NESARC dataset: 3010

Counts of people having hardening of arteries in last 12 months: S13Q6A1

2 40917

1 911

9 1265

Name: S13Q6A1, dtype: int64

Counts of people having hardening of arteries in last 12 months: S13Q6A1 with missing data (9) recoded as NaN

2.0 40917

1.0 911

NaN 1265

Name: S13Q6A1, dtype: int64

Counts of people having hardening of arteries in last 12 months: S13Q6A1 with No (2) recoded as 0

0.0 40917

1.0 911

NaN 1265

Name: S13Q6A1, dtype: int64

Counts of people having high blood pressure in last 12 months: S13Q6A2

2 32828

1 9136

9 1129

Name: S13Q6A2, dtype: int64

Counts of people having high blood pressure in last 12 months: S13Q6A2 with missing data (9) recoded as NaN

2.0 32828

1.0 9136

NaN 1129

Name: S13Q6A2, dtype: int64

Counts of people having high blood pressure in last 12 months: S13Q6A1 with No (2) recoded as 0

0.0 32828

1.0 9136

NaN 1129

Name: S13Q6A2, dtype: int64

Counts of people having heart attack in last 12 months: S13Q6A7

2 41557

1 470

9 1066

Name: S13Q6A7, dtype: int64

Counts of people having heart attack in last 12 months: S13Q6A2 with missing data (9) recoded as NaN

2.0 41557

1.0 470

NaN 1066

Name: S13Q6A7, dtype: int64

Counts of people having heart attack in last 12 months: S13Q6A1 with No (2) recoded as 0

0.0 41557

1.0 470

NaN 1066

Name: S13Q6A7, dtype: int64

Data Management Steps Performed

I have performed the following data management operations on the selected variables:

As shown in the Python code above, as a first step I used the errors='coerce' parameter in the pandas.to_numeric method to convert invalid data to NaN
Next I used the replace method on the data set column to recode 9 to NaN. For all three selected variables, a value of 9 represented UNKNOWN response
Lastly, I used the map method on the data set column to map all 2 values to 0. For all three selected variables, a response of 2 represented No. Since in some programming languages False is also represented as 0, this value was more intuitive than 2.
For all the three variables 1 indicates YES, 2 (and 0) indicates NO and 9 (and NaN) indicates UNKNOWN response.

Search This Blog

Data Visualization

Week 3 Assignment: Data Management

Comments

Post a Comment

Popular posts from this blog

Data Management and Visualization - Research Topic

Week 4 Assignmemt: Graphing Decisions