Week 3 Assignment: Data Management
Variables Selected
As part of this assignment, I have selected the following three variables from the NESARC dataset to perform data manipulation operations:
- Had hardening of arteries in last 12 months (S13Q6A1)
- Had high blood pressure in last 12 months (S13Q6A2)
- Had heart attack in last 12 months (S13Q6A7)
Python Program to perform Data Management
Below is the Python program that I used to perform data management operations on the selected variables:
import pandas
import numpy
# setting options so that pandas data do not get truncated
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (f"Total rows in NESARC dataset: {len(data)}") #number of observations (rows)
print (f"Total columns in NESARC dataset: {len(data.columns)}") # number of variables (columns)
# column names
colHardeningOfArteries = 'S13Q6A1'
colHighBloodPressure = 'S13Q6A2'
colHeartAttack = 'S13Q6A7'
#setting variables you will be working with to numeric (updated)
data[colHardeningOfArteries] = pandas.to_numeric(data[colHardeningOfArteries], errors='coerce')
data[colHighBloodPressure] = pandas.to_numeric(data[colHighBloodPressure], errors='coerce')
data[colHeartAttack] = pandas.to_numeric(data[colHeartAttack], errors='coerce')
recode = {1:1, 2:0}
# Frequency distribution for hardening of arteries without any data management
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries}")
c1 = data[colHardeningOfArteries].value_counts(sort=False)
print(c1)
# Frequency distribution for hardening of arteries by indicating missing data
data[colHardeningOfArteries] = data[colHardeningOfArteries].replace(9, numpy.nan)
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries} with missing data (9) recoded as NaN")
c1 = data[colHardeningOfArteries].value_counts(sort=False, dropna=False)
print(c1)
# Frequency distribution with No mapped to 0
data[colHardeningOfArteries] = data[colHardeningOfArteries].map(recode)
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0 ")
c1 = data[colHardeningOfArteries].value_counts(sort=False, dropna=False)
print(c1)
# Frequency distribution for high blood pressure without any data management
print(f"Counts of people having high blood pressure in last 12 months: {colHighBloodPressure}")
c2 = data[colHighBloodPressure].value_counts(sort=False)
print(c2)
# Frequency distribution for high blood pressure by indicating missing data
data[colHighBloodPressure] = data[colHighBloodPressure].replace(9, numpy.nan)
print(f"Counts of people having high blood pressure in last 12 months: {colHighBloodPressure} with missing data (9) recoded as NaN")
c2 = data[colHighBloodPressure].value_counts(sort=False, dropna=False)
print(c2)
# Frequency distribution with No mapped to 0
data[colHighBloodPressure] = data[colHighBloodPressure].map(recode)
print(f"Counts of people having high blood pressure in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0 ")
c2 = data[colHighBloodPressure].value_counts(sort=False, dropna=False)
print(c2)
# Frequency distribution for heart attack without any data management
print(f"Counts of people having heart attack in last 12 months: {colHeartAttack}")
c3 = data[colHeartAttack].value_counts(sort=False)
print(c3)
# Frequency distribution for heart attack by indicating missing data
data[colHeartAttack] = data[colHeartAttack].replace(9, numpy.nan)
print(f"Counts of people having heart attack in last 12 months: {colHighBloodPressure} with missing data (9) recoded as NaN")
c3 = data[colHeartAttack].value_counts(sort=False, dropna=False)
print(c3)
# Frequency distribution with No mapped to 0
data[colHeartAttack] = data[colHeartAttack].map(recode)
print(f"Counts of people having heart attack in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0 ")
c3 = data[colHeartAttack].value_counts(sort=False, dropna=False)
print(c3)Output of the Data Management Program
Below is the output from the data management Python program:
Total rows in NESARC dataset: 43093
Total columns in NESARC dataset: 3010
Counts of people having hardening of arteries in last 12 months: S13Q6A1
2 40917
1 911
9 1265
Name: S13Q6A1, dtype: int64
Counts of people having hardening of arteries in last 12 months: S13Q6A1 with missing data (9) recoded as NaN
2.0 40917
1.0 911
NaN 1265
Name: S13Q6A1, dtype: int64
Counts of people having hardening of arteries in last 12 months: S13Q6A1 with No (2) recoded as 0
0.0 40917
1.0 911
NaN 1265
Name: S13Q6A1, dtype: int64
Counts of people having high blood pressure in last 12 months: S13Q6A2
2 32828
1 9136
9 1129
Name: S13Q6A2, dtype: int64
Counts of people having high blood pressure in last 12 months: S13Q6A2 with missing data (9) recoded as NaN
2.0 32828
1.0 9136
NaN 1129
Name: S13Q6A2, dtype: int64
Counts of people having high blood pressure in last 12 months: S13Q6A1 with No (2) recoded as 0
0.0 32828
1.0 9136
NaN 1129
Name: S13Q6A2, dtype: int64
Counts of people having heart attack in last 12 months: S13Q6A7
2 41557
1 470
9 1066
Name: S13Q6A7, dtype: int64
Counts of people having heart attack in last 12 months: S13Q6A2 with missing data (9) recoded as NaN
2.0 41557
1.0 470
NaN 1066
Name: S13Q6A7, dtype: int64
Counts of people having heart attack in last 12 months: S13Q6A1 with No (2) recoded as 0
0.0 41557
1.0 470
NaN 1066
Name: S13Q6A7, dtype: int64
Data Management Steps Performed
I have performed the following data management operations on the selected variables:
- As shown in the Python code above, as a first step I used the errors='coerce' parameter in the pandas.to_numeric method to convert invalid data to NaN
- Next I used the replace method on the data set column to recode 9 to NaN. For all three selected variables, a value of 9 represented UNKNOWN response
- Lastly, I used the map method on the data set column to map all 2 values to 0. For all three selected variables, a response of 2 represented No. Since in some programming languages False is also represented as 0, this value was more intuitive than 2.
- For all the three variables 1 indicates YES, 2 (and 0) indicates NO and 9 (and NaN) indicates UNKNOWN response.
Comments
Post a Comment