Week 3 Assignment: Data Management

Variables Selected

As part of this assignment, I have selected the following three variables from the NESARC dataset to perform data manipulation operations:

  • Had hardening of arteries in last 12 months (S13Q6A1)
  • Had high blood pressure in last 12 months (S13Q6A2)
  • Had heart attack in last 12 months (S13Q6A7)

Python Program to perform Data Management
Below is the Python program that I used to perform data management operations on the selected variables:

import pandas
import numpy
# setting options so that pandas data do not get truncated
 
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
 
 
print (f"Total rows in NESARC dataset: {len(data)}") #number of observations (rows)
print (f"Total columns in NESARC dataset: {len(data.columns)}") # number of variables (columns)
 
# column names
colHardeningOfArteries = 'S13Q6A1'
colHighBloodPressure = 'S13Q6A2'
colHeartAttack = 'S13Q6A7'
 
#setting variables you will be working with to numeric (updated)
data[colHardeningOfArteries] = pandas.to_numeric(data[colHardeningOfArteries], errors='coerce')
data[colHighBloodPressure] = pandas.to_numeric(data[colHighBloodPressure], errors='coerce')
data[colHeartAttack] = pandas.to_numeric(data[colHeartAttack], errors='coerce')
 
recode = {1:1, 2:0}
 
# Frequency distribution for hardening of arteries without any data management
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries}")
c1 = data[colHardeningOfArteries].value_counts(sort=False)
print(c1)
 
# Frequency distribution for hardening of arteries by indicating missing data
data[colHardeningOfArteries] = data[colHardeningOfArteries].replace(9, numpy.nan)
 
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries} with missing data (9) recoded as NaN")
c1 = data[colHardeningOfArteries].value_counts(sort=False, dropna=False)
print(c1)
 
# Frequency distribution with No mapped to 0
data[colHardeningOfArteries] = data[colHardeningOfArteries].map(recode)
 
print(f"Counts of people having hardening of arteries in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0  ")
c1 = data[colHardeningOfArteries].value_counts(sort=False, dropna=False)
print(c1)
 
# Frequency distribution for high blood pressure without any data management
print(f"Counts of people having high blood pressure in last 12 months: {colHighBloodPressure}")
c2 = data[colHighBloodPressure].value_counts(sort=False)
print(c2)
 
# Frequency distribution for high blood pressure by indicating missing data
data[colHighBloodPressure] = data[colHighBloodPressure].replace(9, numpy.nan)
 
print(f"Counts of people having high blood pressure in last 12 months: {colHighBloodPressure} with missing data (9) recoded as NaN")
c2 = data[colHighBloodPressure].value_counts(sort=False, dropna=False)
print(c2)
 
# Frequency distribution with No mapped to 0
data[colHighBloodPressure] = data[colHighBloodPressure].map(recode)
 
print(f"Counts of people having high blood pressure in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0  ")
c2 = data[colHighBloodPressure].value_counts(sort=False, dropna=False)
print(c2)
 
# Frequency distribution for heart attack without any data management
print(f"Counts of people having heart attack in last 12 months: {colHeartAttack}")
c3 = data[colHeartAttack].value_counts(sort=False)
print(c3)
 
# Frequency distribution for heart attack by indicating missing data
data[colHeartAttack] = data[colHeartAttack].replace(9, numpy.nan)
 
print(f"Counts of people having heart attack in last 12 months: {colHighBloodPressure} with missing data (9) recoded as NaN")
c3 = data[colHeartAttack].value_counts(sort=False, dropna=False)
print(c3)
 
# Frequency distribution with No mapped to 0
data[colHeartAttack] = data[colHeartAttack].map(recode)
 
print(f"Counts of people having heart attack in last 12 months: {colHardeningOfArteries} with No (2) recoded as 0  ")
c3 = data[colHeartAttack].value_counts(sort=False, dropna=False)
print(c3)


Output of the Data Management Program
Below is the output from the data management Python program:

Total rows in NESARC dataset: 43093
Total columns in NESARC dataset: 3010
Counts of people having hardening of arteries in last 12 months: S13Q6A1
2    40917
1      911
9     1265
Name: S13Q6A1, dtype: int64
Counts of people having hardening of arteries in last 12 months: S13Q6A1 with missing data (9) recoded as NaN
2.0    40917
1.0      911
NaN     1265
Name: S13Q6A1, dtype: int64
Counts of people having hardening of arteries in last 12 months: S13Q6A1 with No (2) recoded as 0  
0.0    40917
1.0      911
NaN     1265
Name: S13Q6A1, dtype: int64
Counts of people having high blood pressure in last 12 months: S13Q6A2
2    32828
1     9136
9     1129
Name: S13Q6A2, dtype: int64
Counts of people having high blood pressure in last 12 months: S13Q6A2 with missing data (9) recoded as NaN
2.0    32828
1.0     9136
NaN     1129
Name: S13Q6A2, dtype: int64
Counts of people having high blood pressure in last 12 months: S13Q6A1 with No (2) recoded as 0  
0.0    32828
1.0     9136
NaN     1129
Name: S13Q6A2, dtype: int64
Counts of people having heart attack in last 12 months: S13Q6A7
2    41557
1      470
9     1066
Name: S13Q6A7, dtype: int64
Counts of people having heart attack in last 12 months: S13Q6A2 with missing data (9) recoded as NaN
2.0    41557
1.0      470
NaN     1066
Name: S13Q6A7, dtype: int64
Counts of people having heart attack in last 12 months: S13Q6A1 with No (2) recoded as 0  
0.0    41557
1.0      470
NaN     1066
Name: S13Q6A7, dtype: int64


Data Management Steps Performed
I have performed the following data management operations on the selected variables:
  • As shown in the Python code above, as a first step I used the errors='coerce' parameter in the pandas.to_numeric method to convert invalid data to NaN
  • Next I used the replace method on the data set column to recode 9 to NaN. For all three selected variables, a value of 9 represented UNKNOWN response
  • Lastly, I used the map method on the data set column to map all 2 values to 0. For all three selected variables, a response of 2 represented No. Since in some programming languages False is also represented as 0, this value was more intuitive than 2.
  • For all the three variables 1 indicates YES, 2 (and 0) indicates NO and 9 (and NaN) indicates UNKNOWN response.

Comments

Popular posts from this blog

Data Management and Visualization - Research Topic

Week 4 Assignmemt: Graphing Decisions