Week 4 Assignmemt: Graphing Decisions
NESARC Dataset: Questions Selected
As part of this assignment I will be considering association between the frequency of drinking alcohol during the time of abuse (:S2AQ21A: Explanatory Variable) and whether or not the person suffered from one of the below cardiovascualar medical condition in last 12 months:
- Hardening of Arteries (S13Q6A1: Response Variable)
- High Blood Pressure (S13Q6A2: Response Variable)
- Heat Attack (S13Q6A7: Response Variable)
All four variables are categorical.
Univariate Bar Graps
Since all variables are categorical, I have used Bar Graphs for Univariate plots. Below are the bar graphs for the four variables:
Descriptions for the selected variables
I have used the Pandas describe() method to describe the variables. Below is the output of the Python program:
Total rows in NESARC dataset: 43093
Total columns in NESARC dataset: 3010
Description for how frequently people drank alcohol during the time of alcohol abuse: S2AQ21A
count 34331.000000
unique 10.000000
top 1.000000
freq 5215.000000
Name: S2AQ21A, dtype: float64
Description for if people suffered from hardening of arteries in last 12 months: S13Q6A1
count 41828.000000
unique 2.000000
top 0.000000
freq 40917.000000
Name: S13Q6A1, dtype: float64
Description for if people suffered from high blood pressure in last 12 months: S13Q6A2
count 41964.000000
unique 2.000000
top 0.000000
freq 32828.000000
Name: S13Q6A2, dtype: float64
Description for if people suffered from heart attack in last 12 months: S13Q6A7
count 42027.000000
unique 2.000000
top 0.000000
freq 41557.000000
Name: S13Q6A7, dtype: float64
Bivariate Graphs
I have used the seaborn.catplot() to plot the association between the explanatory variable and the three response variables. Below are the bivariate bar graphs for the same:
Python Code
Below is the Python / Pandas code used to generate above plots:
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
#Set PANDAS to show all columns in DataFrame
pandas.set_option('display.max_columns', None)
#Set PANDAS to show all rows in DataFrame
pandas.set_option('display.max_rows', None)
# bug fix for display formats to avoid run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)
# Explanatory variable
colFreqAlcoholDuringAbuse = 'S2AQ21A'
# response variables
colHardeningOfArteries = 'S13Q6A1'
colHighBloodPressure = 'S13Q6A2'
colHeartAttack = 'S13Q6A7'
data = pandas.read_csv('nesarc_pds.csv', low_memory=False)
print (f"Total rows in NESARC dataset: {len(data)}") #number of observations (rows)
print (f"Total columns in NESARC dataset: {len(data.columns)}") # number of variables (columns)
#setting variables you will be working with to numeric (updated)
data[colFreqAlcoholDuringAbuse] = pandas.to_numeric(data[colFreqAlcoholDuringAbuse], errors='coerce')
data[colHardeningOfArteries] = pandas.to_numeric(data[colHardeningOfArteries], errors='coerce')
data[colHighBloodPressure] = pandas.to_numeric(data[colHighBloodPressure], errors='coerce')
data[colHeartAttack] = pandas.to_numeric(data[colHeartAttack], errors='coerce')
# Dara Management Step 1: Recode UNKNOWN (99 or 9) to NaB for response variables
data[colFreqAlcoholDuringAbuse] = data[colFreqAlcoholDuringAbuse].replace(99, numpy.nan)
data[colHardeningOfArteries] = data[colHardeningOfArteries].replace(9, numpy.nan)
data[colHighBloodPressure] = data[colHighBloodPressure].replace(9, numpy.nan)
data[colHeartAttack] = data[colHeartAttack].replace(9, numpy.nan)
# Data Management Step 2: Recode NO (2) for response variable to 0
data[colHardeningOfArteries] = data[colHardeningOfArteries].replace(2, 0)
data[colHighBloodPressure] = data[colHighBloodPressure].replace(2, 0)
data[colHeartAttack] = data[colHeartAttack].replace(2, 0)
# Data Management Step3: Reversing alcohol drinking frequency categorierec
# currently 1 is most frequent (every day in a week) and 10 is least frequent (once or twice in a year). Therefore reversing the order
recode = {1:10, 2:9, 3:8, 4:7, 5:6, 6:5, 7:4, 8:3, 9:2, 10:1}
data[colFreqAlcoholDuringAbuse] = data[colFreqAlcoholDuringAbuse].map(recode)
# Make all the variables as categorical
data[colFreqAlcoholDuringAbuse] = data[colFreqAlcoholDuringAbuse].astype('category')
data[colHardeningOfArteries] = data[colHardeningOfArteries].astype('category')
data[colHighBloodPressure] = data[colHighBloodPressure].astype('category')
data[colHeartAttack] = data[colHeartAttack].astype('category')
# UNnivariate bar graphs for categorical variables
freqAlcoholPlot = seaborn.countplot(x=colFreqAlcoholDuringAbuse, data=data)
plt.xlabel('Alcohol Drinking Frequwncy')
plt.title('Frequency of drinking alcohol during alcohol abuse')
fig = freqAlcoholPlot.get_figure()
fig.savefig('AlcoholFrequencyBarChart.png')
hardeningOfArteriesPlot = seaborn.countplot(x=colHardeningOfArteries, data=data)
plt.xlabel('People suffered Hardening of Arteries')
plt.title('People who suffered Hardening of Arteries in last 12 months')
fig = hardeningOfArteriesPlot.get_figure()
fig.savefig('ArteriesHardeningBarChart.png')
highBPPlot = seaborn.countplot(x=colHighBloodPressure, data=data)
plt.xlabel('People suffered High Blood Pressure')
plt.title('People who suffered High Blood Pressure in last 12 months')
fig = highBPPlot.get_figure()
fig.savefig('HighBloodPressureBarChart.png')
heartAttackPlot = seaborn.countplot(x=colHeartAttack, data=data)
plt.xlabel('People suffered Heart Attack')
plt.title('People who suffered Heart Attack in last 12 months')
fig = heartAttackPlot.get_figure()
fig.savefig('HeartAttackBarChart.png')
# describing the variables
print(f"Description for how frequently people drank alcohol during the time of alcohol abuse: {colFreqAlcoholDuringAbuse}")
desc1 = data[colFreqAlcoholDuringAbuse].describe()
print(desc1)
print(f"Description for if people suffered from hardening of arteries in last 12 months: {colHardeningOfArteries}")
desc2 = data[colHardeningOfArteries].describe()
print(desc2)
print(f"Description for if people suffered from high blood pressure in last 12 months: {colHighBloodPressure}")
desc3 = data[colHighBloodPressure].describe()
print(desc3)
print(f"Description for if people suffered from heart attack in last 12 months: {colHeartAttack}")
desc4 = data[colHeartAttack].describe()
print(desc4)
# Bivariate Plots Categorical t0 Categorical Bar Charts
data[colHardeningOfArteries] = pandas.to_numeric(data[colHardeningOfArteries], errors='coerce')
causal_hardeningOfArteries_plot = seaborn.catplot(x=colFreqAlcoholDuringAbuse, y=colHardeningOfArteries, data=data, kind="bar", ci=None)
plt.xlabel('Alcohol Frequency')
plt.ylabel('Hardening of Arteries Proportion ')
# fig = causal_heartattack_plot.get_figure()
plt.savefig('Causal_ArteriesHardening.png')
# Bivariate Plots Categorical to Categorical Bar Charts
data[colHighBloodPressure] = pandas.to_numeric(data[colHighBloodPressure], errors='coerce')
causal_highBP_plot = seaborn.catplot(x=colFreqAlcoholDuringAbuse, y=colHighBloodPressure, data=data, kind="bar", ci=None)
plt.xlabel('Alcohol Frequency')
plt.ylabel('High Blood Pressure Proportion ')
# fig = causal_heartattack_plot.get_figure()
plt.savefig('Causal_HighBloodPressure.png')
# bivariate bar graph C->Q
data[colHeartAttack] = pandas.to_numeric(data[colHeartAttack], errors='coerce')
causal_heartattack_plot = seaborn.catplot(x=colFreqAlcoholDuringAbuse, y=colHeartAttack, data=data, kind="bar", ci=None)
plt.xlabel('Alcohol Frequency')
plt.ylabel('Heart Attack Proportion ')
# fig = causal_heartattack_plot.get_figure()
plt.savefig('Causal_HeartAttack.png')
Summary
- The above bivariate plots have frequency of alcohol drinking during alcohol abuse on x-axis and whether or not the person suffered from Hardenng of Arteries, High Blood Pressure or Heart Attack in last 12 months on the y-axis.
- On x-axis as we move right, the frequency of drinking alcohol increases from 1 being once or twice a year to 10 being every day of the week.
- On the y-axis for the three bivariate plots, we have the proportion of the survey candidates who suffered from the three cardiovascular medical condition in last 12 months.
- As is apparent from the bivariate plots, there doesn't seem to be a direct relationship between frequency of drinking alcohol and a person suffering from hardening of arteries, high blood pressure or heart attack.
- The results are similar to literatue research performed in Assignment 1.







Comments
Post a Comment