We start by importing the modules that will provide the basic functionalites for data manipulation and visualization.
Here we import the modules that support ploting functionalities.
The following command indicates that we want the plots to be shown "inline" (in the web page)
%matplotlib inline
Then we import "matplotlib". - This is a module that provides basic and advanced plotting functionalities. - More specifically, we import its "pylab" submodule. - We assign to it the alias "plt", to make easier for us to type subsequent commands.
from matplotlib import pylab as plt
We now import the "pandas" module. - It provides easy to use features for performing data analysis - It is well integrated with the plotting module - It provides support for reading data from many different file formats - We assign to it the alias "pd" to make easier for us to type commands later
import pandas as pd
We use the "read_csv" function from the pandas module, to read a csv ("comma separated") file containing the health indicators data.
df = pd.read_csv("indicators.csv")
Now we import the "numpy" module. - It provides support for numerical computation - In particular processing of vectors and matrices - We assign to it the alias "np" to make easier for us to type commands later
import numpy as np
The pandas module reads data into a "Data Frame", which is essentially a matrix where the columns are identified by names and the rows are identified by indices.
Here we start by taking a look at the columns of the data frame to see how it is organized. This is a typical first step when we need to get familiar with the content of a new data file.
df.columns
Index([u'County Name', u'County Code', u'Region Name', u'Indicator Number', u'Indicator', u'Total Event Counts', u'Denominator', u'Denominator Note', u'Measure Unit', u'Percentage/Rate', u'95% CI', u'Data Comments', u'Data Years', u'Data Sources', u'Quartile', u'Mapping Distribution', u'Location'], dtype='object')
We can also look at the names of the rows, also known as the "index". In this case, the index is a simple numerical ordering.
df.index
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
The pandas module provide a default pretty-print functionality that is also very helpful to get a sense of the content of the data frame. It has however the drawback that for data frames with many rows or many columns, scrolling may be needed to get a full picture of the data set. We invoke the pretty print by simply typing the name of the data frame.
df
County Name County Code Region Name Indicator Number \
0 Cayuga 5 Central New York d1
1 Cortland 11 Central New York d1
2 Herkimer 21 Central New York d1
3 Jefferson 22 Central New York d1
4 Lewis 23 Central New York d1
5 Madison 25 Central New York d1
6 Oneida 30 Central New York d1
7 Onondaga 31 Central New York d1
8 Oswego 35 Central New York d1
9 St. Lawrence 40 Central New York d1
10 Tompkins 50 Central New York d1
11 Central New York 103 Central New York d1
12 Chemung 7 Finger Lakes d1
13 Livingston 24 Finger Lakes d1
14 Monroe 26 Finger Lakes d1
15 Ontario 32 Finger Lakes d1
16 Schuyler 44 Finger Lakes d1
17 Seneca 45 Finger Lakes d1
18 Steuben 46 Finger Lakes d1
19 Wayne 54 Finger Lakes d1
20 Yates 57 Finger Lakes d1
21 Finger Lakes 102 Finger Lakes d1
22 Dutchess 13 Hudson Valley d1
23 Orange 33 Hudson Valley d1
24 Putnam 37 Hudson Valley d1
25 Rockland 39 Hudson Valley d1
26 Sullivan 48 Hudson Valley d1
27 Ulster 51 Hudson Valley d1
28 Westchester 55 Hudson Valley d1
29 Hudson Valley 106 Hudson Valley d1
30 Nassau 28 Nassau-Suffolk d1
31 Suffolk 47 Nassau-Suffolk d1
32 Nassau-Suffolk 108 Nassau-Suffolk d1
33 Bronx 58 New York City d1
34 Kings 59 New York City d1
35 New York 60 New York City d1
36 Queens 61 New York City d1
37 Richmond 62 New York City d1
38 New York City 107 New York City d1
39 Broome 3 New York-Penn d1
40 Chenango 8 New York-Penn d1
41 Tioga 49 New York-Penn d1
42 New York-Penn 104 New York-Penn d1
43 New York State 999 New York State d1
44 Albany 1 Northeastern New York d1
45 Clinton 9 Northeastern New York d1
46 Columbia 10 Northeastern New York d1
47 Delaware 12 Northeastern New York d1
48 Essex 15 Northeastern New York d1
49 Franklin 16 Northeastern New York d1
50 Fulton 17 Northeastern New York d1
51 Greene 19 Northeastern New York d1
52 Hamilton 20 Northeastern New York d1
53 Montgomery 27 Northeastern New York d1
54 Otsego 36 Northeastern New York d1
55 Rensselaer 38 Northeastern New York d1
56 Saratoga 41 Northeastern New York d1
57 Schenectady 42 Northeastern New York d1
58 Schoharie 43 Northeastern New York d1
59 Warren 52 Northeastern New York d1
... ... ... ...
Indicator Total Event Counts \
0 Cardiovascular disease mortality rate per 100,000 721
1 Cardiovascular disease mortality rate per 100,000 354
2 Cardiovascular disease mortality rate per 100,000 769
3 Cardiovascular disease mortality rate per 100,000 1023
4 Cardiovascular disease mortality rate per 100,000 238
5 Cardiovascular disease mortality rate per 100,000 522
6 Cardiovascular disease mortality rate per 100,000 2644
7 Cardiovascular disease mortality rate per 100,000 3607
8 Cardiovascular disease mortality rate per 100,000 1003
9 Cardiovascular disease mortality rate per 100,000 1019
10 Cardiovascular disease mortality rate per 100,000 593
11 Cardiovascular disease mortality rate per 100,000 12493
12 Cardiovascular disease mortality rate per 100,000 885
13 Cardiovascular disease mortality rate per 100,000 487
14 Cardiovascular disease mortality rate per 100,000 5766
15 Cardiovascular disease mortality rate per 100,000 935
16 Cardiovascular disease mortality rate per 100,000 204
17 Cardiovascular disease mortality rate per 100,000 274
18 Cardiovascular disease mortality rate per 100,000 943
19 Cardiovascular disease mortality rate per 100,000 721
20 Cardiovascular disease mortality rate per 100,000 223
21 Cardiovascular disease mortality rate per 100,000 10438
22 Cardiovascular disease mortality rate per 100,000 2293
23 Cardiovascular disease mortality rate per 100,000 2396
24 Cardiovascular disease mortality rate per 100,000 707
25 Cardiovascular disease mortality rate per 100,000 2305
26 Cardiovascular disease mortality rate per 100,000 709
27 Cardiovascular disease mortality rate per 100,000 1520
28 Cardiovascular disease mortality rate per 100,000 7449
29 Cardiovascular disease mortality rate per 100,000 17379
30 Cardiovascular disease mortality rate per 100,000 14166
31 Cardiovascular disease mortality rate per 100,000 11956
32 Cardiovascular disease mortality rate per 100,000 26122
33 Cardiovascular disease mortality rate per 100,000 10056
34 Cardiovascular disease mortality rate per 100,000 19966
35 Cardiovascular disease mortality rate per 100,000 10816
36 Cardiovascular disease mortality rate per 100,000 18237
37 Cardiovascular disease mortality rate per 100,000 4531
38 Cardiovascular disease mortality rate per 100,000 63606
39 Cardiovascular disease mortality rate per 100,000 2268
40 Cardiovascular disease mortality rate per 100,000 697
41 Cardiovascular disease mortality rate per 100,000 364
42 Cardiovascular disease mortality rate per 100,000 3329
43 Cardiovascular disease mortality rate per 100,000 164204
44 Cardiovascular disease mortality rate per 100,000 2669
45 Cardiovascular disease mortality rate per 100,000 562
46 Cardiovascular disease mortality rate per 100,000 715
47 Cardiovascular disease mortality rate per 100,000 646
48 Cardiovascular disease mortality rate per 100,000 347
49 Cardiovascular disease mortality rate per 100,000 478
50 Cardiovascular disease mortality rate per 100,000 620
51 Cardiovascular disease mortality rate per 100,000 510
52 Cardiovascular disease mortality rate per 100,000 58
53 Cardiovascular disease mortality rate per 100,000 694
54 Cardiovascular disease mortality rate per 100,000 618
55 Cardiovascular disease mortality rate per 100,000 1547
56 Cardiovascular disease mortality rate per 100,000 1664
57 Cardiovascular disease mortality rate per 100,000 1547
58 Cardiovascular disease mortality rate per 100,000 265
59 Cardiovascular disease mortality rate per 100,000 580
... ...
Denominator Denominator Note Measure Unit Percentage/Rate \
0 79763 Average annual population Rate 301.3
1 48898 Average annual population Rate 241.3
2 63638 Average annual population Rate 402.8
3 117619 Average annual population Rate 289.9
4 26772 Average annual population Rate 296.3
5 72254 Average annual population Rate 240.8
6 233403 Average annual population Rate 377.6
7 462913 Average annual population Rate 259.7
8 121905 Average annual population Rate 274.3
9 111116 Average annual population Rate 305.7
10 101689 Average annual population Rate 194.4
11 1439971 Average annual population Rate 289.2
12 88667 Average annual population Rate 332.7
13 64445 Average annual population Rate 251.9
14 741224 Average annual population Rate 259.3
15 107369 Average annual population Rate 290.3
16 18475 Average annual population Rate 368.1
17 34833 Average annual population Rate 262.2
18 98192 Average annual population Rate 320.1
19 92833 Average annual population Rate 258.9
20 25095 Average annual population Rate 296.2
21 1271131 Average annual population Rate 273.7
22 296350 Average annual population Rate 257.9
23 377072 Average annual population Rate 211.8
24 99636 Average annual population Rate 236.5
25 309006 Average annual population Rate 248.6
26 76758 Average annual population Rate 307.9
27 182127 Average annual population Rate 278.2
28 953658 Average annual population Rate 260.4
29 2294607 Average annual population Rate 252.5
30 1347132 Average annual population Rate 350.5
31 1503547 Average annual population Rate 265.1
32 2850679 Average annual population Rate 305.4
33 1391466 Average annual population Rate 240.9
34 2534814 Average annual population Rate 262.6
35 1605625 Average annual population Rate 224.5
36 2261761 Average annual population Rate 268.8
37 476976 Average annual population Rate 316.6
38 8270641 Average annual population Rate 256.4
39 198087 Average annual population Rate 381.7
40 50405 Average annual population Rate 460.9
41 50744 Average annual population Rate 239.1
42 299236 Average annual population Rate 370.8
43 19461584 Average annual population Rate 281.2
44 302018 Average annual population Rate 294.6
45 81897 Average annual population Rate 228.7
46 62421 Average annual population Rate 381.8
47 47018 Average annual population Rate 458.0
48 38746 Average annual population Rate 298.5
49 51141 Average annual population Rate 311.6
50 55255 Average annual population Rate 374.0
51 49041 Average annual population Rate 346.7
52 4851 Average annual population Rate 398.6
53 49584 Average annual population Rate 466.6
54 61926 Average annual population Rate 332.7
55 158122 Average annual population Rate 326.1
56 220186 Average annual population Rate 251.9
57 153985 Average annual population Rate 334.9
58 32285 Average annual population Rate 273.6
59 65853 Average annual population Rate 293.6
... ... ... ...
95% CI Data Comments Data Years \
0 NaN NaN 2009-2011
1 NaN NaN 2009-2011
2 NaN NaN 2009-2011
3 NaN NaN 2009-2011
4 NaN NaN 2009-2011
5 NaN NaN 2009-2011
6 NaN NaN 2009-2011
7 NaN NaN 2009-2011
8 NaN NaN 2009-2011
9 NaN NaN 2009-2011
10 NaN NaN 2009-2011
11 NaN NaN 2009-2011
12 NaN NaN 2009-2011
13 NaN NaN 2009-2011
14 NaN NaN 2009-2011
15 NaN NaN 2009-2011
16 NaN NaN 2009-2011
17 NaN NaN 2009-2011
18 NaN NaN 2009-2011
19 NaN NaN 2009-2011
20 NaN NaN 2009-2011
21 NaN NaN 2009-2011
22 NaN NaN 2009-2011
23 NaN NaN 2009-2011
24 NaN NaN 2009-2011
25 NaN NaN 2009-2011
26 NaN NaN 2009-2011
27 NaN NaN 2009-2011
28 NaN NaN 2009-2011
29 NaN NaN 2009-2011
30 NaN NaN 2009-2011
31 NaN NaN 2009-2011
32 NaN NaN 2009-2011
33 NaN NaN 2009-2011
34 NaN NaN 2009-2011
35 NaN NaN 2009-2011
36 NaN NaN 2009-2011
37 NaN NaN 2009-2011
38 NaN NaN 2009-2011
39 NaN NaN 2009-2011
40 NaN NaN 2009-2011
41 NaN NaN 2009-2011
42 NaN NaN 2009-2011
43 NaN NaN 2009-2011
44 NaN NaN 2009-2011
45 NaN NaN 2009-2011
46 NaN NaN 2009-2011
47 NaN NaN 2009-2011
48 NaN NaN 2009-2011
49 NaN NaN 2009-2011
50 NaN NaN 2009-2011
51 NaN NaN 2009-2011
52 NaN NaN 2009-2011
53 NaN NaN 2009-2011
54 NaN NaN 2009-2011
55 NaN NaN 2009-2011
56 NaN NaN 2009-2011
57 NaN NaN 2009-2011
58 NaN NaN 2009-2011
59 NaN NaN 2009-2011
... ... ...
Data Sources Quartile \
0 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
1 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
2 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
3 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
4 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
5 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
6 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
7 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
8 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
9 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
10 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
11 2009-2011 Vital Statistics Data as of February... NaN
12 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
13 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
14 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
15 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
16 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
17 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
18 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
19 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
20 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
21 2009-2011 Vital Statistics Data as of February... NaN
22 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
23 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
24 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
25 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
26 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
27 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
28 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
29 2009-2011 Vital Statistics Data as of February... NaN
30 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
31 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
32 2009-2011 Vital Statistics Data as of February... NaN
33 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
34 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
35 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
36 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
37 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
38 2009-2011 Vital Statistics Data as of February... NaN
39 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
40 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
41 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
42 2009-2011 Vital Statistics Data as of February... NaN
43 2009-2011 Vital Statistics Data as of February... NaN
44 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
45 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
46 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
47 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
48 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
49 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
50 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
51 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
52 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
53 2009-2011 Vital Statistics Data as of February... 350.5 + : Q4
54 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
55 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
56 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
57 2009-2011 Vital Statistics Data as of February... 296.3 - < 350.5 : Q3
58 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
59 2009-2011 Vital Statistics Data as of February... 0 - < 296.3 : Q1 & Q2
... ...
Mapping Distribution Location
0 2 (42.940095, -76.560755)
1 1 (42.597101, -76.143291)
2 3 (43.070026, -74.994246)
3 1 (44.019295, -75.898971)
4 2 (43.785537, -75.446296)
5 1 (42.986917, -75.720031)
6 3 (43.149482, -75.361773)
7 1 (43.065629, -76.168033)
8 1 (43.39123, -76.31133)
9 2 (44.689468, -75.242045)
10 1 (42.461024, -76.478784)
11 NaN NaN
12 2 (42.116644, -76.812331)
13 1 (42.763754, -77.765392)
14 1 (43.161748, -77.620143)
15 1 (42.894571, -77.252045)
16 3 (42.38593, -76.872032)
17 1 (42.833627, -76.82753)
18 2 (42.270053, -77.324618)
19 1 (43.144336, -77.117995)
20 1 (42.634338, -77.078311)
21 NaN NaN
22 1 (41.686216, -73.840468)
23 1 (41.422459, -74.241929)
24 1 (41.41131, -73.717443)
25 1 (41.127287, -74.017033)
26 2 (41.705166, -74.711705)
27 1 (41.848374, -74.099412)
28 1 (41.039278, -73.805386)
29 NaN NaN
30 2 (40.715749, -73.601185)
31 1 (40.820237, -73.119032)
32 NaN NaN
33 1 (40.85589, -73.868294)
34 1 (40.65642, -73.950691)
35 1 (40.726966, -74.005966)
36 1 (40.749338, -73.789673)
37 2 (40.566763, -74.148102)
38 NaN NaN
39 3 (42.122015, -75.933191)
40 3 (42.481798, -75.570013)
41 1 (42.120252, -76.29595)
42 NaN NaN
43 NaN NaN
44 1 (42.678066, -73.814233)
45 1 (44.731944, -73.548883)
46 3 (42.276913, -73.682168)
47 3 (42.242972, -74.997944)
48 2 (44.166026, -73.685145)
49 2 (44.705699, -74.340621)
50 3 (43.06014, -74.331296)
51 2 (42.298326, -73.973376)
52 3 (43.618468, -74.395268)
53 3 (42.933637, -74.341972)
54 2 (42.564852, -75.060334)
55 2 (42.70098, -73.628669)
56 1 (43.00894, -73.786779)
57 2 (42.809233, -73.946838)
58 1 (42.643426, -74.434606)
59 1 (43.403024, -73.716044)
... ...
[2733 rows x 17 columns]
It is common to find that not all columns of a data frame are relevant or interesting for our data analysis. Therefore it is desirable to extact the interesting columns out of the dataframe, in order to focus on them in the subsequent steps.
Column extraction is made by passing to the data frame a list of column names.
selectedColumns = ['County Name', 'County Code', 'Indicator Number', 'Indicator', 'Percentage/Rate'] df=df[selectedColumns]
The data frame head() function prints by default the first 5 records of the dataset. It is a useful way to take a look at how the data is organized.
df.head()
County Name County Code Indicator Number \
0 Cayuga 5 d1
1 Cortland 11 d1
2 Herkimer 21 d1
3 Jefferson 22 d1
4 Lewis 23 d1
Indicator Percentage/Rate
0 Cardiovascular disease mortality rate per 100,000 301.3
1 Cardiovascular disease mortality rate per 100,000 241.3
2 Cardiovascular disease mortality rate per 100,000 402.8
3 Cardiovascular disease mortality rate per 100,000 289.9
4 Cardiovascular disease mortality rate per 100,000 296.3
[5 rows x 5 columns]
The head function also takes as argument the number of desired rows to print
df.head(3)
County Name County Code Indicator Number \
0 Cayuga 5 d1
1 Cortland 11 d1
2 Herkimer 21 d1
Indicator Percentage/Rate
0 Cardiovascular disease mortality rate per 100,000 301.3
1 Cardiovascular disease mortality rate per 100,000 241.3
2 Cardiovascular disease mortality rate per 100,000 402.8
[3 rows x 5 columns]
The pandas unique() function can be used to examine the unique values for a column. Let's illustrate this here, by first extracting a column (named 'Indicator') from the data frame, and then reducing the content of that column to the set of unique values.
The unique() function returns an array, so we us it to create a Series object, and then take advantage of its pretty-print functionality
indicators=pd.Series(df['Indicator'].unique()) indicators
0 Cardiovascular disease mortality rate per 100,000 1 Cerebrovascular disease (stroke) mortality rat... 2 Age-adjusted cerebrovascular disease (stroke) ... 3 Age-adjusted cardiovascular disease mortality ... 4 Cirrhosis mortality rate per 100,000 5 Age-adjusted cirrhosis mortality rate per 100,000 6 Diabetes mortality rate per 100,000 7 Age-adjusted diabetes mortality rate per 100,000 8 Age-adjusted percentage of adults with physici... 9 Age-adjusted percentage of adults with physici... 10 Percentage of pregnant women in WIC who were p... 11 Percentage of pregnant women in WIC who were p... 12 Percentage of WIC mothers breastfeeding at lea... 13 Percentage overweight but not obese (85th-<95t... 14 Percentage obese (95th percentile or higher) -... 15 Percentage overweight or obese (85th percentil... 16 Percentage overweight but not obese (85th-<95t... 17 Percentage obese (95th percentile or higher ) ... 18 Percentage overweight or obese (85th percentil... 19 Percentage overweight but not obese (85th-<95t... 20 Percentage obese (95th percentile or higher ) ... 21 Percentage overweight or obese (85th percentil... 22 Percentage obese (95th percentile or higher) c... 23 Percentage of children (aged 2-4 years) enroll... 24 Age-adjusted percentage of adults overweight o... 25 Age-adjusted percentage of adults obese (BMI 3... 26 Age-adjusted percentage of adults who did not ... 27 Age-adjusted percentage of adults eating 5 or ... 28 Cardiovascular disease hospitalization rate pe... 29 Cirrhosis hospitalization rate per 10,000 30 Age-adjusted cirrhosis hospitalization rate pe... 31 Diabetes hospitalization rate per 10,000 (prim... 32 Age-adjusted diabetes hospitalization rate per... 33 Diabetes hospitalization rate per 10,000 (any ... 34 Age-adjusted diabetes hospitalization rate per... 35 Age-adjusted cardiovascular disease hospitaliz... 36 Diabetes short-term complications hospitalizat... 37 Diabetes Short-term Complications Hospitalizat... 38 Cerebrovascular disease (stroke) hospitalizati... 39 Age-adjusted cerebrovascular disease (stroke) ... dtype: object
Now that we have become familiar with the content of the data frame, we can proceed to generate plots of its values.
We start our plotting exercise by selecting two specific health indicators. For them, we extract their rate of occurence for all the counties in the data set, and then we plot, per county, the rate of one indicator versus the other.
This is what would typically be done to explore potential correlations between two health indicators
From the list of unique health indicators, we choose two indicators of interest. We can do this by referring to them just by number.
indicator1=indicators[0] indicator2=indicators[9]
We then print them out, as to verify which indicators we are plotting.
print "1: ",indicator1 print "2: ",indicator2
1: Cardiovascular disease mortality rate per 100,000 2: Age-adjusted percentage of adults with physician diagnosed diabetes
Using each one of these indicators, we can now filter the data frame to extract the rows for which that specific indicator is present. We do this by taking advantage of a natural indexing provided by the pandas module.
data1=df[(df['Indicator']==indicator1)] data2=df[(df['Indicator']==indicator2)]
We are essentially asking the data frame to return the list of records for which the value in the 'Indicator' column matches the specific value of one of our health indicators.
This produces two data frames, each one containing the data for that specific health indicator.
From them, we are particularly interested in the columns:
We want to explore, for every county, how the percentage of occurrence of one health indicator, matches the percentage of occurence of the other health indicator. Again, we are exploring potential correlations between two health indicators.
Since the data is coming from a colection of records in a database, we must first extract them and index them by county, so that when we grab data from one indicator, we can be sure that we are matching it to another indicator for the same county.
data1_by_county = data1.set_index(data1['County Name']) data2_by_county = data2.set_index(data2['County Name'])
Now that the indices are common, we can extract the column of data as a Series, for both indicators
rate1 = data1_by_county['Percentage/Rate'] rate2 = data2_by_county['Percentage/Rate']
Then we join both Series into a new Data Frame, and rename its columns to preserve the meaning of the health indicators
name1 = 'Cardiovascular' name2 = 'Diabetes' rates = pd.DataFrame([rate1,rate2],[name1,name2]).T
Real life data is messy. Here we can see that for some Counties we have missing data from one or both of the health indicators. We can drop those records from our analysis, using the dropna() function as follows
rates = rates.dropna()
And finally our data is organized in a consistent way, with two columns of data ready to be plot against each other in a scatter plot.
rates
Cardiovascular Diabetes
Albany 294.6 8.6
Allegany 313.9 8.7
Bronx 240.9 11.3
Broome 381.7 8.6
Cattaraugus 446.1 10.9
Cayuga 301.3 9.5
Chautauqua 378.6 11.2
Chemung 332.7 11.3
Chenango 460.9 12.1
Clinton 228.7 10.0
Columbia 381.8 6.6
Cortland 241.3 10.5
Delaware 458.0 8.7
Dutchess 257.9 9.7
Erie 351.1 10.5
Essex 298.5 10.4
Franklin 311.6 11.7
Fulton 374.0 8.0
Genesee 340.6 13.2
Greene 346.7 8.7
Hamilton 398.6 8.0
Herkimer 402.8 11.2
Jefferson 289.9 10.7
Kings 262.6 10.5
Lewis 296.3 10.4
Livingston 251.9 9.9
Madison 240.8 7.4
Monroe 259.3 8.9
Montgomery 466.6 7.7
Nassau 350.5 5.9
New York 224.5 6.1
New York State 281.2 9.0
Niagara 415.4 10.2
Oneida 377.6 8.8
Onondaga 259.7 7.6
Ontario 290.3 7.4
Orange 211.8 6.9
Orleans 352.0 8.1
Oswego 274.3 9.9
Otsego 332.7 6.6
Putnam 236.5 6.4
Queens 268.8 11.0
Rensselaer 326.1 9.3
Richmond 316.6 8.5
Rockland 248.6 8.0
Saratoga 251.9 8.4
Schenectady 334.9 9.4
Schoharie 273.6 7.6
Schuyler 368.1 10.3
Seneca 262.2 10.7
St. Lawrence 305.7 10.8
Steuben 320.1 7.9
Suffolk 265.1 9.0
Sullivan 307.9 10.4
Tioga 239.1 10.7
Tompkins 194.4 7.4
Ulster 278.2 8.0
Warren 293.6 9.8
Washington 280.2 8.1
Wayne 258.9 8.6
... ...
[63 rows x 2 columns]
From our prepared Data Frame, we now extract the respective columns as Series
rate1 = rates[name1] rate2 = rates[name2]
and pass them as arguments to the Matplotlib scatter function, that will generate the scatter plot
plt.scatter(rate1,rate2)
<matplotlib.collections.PathCollection at 0x7f92ef730d90>![]()
We can compare the values as well by using a bubble chart.
array1=np.array(rate1.tolist()) array1= array1[~np.isnan(array1)] min=array1.min() array_for_bubble=(array1-min+1)*10 plt.scatter(rate1, rate2, s=array_for_bubble, marker='o', c=array_for_bubble)
<matplotlib.collections.PathCollection at 0x7f92ef5cddd0>![]()