Project 01: Exploratory Data Analysis

Zillow Housing Prices Exploratory Data Analysis

Load Data

import pandas as pd
import numpy as np
import seaborn as sns
state_data = pd.read_csv("zecon/State_time_series.csv",parse_dates=True)
print(len(state_data.columns))
print(len(state_data))
82
13212
state_data = state_data[['Date','RegionName','ZHVI_1bedroom','ZHVI_2bedroom','ZHVI_3bedroom','ZHVI_4bedroom','ZHVI_BottomTier','ZHVI_CondoCoop','ZHVI_MiddleTier','ZHVI_SingleFamilyResidence','ZHVI_TopTier']]
state_data["Date"] = pd.to_datetime(state_data["Date"])
state_data['Year'] = state_data['Date'].dt.year
state_data['Month'] = state_data['Date'].dt.month
cross_data = pd.read_csv("zecon/cities_crosswalk.csv")
#cross_data.head()
cities_data = pd.read_csv("zecon/City_time_series.csv")
cities_data.head()
Date RegionName InventorySeasonallyAdjusted_AllHomes InventoryRaw_AllHomes MedianListingPricePerSqft_1Bedroom MedianListingPricePerSqft_2Bedroom MedianListingPricePerSqft_3Bedroom MedianListingPricePerSqft_4Bedroom MedianListingPricePerSqft_5BedroomOrMore MedianListingPricePerSqft_AllHomes ... ZHVI_BottomTier ZHVI_CondoCoop ZHVI_MiddleTier ZHVI_SingleFamilyResidence ZHVI_TopTier ZRI_AllHomes ZRI_AllHomesPlusMultifamily ZriPerSqft_AllHomes Zri_MultiFamilyResidenceRental Zri_SingleFamilyResidenceRental
0 1996-04-30 abbottstownadamspa NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 108700.0 NaN NaN NaN NaN NaN
1 1996-04-30 aberdeenbinghamid NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 168400.0 NaN NaN NaN NaN NaN
2 1996-04-30 aberdeenharfordmd NaN NaN NaN NaN NaN NaN NaN NaN ... 81300.0 137900.0 109600.0 108600.0 147900.0 NaN NaN NaN NaN NaN
3 1996-04-30 aberdeenmonroems NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 74500.0 NaN NaN NaN NaN NaN
4 1996-04-30 aberdeenmoorenc NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 131100.0 NaN NaN NaN NaN NaN

5 rows × 81 columns

print(len(cities_data.columns))
print(len(cities_data))
81
3762566
cities_data["Date"] = pd.to_datetime(cities_data["Date"])
cities_data['Year'] = cities_data['Date'].dt.year
cities_data['Month'] = cities_data['Date'].dt.month
city_state_data = cities_data.join(cross_data.set_index('Unique_City_ID'), on='RegionName')

Michigan

data_michigan = state_data.loc[state_data['RegionName'] == 'Michigan']
year_plot = sns.barplot(x="Year", y="ZHVI_MiddleTier", data=data_michigan)
year_plot.set_xticklabels(year_plot.get_xticklabels(), rotation=90)
[Text(0, 0, '1996'),
 Text(0, 0, '1997'),
 Text(0, 0, '1998'),
 Text(0, 0, '1999'),
 Text(0, 0, '2000'),
 Text(0, 0, '2001'),
 Text(0, 0, '2002'),
 Text(0, 0, '2003'),
 Text(0, 0, '2004'),
 Text(0, 0, '2005'),
 Text(0, 0, '2006'),
 Text(0, 0, '2007'),
 Text(0, 0, '2008'),
 Text(0, 0, '2009'),
 Text(0, 0, '2010'),
 Text(0, 0, '2011'),
 Text(0, 0, '2012'),
 Text(0, 0, '2013'),
 Text(0, 0, '2014'),
 Text(0, 0, '2015'),
 Text(0, 0, '2016'),
 Text(0, 0, '2017')]

png

This graph shows the median housing price within michigan over the course of the dataset. It accurately shows the 2008 recession and therefore is fairly reasonable to believe.

Examining the Country in 2017

country_df = state_data[["RegionName","ZHVI_MiddleTier","Year"]]
country_df = country_df.loc[country_df["Year"]==2017]
country_df = country_df.dropna()
#country_df
state_plot = sns.barplot(x="RegionName", y="ZHVI_MiddleTier", data=country_df)
state_plot.set_xticklabels(state_plot.get_xticklabels(), rotation=90,size=7)
[Text(0, 0, 'Alabama'),
 Text(0, 0, 'Alaska'),
 Text(0, 0, 'Arizona'),
 Text(0, 0, 'Arkansas'),
 Text(0, 0, 'California'),
 Text(0, 0, 'Colorado'),
 Text(0, 0, 'Connecticut'),
 Text(0, 0, 'Delaware'),
 Text(0, 0, 'DistrictofColumbia'),
 Text(0, 0, 'Florida'),
 Text(0, 0, 'Georgia'),
 Text(0, 0, 'Hawaii'),
 Text(0, 0, 'Idaho'),
 Text(0, 0, 'Illinois'),
 Text(0, 0, 'Indiana'),
 Text(0, 0, 'Iowa'),
 Text(0, 0, 'Kansas'),
 Text(0, 0, 'Kentucky'),
 Text(0, 0, 'Maine'),
 Text(0, 0, 'Maryland'),
 Text(0, 0, 'Massachusetts'),
 Text(0, 0, 'Michigan'),
 Text(0, 0, 'Minnesota'),
 Text(0, 0, 'Mississippi'),
 Text(0, 0, 'Missouri'),
 Text(0, 0, 'Montana'),
 Text(0, 0, 'Nebraska'),
 Text(0, 0, 'Nevada'),
 Text(0, 0, 'NewHampshire'),
 Text(0, 0, 'NewJersey'),
 Text(0, 0, 'NewMexico'),
 Text(0, 0, 'NewYork'),
 Text(0, 0, 'NorthCarolina'),
 Text(0, 0, 'NorthDakota'),
 Text(0, 0, 'Ohio'),
 Text(0, 0, 'Oklahoma'),
 Text(0, 0, 'Oregon'),
 Text(0, 0, 'Pennsylvania'),
 Text(0, 0, 'RhodeIsland'),
 Text(0, 0, 'SouthCarolina'),
 Text(0, 0, 'SouthDakota'),
 Text(0, 0, 'Tennessee'),
 Text(0, 0, 'Texas'),
 Text(0, 0, 'Utah'),
 Text(0, 0, 'Vermont'),
 Text(0, 0, 'Virginia'),
 Text(0, 0, 'Washington'),
 Text(0, 0, 'WestVirginia'),
 Text(0, 0, 'Wisconsin'),
 Text(0, 0, 'Wyoming')]

png

This graph makes intuitive sense as the states as the top are Hawaii and California along with Washington D.C. The cheapes states are West Virginia, Mississippi, and Oklahoma.


QUESTION ONE: I’d like to find out where is the most expensive city to live in 2017 and where is the least expensive.

Take the values for the state data set prices and determine which are the max and min prices overall. I can then visualize these through creating a scatter plot in plot.ly or with d3 that maps the max and min for each state’s prices to show the spread and differences for each of the states

max_min_cities = city_state_data
q1_a = max_min_cities[["Date","RegionName","ZHVIPerSqft_AllHomes","ZHVI_BottomTier","ZHVI_MiddleTier","ZHVI_SingleFamilyResidence","ZHVI_TopTier","Month","Year","City","County","State"]]
q1_a = q1_a.dropna()
#specific data for question
q1_a = q1_a.loc[q1_a['Year'] == 2017]
q1_a = q1_a.loc[q1_a['Month']== 5]
q1_a.head()
Date RegionName ZHVIPerSqft_AllHomes ZHVI_BottomTier ZHVI_MiddleTier ZHVI_SingleFamilyResidence ZHVI_TopTier Month Year City County State
3630515 2017-05-31 abbottstownadamspa 121.0 133100.0 143800.0 143800.0 166700.0 5 2017 Abbottstown Adams PA
3630518 2017-05-31 aberdeengrays_harborwa 82.0 79100.0 123700.0 123800.0 199200.0 5 2017 Aberdeen Grays Harbor WA
3630519 2017-05-31 aberdeenharfordmd 142.0 144300.0 204100.0 202500.0 298300.0 5 2017 Aberdeen Harford MD
3630523 2017-05-31 abilenedickinsonks 72.0 65000.0 96700.0 96700.0 146100.0 5 2017 Abilene Dickinson KS
3630525 2017-05-31 abingdonharfordmd 151.0 184400.0 238600.0 241200.0 317000.0 5 2017 Abingdon Harford MD
q1_a['graph'] = q1_a["City"] + " " + q1_a["State"]
q1_a.head()
Date RegionName ZHVIPerSqft_AllHomes ZHVI_BottomTier ZHVI_MiddleTier ZHVI_SingleFamilyResidence ZHVI_TopTier Month Year City County State graph
3630515 2017-05-31 abbottstownadamspa 121.0 133100.0 143800.0 143800.0 166700.0 5 2017 Abbottstown Adams PA Abbottstown PA
3630518 2017-05-31 aberdeengrays_harborwa 82.0 79100.0 123700.0 123800.0 199200.0 5 2017 Aberdeen Grays Harbor WA Aberdeen WA
3630519 2017-05-31 aberdeenharfordmd 142.0 144300.0 204100.0 202500.0 298300.0 5 2017 Aberdeen Harford MD Aberdeen MD
3630523 2017-05-31 abilenedickinsonks 72.0 65000.0 96700.0 96700.0 146100.0 5 2017 Abilene Dickinson KS Abilene KS
3630525 2017-05-31 abingdonharfordmd 151.0 184400.0 238600.0 241200.0 317000.0 5 2017 Abingdon Harford MD Abingdon MD
#q1_a_vis = q1_a.sort_values("ton",ascending=False).head(5)
q1_a_vis = q1_a[["ZHVI_MiddleTier","City","County","State","graph"]]
q1_a_vis.sort_values("ZHVI_MiddleTier",ascending=False)
ZHVI_MiddleTier City County State graph
3631033 6399100.0 Atherton San Mateo CA Atherton CA
3638742 4421300.0 Los Altos Hills Santa Clara CA Los Altos Hills CA
3636938 4243800.0 Hillsborough San Mateo CA Hillsborough CA
3646055 4099600.0 Water Mill Suffolk NY Water Mill NY
3642104 3797800.0 Portola Valley San Mateo CA Portola Valley CA
3631519 3562400.0 Belvedere Marin CA Belvedere CA
3646916 3395200.0 Woodside San Mateo CA Woodside CA
3636873 3183000.0 Hidden Hills Los Angeles CA Hidden Hills CA
3631662 3042500.0 Beverly Hills Los Angeles CA Beverly Hills CA
3644200 3020100.0 Stanford Santa Clara CA Stanford CA
3639005 2892100.0 Malibu Los Angeles CA Malibu CA
3638743 2891400.0 Los Altos Santa Clara CA Los Altos CA
3639885 2881600.0 Monte Sereno Santa Clara CA Monte Sereno CA
3641384 2645700.0 Palo Alto Santa Clara CA Palo Alto CA
3642330 2594900.0 Rancho Santa Fe San Diego CA Rancho Santa Fe CA
3647004 2558600.0 Yarrow Point King WA Yarrow Point WA
3639434 2541700.0 Medina King WA Medina WA
3643292 2518200.0 Saratoga Santa Clara CA Saratoga CA
3644754 2507100.0 Tiburon Marin CA Tiburon CA
3633293 2428000.0 Clyde Hill King WA Clyde Hill WA
3643225 2285300.0 Sands Point Nassau NY Sands Point NY
3643202 2244100.0 San Marino Los Angeles CA San Marino CA
3639048 2242500.0 Manhattan Beach Los Angeles CA Manhattan Beach CA
3641066 2225600.0 Old Westbury Nassau NY Old Westbury NY
3641733 2145900.0 Piedmont Alameda CA Piedmont CA
3634079 2107900.0 Del Mar San Diego CA Del Mar CA
3632968 2055600.0 Cherry Hills Village Arapahoe CO Cherry Hills Village CO
3637730 2053800.0 Kentfield Marin CA Kentfield CA
3641886 2020800.0 Plandome Nassau NY Plandome NY
3641391 2011400.0 Palos Verdes Estates Los Angeles CA Palos Verdes Estates CA
... ... ... ... ... ...
3639045 48700.0 Mangum Greer OK Mangum OK
3639365 48600.0 McDonald Chapel Jefferson AL McDonald Chapel AL
3634409 48300.0 Duquesne Allegheny PA Duquesne PA
3639349 47800.0 McAdoo Schuylkill PA McAdoo PA
3635506 47700.0 Frackville Schuylkill PA Frackville PA
3638956 47400.0 Madison Madison IL Madison IL
3638559 47400.0 Lipscomb Jefferson AL Lipscomb AL
3635101 47100.0 Fairmont City Saint Clair IL Fairmont City IL
3640833 46800.0 Northridge Montgomery OH Northridge OH
3634676 46400.0 Edgewater Jefferson AL Edgewater AL
3645715 46400.0 Venice Madison IL Venice IL
3642349 45700.0 Rankin Allegheny PA Rankin PA
3640478 45600.0 New Philadelphia Schuylkill PA New Philadelphia PA
3645984 45500.0 Warren Trumbull OH Warren OH
3635049 44400.0 Experiment Spalding GA Experiment GA
3632854 44300.0 Centreville Saint Clair IL Centreville IL
3635433 44000.0 Fort McKinley Montgomery OH Fort McKinley OH
3639752 43300.0 Minersville Schuylkill PA Minersville PA
3634580 42800.0 East Saint Louis Saint Clair IL East Saint Louis IL
3632527 42300.0 Campbell Mahoning OH Campbell OH
3644592 41900.0 Tamaqua Schuylkill PA Tamaqua PA
3634178 41400.0 Detroit Wayne MI Detroit MI
3643602 39800.0 Shenandoah Heights Schuylkill PA Shenandoah Heights PA
3633307 38300.0 Coaldale Schuylkill PA Coaldale PA
3643039 37600.0 Saint Clair Schuylkill PA Saint Clair PA
3637573 37600.0 Johnstown Cambria PA Johnstown PA
3635895 32900.0 Girardville Schuylkill PA Girardville PA
3630999 31600.0 Ashland Schuylkill PA Ashland PA
3643607 31100.0 Shenandoah Schuylkill PA Shenandoah PA
3638981 30100.0 Mahanoy City Schuylkill PA Mahanoy City PA

9886 rows × 5 columns

q1_a_vis = q1_a_vis.head(20)
q1_a_plot = sns.barplot(x="graph", y="ZHVI_MiddleTier", data=q1_a_vis)
q1_a_plot.set_xticklabels(q1_a_plot.get_xticklabels(), rotation=90,size=7)
[Text(0, 0, 'Abbottstown PA'),
 Text(0, 0, 'Aberdeen WA'),
 Text(0, 0, 'Aberdeen MD'),
 Text(0, 0, 'Abilene KS'),
 Text(0, 0, 'Abingdon MD'),
 Text(0, 0, 'Abingdon VA'),
 Text(0, 0, 'Abington PA'),
 Text(0, 0, 'Abington MA'),
 Text(0, 0, 'Absecon NJ'),
 Text(0, 0, 'Accident MD'),
 Text(0, 0, 'Accokeek MD'),
 Text(0, 0, 'Acres Green CO'),
 Text(0, 0, 'Acton CA'),
 Text(0, 0, 'Acton MA'),
 Text(0, 0, 'Acton ME'),
 Text(0, 0, 'Acushnet MA'),
 Text(0, 0, 'Acworth GA'),
 Text(0, 0, 'Acworth NH'),
 Text(0, 0, 'Ada OH'),
 Text(0, 0, 'Adair OK')]

png

q1_b_vis = q1_a.sort_values("ZHVI_MiddleTier",ascending=False).tail(20)
q1_b_vis[["ZHVI_MiddleTier","City","County","State"]]
ZHVI_MiddleTier City County State
3645715 46400.0 Venice Madison IL
3642349 45700.0 Rankin Allegheny PA
3640478 45600.0 New Philadelphia Schuylkill PA
3645984 45500.0 Warren Trumbull OH
3635049 44400.0 Experiment Spalding GA
3632854 44300.0 Centreville Saint Clair IL
3635433 44000.0 Fort McKinley Montgomery OH
3639752 43300.0 Minersville Schuylkill PA
3634580 42800.0 East Saint Louis Saint Clair IL
3632527 42300.0 Campbell Mahoning OH
3644592 41900.0 Tamaqua Schuylkill PA
3634178 41400.0 Detroit Wayne MI
3643602 39800.0 Shenandoah Heights Schuylkill PA
3633307 38300.0 Coaldale Schuylkill PA
3643039 37600.0 Saint Clair Schuylkill PA
3637573 37600.0 Johnstown Cambria PA
3635895 32900.0 Girardville Schuylkill PA
3630999 31600.0 Ashland Schuylkill PA
3643607 31100.0 Shenandoah Schuylkill PA
3638981 30100.0 Mahanoy City Schuylkill PA
q1_b_plot = sns.barplot(x="graph", y="ZHVI_MiddleTier", data=q1_b_vis)
q1_b_plot.set_xticklabels(q1_b_plot.get_xticklabels(), rotation=90,size=7)
[Text(0, 0, 'Venice IL'),
 Text(0, 0, 'Rankin PA'),
 Text(0, 0, 'New Philadelphia PA'),
 Text(0, 0, 'Warren OH'),
 Text(0, 0, 'Experiment GA'),
 Text(0, 0, 'Centreville IL'),
 Text(0, 0, 'Fort McKinley OH'),
 Text(0, 0, 'Minersville PA'),
 Text(0, 0, 'East Saint Louis IL'),
 Text(0, 0, 'Campbell OH'),
 Text(0, 0, 'Tamaqua PA'),
 Text(0, 0, 'Detroit MI'),
 Text(0, 0, 'Shenandoah Heights PA'),
 Text(0, 0, 'Coaldale PA'),
 Text(0, 0, 'Saint Clair PA'),
 Text(0, 0, 'Johnstown PA'),
 Text(0, 0, 'Girardville PA'),
 Text(0, 0, 'Ashland PA'),
 Text(0, 0, 'Shenandoah PA'),
 Text(0, 0, 'Mahanoy City PA')]

png

QUESTION 2: Explore each states’ cities and figure out where the most expensive and least expensive places to live are for 2017 in each state.

Take the values for the city data set and determine which city has the highest price per square foot in each state. Then create scatterplot with matplotlib to measure each states’ individual cities and show which are the most expensive and least expensive parts of the state.

q2_a = city_state_data.loc[city_state_data['Year'] == 2017]
q2_a = q2_a.loc[q2_a['Month'] == 5]
q2_a = q2_a[["Date","ZHVIPerSqft_AllHomes","Month","Year","City","County","State"]]
q2_a = q2_a.dropna()
q2_a.head()
Date ZHVIPerSqft_AllHomes Month Year City County State
3630513 2017-05-31 92.0 5 2017 Abbeville Lafayette MS
3630515 2017-05-31 121.0 5 2017 Abbottstown Adams PA
3630516 2017-05-31 90.0 5 2017 Aberdeen Bingham ID
3630518 2017-05-31 82.0 5 2017 Aberdeen Grays Harbor WA
3630519 2017-05-31 142.0 5 2017 Aberdeen Harford MD
q2_max = q2_a.groupby(q2_a["State"],sort=False)["ZHVIPerSqft_AllHomes"].transform(max) == q2_a["ZHVIPerSqft_AllHomes"]
q2_a[q2_max]
Date ZHVIPerSqft_AllHomes Month Year City County State
3630933 2017-05-31 463.0 5 2017 Arlington Arlington VA
3631011 2017-05-31 1132.0 5 2017 Aspen Pitkin CO
3631033 2017-05-31 1746.0 5 2017 Atherton San Mateo CA
3631451 2017-05-31 363.0 5 2017 Belle Meade Davidson TN
3631463 2017-05-31 172.0 5 2017 Bellefonte New Castle DE
3631707 2017-05-31 251.0 5 2017 Birmingham Oakland MI
3632352 2017-05-31 200.0 5 2017 Burlington Chittenden VT
3632520 2017-05-31 147.0 5 2017 Cammack Village Pulaski AR
3632944 2017-05-31 137.0 5 2017 Cheat Lake Monongalia WV
3633019 2017-05-31 647.0 5 2017 Chevy Chase Village Montgomery MD
3633408 2017-05-31 192.0 5 2017 Colonial Pine Hills Pennington SD
3634333 2017-05-31 279.0 5 2017 Druid Hills Dekalb GA
3635022 2017-05-31 166.0 5 2017 Evansville Natrona WY
3635274 2017-05-31 1098.0 5 2017 Fisher Island Miami-Dade FL
3635520 2017-05-31 264.0 5 2017 Frankfort Franklin KY
3635710 2017-05-31 210.0 5 2017 Gallatin Gateway Gallatin MT
3635938 2017-05-31 426.0 5 2017 Glenbrook Douglas NV
3636293 2017-05-31 579.0 5 2017 Greenwich Fairfield CT
3636423 2017-05-31 704.0 5 2017 Haleiwa Honolulu HI
3637616 2017-05-31 210.0 5 2017 Juneau Juneau AK
3637706 2017-05-31 417.0 5 2017 Kenilworth Cook IL
3637754 2017-05-31 388.0 5 2017 Ketchum Blaine ID
3637984 2017-05-31 265.0 5 2017 Ladue Saint Louis MO
3638509 2017-05-31 215.0 5 2017 Lincoln Burleigh ND
3638957 2017-05-31 110.0 5 2017 Madison Madison MS
3639099 2017-05-31 207.0 5 2017 Maple Bluff Dane WI
3639434 2017-05-31 764.0 5 2017 Medina King WA
3639508 2017-05-31 175.0 5 2017 Meridian Hills Marion IN
3639784 2017-05-31 244.0 5 2017 Mission Hills Johnson KS
3640113 2017-05-31 295.0 5 2017 Mount Olympus Salt Lake UT
3640405 2017-05-31 426.0 5 2017 New Castle Rockingham NH
3640473 2017-05-31 156.0 5 2017 New Orleans Orleans LA
3640498 2017-05-31 590.0 5 2017 New Shoreham Washington RI
3640604 2017-05-31 171.0 5 2017 Nichols Hills Oklahoma OK
3641023 2017-05-31 330.0 5 2017 Ogunquit York ME
3641134 2017-05-31 265.0 5 2017 Orange Beach Baldwin AL
3641419 2017-05-31 377.0 5 2017 Paradise Valley Maricopa AZ
3642101 2017-05-31 330.0 5 2017 Portland Multnomah OR
3642301 2017-05-31 256.0 5 2017 Radnor Delaware PA
3642803 2017-05-31 414.0 5 2017 Rollingwood Travis TX
3643266 2017-05-31 195.0 5 2017 Santa Fe Santa Fe NM
3644306 2017-05-31 763.0 5 2017 Stone Harbor Cape May NJ
3644416 2017-05-31 606.0 5 2017 Sullivans Island Charleston SC
3644695 2017-05-31 262.0 5 2017 The Village of Indian Hill Hamilton OH
3644830 2017-05-31 353.0 5 2017 Tonka Bay Hennepin MN
3645147 2017-05-31 781.0 5 2017 Town of Nantucket Nantucket MA
3645562 2017-05-31 187.0 5 2017 University Heights Johnson IA
3646033 2017-05-31 493.0 5 2017 Washington District of Columbia DC
3646055 2017-05-31 1185.0 5 2017 Water Mill Suffolk NY
3646070 2017-05-31 158.0 5 2017 Waterloo Douglas NE
3646964 2017-05-31 555.0 5 2017 Wrightsville Beach New Hanover NC
max_graph = sns.barplot(x="State", y="ZHVIPerSqft_AllHomes", data=q2_a[q2_max])
max_graph.set_xticklabels(max_graph.get_xticklabels(), rotation=90,size=7)
[Text(0, 0, 'VA'),
 Text(0, 0, 'CO'),
 Text(0, 0, 'CA'),
 Text(0, 0, 'TN'),
 Text(0, 0, 'DE'),
 Text(0, 0, 'MI'),
 Text(0, 0, 'VT'),
 Text(0, 0, 'AR'),
 Text(0, 0, 'WV'),
 Text(0, 0, 'MD'),
 Text(0, 0, 'SD'),
 Text(0, 0, 'GA'),
 Text(0, 0, 'WY'),
 Text(0, 0, 'FL'),
 Text(0, 0, 'KY'),
 Text(0, 0, 'MT'),
 Text(0, 0, 'NV'),
 Text(0, 0, 'CT'),
 Text(0, 0, 'HI'),
 Text(0, 0, 'AK'),
 Text(0, 0, 'IL'),
 Text(0, 0, 'ID'),
 Text(0, 0, 'MO'),
 Text(0, 0, 'ND'),
 Text(0, 0, 'MS'),
 Text(0, 0, 'WI'),
 Text(0, 0, 'WA'),
 Text(0, 0, 'IN'),
 Text(0, 0, 'KS'),
 Text(0, 0, 'UT'),
 Text(0, 0, 'NH'),
 Text(0, 0, 'LA'),
 Text(0, 0, 'RI'),
 Text(0, 0, 'OK'),
 Text(0, 0, 'ME'),
 Text(0, 0, 'AL'),
 Text(0, 0, 'AZ'),
 Text(0, 0, 'OR'),
 Text(0, 0, 'PA'),
 Text(0, 0, 'TX'),
 Text(0, 0, 'NM'),
 Text(0, 0, 'NJ'),
 Text(0, 0, 'SC'),
 Text(0, 0, 'OH'),
 Text(0, 0, 'MN'),
 Text(0, 0, 'MA'),
 Text(0, 0, 'IA'),
 Text(0, 0, 'DC'),
 Text(0, 0, 'NY'),
 Text(0, 0, 'NE'),
 Text(0, 0, 'NC')]

png

q2_min = q2_a.groupby(q2_a["State"],sort=False)["ZHVIPerSqft_AllHomes"].transform(min) == q2_a["ZHVIPerSqft_AllHomes"]
q2_a[q2_min]
Date ZHVIPerSqft_AllHomes Month Year City County State
3631340 2017-05-31 44.0 5 2017 Bayard Grant NM
3631591 2017-05-31 43.0 5 2017 Berlin Coos NH
3632027 2017-05-31 33.0 5 2017 Breckenridge Stephens TX
3632101 2017-05-31 41.0 5 2017 Brighton Jefferson AL
3632436 2017-05-31 41.0 5 2017 Cahokia Saint Clair IL
3632779 2017-05-31 87.0 5 2017 Cedar City Iron UT
3632890 2017-05-31 36.0 5 2017 Chanute Neosho KS
3632938 2017-05-31 48.0 5 2017 Chattahoochee Gadsden FL
3633738 2017-05-31 57.0 5 2017 Creighton Knox NE
3633946 2017-05-31 56.0 5 2017 Danville Danville City VA
3634102 2017-05-31 121.0 5 2017 Dell Rapids Minnehaha SD
3634275 2017-05-31 57.0 5 2017 Douglas Cochise AZ
3634697 2017-05-31 33.0 5 2017 Edison Calhoun GA
3635064 2017-05-31 123.0 5 2017 Fairbanks Fairbanks North Star AK
3635104 2017-05-31 39.0 5 2017 Fairmont Robeson NC
3635682 2017-05-31 56.0 5 2017 Gaffney Cherokee SC
3636099 2017-05-31 109.0 5 2017 Grand Forks Grand Forks ND
3636251 2017-05-31 38.0 5 2017 Greenfield Weakley TN
3636302 2017-05-31 43.0 5 2017 Greenwood Leflore MS
3636319 2017-05-31 85.0 5 2017 Greybull Big Horn WY
3636585 2017-05-31 101.0 5 2017 Harrington Kent DE
3636616 2017-05-31 39.0 5 2017 Hartford City Blackford IN
3636618 2017-05-31 78.0 5 2017 Hartford Hartford CT
3637116 2017-05-31 70.0 5 2017 Hoquiam Grays Harbor WA
3637281 2017-05-31 78.0 5 2017 Idaho Falls Bonneville ID
3637539 2017-05-31 47.0 5 2017 Jennings Saint Louis MO
3638201 2017-05-31 41.0 5 2017 Lanett Chambers AL
3638251 2017-05-31 103.0 5 2017 Laughlin Clark NV
3638457 2017-05-31 84.0 5 2017 Libby Lincoln MT
3638672 2017-05-31 34.0 5 2017 Lonaconing Allegany MD
3638981 2017-05-31 23.0 5 2017 Mahanoy City Schuylkill PA
3639221 2017-05-31 48.0 5 2017 Marshalltown Marshall IA
3639806 2017-05-31 81.0 5 2017 Mojave Kern CA
3639900 2017-05-31 43.0 5 2017 Montevideo Chippewa MN
3640275 2017-05-31 45.0 5 2017 Naoma Raleigh WV
3640339 2017-05-31 58.0 5 2017 Nekoosa Wood WI
3640658 2017-05-31 86.0 5 2017 North Adams Berkshire MA
3640769 2017-05-31 121.0 5 2017 North Providence Providence RI
3640855 2017-05-31 56.0 5 2017 Norway Orangeburg SC
3640884 2017-05-31 76.0 5 2017 Nyssa Malheur OR
3641323 2017-05-31 161.0 5 2017 Pahoa Hawaii HI
3641765 2017-05-31 48.0 5 2017 Pine Bluff Jefferson AR
3642182 2017-05-31 41.0 5 2017 Prichard Mobile AL
3642633 2017-05-31 35.0 5 2017 River Rouge Wayne MI
3642657 2017-05-31 47.0 5 2017 Riverview Saint Louis MO
3642768 2017-05-31 57.0 5 2017 Rocky Ford Otero CO
3643131 2017-05-31 52.0 5 2017 Salem Salem NJ
3643139 2017-05-31 87.0 5 2017 Salina Sevier UT
3643199 2017-05-31 57.0 5 2017 San Manuel Pinal AZ
3645011 2017-05-31 30.0 5 2017 Town of Friendship Allegany NY
3645060 2017-05-31 51.0 5 2017 Town of Houlton Aroostook ME
3645262 2017-05-31 79.0 5 2017 Town of Springfield Windsor VT
3645379 2017-05-31 43.0 5 2017 Trimont Martin MN
3645822 2017-05-31 57.0 5 2017 Vivian Caddo LA
3646033 2017-05-31 493.0 5 2017 Washington District of Columbia DC
3646484 2017-05-31 53.0 5 2017 Westwood Boyd KY
3646497 2017-05-31 35.0 5 2017 Wewoka Seminole OK
3646983 2017-05-31 101.0 5 2017 Wyoming Kent DE
3647038 2017-05-31 30.0 5 2017 Youngstown Mahoning OH
min_graph = sns.barplot(x="State", y="ZHVIPerSqft_AllHomes", data=q2_a[q2_min])
min_graph.set_xticklabels(min_graph.get_xticklabels(), rotation=90,size=7)
[Text(0, 0, 'NM'),
 Text(0, 0, 'NH'),
 Text(0, 0, 'TX'),
 Text(0, 0, 'AL'),
 Text(0, 0, 'IL'),
 Text(0, 0, 'UT'),
 Text(0, 0, 'KS'),
 Text(0, 0, 'FL'),
 Text(0, 0, 'NE'),
 Text(0, 0, 'VA'),
 Text(0, 0, 'SD'),
 Text(0, 0, 'AZ'),
 Text(0, 0, 'GA'),
 Text(0, 0, 'AK'),
 Text(0, 0, 'NC'),
 Text(0, 0, 'SC'),
 Text(0, 0, 'ND'),
 Text(0, 0, 'TN'),
 Text(0, 0, 'MS'),
 Text(0, 0, 'WY'),
 Text(0, 0, 'DE'),
 Text(0, 0, 'IN'),
 Text(0, 0, 'CT'),
 Text(0, 0, 'WA'),
 Text(0, 0, 'ID'),
 Text(0, 0, 'MO'),
 Text(0, 0, 'NV'),
 Text(0, 0, 'MT'),
 Text(0, 0, 'MD'),
 Text(0, 0, 'PA'),
 Text(0, 0, 'IA'),
 Text(0, 0, 'CA'),
 Text(0, 0, 'MN'),
 Text(0, 0, 'WV'),
 Text(0, 0, 'WI'),
 Text(0, 0, 'MA'),
 Text(0, 0, 'RI'),
 Text(0, 0, 'OR'),
 Text(0, 0, 'HI'),
 Text(0, 0, 'AR'),
 Text(0, 0, 'MI'),
 Text(0, 0, 'CO'),
 Text(0, 0, 'NJ'),
 Text(0, 0, 'NY'),
 Text(0, 0, 'ME'),
 Text(0, 0, 'VT'),
 Text(0, 0, 'LA'),
 Text(0, 0, 'DC'),
 Text(0, 0, 'KY'),
 Text(0, 0, 'OK'),
 Text(0, 0, 'OH')]

png


QUESTION 3: Determine which cities have the highest overall increase in housing prices over the last 10, 5, and 1 year. (Going back from the most current year in the data - 2017)

Calculate the difference between square foot price for each city from 2016 – 2017, 2012 – 2017, and 2007 – 2017. Which cities have the largest increases? Which have the least? Plot in seaborn for each of the time series in question.

q3 = city_state_data[["Date","RegionName","ZHVIPerSqft_AllHomes","Year","Month","City","County","State"]]
q3 = q3.dropna()
q3.head()
Date RegionName ZHVIPerSqft_AllHomes Year Month City County State
1 1996-04-30 aberdeenbinghamid 96.0 1996 4 Aberdeen Bingham ID
2 1996-04-30 aberdeenharfordmd 76.0 1996 4 Aberdeen Harford MD
5 1996-04-30 abernathyhaletx 48.0 1996 4 Abernathy Hale TX
7 1996-04-30 abingdonharfordmd 77.0 1996 4 Abingdon Harford MD
9 1996-04-30 abingdonwashingtonva 57.0 1996 4 Abingdon Washington VA
sqft_2017a = q3.loc[q3['Year']==2017]
sqft_2017 = sqft_2017a.loc[sqft_2017a['Month']==5]
sqft_2016a = q3.loc[q3['Year']==2016]
sqft_2016 = sqft_2016a.loc[sqft_2016a['Month']==5]
sqft_2017.head()
Date RegionName ZHVIPerSqft_AllHomes Year Month City County State
3630513 2017-05-31 abbevillelafayettems 92.0 2017 5 Abbeville Lafayette MS
3630515 2017-05-31 abbottstownadamspa 121.0 2017 5 Abbottstown Adams PA
3630516 2017-05-31 aberdeenbinghamid 90.0 2017 5 Aberdeen Bingham ID
3630518 2017-05-31 aberdeengrays_harborwa 82.0 2017 5 Aberdeen Grays Harbor WA
3630519 2017-05-31 aberdeenharfordmd 142.0 2017 5 Aberdeen Harford MD
sqft_2016.head()
Date RegionName ZHVIPerSqft_AllHomes Year Month City County State
3431864 2016-05-31 abbevillelafayettems 84.0 2016 5 Abbeville Lafayette MS
3431866 2016-05-31 abbottstownadamspa 117.0 2016 5 Abbottstown Adams PA
3431867 2016-05-31 aberdeenbinghamid 81.0 2016 5 Aberdeen Bingham ID
3431869 2016-05-31 aberdeengrays_harborwa 69.0 2016 5 Aberdeen Grays Harbor WA
3431870 2016-05-31 aberdeenharfordmd 138.0 2016 5 Aberdeen Harford MD
#x = sqft_2017.head(200)
#x.sort_values(by="ZHVIPerSqft_AllHomes",ascending=False)
#sqft_2016.head()
comparison_test = pd.merge(sqft_2017,sqft_2016, on='RegionName')
#comparison_test.head(100)
comparison_a = pd.merge(sqft_2017,sqft_2016, on='RegionName')
#comparison_a = comparison_a[["RegionName_x","ZHVIPerSqft_AllHomes_x","Year_x","City","State_x","ZHVIPerSqft_AllHomes_y","Year_y"]]
comparison_a['Dif'] = comparison_a["ZHVIPerSqft_AllHomes_x"] - comparison_a["ZHVIPerSqft_AllHomes_y"]
comparison_a.head()
Date_x RegionName ZHVIPerSqft_AllHomes_x Year_x Month_x City_x County_x State_x Date_y ZHVIPerSqft_AllHomes_y Year_y Month_y City_y County_y State_y Dif
0 2017-05-31 abbevillelafayettems 92.0 2017 5 Abbeville Lafayette MS 2016-05-31 84.0 2016 5 Abbeville Lafayette MS 8.0
1 2017-05-31 abbottstownadamspa 121.0 2017 5 Abbottstown Adams PA 2016-05-31 117.0 2016 5 Abbottstown Adams PA 4.0
2 2017-05-31 aberdeenbinghamid 90.0 2017 5 Aberdeen Bingham ID 2016-05-31 81.0 2016 5 Aberdeen Bingham ID 9.0
3 2017-05-31 aberdeengrays_harborwa 82.0 2017 5 Aberdeen Grays Harbor WA 2016-05-31 69.0 2016 5 Aberdeen Grays Harbor WA 13.0
4 2017-05-31 aberdeenharfordmd 142.0 2017 5 Aberdeen Harford MD 2016-05-31 138.0 2016 5 Aberdeen Harford MD 4.0
comparison_a.iloc[comparison_a["Dif"].idxmax()]
Date_x                     2017-05-31 00:00:00
RegionName                stinson_beachmarinca
ZHVIPerSqft_AllHomes_x                    1709
Year_x                                    2017
Month_x                                      5
City_x                           Stinson Beach
County_x                                 Marin
State_x                                     CA
Date_y                     2016-05-31 00:00:00
ZHVIPerSqft_AllHomes_y                    1575
Year_y                                    2016
Month_y                                      5
City_y                           Stinson Beach
County_y                                 Marin
State_y                                     CA
Dif                                        134
Name: 10096, dtype: object
comparison_a_graph = comparison_a.sort_values(by=["Dif"],ascending=False)
comparison_a_graph = comparison_a_graph.head(20)
comparison_a_graph
Date_x RegionName ZHVIPerSqft_AllHomes_x Year_x Month_x City_x County_x State_x Date_y ZHVIPerSqft_AllHomes_y Year_y Month_y City_y County_y State_y Dif
10096 2017-05-31 stinson_beachmarinca 1709.0 2017 5 Stinson Beach Marin CA 2016-05-31 1575.0 2016 5 Stinson Beach Marin CA 134.0
352 2017-05-31 athertonsan_mateoca 1746.0 2017 5 Atherton San Mateo CA 2016-05-31 1633.0 2016 5 Atherton San Mateo CA 113.0
11451 2017-05-31 water_millsuffolkny 1185.0 2017 5 Water Mill Suffolk NY 2016-05-31 1082.0 2016 5 Water Mill Suffolk NY 103.0
10367 2017-05-31 telluridesan_miguelco 667.0 2017 5 Telluride San Miguel CO 2016-05-31 566.0 2016 5 Telluride San Miguel CO 101.0
9892 2017-05-31 southamptonsuffolkny 744.0 2017 5 Southampton Suffolk NY 2016-05-31 643.0 2016 5 Southampton Suffolk NY 101.0
1585 2017-05-31 cayucossan_luis_obispoca 712.0 2017 5 Cayucos San Luis Obispo CA 2016-05-31 612.0 2016 5 Cayucos San Luis Obispo CA 100.0
340 2017-05-31 aspenpitkinco 1132.0 2017 5 Aspen Pitkin CO 2016-05-31 1039.0 2016 5 Aspen Pitkin CO 93.0
2899 2017-05-31 east_palo_altosan_mateoca 689.0 2017 5 East Palo Alto San Mateo CA 2016-05-31 596.0 2016 5 East Palo Alto San Mateo CA 93.0
5966 2017-05-31 los_altos_hillssanta_claraca 1336.0 2017 5 Los Altos Hills Santa Clara CA 2016-05-31 1244.0 2016 5 Los Altos Hills Santa Clara CA 92.0
7954 2017-05-31 palo_altosanta_claraca 1574.0 2017 5 Palo Alto Santa Clara CA 2016-05-31 1485.0 2016 5 Palo Alto Santa Clara CA 89.0
11144 2017-05-31 vaileagleco 631.0 2017 5 Vail Eagle CO 2016-05-31 542.0 2016 5 Vail Eagle CO 89.0
4405 2017-05-31 harrisonhudsonnj 707.0 2017 5 Harrison Hudson NJ 2016-05-31 621.0 2016 5 Harrison Hudson NJ 86.0
5372 2017-05-31 la_hondasan_mateoca 718.0 2017 5 La Honda San Mateo CA 2016-05-31 636.0 2016 5 La Honda San Mateo CA 82.0
11546 2017-05-31 weehawkenhudsonnj 416.0 2017 5 Weehawken Hudson NJ 2016-05-31 335.0 2016 5 Weehawken Hudson NJ 81.0
7572 2017-05-31 noyacksuffolkny 636.0 2017 5 Noyack Suffolk NY 2016-05-31 556.0 2016 5 Noyack Suffolk NY 80.0
5581 2017-05-31 larkspurmarinca 840.0 2017 5 Larkspur Marin CA 2016-05-31 762.0 2016 5 Larkspur Marin CA 78.0
10505 2017-05-31 topangalos_angelesca 787.0 2017 5 Topanga Los Angeles CA 2016-05-31 710.0 2016 5 Topanga Los Angeles CA 77.0
6466 2017-05-31 medinakingwa 764.0 2017 5 Medina King WA 2016-05-31 687.0 2016 5 Medina King WA 77.0
760 2017-05-31 berkeleyalamedaca 765.0 2017 5 Berkeley Alameda CA 2016-05-31 688.0 2016 5 Berkeley Alameda CA 77.0
4960 2017-05-31 invernessmarinca 726.0 2017 5 Inverness Marin CA 2016-05-31 650.0 2016 5 Inverness Marin CA 76.0
len(comparison_a_graph)
20
comparison_a_graph.to_csv("comparison_a_graph.csv",encoding='utf-8', index=False)
sns.lineplot(x="State_x", y="ZHVIPerSqft_AllHomes_x",data=comparison_a_graph)
<matplotlib.axes._subplots.AxesSubplot at 0x1a31652cc0>

png

test_graph = sns.barplot(x="State_x", y="ZHVIPerSqft_AllHomes_x", data=comparison_a_graph)
test_graph.set_xticklabels(max_graph.get_xticklabels(), rotation=90,size=7)
[Text(0, 0, 'VA'),
 Text(0, 0, 'CO'),
 Text(0, 0, 'CA'),
 Text(0, 0, 'TN'),
 Text(0, 0, 'DE')]

png

comparison_a.iloc[comparison_a["Dif"].idxmin()]
Date_x                          2017-05-31 00:00:00
RegionName                fisher_islandmiami_dadefl
ZHVIPerSqft_AllHomes_x                         1098
Year_x                                         2017
Month_x                                           5
City_x                                Fisher Island
County_x                                 Miami-Dade
State_x                                          FL
Date_y                          2016-05-31 00:00:00
ZHVIPerSqft_AllHomes_y                         1183
Year_y                                         2016
Month_y                                           5
City_y                                Fisher Island
County_y                                 Miami-Dade
State_y                                          FL
Dif                                             -85
Name: 3426, dtype: object
comparison_a2_graph = comparison_a.sort_values(by=["Dif"],ascending=True)
comparison_a2_graph = comparison_a2_graph.head(20)
comparison_a2_graph
Date_x RegionName ZHVIPerSqft_AllHomes_x Year_x Month_x City_x County_x State_x Date_y ZHVIPerSqft_AllHomes_y Year_y Month_y City_y County_y State_y Dif
3426 2017-05-31 fisher_islandmiami_dadefl 1098.0 2017 5 Fisher Island Miami-Dade FL 2016-05-31 1183.0 2016 5 Fisher Island Miami-Dade FL -85.0
9735 2017-05-31 snowmass_villagepitkinco 634.0 2017 5 Snowmass Village Pitkin CO 2016-05-31 679.0 2016 5 Snowmass Village Pitkin CO -45.0
8269 2017-05-31 piney_pointharristx 307.0 2017 5 Piney Point Harris TX 2016-05-31 347.0 2016 5 Piney Point Harris TX -40.0
10845 2017-05-31 town_of_scarboroughcumberlandme 276.0 2017 5 Town of Scarborough Cumberland ME 2016-05-31 314.0 2016 5 Town of Scarborough Cumberland ME -38.0
9909 2017-05-31 southside_placeharristx 296.0 2017 5 Southside Place Harris TX 2016-05-31 329.0 2016 5 Southside Place Harris TX -33.0
4043 2017-05-31 grand_forksgrand_forksnd 109.0 2017 5 Grand Forks Grand Forks ND 2016-05-31 137.0 2016 5 Grand Forks Grand Forks ND -28.0
5249 2017-05-31 key_biscaynemiami_dadefl 664.0 2017 5 Key Biscayne Miami-Dade FL 2016-05-31 688.0 2016 5 Key Biscayne Miami-Dade FL -24.0
9009 2017-05-31 rooseveltduchesneut 100.0 2017 5 Roosevelt Duchesne UT 2016-05-31 120.0 2016 5 Roosevelt Duchesne UT -20.0
8558 2017-05-31 prophetstownwhitesideil 71.0 2017 5 Prophetstown Whiteside IL 2016-05-31 89.0 2016 5 Prophetstown Whiteside IL -18.0
5200 2017-05-31 kenilworthcookil 417.0 2017 5 Kenilworth Cook IL 2016-05-31 435.0 2016 5 Kenilworth Cook IL -18.0
11967 2017-05-31 wiltonmuscatineia 111.0 2017 5 Wilton Muscatine IA 2016-05-31 129.0 2016 5 Wilton Muscatine IA -18.0
8587 2017-05-31 put_in_bayottawaoh 195.0 2017 5 Put in Bay Ottawa OH 2016-05-31 213.0 2016 5 Put in Bay Ottawa OH -18.0
4695 2017-05-31 holdernessgraftonnh 173.0 2017 5 Holderness Grafton NH 2016-05-31 189.0 2016 5 Holderness Grafton NH -16.0
10537 2017-05-31 town_of_almondalleganyny 50.0 2017 5 Town of Almond Allegany NY 2016-05-31 65.0 2016 5 Town of Almond Allegany NY -15.0
11135 2017-05-31 uticalickingoh 70.0 2017 5 Utica Licking OH 2016-05-31 84.0 2016 5 Utica Licking OH -14.0
6540 2017-05-31 mettawalakeil 188.0 2017 5 Mettawa Lake IL 2016-05-31 202.0 2016 5 Mettawa Lake IL -14.0
10182 2017-05-31 sugarcreek_townshiparmstrongpa 95.0 2017 5 Sugarcreek Township Armstrong PA 2016-05-31 108.0 2016 5 Sugarcreek Township Armstrong PA -13.0
7127 2017-05-31 nehalemtillamookor 224.0 2017 5 Nehalem Tillamook OR 2016-05-31 237.0 2016 5 Nehalem Tillamook OR -13.0
11777 2017-05-31 westportfairfieldct 402.0 2017 5 Westport Fairfield CT 2016-05-31 415.0 2016 5 Westport Fairfield CT -13.0
9102 2017-05-31 rushfordalleganyny 65.0 2017 5 Rushford Allegany NY 2016-05-31 78.0 2016 5 Rushford Allegany NY -13.0
comparison_a2_graph.to_csv("comparison_a2_graph.csv",encoding='utf-8', index=False)
sns.lineplot(x="State_x", y="ZHVIPerSqft_AllHomes_x",data=comparison_a2_graph)
<matplotlib.axes._subplots.AxesSubplot at 0x1a354ae780>

png

sqft_2017b = sqft_2017
sqft_2012a = q3.loc[q3['Year']==2012]
sqft_2012 = sqft_2012a.loc[sqft_2012a['Month']==5]
comparison_b = pd.merge(sqft_2017b,sqft_2012, on='RegionName')
#comparison_b = comparison_b[["RegionName_x","ZHVIPerSqft_AllHomes_x","Year_x","City","State_x","ZHVIPerSqft_AllHomes_y","Year_y"]]
comparison_b['Dif'] = comparison_b["ZHVIPerSqft_AllHomes_x"] - comparison_b["ZHVIPerSqft_AllHomes_y"]
#comparison_b.head()
comparison_b.iloc[comparison_b["Dif"].idxmax()]
Date_x                       2017-05-31 00:00:00
RegionName                palo_altosanta_claraca
ZHVIPerSqft_AllHomes_x                      1574
Year_x                                      2017
Month_x                                        5
City_x                                 Palo Alto
County_x                             Santa Clara
State_x                                       CA
Date_y                       2012-05-31 00:00:00
ZHVIPerSqft_AllHomes_y                       801
Year_y                                      2012
Month_y                                        5
City_y                                 Palo Alto
County_y                             Santa Clara
State_y                                       CA
Dif                                          773
Name: 7859, dtype: object
comparison_b_graph = comparison_b.sort_values(by=["Dif"],ascending=False)
comparison_b_graph = comparison_b_graph.head(20)
comparison_b_graph
Date_x RegionName ZHVIPerSqft_AllHomes_x Year_x Month_x City_x County_x State_x Date_y ZHVIPerSqft_AllHomes_y Year_y Month_y City_y County_y State_y Dif
7859 2017-05-31 palo_altosanta_claraca 1574.0 2017 5 Palo Alto Santa Clara CA 2012-05-31 801.0 2012 5 Palo Alto Santa Clara CA 773.0
349 2017-05-31 athertonsan_mateoca 1746.0 2017 5 Atherton San Mateo CA 2012-05-31 977.0 2012 5 Atherton San Mateo CA 769.0
9980 2017-05-31 stinson_beachmarinca 1709.0 2017 5 Stinson Beach Marin CA 2012-05-31 1040.0 2012 5 Stinson Beach Marin CA 669.0
9919 2017-05-31 stanfordsanta_claraca 1286.0 2017 5 Stanford Santa Clara CA 2012-05-31 715.0 2012 5 Stanford Santa Clara CA 571.0
5896 2017-05-31 los_altossanta_claraca 1271.0 2017 5 Los Altos Santa Clara CA 2012-05-31 724.0 2012 5 Los Altos Santa Clara CA 547.0
5895 2017-05-31 los_altos_hillssanta_claraca 1336.0 2017 5 Los Altos Hills Santa Clara CA 2012-05-31 797.0 2012 5 Los Altos Hills Santa Clara CA 539.0
11973 2017-05-31 woodsidesan_mateoca 1375.0 2017 5 Woodside San Mateo CA 2012-05-31 838.0 2012 5 Woodside San Mateo CA 537.0
8386 2017-05-31 portola_valleysan_mateoca 1376.0 2017 5 Portola Valley San Mateo CA 2012-05-31 864.0 2012 5 Portola Valley San Mateo CA 512.0
6428 2017-05-31 menlo_parksan_mateoca 1175.0 2017 5 Menlo Park San Mateo CA 2012-05-31 681.0 2012 5 Menlo Park San Mateo CA 494.0
6920 2017-05-31 mountain_viewsanta_claraca 995.0 2017 5 Mountain View Santa Clara CA 2012-05-31 501.0 2012 5 Mountain View Santa Clara CA 494.0
4604 2017-05-31 hillsboroughsan_mateoca 1224.0 2017 5 Hillsborough San Mateo CA 2012-05-31 745.0 2012 5 Hillsborough San Mateo CA 479.0
10120 2017-05-31 sunnyvalesanta_claraca 940.0 2017 5 Sunnyvale Santa Clara CA 2012-05-31 466.0 2012 5 Sunnyvale Santa Clara CA 474.0
7338 2017-05-31 north_fair_oakssan_mateoca 821.0 2017 5 North Fair Oaks San Mateo CA 2012-05-31 354.0 2012 5 North Fair Oaks San Mateo CA 467.0
817 2017-05-31 beverly_hillslos_angelesca 1140.0 2017 5 Beverly Hills Los Angeles CA 2012-05-31 677.0 2012 5 Beverly Hills Los Angeles CA 463.0
2868 2017-05-31 east_palo_altosan_mateoca 689.0 2017 5 East Palo Alto San Mateo CA 2012-05-31 250.0 2012 5 East Palo Alto San Mateo CA 439.0
713 2017-05-31 belvederemarinca 1333.0 2017 5 Belvedere Marin CA 2012-05-31 896.0 2012 5 Belvedere Marin CA 437.0
6124 2017-05-31 manhattan_beachlos_angelesca 1039.0 2017 5 Manhattan Beach Los Angeles CA 2012-05-31 605.0 2012 5 Manhattan Beach Los Angeles CA 434.0
2364 2017-05-31 cupertinosanta_claraca 1009.0 2017 5 Cupertino Santa Clara CA 2012-05-31 576.0 2012 5 Cupertino Santa Clara CA 433.0
4582 2017-05-31 highlands_baywood_parksan_mateoca 963.0 2017 5 Highlands-Baywood Park San Mateo CA 2012-05-31 533.0 2012 5 Highlands-Baywood Park San Mateo CA 430.0
9040 2017-05-31 sag_harborsuffolkny 1029.0 2017 5 Sag Harbor Suffolk NY 2012-05-31 604.0 2012 5 Sag Harbor Suffolk NY 425.0
comparison_b_graph.to_csv("comparison_b_graph.csv",encoding='utf-8', index=False)
comparison_b2_graph = comparison_b.sort_values(by=["Dif"],ascending=True)
comparison_b2_graph = comparison_b2_graph.head(20)
comparison_b2_graph
Date_x RegionName ZHVIPerSqft_AllHomes_x Year_x Month_x City_x County_x State_x Date_y ZHVIPerSqft_AllHomes_y Year_y Month_y City_y County_y State_y Dif
8903 2017-05-31 rooseveltduchesneut 100.0 2017 5 Roosevelt Duchesne UT 2012-05-31 160.0 2012 5 Roosevelt Duchesne UT -60.0
9792 2017-05-31 southportlincolnme 302.0 2017 5 Southport Lincoln ME 2012-05-31 331.0 2012 5 Southport Lincoln ME -29.0
11102 2017-05-31 versaillesdarkeoh 83.0 2017 5 Versailles Darke OH 2012-05-31 106.0 2012 5 Versailles Darke OH -23.0
4121 2017-05-31 greenvilledarkeoh 69.0 2017 5 Greenville Darke OH 2012-05-31 90.0 2012 5 Greenville Darke OH -21.0
2606 2017-05-31 dickinsonstarknd 172.0 2017 5 Dickinson Stark ND 2012-05-31 193.0 2012 5 Dickinson Stark ND -21.0
258 2017-05-31 arcanumdarkeoh 70.0 2017 5 Arcanum Darke OH 2012-05-31 90.0 2012 5 Arcanum Darke OH -20.0
11080 2017-05-31 ventnor_cityatlanticnj 164.0 2017 5 Ventnor City Atlantic NJ 2012-05-31 181.0 2012 5 Ventnor City Atlantic NJ -17.0
5747 2017-05-31 linwooddelawarepa 64.0 2017 5 Linwood Delaware PA 2012-05-31 79.0 2012 5 Linwood Delaware PA -15.0
3707 2017-05-31 galenajo_daviessil 90.0 2017 5 Galena Jo Daviess IL 2012-05-31 104.0 2012 5 Galena Jo Daviess IL -14.0
5090 2017-05-31 kankakeekankakeeil 58.0 2017 5 Kankakee Kankakee IL 2012-05-31 72.0 2012 5 Kankakee Kankakee IL -14.0
1027 2017-05-31 bradleykankakeeil 80.0 2017 5 Bradley Kankakee IL 2012-05-31 93.0 2012 5 Bradley Kankakee IL -13.0
910 2017-05-31 blue_earthfaribaultmn 91.0 2017 5 Blue Earth Faribault MN 2012-05-31 104.0 2012 5 Blue Earth Faribault MN -13.0
5973 2017-05-31 lumbertonlamarms 59.0 2017 5 Lumberton Lamar MS 2012-05-31 72.0 2012 5 Lumberton Lamar MS -13.0
8347 2017-05-31 port_isabelcamerontx 137.0 2017 5 Port Isabel Cameron TX 2012-05-31 150.0 2012 5 Port Isabel Cameron TX -13.0
201 2017-05-31 andrewsandrewstx 96.0 2017 5 Andrews Andrews TX 2012-05-31 109.0 2012 5 Andrews Andrews TX -13.0
9073 2017-05-31 saint_henrymerceroh 104.0 2017 5 Saint Henry Mercer OH 2012-05-31 116.0 2012 5 Saint Henry Mercer OH -12.0
11023 2017-05-31 vails_gateorangeny 133.0 2017 5 Vails Gate Orange NY 2012-05-31 145.0 2012 5 Vails Gate Orange NY -12.0
5745 2017-05-31 linwoodatlanticnj 121.0 2017 5 Linwood Atlantic NJ 2012-05-31 133.0 2012 5 Linwood Atlantic NJ -12.0
2191 2017-05-31 corollacurritucknc 185.0 2017 5 Corolla Currituck NC 2012-05-31 197.0 2012 5 Corolla Currituck NC -12.0
10811 2017-05-31 trainerdelawarepa 71.0 2017 5 Trainer Delaware PA 2012-05-31 82.0 2012 5 Trainer Delaware PA -11.0
comparison_b2_graph.to_csv("comparison_b2_graph.csv",encoding='utf-8', index=False)
sqft_2017c = sqft_2017
sqft_2007a = q3.loc[q3['Year']==2007]
sqft_2007 = sqft_2007a.loc[sqft_2007a['Month']==5]
comparison_c = pd.merge(sqft_2017c,sqft_2007, on='RegionName')
#comparison_c = comparison_c[["RegionName_x","ZHVIPerSqft_AllHomes_x","Year_x","City","State_x","ZHVIPerSqft_AllHomes_y","Year_y"]]
comparison_c['Dif'] = comparison_c["ZHVIPerSqft_AllHomes_x"] - comparison_c["ZHVIPerSqft_AllHomes_y"]
comparison_c.iloc[comparison_c["Dif"].idxmax()]
Date_x                       2017-05-31 00:00:00
RegionName                palo_altosanta_claraca
ZHVIPerSqft_AllHomes_x                      1574
Year_x                                      2017
Month_x                                        5
City_x                                 Palo Alto
County_x                             Santa Clara
State_x                                       CA
Date_y                       2007-05-31 00:00:00
ZHVIPerSqft_AllHomes_y                       753
Year_y                                      2007
Month_y                                        5
City_y                                 Palo Alto
County_y                             Santa Clara
State_y                                       CA
Dif                                          821
Name: 7262, dtype: object
comparison_c_graph = comparison_c.sort_values(by=["Dif"],ascending=False)
comparison_c_graph = comparison_c_graph.head(20)
comparison_c_graph
Date_x RegionName ZHVIPerSqft_AllHomes_x Year_x Month_x City_x County_x State_x Date_y ZHVIPerSqft_AllHomes_y Year_y Month_y City_y County_y State_y Dif
7262 2017-05-31 palo_altosanta_claraca 1574.0 2017 5 Palo Alto Santa Clara CA 2007-05-31 753.0 2007 5 Palo Alto Santa Clara CA 821.0
320 2017-05-31 athertonsan_mateoca 1746.0 2017 5 Atherton San Mateo CA 2007-05-31 1058.0 2007 5 Atherton San Mateo CA 688.0
9242 2017-05-31 stinson_beachmarinca 1709.0 2017 5 Stinson Beach Marin CA 2007-05-31 1060.0 2007 5 Stinson Beach Marin CA 649.0
5440 2017-05-31 los_altossanta_claraca 1271.0 2017 5 Los Altos Santa Clara CA 2007-05-31 725.0 2007 5 Los Altos Santa Clara CA 546.0
9185 2017-05-31 stanfordsanta_claraca 1286.0 2017 5 Stanford Santa Clara CA 2007-05-31 751.0 2007 5 Stanford Santa Clara CA 535.0
5439 2017-05-31 los_altos_hillssanta_claraca 1336.0 2017 5 Los Altos Hills Santa Clara CA 2007-05-31 871.0 2007 5 Los Altos Hills Santa Clara CA 465.0
6386 2017-05-31 mountain_viewsanta_claraca 995.0 2017 5 Mountain View Santa Clara CA 2007-05-31 546.0 2007 5 Mountain View Santa Clara CA 449.0
5925 2017-05-31 menlo_parksan_mateoca 1175.0 2017 5 Menlo Park San Mateo CA 2007-05-31 728.0 2007 5 Menlo Park San Mateo CA 447.0
11098 2017-05-31 woodsidesan_mateoca 1375.0 2017 5 Woodside San Mateo CA 2007-05-31 958.0 2007 5 Woodside San Mateo CA 417.0
2169 2017-05-31 cupertinosanta_claraca 1009.0 2017 5 Cupertino Santa Clara CA 2007-05-31 594.0 2007 5 Cupertino Santa Clara CA 415.0
4218 2017-05-31 highlands_baywood_parksan_mateoca 963.0 2017 5 Highlands-Baywood Park San Mateo CA 2007-05-31 550.0 2007 5 Highlands-Baywood Park San Mateo CA 413.0
9370 2017-05-31 sunnyvalesanta_claraca 940.0 2017 5 Sunnyvale Santa Clara CA 2007-05-31 534.0 2007 5 Sunnyvale Santa Clara CA 406.0
1164 2017-05-31 burlingamesan_mateoca 1034.0 2017 5 Burlingame San Mateo CA 2007-05-31 640.0 2007 5 Burlingame San Mateo CA 394.0
7754 2017-05-31 portola_valleysan_mateoca 1376.0 2017 5 Portola Valley San Mateo CA 2007-05-31 987.0 2007 5 Portola Valley San Mateo CA 389.0
8359 2017-05-31 sag_harborsuffolkny 1029.0 2017 5 Sag Harbor Suffolk NY 2007-05-31 647.0 2007 5 Sag Harbor Suffolk NY 382.0
650 2017-05-31 belvederemarinca 1333.0 2017 5 Belvedere Marin CA 2007-05-31 958.0 2007 5 Belvedere Marin CA 375.0
642 2017-05-31 belmontsan_mateoca 907.0 2017 5 Belmont San Mateo CA 2007-05-31 537.0 2007 5 Belmont San Mateo CA 370.0
8564 2017-05-31 saratogasanta_claraca 1025.0 2017 5 Saratoga Santa Clara CA 2007-05-31 680.0 2007 5 Saratoga Santa Clara CA 345.0
741 2017-05-31 beverly_hillslos_angelesca 1140.0 2017 5 Beverly Hills Los Angeles CA 2007-05-31 797.0 2007 5 Beverly Hills Los Angeles CA 343.0
8471 2017-05-31 san_carlossan_mateoca 929.0 2017 5 San Carlos San Mateo CA 2007-05-31 594.0 2007 5 San Carlos San Mateo CA 335.0
comparison_c_graph.to_csv("comparison_c_graph.csv",encoding='utf-8', index=False)
sns.lineplot(x="State_x", y="ZHVIPerSqft_AllHomes_x",data=comparison_c_graph)
<matplotlib.axes._subplots.AxesSubplot at 0x1a357b9c50>

png

comparison_c2_graph = comparison_c.sort_values(by=["Dif"],ascending=True)
comparison_c2_graph = comparison_c2_graph.head(21)
comparison_c2_graph
Date_x RegionName ZHVIPerSqft_AllHomes_x Year_x Month_x City_x County_x State_x Date_y ZHVIPerSqft_AllHomes_y Year_y Month_y City_y County_y State_y Dif
4844 2017-05-31 kirkwoodalpineca 283.0 2017 5 Kirkwood Alpine CA 2007-05-31 528.0 2007 5 Kirkwood Alpine CA -245.0
11130 2017-05-31 wrightsville_beachnew_hanovernc 555.0 2017 5 Wrightsville Beach New Hanover NC 2007-05-31 758.0 2007 5 Wrightsville Beach New Hanover NC -203.0
4678 2017-05-31 jupiter_islandmartinfl 923.0 2017 5 Jupiter Island Martin FL 2007-05-31 1114.0 2007 5 Jupiter Island Martin FL -191.0
940 2017-05-31 bradleysan_luis_obispoca 246.0 2017 5 Bradley San Luis Obispo CA 2007-05-31 426.0 2007 5 Bradley San Luis Obispo CA -180.0
2741 2017-05-31 edisto_beachcolletonsc 221.0 2017 5 Edisto Beach Colleton SC 2007-05-31 391.0 2007 5 Edisto Beach Colleton SC -170.0
4491 2017-05-31 indian_beachcarteretnc 221.0 2017 5 Indian Beach Carteret NC 2007-05-31 389.0 2007 5 Indian Beach Carteret NC -168.0
326 2017-05-31 atlantic_beachcarteretnc 205.0 2017 5 Atlantic Beach Carteret NC 2007-05-31 371.0 2007 5 Atlantic Beach Carteret NC -166.0
5630 2017-05-31 mammoth_lakesmonoca 296.0 2017 5 Mammoth Lakes Mono CA 2007-05-31 447.0 2007 5 Mammoth Lakes Mono CA -151.0
10797 2017-05-31 westwoodplumasca 167.0 2017 5 Westwood Plumas CA 2007-05-31 316.0 2007 5 Westwood Plumas CA -149.0
6988 2017-05-31 ocean_isle_beachbrunswicknc 193.0 2017 5 Ocean Isle Beach Brunswick NC 2007-05-31 338.0 2007 5 Ocean Isle Beach Brunswick NC -145.0
7538 2017-05-31 pine_knoll_shorescarteretnc 196.0 2017 5 Pine Knoll Shores Carteret NC 2007-05-31 336.0 2007 5 Pine Knoll Shores Carteret NC -140.0
2018 2017-05-31 corollacurritucknc 185.0 2017 5 Corolla Currituck NC 2007-05-31 323.0 2007 5 Corolla Currituck NC -138.0
7134 2017-05-31 ortley_beachoceannj 285.0 2017 5 Ortley Beach Ocean NJ 2007-05-31 419.0 2007 5 Ortley Beach Ocean NJ -134.0
2099 2017-05-31 crescent_beachsaint_johnsfl 279.0 2017 5 Crescent Beach Saint Johns FL 2007-05-31 411.0 2007 5 Crescent Beach Saint Johns FL -132.0
1014 2017-05-31 brigantineatlanticnj 203.0 2017 5 Brigantine Atlantic NJ 2007-05-31 335.0 2007 5 Brigantine Atlantic NJ -132.0
8241 2017-05-31 roselandnelsonva 142.0 2017 5 Roseland Nelson VA 2007-05-31 265.0 2007 5 Roseland Nelson VA -123.0
3347 2017-05-31 freedomsanta_cruzca 356.0 2017 5 Freedom Santa Cruz CA 2007-05-31 478.0 2007 5 Freedom Santa Cruz CA -122.0
8657 2017-05-31 seasidemontereyca 393.0 2017 5 Seaside Monterey CA 2007-05-31 513.0 2007 5 Seaside Monterey CA -120.0
2869 2017-05-31 emerald_islecarteretnc 232.0 2017 5 Emerald Isle Carteret NC 2007-05-31 350.0 2007 5 Emerald Isle Carteret NC -118.0
5415 2017-05-31 longportatlanticnj 467.0 2017 5 Longport Atlantic NJ 2007-05-31 585.0 2007 5 Longport Atlantic NJ -118.0
1607 2017-05-31 chesterplumasca 155.0 2017 5 Chester Plumas CA 2007-05-31 272.0 2007 5 Chester Plumas CA -117.0
sns.scatterplot(y="Dif", x="City_y", data=comparison_c2_graph)
<matplotlib.axes._subplots.AxesSubplot at 0x1a3898f7b8>

png

sns.lineplot(x="State_x", y="ZHVIPerSqft_AllHomes_x",data=comparison_c2_graph)
<matplotlib.axes._subplots.AxesSubplot at 0x1a35799780>

png

comparison_c2_graph.to_csv("comparison_c2_graph.csv",encoding='utf-8', index=False)

QUESTION 4: For a city of choice, (most likely Ann Arbor) have prices for each size of home changed in similar amounts over the course of the datasets recordings or are there some sizes that are more affordable or less affordable within the city?

Calculate the mean increase for each type of home (1 bedroom, 2 bedroom, etc.) in Ann Arbor. Calculate the average price change for each of these types of homes year to year. Compare the average price change, is it similar or different? I think a histogram will be a good way to visualize this, but I could potentially create a story in Tableau which might walk the viewer through the information.

q4 = city_state_data
q4 = q4[["Date","Year","Month","RegionName","ZHVI_1bedroom","ZHVI_2bedroom","ZHVI_3bedroom","ZHVI_4bedroom","ZHVI_5BedroomOrMore","ZHVIPerSqft_AllHomes","ZHVI_BottomTier","ZHVI_MiddleTier","ZHVI_SingleFamilyResidence","ZHVI_TopTier","State","County","City"]]
q4 = q4.dropna()
q4.head()
Date Year Month RegionName ZHVI_1bedroom ZHVI_2bedroom ZHVI_3bedroom ZHVI_4bedroom ZHVI_5BedroomOrMore ZHVIPerSqft_AllHomes ZHVI_BottomTier ZHVI_MiddleTier ZHVI_SingleFamilyResidence ZHVI_TopTier State County City
10 1996-04-30 1996 4 abingtonmontgomerypa 47500.0 102200.0 122000.0 157600.0 219000.0 78.0 102000.0 128700.0 130900.0 189400.0 PA Montgomery Abington
18 1996-04-30 1996 4 actonmiddlesexma 52400.0 122200.0 212500.0 288100.0 325500.0 118.0 128200.0 230500.0 253900.0 330100.0 MA Middlesex Acton
21 1996-04-30 1996 4 acworthcobbga 62200.0 75500.0 100800.0 155900.0 201300.0 60.0 83900.0 112900.0 114800.0 174000.0 GA Cobb Acworth
44 1996-04-30 1996 4 agawamhampdenma 35800.0 82100.0 107400.0 114100.0 127300.0 71.0 68600.0 101500.0 106800.0 124700.0 MA Hampden Agawam
55 1996-04-30 1996 4 akronsummitoh 41600.0 49100.0 59500.0 70000.0 74900.0 49.0 42800.0 57100.0 56300.0 88000.0 OH Summit Akron
ann_arbor = q4.loc[q4["RegionName"]=="ann_arborwashtenawmi"]
ann_arbor.head()
Date Year Month RegionName ZHVI_1bedroom ZHVI_2bedroom ZHVI_3bedroom ZHVI_4bedroom ZHVI_5BedroomOrMore ZHVIPerSqft_AllHomes ZHVI_BottomTier ZHVI_MiddleTier ZHVI_SingleFamilyResidence ZHVI_TopTier State County City
209 1996-04-30 1996 4 ann_arborwashtenawmi 90300.0 107300.0 129000.0 188500.0 220200.0 95.0 106900.0 152000.0 157300.0 248200.0 MI Washtenaw Ann Arbor
11639 1996-05-31 1996 5 ann_arborwashtenawmi 87500.0 107500.0 130400.0 188600.0 219300.0 96.0 107900.0 153100.0 158500.0 249500.0 MI Washtenaw Ann Arbor
23211 1996-06-30 1996 6 ann_arborwashtenawmi 84100.0 107700.0 131500.0 188800.0 220700.0 97.0 108300.0 154200.0 159700.0 250900.0 MI Washtenaw Ann Arbor
34842 1996-07-31 1996 7 ann_arborwashtenawmi 81100.0 108600.0 132200.0 189100.0 222600.0 97.0 108900.0 154900.0 160300.0 251100.0 MI Washtenaw Ann Arbor
46479 1996-08-31 1996 8 ann_arborwashtenawmi 80700.0 109200.0 133200.0 189500.0 224800.0 97.0 109300.0 155300.0 160500.0 251500.0 MI Washtenaw Ann Arbor
ann_arbor2017 = ann_arbor.loc[ann_arbor["Year"]==2017]
ann_arbor2017["ZHVI_1bedroom"].median()
171350.0
ann_arbor2017["ZHVI_2bedroom"].median()
233000.0
ann_arbor2017["ZHVI_3bedroom"].median()
303150.0
ann_arbor2017["ZHVI_4bedroom"].median()
408050.0
ann_arbor2017["ZHVI_5BedroomOrMore"].median()
568850.0
ann_arbor2012 = ann_arbor.loc[ann_arbor["Year"]==2012]
ann_arbor2012["ZHVI_1bedroom"].median()
142800.0
ann_arbor2012["ZHVI_2bedroom"].median()
157750.0
ann_arbor2012["ZHVI_3bedroom"].median()
204450.0
ann_arbor2012["ZHVI_4bedroom"].median()
279500.0
ann_arbor2012["ZHVI_5BedroomOrMore"].median()
365800.0
ann_arbor2007 = ann_arbor.loc[ann_arbor["Year"]==2007]
ann_arbor2007["ZHVI_1bedroom"].median()
160400.0
ann_arbor2007["ZHVI_2bedroom"].median()
173050.0
ann_arbor2007["ZHVI_3bedroom"].median()
220000.0
ann_arbor2007["ZHVI_4bedroom"].median()
295500.0
ann_arbor2007["ZHVI_5BedroomOrMore"].median()
403150.0
ann_arbor2002 = ann_arbor.loc[ann_arbor["Year"]==2002]
ann_arbor2002["ZHVI_1bedroom"].median()
131750.0
ann_arbor2002["ZHVI_2bedroom"].median()
173650.0
ann_arbor2002["ZHVI_3bedroom"].median()
219950.0
ann_arbor2002["ZHVI_4bedroom"].median()
299700.0
ann_arbor2002["ZHVI_5BedroomOrMore"].median()
376850.0

QUESTION 5: (Extra) Determine which cities are likely to experience an increase in housing prices in the coming years.

(Extra) Use machine learning regression to determine which city is likely to have the largest price increase over the coming 1, 5 and 10 years. (Where does it make sense to invest?) Plot the results in a scatterplot with seaborn?

q5 = city_state_data
len(q5)
3762566
q5 = q5.loc[q5["State"]=='MI']
q5 = q5[["Date","Year","Month","RegionName","ZHVI_1bedroom","ZHVI_2bedroom","ZHVI_3bedroom","ZHVI_4bedroom","ZHVI_5BedroomOrMore","ZHVIPerSqft_AllHomes","ZHVI_BottomTier","ZHVI_MiddleTier","ZHVI_SingleFamilyResidence","ZHVI_TopTier","State","County","City"]]
q5 = q5.dropna()
q5.head()
Date Year Month RegionName ZHVI_1bedroom ZHVI_2bedroom ZHVI_3bedroom ZHVI_4bedroom ZHVI_5BedroomOrMore ZHVIPerSqft_AllHomes ZHVI_BottomTier ZHVI_MiddleTier ZHVI_SingleFamilyResidence ZHVI_TopTier State County City
209 1996-04-30 1996 4 ann_arborwashtenawmi 90300.0 107300.0 129000.0 188500.0 220200.0 95.0 106900.0 152000.0 157300.0 248200.0 MI Washtenaw Ann Arbor
844 1996-04-30 1996 4 bloomfield_hillsoaklandmi 61500.0 117800.0 215800.0 315100.0 528700.0 108.0 170000.0 270000.0 299500.0 541900.0 MI Oakland Bloomfield Hills
3183 1996-04-30 1996 4 ferndaleoaklandmi 60700.0 63000.0 75800.0 95700.0 95100.0 71.0 58600.0 71000.0 71000.0 101200.0 MI Oakland Ferndale
3787 1996-04-30 1996 4 grand_rapidskentmi 71700.0 63100.0 71500.0 78300.0 86200.0 50.0 45900.0 65800.0 64300.0 105600.0 MI Kent Grand Rapids
4188 1996-04-30 1996 4 hazel_parkoaklandmi 42000.0 43800.0 54600.0 64100.0 71100.0 53.0 42300.0 51500.0 51500.0 64500.0 MI Oakland Hazel Park
len(q5)
4470
q5_2017 = q5.loc[q5["Year"]==2017]
q5_2017 = q5_2017.loc[q5_2017["Month"]==5]
#q5_2017.head()
q5_2015 = q5.loc[q5["Year"]==2015]
q5_2015 = q5_2015.loc[q5_2015["Month"]==5]
#q5_2017.head()
q5_a = pd.merge(q5_2017,q5_2015, on='RegionName')
q5_a = q5_a[["RegionName","Year_x","ZHVIPerSqft_AllHomes_x","Year_y","ZHVIPerSqft_AllHomes_y"]]
q5_a['Dif'] = q5_a["ZHVIPerSqft_AllHomes_x"] - q5_a["ZHVIPerSqft_AllHomes_y"]
q5_a
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif
0 ann_arborwashtenawmi 2017 208.0 2015 184.0 24.0
1 bloomfield_hillsoaklandmi 2017 170.0 2015 161.0 9.0
2 brightonlivingstonmi 2017 143.0 2015 129.0 14.0
3 detroitwaynemi 2017 40.0 2015 34.0 6.0
4 farmington_hillsoaklandmi 2017 124.0 2015 113.0 11.0
5 ferndaleoaklandmi 2017 136.0 2015 108.0 28.0
6 flintgeneseemi 2017 40.0 2015 32.0 8.0
7 grand_rapidskentmi 2017 101.0 2015 79.0 22.0
8 hazel_parkoaklandmi 2017 67.0 2015 55.0 12.0
9 hollandottawami 2017 126.0 2015 107.0 19.0
10 livoniawaynemi 2017 134.0 2015 117.0 17.0
11 milford_townshipoaklandmi 2017 148.0 2015 134.0 14.0
12 muskegon_townshipmuskegonmi 2017 60.0 2015 51.0 9.0
13 norton_shoresmuskegonmi 2017 97.0 2015 82.0 15.0
14 rochester_hillsoaklandmi 2017 137.0 2015 126.0 11.0
15 saginawsaginawmi 2017 47.0 2015 43.0 4.0
16 saint_clair_shoresmacombmi 2017 116.0 2015 97.0 19.0
17 south_lyonoaklandmi 2017 145.0 2015 131.0 14.0
18 troyoaklandmi 2017 147.0 2015 133.0 14.0
19 waterfordoaklandmi 2017 113.0 2015 98.0 15.0
20 west_bloomfieldoaklandmi 2017 126.0 2015 116.0 10.0
len(q5_a)
21
q5_a['Label'] = np.where(q5_a['Dif']>=10 ,'yes','no')
q5_a
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label
0 ann_arborwashtenawmi 2017 208.0 2015 184.0 24.0 yes
1 bloomfield_hillsoaklandmi 2017 170.0 2015 161.0 9.0 no
2 brightonlivingstonmi 2017 143.0 2015 129.0 14.0 yes
3 detroitwaynemi 2017 40.0 2015 34.0 6.0 no
4 farmington_hillsoaklandmi 2017 124.0 2015 113.0 11.0 yes
5 ferndaleoaklandmi 2017 136.0 2015 108.0 28.0 yes
6 flintgeneseemi 2017 40.0 2015 32.0 8.0 no
7 grand_rapidskentmi 2017 101.0 2015 79.0 22.0 yes
8 hazel_parkoaklandmi 2017 67.0 2015 55.0 12.0 yes
9 hollandottawami 2017 126.0 2015 107.0 19.0 yes
10 livoniawaynemi 2017 134.0 2015 117.0 17.0 yes
11 milford_townshipoaklandmi 2017 148.0 2015 134.0 14.0 yes
12 muskegon_townshipmuskegonmi 2017 60.0 2015 51.0 9.0 no
13 norton_shoresmuskegonmi 2017 97.0 2015 82.0 15.0 yes
14 rochester_hillsoaklandmi 2017 137.0 2015 126.0 11.0 yes
15 saginawsaginawmi 2017 47.0 2015 43.0 4.0 no
16 saint_clair_shoresmacombmi 2017 116.0 2015 97.0 19.0 yes
17 south_lyonoaklandmi 2017 145.0 2015 131.0 14.0 yes
18 troyoaklandmi 2017 147.0 2015 133.0 14.0 yes
19 waterfordoaklandmi 2017 113.0 2015 98.0 15.0 yes
20 west_bloomfieldoaklandmi 2017 126.0 2015 116.0 10.0 yes
import numpy as np
import pandas as pd
import scipy as sp
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import sklearn.ensemble as skens
import sklearn.metrics as skmetric
import sklearn.naive_bayes as sknb
import sklearn.tree as sktree
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style='white', color_codes=True, font_scale=1.3)
import sklearn.externals.six as sksix
import IPython.display as ipd
from sklearn.model_selection import cross_val_score
from sklearn import metrics
import os
from sklearn.metrics import accuracy_score
df_q5_train,df_q5_test = train_test_split(q5_a, test_size=0.3)
gnb_model = sknb.GaussianNB()
# given sepal length, predict if setosa
gnb_model.fit(df_q5_train[['ZHVIPerSqft_AllHomes_y']],df_q5_train['Label'])
GaussianNB(priors=None, var_smoothing=1e-09)
y_pred = gnb_model.predict(df_q5_test[['ZHVIPerSqft_AllHomes_y']])
df_q5_test['predicted_nb'] = y_pred
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df_q5_test
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label predicted_nb
9 hollandottawami 2017 126.0 2015 107.0 19.0 yes yes
19 waterfordoaklandmi 2017 113.0 2015 98.0 15.0 yes yes
20 west_bloomfieldoaklandmi 2017 126.0 2015 116.0 10.0 yes yes
3 detroitwaynemi 2017 40.0 2015 34.0 6.0 no no
16 saint_clair_shoresmacombmi 2017 116.0 2015 97.0 19.0 yes yes
12 muskegon_townshipmuskegonmi 2017 60.0 2015 51.0 9.0 no no
13 norton_shoresmuskegonmi 2017 97.0 2015 82.0 15.0 yes yes
accuracy_score(df_q5_test["Label"], df_q5_test["predicted_nb"])
1.0
y_pred = gnb_model.predict(df_q5_test[['ZHVIPerSqft_AllHomes_x']])
df_q5_test['predicted_nb2'] = y_pred
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
accuracy_score(df_q5_test["Label"], df_q5_test["predicted_nb2"])
0.8571428571428571
df_q5_test
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label predicted_nb predicted_nb2
9 hollandottawami 2017 126.0 2015 107.0 19.0 yes yes yes
19 waterfordoaklandmi 2017 113.0 2015 98.0 15.0 yes yes yes
20 west_bloomfieldoaklandmi 2017 126.0 2015 116.0 10.0 yes yes yes
3 detroitwaynemi 2017 40.0 2015 34.0 6.0 no no no
16 saint_clair_shoresmacombmi 2017 116.0 2015 97.0 19.0 yes yes yes
12 muskegon_townshipmuskegonmi 2017 60.0 2015 51.0 9.0 no no yes
13 norton_shoresmuskegonmi 2017 97.0 2015 82.0 15.0 yes yes yes

Whole Country

q5_b = city_state_data
len(q5_b)
3762566
q5_b2 = q5_b[["Date","Year","Month","RegionName","ZHVI_1bedroom","ZHVI_2bedroom","ZHVI_3bedroom","ZHVI_4bedroom","ZHVI_5BedroomOrMore","ZHVIPerSqft_AllHomes","ZHVI_BottomTier","ZHVI_MiddleTier","ZHVI_SingleFamilyResidence","ZHVI_TopTier","State","County","City"]]
q5_b2 = q5_b2.dropna()
q5_b2.head()
Date Year Month RegionName ZHVI_1bedroom ZHVI_2bedroom ZHVI_3bedroom ZHVI_4bedroom ZHVI_5BedroomOrMore ZHVIPerSqft_AllHomes ZHVI_BottomTier ZHVI_MiddleTier ZHVI_SingleFamilyResidence ZHVI_TopTier State County City
10 1996-04-30 1996 4 abingtonmontgomerypa 47500.0 102200.0 122000.0 157600.0 219000.0 78.0 102000.0 128700.0 130900.0 189400.0 PA Montgomery Abington
18 1996-04-30 1996 4 actonmiddlesexma 52400.0 122200.0 212500.0 288100.0 325500.0 118.0 128200.0 230500.0 253900.0 330100.0 MA Middlesex Acton
21 1996-04-30 1996 4 acworthcobbga 62200.0 75500.0 100800.0 155900.0 201300.0 60.0 83900.0 112900.0 114800.0 174000.0 GA Cobb Acworth
44 1996-04-30 1996 4 agawamhampdenma 35800.0 82100.0 107400.0 114100.0 127300.0 71.0 68600.0 101500.0 106800.0 124700.0 MA Hampden Agawam
55 1996-04-30 1996 4 akronsummitoh 41600.0 49100.0 59500.0 70000.0 74900.0 49.0 42800.0 57100.0 56300.0 88000.0 OH Summit Akron
q5_b2_2017 = q5_b2.loc[q5_b2["Year"]==2017]
q5_b2_2017 = q5_b2_2017.loc[q5_b2_2017["Month"]==5]
#q5_2017.head()
q5_b2_2015 = q5_b2.loc[q5_b2["Year"]==2015]
q5_b2_2015 = q5_b2_2015.loc[q5_b2_2015["Month"]==5]
q5_b2_a = pd.merge(q5_b2_2017,q5_b2_2015, on='RegionName')
q5_b2_a = q5_b2_a[["RegionName","Year_x","ZHVIPerSqft_AllHomes_x","Year_y","ZHVIPerSqft_AllHomes_y"]]
q5_b2_a['Dif'] = q5_b2_a["ZHVIPerSqft_AllHomes_x"] - q5_b2_a["ZHVIPerSqft_AllHomes_y"]
conditions = [
    (q5_b2_a['Dif'] >= 100 ) & (q5_b2_a['Dif'] <= 500),
    (q5_b2_a['Dif'] < 100 ) ,
    (q5_b2_a['Dif'] > 500 )]
choices = ['yes', 'no', 'no']
q5_b2_a['Label'] = np.select(conditions, choices, default='null')
q5_b2_a.head()
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label
0 abingdonwashingtonva 2017 94.0 2015 89.0 5.0 no
1 abingtonmontgomerypa 2017 159.0 2015 150.0 9.0 no
2 actonmiddlesexma 2017 267.0 2015 241.0 26.0 no
3 acworthcobbga 2017 107.0 2015 93.0 14.0 no
4 agawamhampdenma 2017 144.0 2015 136.0 8.0 no
df_q5B_train,df_q5B_test = train_test_split(q5_b2_a, test_size=0.3)
gnb_model = sknb.GaussianNB()
# given sepal length, predict if setosa
gnb_model.fit(df_q5B_train[['ZHVIPerSqft_AllHomes_y']],df_q5B_train['Label'])
GaussianNB(priors=None, var_smoothing=1e-09)
y_pred = gnb_model.predict(df_q5B_test[['ZHVIPerSqft_AllHomes_y']])
df_q5B_test['predicted_nb'] = y_pred
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df_q5B_test
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label predicted_nb
778 sacoyorkme 2017 157.0 2015 144.0 13.0 no no
114 bristolhartfordct 2017 128.0 2015 126.0 2.0 no no
874 stamfordfairfieldct 2017 249.0 2015 251.0 -2.0 no no
75 bethlehemnorthamptonpa 2017 106.0 2015 100.0 6.0 no no
568 monroefairfieldct 2017 173.0 2015 175.0 -2.0 no no
141 cantonhaywoodnc 2017 112.0 2015 96.0 16.0 no no
2 actonmiddlesexma 2017 267.0 2015 241.0 26.0 no no
362 haikumauihi 2017 485.0 2015 443.0 42.0 no no
218 daly_citysan_mateoca 2017 693.0 2015 587.0 106.0 yes yes
546 middletown_townshipbuckspa 2017 169.0 2015 162.0 7.0 no no
207 cranstonprovidenceri 2017 166.0 2015 144.0 22.0 no no
685 palmettomanateefl 2017 128.0 2015 106.0 22.0 no no
144 carmelputnamny 2017 213.0 2015 201.0 12.0 no no
186 colorado_springsel_pasoco 2017 136.0 2015 116.0 20.0 no no
326 garden_groveorangeca 2017 418.0 2015 360.0 58.0 no no
152 celebrationosceolafl 2017 204.0 2015 192.0 12.0 no no
355 greergreenvillesc 2017 99.0 2015 88.0 11.0 no no
478 little_fallsmorrisonmn 2017 68.0 2015 60.0 8.0 no no
403 inmanspartanburgsc 2017 92.0 2015 81.0 11.0 no no
876 statesvilleiredellnc 2017 85.0 2015 74.0 11.0 no no
507 madisonnew_havenct 2017 194.0 2015 189.0 5.0 no no
699 pembroke_pinesbrowardfl 2017 168.0 2015 143.0 25.0 no no
562 minneapolishennepinmn 2017 210.0 2015 181.0 29.0 no no
551 milford_townshipoaklandmi 2017 148.0 2015 134.0 14.0 no no
17 altamonte_springsseminolefl 2017 117.0 2015 91.0 26.0 no no
189 columbiaboonemo 2017 93.0 2015 87.0 6.0 no no
194 concordcabarrusnc 2017 104.0 2015 92.0 12.0 no no
149 castro_valleyalamedaca 2017 467.0 2015 401.0 66.0 no no
124 buhltwin_fallsid 2017 84.0 2015 77.0 7.0 no no
96 bonita_springsleefl 2017 198.0 2015 181.0 17.0 no no
... ... ... ... ... ... ... ... ...
53 avonhartfordct 2017 157.0 2015 157.0 0.0 no no
487 long_beachlos_angelesca 2017 419.0 2015 365.0 54.0 no no
411 jasperduboisin 2017 92.0 2015 86.0 6.0 no no
821 schaumburgcookil 2017 171.0 2015 159.0 12.0 no no
345 grants_passjosephineor 2017 162.0 2015 133.0 29.0 no no
632 northampton_townshipbuckspa 2017 174.0 2015 171.0 3.0 no no
738 prospectjeffersonky 2017 147.0 2015 136.0 11.0 no no
319 fremontdodgene 2017 108.0 2015 97.0 11.0 no no
210 crestviewokaloosafl 2017 91.0 2015 84.0 7.0 no no
809 sangerfresnoca 2017 138.0 2015 117.0 21.0 no no
350 greensboroguilfordnc 2017 87.0 2015 80.0 7.0 no no
1019 westboroughworcesterma 2017 223.0 2015 201.0 22.0 no no
302 findlayhancockoh 2017 85.0 2015 80.0 5.0 no no
510 malvernchesterpa 2017 192.0 2015 183.0 9.0 no no
1050 woodburnmarionor 2017 168.0 2015 126.0 42.0 no no
978 vestavia_hillsjeffersonal 2017 151.0 2015 138.0 13.0 no no
406 island_parkfremontid 2017 210.0 2015 192.0 18.0 no no
588 napervilledu_pageil 2017 162.0 2015 158.0 4.0 no no
825 seasidemontereyca 2017 393.0 2015 325.0 68.0 no no
78 biddefordyorkme 2017 163.0 2015 149.0 14.0 no no
70 bellinghamwhatcomwa 2017 215.0 2015 179.0 36.0 no no
300 fernandina_beachnassaufl 2017 172.0 2015 151.0 21.0 no no
785 saint_georgewashingtonut 2017 127.0 2015 114.0 13.0 no no
964 twin_fallstwin_fallsid 2017 94.0 2015 85.0 9.0 no no
58 baltimorebaltimore_citymd 2017 107.0 2015 99.0 8.0 no no
306 fontanasan_bernardinoca 2017 219.0 2015 192.0 27.0 no no
252 east_greenwichkentri 2017 192.0 2015 166.0 26.0 no no
192 columbusfranklinoh 2017 93.0 2015 78.0 15.0 no no
201 coos_baycoosor 2017 134.0 2015 118.0 16.0 no no
1057 worcesterworcesterma 2017 150.0 2015 135.0 15.0 no no

321 rows × 8 columns

accuracy_score(df_q5B_test["Label"], df_q5B_test["predicted_nb"])
0.9844236760124611
# utility class to compare the predictions versus ground truth
def comparePlot(input_frame,real_column,predicted_column):
    df_a = input_frame.copy()
    df_b = input_frame.copy()
    df_a['label_source'] = 'ZHVIPerSqft_AllHomes_y'
    df_b['label_source'] = 'ZHVIPerSqft_AllHomes_x'
    df_a['label'] = df_a[real_column]
    df_b['label'] = df_b[predicted_column].apply(lambda x: 'Predict %s'%x)
    df_c = pd.concat((df_a, df_b), axis=0, ignore_index=True)
    sns.lmplot(x='ZHVIPerSqft_AllHomes_y', y='ZHVIPerSqft_AllHomes_x', col='label_source',
               hue='label', data=df_c, fit_reg=False, size=4);
comparePlot(df_q5B_test,"ZHVIPerSqft_AllHomes_y","predicted_nb")
/anaconda3/lib/python3.6/site-packages/seaborn/regression.py:546: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

png

y_pred = gnb_model.predict(df_q5B_test[['ZHVIPerSqft_AllHomes_x']])
df_q5B_test['predicted_nb2'] = y_pred
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df_q5B_test
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label predicted_nb predicted_nb2
778 sacoyorkme 2017 157.0 2015 144.0 13.0 no no no
114 bristolhartfordct 2017 128.0 2015 126.0 2.0 no no no
874 stamfordfairfieldct 2017 249.0 2015 251.0 -2.0 no no no
75 bethlehemnorthamptonpa 2017 106.0 2015 100.0 6.0 no no no
568 monroefairfieldct 2017 173.0 2015 175.0 -2.0 no no no
141 cantonhaywoodnc 2017 112.0 2015 96.0 16.0 no no no
2 actonmiddlesexma 2017 267.0 2015 241.0 26.0 no no no
362 haikumauihi 2017 485.0 2015 443.0 42.0 no no no
218 daly_citysan_mateoca 2017 693.0 2015 587.0 106.0 yes yes yes
546 middletown_townshipbuckspa 2017 169.0 2015 162.0 7.0 no no no
207 cranstonprovidenceri 2017 166.0 2015 144.0 22.0 no no no
685 palmettomanateefl 2017 128.0 2015 106.0 22.0 no no no
144 carmelputnamny 2017 213.0 2015 201.0 12.0 no no no
186 colorado_springsel_pasoco 2017 136.0 2015 116.0 20.0 no no no
326 garden_groveorangeca 2017 418.0 2015 360.0 58.0 no no no
152 celebrationosceolafl 2017 204.0 2015 192.0 12.0 no no no
355 greergreenvillesc 2017 99.0 2015 88.0 11.0 no no no
478 little_fallsmorrisonmn 2017 68.0 2015 60.0 8.0 no no no
403 inmanspartanburgsc 2017 92.0 2015 81.0 11.0 no no no
876 statesvilleiredellnc 2017 85.0 2015 74.0 11.0 no no no
507 madisonnew_havenct 2017 194.0 2015 189.0 5.0 no no no
699 pembroke_pinesbrowardfl 2017 168.0 2015 143.0 25.0 no no no
562 minneapolishennepinmn 2017 210.0 2015 181.0 29.0 no no no
551 milford_townshipoaklandmi 2017 148.0 2015 134.0 14.0 no no no
17 altamonte_springsseminolefl 2017 117.0 2015 91.0 26.0 no no no
189 columbiaboonemo 2017 93.0 2015 87.0 6.0 no no no
194 concordcabarrusnc 2017 104.0 2015 92.0 12.0 no no no
149 castro_valleyalamedaca 2017 467.0 2015 401.0 66.0 no no no
124 buhltwin_fallsid 2017 84.0 2015 77.0 7.0 no no no
96 bonita_springsleefl 2017 198.0 2015 181.0 17.0 no no no
... ... ... ... ... ... ... ... ... ...
53 avonhartfordct 2017 157.0 2015 157.0 0.0 no no no
487 long_beachlos_angelesca 2017 419.0 2015 365.0 54.0 no no no
411 jasperduboisin 2017 92.0 2015 86.0 6.0 no no no
821 schaumburgcookil 2017 171.0 2015 159.0 12.0 no no no
345 grants_passjosephineor 2017 162.0 2015 133.0 29.0 no no no
632 northampton_townshipbuckspa 2017 174.0 2015 171.0 3.0 no no no
738 prospectjeffersonky 2017 147.0 2015 136.0 11.0 no no no
319 fremontdodgene 2017 108.0 2015 97.0 11.0 no no no
210 crestviewokaloosafl 2017 91.0 2015 84.0 7.0 no no no
809 sangerfresnoca 2017 138.0 2015 117.0 21.0 no no no
350 greensboroguilfordnc 2017 87.0 2015 80.0 7.0 no no no
1019 westboroughworcesterma 2017 223.0 2015 201.0 22.0 no no no
302 findlayhancockoh 2017 85.0 2015 80.0 5.0 no no no
510 malvernchesterpa 2017 192.0 2015 183.0 9.0 no no no
1050 woodburnmarionor 2017 168.0 2015 126.0 42.0 no no no
978 vestavia_hillsjeffersonal 2017 151.0 2015 138.0 13.0 no no no
406 island_parkfremontid 2017 210.0 2015 192.0 18.0 no no no
588 napervilledu_pageil 2017 162.0 2015 158.0 4.0 no no no
825 seasidemontereyca 2017 393.0 2015 325.0 68.0 no no no
78 biddefordyorkme 2017 163.0 2015 149.0 14.0 no no no
70 bellinghamwhatcomwa 2017 215.0 2015 179.0 36.0 no no no
300 fernandina_beachnassaufl 2017 172.0 2015 151.0 21.0 no no no
785 saint_georgewashingtonut 2017 127.0 2015 114.0 13.0 no no no
964 twin_fallstwin_fallsid 2017 94.0 2015 85.0 9.0 no no no
58 baltimorebaltimore_citymd 2017 107.0 2015 99.0 8.0 no no no
306 fontanasan_bernardinoca 2017 219.0 2015 192.0 27.0 no no no
252 east_greenwichkentri 2017 192.0 2015 166.0 26.0 no no no
192 columbusfranklinoh 2017 93.0 2015 78.0 15.0 no no no
201 coos_baycoosor 2017 134.0 2015 118.0 16.0 no no no
1057 worcesterworcesterma 2017 150.0 2015 135.0 15.0 no no no

321 rows × 9 columns

invest = df_q5B_test.loc[df_q5B_test["predicted_nb2"]=="yes"]
invest.head()
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label predicted_nb predicted_nb2
218 daly_citysan_mateoca 2017 693.0 2015 587.0 106.0 yes yes yes
795 san_anselmomarinca 2017 661.0 2015 569.0 92.0 no yes yes
757 redondo_beachlos_angelesca 2017 579.0 2015 494.0 85.0 no no yes
417 kailuahonoluluhi 2017 604.0 2015 537.0 67.0 no no yes
509 malibulos_angelesca 2017 1208.0 2015 1087.0 121.0 yes yes yes

final_data = invest.sort_values(by='ZHVIPerSqft_AllHomes_x')
final_data.to_csv("Output.csv",encoding='utf-8', index=False)
invest.head()
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label predicted_nb predicted_nb2
218 daly_citysan_mateoca 2017 693.0 2015 587.0 106.0 yes yes yes
795 san_anselmomarinca 2017 661.0 2015 569.0 92.0 no yes yes
757 redondo_beachlos_angelesca 2017 579.0 2015 494.0 85.0 no no yes
417 kailuahonoluluhi 2017 604.0 2015 537.0 67.0 no no yes
509 malibulos_angelesca 2017 1208.0 2015 1087.0 121.0 yes yes yes
final_data.head()
RegionName Year_x ZHVIPerSqft_AllHomes_x Year_y ZHVIPerSqft_AllHomes_y Dif Label predicted_nb predicted_nb2
757 redondo_beachlos_angelesca 2017 579.0 2015 494.0 85.0 no no yes
417 kailuahonoluluhi 2017 604.0 2015 537.0 67.0 no no yes
392 honoluluhonoluluhi 2017 606.0 2015 556.0 50.0 no no yes
845 solana_beachsan_diegoca 2017 631.0 2015 578.0 53.0 no yes yes
795 san_anselmomarinca 2017 661.0 2015 569.0 92.0 no yes yes

Sample Data

sample_data = city_state_data.head(1000)
sample_data.to_csv("sample_data.csv",encoding='utf-8', index=False)


Tags:

Updated: