Unsupervised anomaly detection (fraud) algorithm

This project has good example algorithms to do a forensic-type analysis, looking for anomalies in a dataset. We first do some data cleaning (exclusions, imputation, don’t remove outliers - that’s what we’re looking for), then build variables that are designed to look for the kinds of anomalies we are interested in, in this case, unusual property valuations.

After we build the variables we know we have lots of correlations and too high dimensionality so we need to remove correlations and reduce dimensionality. Since we don’t have a dependent variable the easiest useful thing to do is PCA. We z scale (always z scale before a PCA), do PCA, keep the top few PCs, then z scale again in order to make each retained PC equally important (optional step; only do this if you keep just a few PCs.).

We use two different anomaly detection (fraud) algorithms. The first just looks for outliers in the final scaled PC space using a Minkowski distance from the origin. The second method makes a simple autoencoder and the fraud score is then the reproduction error. It’s important to note that each/either of these two methods would be a fine fraud score by itself.

Since we have two score and we don’t really know which one is better we just average the two scores. To do this we replace the score with its rank order and then average the rank-ordered scores for our final score.

Finally we sort all the records by this final score and explore the top n records. To help the investigation we show which of the variables are driving these top scoring records with a heat map of the variable zscores, which can point the investigators to what’s making the high score for these top scoring records.

This problem is an invented problem to demonstrate the process of building unsupervised fraud models. The data set is real and the invented problem is realistic. What’s lacking the most is the ability to interact with domain experts in order to do proper exclusions and design good/appropriate variables.

The data can be found here: https://data.cityofnewyork.us/Housing-Development/Property-Valuation-and-Assessment-Data/rgy2-tti8

Download PDF file.

data.head()
RECORD BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT ... BLDFRONT BLDDEPTH AVLAND2 AVTOT2 EXLAND2 EXTOT2 EXCD2 PERIOD YEAR VALTYPE
0 1 1000010101 1 1 101 NaN U S GOVT LAND & BLDGS P7 4 500 ... 0 0 3775500.0 8613000.0 3775500.0 8613000.0 NaN FINAL 2010/11 AC-TR
1 2 1000010201 1 1 201 NaN U S GOVT LAND & BLDGS Z9 4 27 ... 0 0 11111400.0 80690400.0 11111400.0 80690400.0 NaN FINAL 2010/11 AC-TR
2 3 1000020001 1 2 1 NaN DEPT OF GENERAL SERVI Y7 4 709 ... 709 564 32321790.0 40179510.0 32321790.0 40179510.0 NaN FINAL 2010/11 AC-TR
3 4 1000020023 1 2 23 NaN DEPARTMENT OF BUSINES T2 4 793 ... 85 551 13644000.0 15750000.0 13644000.0 15750000.0 NaN FINAL 2010/11 AC-TR
4 5 1000030001 1 3 1 NaN PARKS AND RECREATION Q1 4 323 ... 89 57 106348680.0 107758350.0 106348680.0 107758350.0 NaN FINAL 2010/11 AC-TR

5 rows × 32 columns

Remove some properties that we aren’t interested in

numrecords_orig = len(data)
numrecords = numrecords_orig
numrecords
1070994
#remove the records with easement type as goverment 
data = data[data["EASEMENT"] != "U"].reset_index(drop=True)
numremoved = numrecords - len(data)
print('# records removed:', numremoved)
# records removed: 1
# create some words for the owner name that might be goverment or a cemetery
gov_list = ['DEPT ', 'DEPARTMENT', 'UNITED STATES','GOVERNMENT',' GOVT ', 'CEMETERY']

# owner = list(set(data['OWNER'].to_list()))
# owner.pop(0) #remove the nan

owner1 = list(set(data['OWNER'].to_list()))
owner = [item for item in owner1 if str(item) != 'nan'] # remove any nan's

remove_list = []
print("Total owner number before removing is ", len(owner))

for i in owner:
   for g in gov_list:
    if g in i and 'STORES' not in i:
        remove_list.append(i)
Total owner number before removing is  863347
# Look at the most frequent owners. This might show some other properties we aren't interested in.
remove_list2 = data['OWNER'].value_counts().head(20).index.tolist()
remove_list2
['PARKCHESTER PRESERVAT',
 'PARKS AND RECREATION',
 'DCAS',
 'HOUSING PRESERVATION',
 'CITY OF NEW YORK',
 'DEPT OF ENVIRONMENTAL',
 'BOARD OF EDUCATION',
 'NEW YORK CITY HOUSING',
 'CNY/NYCTA',
 'NYC HOUSING PARTNERSH',
 'YORKVILLE TOWERS ASSO',
 'DEPARTMENT OF BUSINES',
 'DEPT OF TRANSPORTATIO',
 'MTA/LIRR',
 'PARCKHESTER PRESERVAT',
 'MH RESIDENTIAL 1, LLC',
 '434 M LLC',
 'LINCOLN PLAZA ASSOCIA',
 'DEUTSCHE BANK NATIONA',
 '561 11TH AVENUE TMG L']
# add some others to also be removed
remove_list2.append('THE CITY OF NEW YORK')
remove_list2.append('NYS URBAN DEVELOPMENT')
remove_list2.append('CULTURAL AFFAIRS')
remove_list2.append('NY STATE PUBLIC WORKS')
remove_list2.append("NYC DEP'T OF HIGHWAYS")
remove_list2.append('CITY WIDE ADMINISTRAT')
remove_list2.append('NEW YORK CITY')
remove_list2.append('THE PORT OFNY & NJ')
remove_list2.append('NEW YORK STATE DEPART')
remove_list2.append('CITY AND NON-CITY OWN')
remove_list2.append('SANITATION')
remove_list2.append('NYS DOT')
remove_list2.append('NEW YORK CITY TRANSIT')
remove_list2.append('PORT AUTHORITY OF NY')
remove_list2.append('NEW YORK STATE OWNED')
remove_list2.append('NYC PARK DEPT')
remove_list2.append('PORT OF NEW YORK AUTH')
remove_list2.append('NYC PARK DEPT')
remove_list2.append('LIRR')
remove_list2.append('NY STATE PUBLIC SERV')
remove_list2.append('STATE OF NEW YORK')
remove_list2.append('NYC HIGHWAY DEPT')
for i in remove_list2:
    if i not in remove_list:
        remove_list.append(i)
    else:
        print(i)
DEPT OF ENVIRONMENTAL
DEPARTMENT OF BUSINES
DEPT OF TRANSPORTATIO
NYC PARK DEPT
# rremove some of the removes...
remove_list.remove('YORKVILLE TOWERS ASSO')
remove_list.remove('434 M LLC')
remove_list.remove('DEUTSCHE BANK NATIONA')
remove_list.remove('561 11TH AVENUE TMG L')
remove_list.remove('MH RESIDENTIAL 1, LLC')
len(remove_list)
263
numrecords = len(data)
removed = data[data['OWNER'].isin(remove_list)].reset_index(drop=True)
data = data[~data['OWNER'].isin(remove_list)].reset_index(drop=True)
numremoved = numrecords - len(data)
print('# records removed:', numremoved)
# records removed: 26500
removed.shape
(26500, 32)
# any on this list that we shouldn't remove?
# plt.rcParams.update({'figure.figsize':(6,14)})
plt.figure(figsize=(6,14))
plt.xscale('log')
removed['OWNER'].value_counts().head(50).sort_values().plot(kind='barh')

data.shape
(1044493, 32)
# this is how many records we removed
numrecords_orig - len(data)
26501
data.head(10)
RECORD BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT ... BLDFRONT BLDDEPTH AVLAND2 AVTOT2 EXLAND2 EXTOT2 EXCD2 PERIOD YEAR VALTYPE
0 9 1000041001 1 4 1001 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 636093.0 2049290.0 NaN NaN NaN FINAL 2010/11 AC-TR
1 10 1000041002 1 4 1002 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 919276.0 2961617.0 NaN NaN NaN FINAL 2010/11 AC-TR
2 11 1000041003 1 4 1003 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 967500.0 5483912.0 NaN NaN NaN FINAL 2010/11 AC-TR
3 12 1000041004 1 4 1004 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 163174.0 525692.0 NaN NaN NaN FINAL 2010/11 AC-TR
4 13 1000041005 1 4 1005 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 373783.0 1204211.0 NaN NaN NaN FINAL 2010/11 AC-TR
5 14 1000041006 1 4 1006 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 353383.0 1138493.0 NaN NaN NaN FINAL 2010/11 AC-TR
6 15 1000041007 1 4 1007 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 1246572.0 4016063.0 NaN NaN NaN FINAL 2010/11 AC-TR
7 16 1000041008 1 4 1008 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 1213369.0 3909089.0 NaN NaN NaN FINAL 2010/11 AC-TR
8 17 1000041009 1 4 1009 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 1213369.0 3909089.0 NaN NaN NaN FINAL 2010/11 AC-TR
9 18 1000041010 1 4 1010 NaN TRZ HOLDINGS, LLC R5 4 0 ... 0 0 1213369.0 3909089.0 NaN NaN NaN FINAL 2010/11 AC-TR

10 rows × 32 columns

Fill in missing ZIP

# How many zips are missing? Replace NAN with 0 and count them.
missing_zips = np.where(pd.isnull(data['ZIP']))[0]
num_missing_zips_orig = len(missing_zips)
num_missing_zips_orig
20431
sum(data['BORO'].isna())
0
sum(data['STADDR'].isna())
364
# concatenate the 'staddr' and 'boro' columns into a new 'staddr_boro' column 
data['staddr_boro'] = data[data['STADDR'].notnull()]['STADDR'] + '_' + data[data['BORO'].notnull()]['BORO'].astype(str)
data['staddr_boro']
0              1 WATER STREET_1
1              1 WATER STREET_1
2              1 WATER STREET_1
3              1 WATER STREET_1
4              1 WATER STREET_1
                   ...         
1044488    142 BENTLEY STREET_5
1044489    146 BENTLEY STREET_5
1044490    150 BENTLEY STREET_5
1044491    156 BENTLEY STREET_5
1044492    162 BENTLEY STREET_5
Name: staddr_boro, Length: 1044493, dtype: object
staddr_boro_zip = {}
for index, staddrboro in data['staddr_boro'].items():
    if staddrboro not in staddr_boro_zip :
        staddr_boro_zip [staddrboro] = data.loc[index, 'ZIP']
        
        
# fill in by mapping with street addrees boroughs
data['ZIP'] = data['ZIP'].fillna(data['staddr_boro'].map(staddr_boro_zip))
# how many missing zips did we fill in with this last step?
num_filled_in = num_missing_zips_orig - len(np.where(pd.isnull(data['ZIP']))[0])
num_filled_in
2832
# How many are still left to fill in?
missing_zips = np.where(pd.isnull(data['ZIP']))[0]
len(missing_zips)
17599
# Assume the data is already sorted by zip. If a zip is missing, 
# and the before and after zips are the same, fill in the zip with that value
for i in range(len(missing_zips)):
    if(data.loc[missing_zips[i]+1,'ZIP'] == data.loc[missing_zips[i]-1,'ZIP']):
        data.loc[missing_zips[i],'ZIP'] = data.loc[missing_zips[i]-1,'ZIP']
# how many mnissing zips did we fill in with this last step?
num_filled_in = len(missing_zips) - len(np.where(pd.isnull(data['ZIP']))[0])
num_filled_in
9491
# How many are still left to fill in?
missing_zips = np.where(pd.isnull(data['ZIP']))[0]
len(missing_zips)
8108
%%time
# For the remaining missing zips, just fill in with the previous record's zip.
# another slow loop that should be improved...
for i in range(len(missing_zips)):
    data.loc[missing_zips[i],'ZIP'] = data.loc[missing_zips[i]-1,'ZIP']
CPU times: user 484 ms, sys: 3.4 ms, total: 487 ms
Wall time: 492 ms
missing_zips = np.where(pd.isnull(data['ZIP']))[0]
len(missing_zips)
0
data = data.drop('staddr_boro', axis=1)

FULLVAL, AVLAND, AVTOT

FULLVAL

len(data[data['FULLVAL']==0])
10025
data['FULLVAL'].isnull().sum()
0
data['FULLVAL'].replace(0, np.nan, inplace=True)
data['FULLVAL'].isnull().sum()
10025
data["FULLVAL"] = data.\
                        groupby(['TAXCLASS','BORO','BLDGCL'])['FULLVAL'].transform(lambda x: x.fillna(x.mean()))
data['FULLVAL'].isnull().sum()
7307
data["FULLVAL"] = data.\
                        groupby(['TAXCLASS','BORO'])['FULLVAL'].transform(lambda x: x.fillna(x.mean()))
data['FULLVAL'].isnull().sum()
386
data["FULLVAL"] = data.\
                        groupby(['TAXCLASS'])['FULLVAL'].transform(lambda x: x.fillna(x.mean()))
data['FULLVAL'].isnull().sum()
0

AVLAND

len(data[data['AVLAND']==0])
10027
data['AVLAND'].isnull().sum()
0
data['AVLAND'].replace(0, np.nan, inplace=True)
data['AVLAND'].isnull().sum()
10027
data["AVLAND"] = data.\
                        groupby(['TAXCLASS','BORO','BLDGCL'])['AVLAND'].transform(lambda x: x.fillna(x.mean()))
data['AVLAND'].isnull().sum()
7307
data["AVLAND"] = data.\
                        groupby(['TAXCLASS','BORO'])['AVLAND'].transform(lambda x: x.fillna(x.mean()))
data['AVLAND'].isnull().sum()
386
data["AVLAND"] = data.\
                        groupby(['TAXCLASS'])['AVLAND'].transform(lambda x: x.fillna(x.mean()))
data['AVLAND'].isnull().sum()
0

AVTOT

len(data[data['AVTOT']==0])
10025
data['AVTOT'].isnull().sum()
0
data['AVTOT'].replace(0, np.nan, inplace=True)
data['AVTOT'].isnull().sum()
10025
data["AVTOT"] = data.\
                        groupby(['TAXCLASS','BORO','BLDGCL'])['AVTOT'].transform(lambda x: x.fillna(x.mean()))
data['AVTOT'].isnull().sum()
7307
data["AVTOT"] = data.\
                        groupby(['TAXCLASS','BORO'])['AVTOT'].transform(lambda x: x.fillna(x.mean()))
data['AVTOT'].isnull().sum()
386
data["AVTOT"] = data.\
                        groupby(['TAXCLASS'])['AVTOT'].transform(lambda x: x.fillna(x.mean()))
data['AVTOT'].isnull().sum()
0
data.head().transpose()
0 1 2 3 4
RECORD 9 10 11 12 13
BBLE 1000041001 1000041002 1000041003 1000041004 1000041005
BORO 1 1 1 1 1
BLOCK 4 4 4 4 4
LOT 1001 1002 1003 1004 1005
EASEMENT NaN NaN NaN NaN NaN
OWNER TRZ HOLDINGS, LLC TRZ HOLDINGS, LLC TRZ HOLDINGS, LLC TRZ HOLDINGS, LLC TRZ HOLDINGS, LLC
BLDGCL R5 R5 R5 R5 R5
TAXCLASS 4 4 4 4 4
LTFRONT 0 0 0 0 0
LTDEPTH 0 0 0 0 0
EXT NaN NaN NaN NaN NaN
STORIES 50.0 50.0 50.0 50.0 50.0
FULLVAL 3944762.0 5700930.0 10600000.0 1011928.0 2318026.0
AVLAND 636093.0 919276.0 967500.0 163174.0 373783.0
AVTOT 1775143.0 2565419.0 4770000.0 455368.0 1043112.0
EXLAND 0.0 0.0 0.0 0.0 0.0
EXTOT 0.0 0.0 0.0 0.0 0.0
EXCD1 NaN NaN NaN NaN NaN
STADDR 1 WATER STREET 1 WATER STREET 1 WATER STREET 1 WATER STREET 1 WATER STREET
ZIP 10004.0 10004.0 10004.0 10004.0 10004.0
EXMPTCL NaN NaN NaN NaN NaN
BLDFRONT 0 0 0 0 0
BLDDEPTH 0 0 0 0 0
AVLAND2 636093.0 919276.0 967500.0 163174.0 373783.0
AVTOT2 2049290.0 2961617.0 5483912.0 525692.0 1204211.0
EXLAND2 NaN NaN NaN NaN NaN
EXTOT2 NaN NaN NaN NaN NaN
EXCD2 NaN NaN NaN NaN NaN
PERIOD FINAL FINAL FINAL FINAL FINAL
YEAR 2010/11 2010/11 2010/11 2010/11 2010/11
VALTYPE AC-TR AC-TR AC-TR AC-TR AC-TR

Fill in the missing STORIES

data['STORIES'].isnull().sum()
42030
modes = data.groupby(['BORO', 'BLDGCL'])['STORIES'] \
         .transform(lambda x: x.mode(dropna=False).iloc[0])
data['STORIES'] = data['STORIES'].fillna(modes)
data['STORIES'].isnull().sum()
37922
data["STORIES"] = data.\
                        groupby(['TAXCLASS'])['STORIES'].transform(lambda x: x.fillna(x.mean()))
data['STORIES'].isnull().sum()
0
data.head().transpose()
0 1 2 3 4
RECORD 9 10 11 12 13
BBLE 1000041001 1000041002 1000041003 1000041004 1000041005
BORO 1 1 1 1 1
BLOCK 4 4 4 4 4
LOT 1001 1002 1003 1004 1005
EASEMENT NaN NaN NaN NaN NaN
OWNER TRZ HOLDINGS, LLC TRZ HOLDINGS, LLC TRZ HOLDINGS, LLC TRZ HOLDINGS, LLC TRZ HOLDINGS, LLC
BLDGCL R5 R5 R5 R5 R5
TAXCLASS 4 4 4 4 4
LTFRONT 0 0 0 0 0
LTDEPTH 0 0 0 0 0
EXT NaN NaN NaN NaN NaN
STORIES 50.0 50.0 50.0 50.0 50.0
FULLVAL 3944762.0 5700930.0 10600000.0 1011928.0 2318026.0
AVLAND 636093.0 919276.0 967500.0 163174.0 373783.0
AVTOT 1775143.0 2565419.0 4770000.0 455368.0 1043112.0
EXLAND 0.0 0.0 0.0 0.0 0.0
EXTOT 0.0 0.0 0.0 0.0 0.0
EXCD1 NaN NaN NaN NaN NaN
STADDR 1 WATER STREET 1 WATER STREET 1 WATER STREET 1 WATER STREET 1 WATER STREET
ZIP 10004.0 10004.0 10004.0 10004.0 10004.0
EXMPTCL NaN NaN NaN NaN NaN
BLDFRONT 0 0 0 0 0
BLDDEPTH 0 0 0 0 0
AVLAND2 636093.0 919276.0 967500.0 163174.0 373783.0
AVTOT2 2049290.0 2961617.0 5483912.0 525692.0 1204211.0
EXLAND2 NaN NaN NaN NaN NaN
EXTOT2 NaN NaN NaN NaN NaN
EXCD2 NaN NaN NaN NaN NaN
PERIOD FINAL FINAL FINAL FINAL FINAL
YEAR 2010/11 2010/11 2010/11 2010/11 2010/11
VALTYPE AC-TR AC-TR AC-TR AC-TR AC-TR

Fill in LTFRONT, LTDEPTH, BLDDEPTH, BLDFRONT with averages by TAXCLASS

# Because these 4 fields do not have NAs, we just need to replace 0s. 
# We think zero and 1 are invalid values for these fields, so replace them with NA.
# Calculate groupwise average. Replace 0 and 1's by NAs so they are not counted in calculating mean.
data.loc[data['LTFRONT']==0,'LTFRONT']=np.nan
data.loc[data['LTDEPTH']==0,'LTDEPTH']=np.nan
data.loc[data['BLDFRONT']==0,'BLDFRONT']=np.nan
data.loc[data['BLDDEPTH']==0,'BLDDEPTH']=np.nan
data.loc[data['LTFRONT']==1,'LTFRONT']=np.nan
data.loc[data['LTDEPTH']==1,'LTDEPTH']=np.nan
data.loc[data['BLDFRONT']==1,'BLDFRONT']=np.nan
data.loc[data['BLDDEPTH']==1,'BLDDEPTH']=np.nan
data.head()
RECORD BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT ... BLDFRONT BLDDEPTH AVLAND2 AVTOT2 EXLAND2 EXTOT2 EXCD2 PERIOD YEAR VALTYPE
0 9 1000041001 1 4 1001 NaN TRZ HOLDINGS, LLC R5 4 NaN ... NaN NaN 636093.0 2049290.0 NaN NaN NaN FINAL 2010/11 AC-TR
1 10 1000041002 1 4 1002 NaN TRZ HOLDINGS, LLC R5 4 NaN ... NaN NaN 919276.0 2961617.0 NaN NaN NaN FINAL 2010/11 AC-TR
2 11 1000041003 1 4 1003 NaN TRZ HOLDINGS, LLC R5 4 NaN ... NaN NaN 967500.0 5483912.0 NaN NaN NaN FINAL 2010/11 AC-TR
3 12 1000041004 1 4 1004 NaN TRZ HOLDINGS, LLC R5 4 NaN ... NaN NaN 163174.0 525692.0 NaN NaN NaN FINAL 2010/11 AC-TR
4 13 1000041005 1 4 1005 NaN TRZ HOLDINGS, LLC R5 4 NaN ... NaN NaN 373783.0 1204211.0 NaN NaN NaN FINAL 2010/11 AC-TR

5 rows × 32 columns

LTFRONT

data['LTFRONT'].isnull().sum()
161133
data["LTFRONT"] = data.\
                        groupby(['TAXCLASS','BORO'])['LTFRONT'].transform(lambda x: x.fillna(x.mean()))
data[data['LTFRONT'].isnull()]
RECORD BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT ... BLDFRONT BLDDEPTH AVLAND2 AVTOT2 EXLAND2 EXTOT2 EXCD2 PERIOD YEAR VALTYPE
126002 127752 1018259034 1 1825 9034 NaN NaN V0 1B NaN ... NaN NaN NaN NaN NaN NaN NaN FINAL 2010/11 AC-TR
126003 127753 1018259036 1 1825 9036 NaN NaN V0 1B NaN ... NaN NaN NaN NaN NaN NaN NaN FINAL 2010/11 AC-TR

2 rows × 32 columns

data["LTFRONT"] = data.\
                        groupby(['TAXCLASS'])['LTFRONT'].transform(lambda x: x.fillna(x.mean()))
data['LTFRONT'].isnull().sum()
0

LTDEPTH

data['LTDEPTH'].isnull().sum()
161715
data["LTDEPTH"] = data.\
                        groupby(['TAXCLASS','BORO'])['LTDEPTH'].transform(lambda x: x.fillna(x.mean()))
data[data['LTDEPTH'].isnull()]
RECORD BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT ... BLDFRONT BLDDEPTH AVLAND2 AVTOT2 EXLAND2 EXTOT2 EXCD2 PERIOD YEAR VALTYPE
126002 127752 1018259034 1 1825 9034 NaN NaN V0 1B 45.465048 ... NaN NaN NaN NaN NaN NaN NaN FINAL 2010/11 AC-TR
126003 127753 1018259036 1 1825 9036 NaN NaN V0 1B 45.465048 ... NaN NaN NaN NaN NaN NaN NaN FINAL 2010/11 AC-TR

2 rows × 32 columns

data["LTDEPTH"] = data.\
                        groupby(['TAXCLASS'])['LTDEPTH'].transform(lambda x: x.fillna(x.mean()))
data['LTDEPTH'].isnull().sum()
0

BLDFRONT

data['BLDFRONT'].isnull().sum()
206926
data['BLDFRONT'] = data.\
                        groupby(['TAXCLASS','BORO','BLDGCL'])['BLDFRONT'].transform(lambda x: x.fillna(x.mean()))
data['BLDFRONT'].isnull().sum()
18672
data['BLDFRONT'] = data.\
                        groupby(['TAXCLASS','BORO'])['BLDFRONT'].transform(lambda x: x.fillna(x.mean()))
data['BLDFRONT'].isnull().sum()
15718
data['BLDFRONT'] = data.\
                        groupby(['TAXCLASS'])['BLDFRONT'].transform(lambda x: x.fillna(x.mean()))
data['BLDFRONT'].isnull().sum()
0

BLDEPTH

data['BLDDEPTH'].isnull().sum()
206944
data['BLDDEPTH'] = data.\
                        groupby(['TAXCLASS','BORO','BLDGCL'])['BLDDEPTH'].transform(lambda x: x.fillna(x.mean()))
data['BLDDEPTH'].isnull().sum()
15804
data['BLDDEPTH'] = data.\
                        groupby(['TAXCLASS','BORO'])['BLDDEPTH'].transform(lambda x: x.fillna(x.mean()))
data['BLDDEPTH'].isnull().sum()
12830
data['BLDDEPTH'] = data.\
                        groupby(['TAXCLASS'])['BLDDEPTH'].transform(lambda x: x.fillna(x.mean()))
data['BLDDEPTH'].isnull().sum()
0
data.dtypes
RECORD        int64
BBLE         object
BORO          int64
BLOCK         int64
LOT           int64
EASEMENT     object
OWNER        object
BLDGCL       object
TAXCLASS     object
LTFRONT     float64
LTDEPTH     float64
EXT          object
STORIES     float64
FULLVAL     float64
AVLAND      float64
AVTOT       float64
EXLAND      float64
EXTOT       float64
EXCD1       float64
STADDR       object
ZIP         float64
EXMPTCL      object
BLDFRONT    float64
BLDDEPTH    float64
AVLAND2     float64
AVTOT2      float64
EXLAND2     float64
EXTOT2      float64
EXCD2       float64
PERIOD       object
YEAR         object
VALTYPE      object
dtype: object
# convert ZIP to a string rather than a float
# We call the first three digits of the zip zip3
data['ZIP'] = data['ZIP'].astype(str)
data['zip3'] = data['ZIP'].str[:3]
cols = data.columns
print(cols)
Index(['RECORD', 'BBLE', 'BORO', 'BLOCK', 'LOT', 'EASEMENT', 'OWNER', 'BLDGCL',
       'TAXCLASS', 'LTFRONT', 'LTDEPTH', 'EXT', 'STORIES', 'FULLVAL', 'AVLAND',
       'AVTOT', 'EXLAND', 'EXTOT', 'EXCD1', 'STADDR', 'ZIP', 'EXMPTCL',
       'BLDFRONT', 'BLDDEPTH', 'AVLAND2', 'AVTOT2', 'EXLAND2', 'EXTOT2',
       'EXCD2', 'PERIOD', 'YEAR', 'VALTYPE', 'zip3'],
      dtype='object')

Now build variables that try to find properties that are unusual in ways we’re interested in

# epsilon is an arbitrary small number to make sure we don't divide by zero
epsilon = .0001
data['ltsize'] = data['LTFRONT'] * data['LTDEPTH'] + epsilon
data['bldsize'] = data['BLDFRONT'] * data['BLDDEPTH'] + epsilon
data['bldvol'] = data['bldsize'] * data['STORIES'] + epsilon
data['r1'] = data['FULLVAL'] / data['ltsize']
data['r2'] = data['FULLVAL'] / data['bldsize']
data['r3'] = data['FULLVAL'] / data['bldvol']
data['r4'] = data['AVLAND'] / data['ltsize']
data['r5'] = data['AVLAND'] / data['bldsize']
data['r6'] = data['AVLAND'] / data['bldvol']
data['r7'] = data['AVTOT'] / data['ltsize']
data['r8'] = data['AVTOT'] / data['bldsize']
data['r9'] = data['AVTOT'] / data['bldvol']
data.describe()
RECORD BORO BLOCK LOT LTFRONT LTDEPTH STORIES FULLVAL AVLAND AVTOT ... bldvol r1 r2 r3 r4 r5 r6 r7 r8 r9
count 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 ... 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06
mean 5.368069e+05 3.220281e+00 4.756777e+03 3.509013e+02 5.045457e+01 1.073812e+02 4.969850e+00 8.166002e+05 6.654802e+04 1.999379e+05 ... 6.143487e+04 2.107194e+02 5.350157e+02 2.451426e+02 9.965350e+00 2.336084e+01 1.025281e+01 2.534315e+01 4.970893e+01 1.930086e+01
std 3.080025e+05 1.199074e+00 3.677416e+03 8.267095e+02 5.999584e+01 5.153451e+01 8.225039e+00 6.399805e+06 2.012308e+06 5.392440e+06 ... 2.325184e+06 4.571358e+02 9.277953e+02 4.564148e+02 6.515434e+01 3.670539e+02 1.592434e+02 1.593337e+02 9.083669e+02 3.132694e+02
min 9.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.000000e+00 1.000000e+00 4.000000e+00 1.000000e+00 1.000000e+00 ... 8.000200e+00 3.614764e-04 8.696149e-04 6.269316e-05 8.333333e-06 9.662388e-05 7.432606e-06 8.333333e-06 2.000000e-04 3.716303e-05
25% 2.729100e+05 3.000000e+00 1.542000e+03 2.300000e+01 2.100000e+01 1.000000e+02 2.000000e+00 3.181550e+05 9.679000e+03 1.892600e+04 ... 1.408000e+03 7.566667e+01 2.064777e+02 8.013468e+01 2.370750e+00 5.980342e+00 2.170500e+00 5.378469e+00 1.736622e+01 7.240779e+00
50% 5.387720e+05 3.000000e+00 4.078000e+03 4.900000e+01 3.000000e+01 1.000000e+02 2.000000e+00 4.540000e+05 1.387800e+04 2.579100e+04 ... 2.210000e+03 1.533719e+02 5.000000e+02 2.391826e+02 4.590250e+00 1.504083e+01 7.180850e+00 8.619412e+00 2.700291e+01 1.272639e+01
75% 8.022750e+05 4.000000e+00 6.920000e+03 1.400000e+02 6.000000e+01 1.120598e+02 4.000000e+00 6.240000e+05 1.998000e+04 4.724400e+04 ... 7.035000e+03 2.430000e+02 6.853332e+02 3.364317e+02 7.215789e+00 2.035416e+01 1.006597e+01 1.352434e+01 3.602222e+01 1.759953e+01
max 1.070994e+06 5.000000e+00 1.635000e+04 9.450000e+03 9.999000e+03 9.619000e+03 1.190000e+02 1.663775e+09 1.792809e+09 4.668309e+09 ... 2.205711e+09 8.586666e+04 2.855351e+05 2.275000e+05 3.728371e+04 3.331136e+05 1.110379e+05 3.864000e+04 8.673969e+05 2.891323e+05

8 rows × 32 columns

I want outliers in these 9 variables, either very high or very low. Very high is easy to find but very low might be close to zero and probably not many standard deviations below the average. A simple way to look for outliers that are very low is to also include 1/over these variables, which will be very large outliers when the variables are very low. First I scale them all to have reasonable average.

vars9 = ['r1','r2','r3','r4','r5','r6','r7','r8','r9']
for vars in vars9:
    data[vars] = data[vars]/data[vars].median()
    
data.describe()
RECORD BORO BLOCK LOT LTFRONT LTDEPTH STORIES FULLVAL AVLAND AVTOT ... bldvol r1 r2 r3 r4 r5 r6 r7 r8 r9
count 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 ... 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06
mean 5.368069e+05 3.220281e+00 4.756777e+03 3.509013e+02 5.045457e+01 1.073812e+02 4.969850e+00 8.166002e+05 6.654802e+04 1.999379e+05 ... 6.143487e+04 1.373911e+00 1.070031e+00 1.024918e+00 2.170982e+00 1.553161e+00 1.427799e+00 2.940242e+00 1.840873e+00 1.516602e+00
std 3.080025e+05 1.199074e+00 3.677416e+03 8.267095e+02 5.999584e+01 5.153451e+01 8.225039e+00 6.399805e+06 2.012308e+06 5.392440e+06 ... 2.325184e+06 2.980571e+00 1.855591e+00 1.908227e+00 1.419407e+01 2.440382e+01 2.217612e+01 1.848544e+01 3.363960e+01 2.461574e+01
min 9.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.000000e+00 1.000000e+00 4.000000e+00 1.000000e+00 1.000000e+00 ... 8.000200e+00 2.356863e-06 1.739230e-06 2.621142e-07 1.815442e-06 6.424104e-06 1.035059e-06 9.668100e-07 7.406610e-06 2.920156e-06
25% 2.729100e+05 3.000000e+00 1.542000e+03 2.300000e+01 2.100000e+01 1.000000e+02 2.000000e+00 3.181550e+05 9.679000e+03 1.892600e+04 ... 1.408000e+03 4.933542e-01 4.129555e-01 3.350355e-01 5.164751e-01 3.976071e-01 3.022622e-01 6.239949e-01 6.431241e-01 5.689580e-01
50% 5.387720e+05 3.000000e+00 4.078000e+03 4.900000e+01 3.000000e+01 1.000000e+02 2.000000e+00 4.540000e+05 1.387800e+04 2.579100e+04 ... 2.210000e+03 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
75% 8.022750e+05 4.000000e+00 6.920000e+03 1.400000e+02 6.000000e+01 1.120598e+02 4.000000e+00 6.240000e+05 1.998000e+04 4.724400e+04 ... 7.035000e+03 1.584384e+00 1.370666e+00 1.406589e+00 1.571982e+00 1.353260e+00 1.401780e+00 1.569056e+00 1.334013e+00 1.382917e+00
max 1.070994e+06 5.000000e+00 1.635000e+04 9.450000e+03 9.999000e+03 9.619000e+03 1.190000e+02 1.663775e+09 1.792809e+09 4.668309e+09 ... 2.205711e+09 5.598592e+02 5.710702e+02 9.511558e+02 8.122370e+03 2.214728e+04 1.546305e+04 4.482904e+03 3.212235e+04 2.271912e+04

8 rows × 32 columns

# add in the inverse of all the 9 primary variables.
for vars in vars9:
    data[vars+'inv'] = 1/(data[vars] + epsilon)
data.head()
RECORD BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT ... r9 r1inv r2inv r3inv r4inv r5inv r6inv r7inv r8inv r9inv
0 9 1000041001 1 4 1001 NaN TRZ HOLDINGS, LLC R5 4 70.813856 ... 0.188672 0.319454 1.873784 44.626014 0.059294 0.349613 8.339012 0.039897 0.224915 5.297406
1 10 1000041002 1 4 1002 NaN TRZ HOLDINGS, LLC R5 4 70.813856 ... 0.272666 0.221048 1.296641 30.921506 0.041028 0.241917 5.771662 0.027607 0.155631 3.666141
2 11 1000041003 1 4 1003 NaN TRZ HOLDINGS, LLC R5 4 70.813856 ... 0.506981 0.118886 0.697406 16.654116 0.038983 0.229860 5.484137 0.014848 0.083703 1.972072
3 12 1000041004 1 4 1004 NaN TRZ HOLDINGS, LLC R5 4 70.813856 ... 0.048399 1.245200 7.300537 171.742678 0.231138 1.362741 32.429174 0.155527 0.876721 20.619011
4 13 1000041005 1 4 1005 NaN TRZ HOLDINGS, LLC R5 4 70.813856 ... 0.110867 0.543627 3.188341 75.706405 0.100904 0.594947 14.182787 0.067895 0.382749 9.011649

5 rows × 54 columns

Now I want the large outliers where the variables are either very low or very high, so I’ll keep only one of the two, r or rinv, depending on which is largest. This allows me to find both the very low and high outliers.

for vars in vars9:
    data[vars] = data[[vars,vars+'inv']].max(axis=1)

Now I can remove the inverse columns since I have the 9 variables that I need

for vars in vars9:
    data.drop(columns=(vars+'inv'),inplace=True)

data.describe()
RECORD BORO BLOCK LOT LTFRONT LTDEPTH STORIES FULLVAL AVLAND AVTOT ... bldvol r1 r2 r3 r4 r5 r6 r7 r8 r9
count 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 ... 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06
mean 5.368069e+05 3.220281e+00 4.756777e+03 3.509013e+02 5.045457e+01 1.073812e+02 4.969850e+00 8.166002e+05 6.654802e+04 1.999379e+05 ... 6.143487e+04 1.294444e+01 2.499720e+01 1.127279e+02 8.519128e+00 1.485009e+01 5.837383e+01 5.746233e+00 9.022489e+00 2.764315e+01
std 3.080025e+05 1.199074e+00 3.677416e+03 8.267095e+02 5.999584e+01 5.153451e+01 8.225039e+00 6.399805e+06 2.012308e+06 5.392440e+06 ... 2.325184e+06 1.088717e+02 1.815853e+02 5.091489e+02 7.580573e+01 1.304015e+02 3.641357e+02 5.249273e+01 1.047840e+02 2.144214e+02
min 9.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 2.000000e+00 1.000000e+00 4.000000e+00 1.000000e+00 1.000000e+00 ... 8.000200e+00 9.999570e-01 9.999517e-01 9.999559e-01 9.999539e-01 9.999507e-01 9.999511e-01 9.999514e-01 9.999527e-01 9.999501e-01
25% 2.729100e+05 3.000000e+00 1.542000e+03 2.300000e+01 2.100000e+01 1.000000e+02 2.000000e+00 3.181550e+05 9.679000e+03 1.892600e+04 ... 1.408000e+03 1.269474e+00 1.192974e+00 1.210624e+00 1.262922e+00 1.191561e+00 1.215178e+00 1.233189e+00 1.164291e+00 1.178231e+00
50% 5.387720e+05 3.000000e+00 4.078000e+03 4.900000e+01 3.000000e+01 1.000000e+02 2.000000e+00 4.540000e+05 1.387800e+04 2.579100e+04 ... 2.210000e+03 1.701746e+00 1.488500e+00 1.560963e+00 1.682360e+00 1.474922e+00 1.577504e+00 1.585997e+00 1.403326e+00 1.470000e+00
75% 8.022750e+05 4.000000e+00 6.920000e+03 1.400000e+02 6.000000e+01 1.120598e+02 4.000000e+00 6.240000e+05 1.998000e+04 4.724400e+04 ... 7.035000e+03 3.207889e+00 3.313753e+00 3.685617e+00 3.260447e+00 4.274797e+00 4.964164e+00 2.710417e+00 2.769308e+00 3.372197e+00
max 1.070994e+06 5.000000e+00 1.635000e+04 9.450000e+03 9.999000e+03 9.619000e+03 1.190000e+02 1.663775e+09 1.792809e+09 4.668309e+09 ... 2.205711e+09 9.769741e+03 9.829050e+03 9.973857e+03 9.821693e+03 2.214728e+04 1.546305e+04 9.904245e+03 3.212235e+04 2.271912e+04

8 rows × 32 columns

Now I add more variables where I standardize each of these 9 basic variables by a few logical groupings. For example, is a property’s value of r1 typical for that zip code? for that taxclass?

# Standardized variables by appropriate logical group
zip5_mean = data.groupby('ZIP')[vars9].mean()
taxclass_mean = data.groupby('TAXCLASS')[vars9].mean()
data = data.join(zip5_mean, on='ZIP', rsuffix='_zip5')
data = data.join(taxclass_mean, on='TAXCLASS', rsuffix='_taxclass')
rsuffix = ['_zip5', '_taxclass']
for var in vars9:
    for r in rsuffix:
        data[str(var)+r] = data[var] / data[str(var)+r]
# include two more possibly interesting variables
data['value_ratio'] = data['FULLVAL']/(data['AVLAND']+data['AVTOT'])
data['value_ratio'] = data['value_ratio']/data['value_ratio'].mean()
# again, use 1/variable if that's larger, in order to find the low outliers
data['value_ratio'] = np.where(data['value_ratio'] < 1, 1/(data['value_ratio']+epsilon), data['value_ratio'])
data['size_ratio'] = data['bldsize'] / (data['ltsize']+1)
data.head().transpose()
0 1 2 3 4
RECORD 9 10 11 12 13
BBLE 1000041001 1000041002 1000041003 1000041004 1000041005
BORO 1 1 1 1 1
BLOCK 4 4 4 4 4
LOT 1001 1002 1003 1004 1005
... ... ... ... ... ...
r7_taxclass 1.288115 1.861571 3.461304 0.330433 0.756924
r8_taxclass 0.177858 0.257039 0.477925 0.045625 0.104513
r9_taxclass 0.056876 0.039362 0.021173 0.221379 0.096755
value_ratio 8.056705 8.056707 7.135024 8.056717 8.056713
size_ratio 1.799293 1.799293 1.799293 1.799293 1.799293

65 rows × 5 columns

data.columns
Index(['RECORD', 'BBLE', 'BORO', 'BLOCK', 'LOT', 'EASEMENT', 'OWNER', 'BLDGCL',
       'TAXCLASS', 'LTFRONT', 'LTDEPTH', 'EXT', 'STORIES', 'FULLVAL', 'AVLAND',
       'AVTOT', 'EXLAND', 'EXTOT', 'EXCD1', 'STADDR', 'ZIP', 'EXMPTCL',
       'BLDFRONT', 'BLDDEPTH', 'AVLAND2', 'AVTOT2', 'EXLAND2', 'EXTOT2',
       'EXCD2', 'PERIOD', 'YEAR', 'VALTYPE', 'zip3', 'ltsize', 'bldsize',
       'bldvol', 'r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9',
       'r1_zip5', 'r2_zip5', 'r3_zip5', 'r4_zip5', 'r5_zip5', 'r6_zip5',
       'r7_zip5', 'r8_zip5', 'r9_zip5', 'r1_taxclass', 'r2_taxclass',
       'r3_taxclass', 'r4_taxclass', 'r5_taxclass', 'r6_taxclass',
       'r7_taxclass', 'r8_taxclass', 'r9_taxclass', 'value_ratio',
       'size_ratio'],
      dtype='object')
save_record = data['RECORD']
save_record.head()
0     9
1    10
2    11
3    12
4    13
Name: RECORD, dtype: int64
data['lot_bldsize'] = data['ltsize'] / data['bldsize']
# Flags and Indicators
high_market_value_threshold = data['FULLVAL'].quantile(0.90)
data['high_market_value_indicator'] = data['FULLVAL'] > high_market_value_threshold

high_building_density_threshold = 0.70
data['high_market_value_indicator'] = data['size_ratio'] > high_building_density_threshold

# Interaction Terms
data['lot_building_interaction'] = data['ltsize'] * data['bldsize']
data['market_assessed_value_interaction'] = data['FULLVAL'] * data['AVTOT']
dropcols = ['RECORD','BBLE', 'BORO', 'BLOCK', 'LOT', 'EASEMENT',
       'OWNER', 'BLDGCL', 'TAXCLASS', 'LTFRONT', 'LTDEPTH', 'EXT', 'STORIES',
       'FULLVAL', 'AVLAND', 'AVTOT', 'EXLAND', 'EXTOT', 'EXCD1', 'STADDR',
       'ZIP', 'EXMPTCL', 'BLDFRONT', 'BLDDEPTH', 'AVLAND2', 'AVTOT2',
       'EXLAND2', 'EXTOT2', 'EXCD2', 'PERIOD', 'YEAR', 'VALTYPE', 'zip3','ltsize','bldsize','bldvol']
data = data.drop(columns = dropcols)
data.shape
(1044493, 33)
# this dataframe is now just the variables for our unsupervised fraud models
data.head().transpose()
0 1 2 3 4
r1 3.130243 4.523796 8.411301 1.2452 1.839398
r2 1.873784 1.296641 1.433786 7.300537 3.188341
r3 44.626014 30.921506 16.654116 171.742678 75.706405
r4 16.865039 24.373205 25.651791 4.32631 9.910288
r5 2.860204 4.133542 4.350382 1.362741 1.680722
r6 8.339012 5.771662 5.484137 32.429174 14.182787
r7 25.064485 36.222944 67.350964 6.429659 14.728427
r8 4.446018 6.42534 11.946927 1.140513 2.612575
r9 5.297406 3.666141 1.972072 20.619011 9.011649
r1_zip5 0.344811 0.498317 0.926544 0.137165 0.202618
r2_zip5 0.068016 0.047066 0.052044 0.264999 0.115732
r3_zip5 0.111838 0.077493 0.041737 0.430409 0.18973
r4_zip5 1.040068 1.503097 1.581947 0.266804 0.611168
r5_zip5 0.140338 0.202815 0.213455 0.066864 0.082466
r6_zip5 0.027176 0.01881 0.017872 0.105685 0.046221
r7_zip5 1.368515 1.977764 3.677347 0.351058 0.804169
r8_zip5 0.536412 0.775217 1.441397 0.137603 0.315207
r9_zip5 0.104537 0.072347 0.038916 0.406889 0.177833
r1_taxclass 0.058539 0.0846 0.1573 0.023286 0.034399
r2_taxclass 0.016154 0.011178 0.012361 0.062937 0.027486
r3_taxclass 0.132343 0.091701 0.04939 0.509322 0.224516
r4_taxclass 0.545548 0.788422 0.829781 0.139947 0.320577
r5_taxclass 0.054123 0.078218 0.082321 0.025787 0.031804
r6_taxclass 0.044412 0.030738 0.029207 0.17271 0.075534
r7_taxclass 1.288115 1.861571 3.461304 0.330433 0.756924
r8_taxclass 0.177858 0.257039 0.477925 0.045625 0.104513
r9_taxclass 0.056876 0.039362 0.021173 0.221379 0.096755
value_ratio 8.056705 8.056707 7.135024 8.056717 8.056713
size_ratio 1.799293 1.799293 1.799293 1.799293 1.799293
lot_bldsize 0.555706 0.555706 0.555706 0.555706 0.555706
high_market_value_indicator True True True True True
lot_building_interaction 121492255.305798 121492255.305798 121492255.305798 121492255.305798 121492255.305798
market_assessed_value_interaction 7002516650966.0 14625274139670.0 50562000000000.0 460799629504.0 2417960736912.0
# Calculate and write the basic statistics of all the variables to check if everything looks OK
stats = data.describe().transpose()
stats
count mean std min 25% 50% 75% max
r1 1044493.0 1.294444e+01 1.088717e+02 0.999957 1.269474e+00 1.701746e+00 3.207889e+00 9.769741e+03
r2 1044493.0 2.499720e+01 1.815853e+02 0.999952 1.192974e+00 1.488500e+00 3.313753e+00 9.829050e+03
r3 1044493.0 1.127279e+02 5.091489e+02 0.999956 1.210624e+00 1.560963e+00 3.685617e+00 9.973857e+03
r4 1044493.0 8.519128e+00 7.580573e+01 0.999954 1.262922e+00 1.682360e+00 3.260447e+00 9.821693e+03
r5 1044493.0 1.485009e+01 1.304015e+02 0.999951 1.191561e+00 1.474922e+00 4.274797e+00 2.214728e+04
r6 1044493.0 5.837383e+01 3.641357e+02 0.999951 1.215178e+00 1.577504e+00 4.964164e+00 1.546305e+04
r7 1044493.0 5.746233e+00 5.249273e+01 0.999951 1.233189e+00 1.585997e+00 2.710417e+00 9.904245e+03
r8 1044493.0 9.022489e+00 1.047840e+02 0.999953 1.164291e+00 1.403326e+00 2.769308e+00 3.212235e+04
r9 1044493.0 2.764315e+01 2.144214e+02 0.999950 1.178231e+00 1.470000e+00 3.372197e+00 2.271912e+04
r1_zip5 1044493.0 1.000000e+00 8.217521e+00 0.005040 1.619956e-01 3.215630e-01 6.642092e-01 2.284920e+03
r2_zip5 1044493.0 1.000000e+00 8.869314e+00 0.001689 1.020206e-01 2.270835e-01 5.228364e-01 1.365401e+03
r3_zip5 1044493.0 1.000000e+00 9.074389e+00 0.000491 5.121857e-02 1.354225e-01 4.224967e-01 1.454222e+03
r4_zip5 1044493.0 1.000000e+00 6.586892e+00 0.007198 2.092768e-01 3.997583e-01 7.210783e-01 2.092640e+03
r5_zip5 1044493.0 1.000000e+00 9.721280e+00 0.003136 1.464844e-01 2.348185e-01 4.425110e-01 2.998695e+03
r6_zip5 1044493.0 1.000000e+00 9.493043e+00 0.000569 8.694206e-02 1.559555e-01 3.436608e-01 1.838606e+03
r7_zip5 1044493.0 1.000000e+00 7.881081e+00 0.009727 2.953191e-01 4.733899e-01 7.430448e-01 2.296383e+03
r8_zip5 1044493.0 1.000000e+00 1.234179e+01 0.006126 1.632185e-01 2.475734e-01 4.167635e-01 3.005149e+03
r9_zip5 1044493.0 1.000000e+00 1.152124e+01 0.001153 9.615886e-02 1.617427e-01 2.970505e-01 2.177199e+03
r1_taxclass 1044493.0 1.000000e+00 3.494666e+00 0.007712 5.608901e-01 7.193174e-01 9.924283e-01 1.013376e+03
r2_taxclass 1044493.0 1.000000e+00 5.572151e+00 0.008621 6.723179e-01 8.026218e-01 1.011107e+00 3.755317e+03
r3_taxclass 1044493.0 1.000000e+00 6.074469e+00 0.002193 6.655659e-01 8.122948e-01 1.064785e+00 4.260404e+03
r4_taxclass 1044493.0 1.000000e+00 4.296360e+00 0.026635 5.986215e-01 7.583626e-01 1.017097e+00 1.303541e+03
r5_taxclass 1044493.0 1.000000e+00 5.350972e+00 0.008257 5.863012e-01 7.962156e-01 9.913575e-01 3.133052e+03
r6_taxclass 1044493.0 1.000000e+00 5.316970e+00 0.004834 6.154172e-01 7.804270e-01 1.024000e+00 3.567660e+03
r7_taxclass 1044493.0 1.000000e+00 4.675335e+00 0.023186 6.099768e-01 7.581443e-01 1.000533e+00 2.082944e+03
r8_taxclass 1044493.0 1.000000e+00 6.165441e+00 0.005117 7.017349e-01 8.178881e-01 9.923429e-01 3.790923e+03
r9_taxclass 1044493.0 1.000000e+00 5.973456e+00 0.003115 6.826764e-01 8.073178e-01 1.023067e+00 4.239516e+03
value_ratio 1044493.0 3.255193e+00 1.823361e+01 0.999901 1.119627e+00 1.284058e+00 6.390237e+00 1.000271e+04
size_ratio 1044493.0 6.551120e-01 1.261914e+01 0.000027 2.541492e-01 3.811814e-01 6.245891e-01 1.019954e+04
lot_bldsize 1044493.0 4.244950e+00 6.185213e+01 0.000098 1.600000e+00 2.622768e+00 3.933566e+00 3.705596e+04
lot_building_interaction 1044493.0 1.066826e+08 1.306322e+10 3600.012000 1.716000e+06 3.130380e+06 1.085600e+07 1.031513e+13
market_assessed_value_interaction 1044493.0 2.004535e+13 2.637493e+15 40.000000 6.506798e+09 1.159500e+10 2.365120e+10 1.746040e+18
data.isna().sum().sum()
0
# zscale all the variables
data_zs = (data - data.mean()) / data.std()
data_zs_save = data_zs.copy()
data_zs.describe().transpose()
count mean std min 25% 50% 75% max
r1 1044493.0 4.530633e-18 1.0 -0.109712 -0.107236 -0.103265 -0.089431 89.617375
r2 1044493.0 -9.066709e-17 1.0 -0.132154 -0.131091 -0.129464 -0.119412 53.991444
r3 1044493.0 -2.068037e-17 1.0 -0.219441 -0.219027 -0.218339 -0.214166 19.367869
r4 1044493.0 1.768716e-18 1.0 -0.099190 -0.095721 -0.090188 -0.069370 129.451612
r5 1044493.0 -5.904789e-18 1.0 -0.106211 -0.104742 -0.102569 -0.081098 169.725243
r6 1044493.0 -2.737428e-17 1.0 -0.157562 -0.156971 -0.155976 -0.146675 42.304772
r7 1044493.0 2.552393e-17 1.0 -0.090418 -0.085975 -0.079254 -0.057833 188.568939
r8 1044493.0 -1.129257e-17 1.0 -0.076563 -0.074994 -0.072713 -0.059677 306.471822
r9 1044493.0 1.455789e-18 1.0 -0.124256 -0.123425 -0.122064 -0.113193 105.826575
r1_zip5 1044493.0 1.171434e-17 1.0 -0.121078 -0.101978 -0.082560 -0.040863 277.932913
r2_zip5 1044493.0 -4.839478e-17 1.0 -0.112558 -0.101246 -0.087145 -0.053799 153.833918
r3_zip5 1044493.0 -2.387086e-17 1.0 -0.110146 -0.104556 -0.095277 -0.063641 160.145446
r4_zip5 1044493.0 4.955125e-17 1.0 -0.150724 -0.120045 -0.091127 -0.042345 317.545865
r5_zip5 1044493.0 -4.581654e-17 1.0 -0.102545 -0.087799 -0.078712 -0.057347 308.364256
r6_zip5 1044493.0 1.185720e-17 1.0 -0.105280 -0.096182 -0.088912 -0.069139 193.573973
r7_zip5 1044493.0 3.945596e-19 1.0 -0.125652 -0.089414 -0.066820 -0.032604 291.252317
r8_zip5 1044493.0 -3.524846e-17 1.0 -0.080529 -0.067801 -0.060966 -0.047257 243.412712
r9_zip5 1044493.0 -5.727918e-17 1.0 -0.086696 -0.078450 -0.072758 -0.061013 188.885835
r1_taxclass 1044493.0 -7.925207e-18 1.0 -0.283944 -0.125651 -0.080317 -0.002167 289.691763
r2_taxclass 1044493.0 -7.393231e-17 1.0 -0.177917 -0.058807 -0.035422 0.001993 673.764395
r3_taxclass 1044493.0 -3.470084e-17 1.0 -0.164262 -0.055056 -0.030901 0.010665 701.197810
r4_taxclass 1044493.0 2.733346e-17 1.0 -0.226556 -0.093423 -0.056242 0.003979 303.173221
r5_taxclass 1044493.0 1.123815e-17 1.0 -0.185339 -0.077313 -0.038084 -0.001615 585.323933
r6_taxclass 1044493.0 -9.251743e-19 1.0 -0.187168 -0.072331 -0.041297 0.004514 670.807012
r7_taxclass 1044493.0 4.419068e-17 1.0 -0.208929 -0.083421 -0.051730 0.000114 445.303652
r8_taxclass 1044493.0 -2.258514e-18 1.0 -0.161364 -0.048377 -0.029538 -0.001242 614.704237
r9_taxclass 1044493.0 3.466002e-18 1.0 -0.166886 -0.053122 -0.032256 0.003862 709.558482
value_ratio 1044493.0 -3.327907e-17 1.0 -0.123689 -0.117123 -0.108105 0.171938 548.408183
size_ratio 1044493.0 2.881646e-17 1.0 -0.051912 -0.031774 -0.021708 -0.002419 808.207306
lot_bldsize 1044493.0 -3.417703e-17 1.0 -0.068629 -0.042762 -0.026227 -0.005034 599.037066
high_market_value_indicator 1044493.0 7.662620e-17 1.0 -0.500292 -0.500292 -0.500292 -0.500292 1.998831
lot_building_interaction 1044493.0 -2.948993e-18 1.0 -0.008166 -0.008035 -0.007927 -0.007336 789.623089
market_assessed_value_interaction 1044493.0 -3.387771e-18 1.0 -0.007600 -0.007598 -0.007596 -0.007591 662.000028
# do a complete PCA and look at the scree and cumulative variance plots
pca = PCA(n_components = .99, svd_solver = 'full')
pca.fit(data_zs)
plt.plot(pca.explained_variance_ratio_)
plt.xlabel('Number of components minus 1')
plt.ylabel('PC variance')
plt.xticks(np.arange(0, 36, step=2))
plt.axvline(x=4, linestyle='--')

# do a complete PCA and look at the scree and cumulative variance plots
pca = PCA(n_components = .99, svd_solver = 'full')
pca.fit(data_zs)
plt.axvline(x=4, linestyle='--')
plt.bar(np.arange(len(pca.explained_variance_ratio_)), pca.explained_variance_ratio_)
plt.plot(np.arange(len(pca.explained_variance_ratio_)), pca.explained_variance_ratio_, color='black')
plt.xlabel('Number of components minus 1')
plt.ylabel('PC variance')
plt.xticks(np.arange(0, 10, step=1))

plt.show()

plt.xlabel('Number of components minus 1')
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.ylabel('PC cumulative variance')
plt.yticks(np.arange(0.05, 1.1, step=.1))
plt.xticks(np.arange(0, 36, step=2))
plt.axvline(x=4, linestyle='--')
plt.ylim(0,1)

%%time
# now redo the PCA but just keep the top few PCs
data_zs = data_zs_save.copy()
pca = PCA(n_components = 5, svd_solver = 'full')
princ_comps = pca.fit_transform(data_zs)
pca.n_components_
CPU times: user 8.3 s, sys: 732 ms, total: 9.03 s
Wall time: 2.41 s
5
print(np.cumsum(pca.explained_variance_ratio_))
[0.3194038  0.46603662 0.57642861 0.64734677 0.70235809]
data_pca = pd.DataFrame(princ_comps, columns = ['PC' + str(i) for i in range(1, pca.n_components_+1)])
data_pca.shape
(1044493, 5)
data_pca.head(5)
PC1 PC2 PC3 PC4 PC5
0 -0.287502 -0.312171 0.021055 -0.171276 0.190024
1 -0.167137 -0.399473 0.172508 -0.057031 0.245359
2 0.069719 -0.575129 0.451087 0.203608 0.298239
3 -0.387076 -0.137734 -0.249001 -0.485465 0.059842
4 -0.381933 -0.226063 -0.122581 -0.298735 0.132512
data_pca.describe()
PC1 PC2 PC3 PC4 PC5
count 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06
mean 4.468048e-17 7.164659e-17 -3.877569e-17 4.680294e-17 2.031302e-17
std 3.246587e+00 2.199746e+00 1.908648e+00 1.529804e+00 1.347358e+00
min -7.175446e-01 -2.136625e+02 -3.459215e+02 -3.316447e+01 -1.712929e+02
25% -4.594581e-01 -3.344164e-02 -6.328371e-02 1.412898e-01 -4.141070e-03
50% -4.062916e-01 8.377637e-02 -1.950388e-03 2.171742e-01 4.098485e-02
75% -2.631572e-01 1.474075e-01 6.211204e-02 2.489887e-01 6.809073e-02
max 7.937746e+02 1.384273e+03 3.553147e+02 3.896609e+02 3.495221e+02
# zscale the pcs.
# I do this (make all the retained PCs equally important) if I only keep a small number of PCs.
# Alternatively you can keep maybe up to 6 to 8 or so, and don't do this second z scale
# I prefer to keep a somewhat small number of PCs and then make them all equally important via zscaling.
# This second zscale step makes the later Minkowski distance to be similar to a Mahalanobis distance.
# Many people don't do this second zscaling, but I like to do it.

data_pca_zs = (data_pca - data_pca.mean()) / data_pca.std()
data_pca_zs.describe()
PC1 PC2 PC3 PC4 PC5
count 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06
mean -3.068041e-17 -4.081651e-20 -2.598651e-18 -7.564661e-18 3.959202e-18
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -2.210151e-01 -9.713052e+01 -1.812391e+02 -2.167891e+01 -1.271324e+02
25% -1.415203e-01 -1.520250e-02 -3.315631e-02 9.235811e-02 -3.073474e-03
50% -1.251442e-01 3.808456e-02 -1.021869e-03 1.419621e-01 3.041868e-02
75% -8.105657e-02 6.701115e-02 3.254243e-02 1.627586e-01 5.053648e-02
max 2.444951e+02 6.292876e+02 1.861604e+02 2.547130e+02 2.594129e+02
data_pca_zs.shape
(1044493, 5)
data_pca_zs.head(5)
PC1 PC2 PC3 PC4 PC5
0 -0.088555 -0.141912 0.011031 -0.111960 0.141034
1 -0.051481 -0.181599 0.090382 -0.037280 0.182104
2 0.021475 -0.261452 0.236338 0.133094 0.221351
3 -0.119225 -0.062614 -0.130459 -0.317338 0.044414
4 -0.117641 -0.102768 -0.064224 -0.195277 0.098350

Now calculate two unsupervised fraud scores

# Set the powers for the two Minkowski distances. The final results are relatively insensitive to these choices. 
# Good choices are anywhere from 1 to about 4.
p1 = 2
p2 = 2
ntop = 10000

Calculate score 1

oop1 = 1/p1
score1 = (((data_pca_zs).abs()**p1).sum(axis=1))**oop1
score1.head(10)
0    0.246025
1    0.279909
2    0.437475
3    0.371256
4    0.276281
5    0.281636
6    0.371264
7    0.360560
8    0.360560
9    0.360560
dtype: float64
data_pca_zs.head(10)
PC1 PC2 PC3 PC4 PC5
0 -0.088555 -0.141912 0.011031 -0.111960 0.141034
1 -0.051481 -0.181599 0.090382 -0.037280 0.182104
2 0.021475 -0.261452 0.236338 0.133094 0.221351
3 -0.119225 -0.062614 -0.130459 -0.317338 0.044414
4 -0.117641 -0.102768 -0.064224 -0.195277 0.098350
5 -0.119354 -0.099474 -0.070257 -0.203236 0.094537
6 -0.006353 -0.226469 0.181305 0.043025 0.227564
7 -0.011020 -0.221945 0.172113 0.035079 0.223089
8 -0.011020 -0.221945 0.172113 0.035079 0.223089
9 -0.011020 -0.221945 0.172113 0.035079 0.223089
score1.max()
726.0134522130637

Autoencoder for score 2

NNmodel = MLPRegressor(hidden_layer_sizes=(3),activation='logistic',max_iter=50,random_state=1)
NNmodel.fit(data_pca_zs,data_pca_zs)
MLPRegressor(activation='logistic', hidden_layer_sizes=3, max_iter=50,
             random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# calculate score 2 as the error of an autoencoder
pca_out = NNmodel.predict(data_pca_zs)
error = pca_out - data_pca_zs
oop2 = 1/p2
score2 = ((error.abs()**p2).sum(axis=1))**oop2
scores = pd.DataFrame(score1)
scores.columns=['score1']
scores['score2'] = score2
scores['RECORD'] = save_record
scores.head(10)
score1 score2 RECORD
0 0.246025 0.222947 9
1 0.279909 0.258002 10
2 0.437475 0.373022 11
3 0.371256 0.265296 12
4 0.276281 0.227088 13
5 0.281636 0.229178 14
6 0.371264 0.329990 15
7 0.360560 0.321740 16
8 0.360560 0.321740 17
9 0.360560 0.321740 18
scores['score1 rank'] = scores['score1'].rank()
scores['score2 rank'] = scores['score2'].rank()
scores.head(20)
score1 score2 RECORD score1 rank score2 rank
0 0.246025 0.222947 9 678868.0 808100.0
1 0.279909 0.258002 10 765499.0 856994.0
2 0.437475 0.373022 11 875948.0 963320.0
3 0.371256 0.265296 12 850129.0 870794.0
4 0.276281 0.227088 13 759084.0 813206.0
5 0.281636 0.229178 14 768573.0 815829.0
6 0.371264 0.329990 15 850134.0 937470.0
7 0.360560 0.321740 16 844765.0 931056.0
8 0.360560 0.321740 17 844764.0 931058.0
9 0.360560 0.321740 18 844769.0 931061.0
10 0.360560 0.321740 19 844770.0 931062.0
11 0.360560 0.321740 20 844767.0 931060.0
12 0.360560 0.321740 21 844768.0 931057.0
13 0.360560 0.321740 22 844766.0 931059.0
14 0.360409 0.321624 23 844681.0 930980.0
15 0.355454 0.317796 24 841943.0 928028.0
16 0.355454 0.317796 25 841942.0 928027.0
17 0.367873 0.327379 26 848537.0 935575.0
18 0.367873 0.327379 27 848535.0 935573.0
19 0.367873 0.327379 28 848534.0 935572.0
# calculate the final score as the average of the two scores
weight = .5
scores['final'] = (weight*scores['score1 rank'] + (1-weight)*scores['score2 rank'])
scores_sorted = scores.sort_values(by='final', ascending=False)
scores_sorted.head(20)
score1 score2 RECORD score1 rank score2 rank final
934505 726.013452 724.824555 956520 1044493.0 1044493.0 1044493.0
641107 658.286442 657.201701 658933 1044492.0 1044492.0 1044492.0
897398 358.381867 329.739441 917942 1044491.0 1044491.0 1044491.0
632014 251.986026 215.525578 649717 1044490.0 1044490.0 1044490.0
960230 226.307024 213.092588 982930 1044488.0 1044489.0 1044488.5
443552 241.769839 201.444558 459429 1044489.0 1044487.0 1044488.0
957706 224.150706 211.014789 980276 1044487.0 1044488.0 1044487.5
973751 206.165107 164.724296 996722 1044486.0 1044486.0 1044486.0
320488 159.593777 160.141393 333412 1044485.0 1044485.0 1044485.0
230836 159.122510 131.377231 241946 1044484.0 1044484.0 1044484.0
110100 132.861721 130.441746 111420 1044482.0 1044483.0 1044482.5
1028942 141.101680 115.443716 1053859 1044483.0 1044481.0 1044482.0
956476 131.662481 122.488266 979038 1044481.0 1044482.0 1044481.5
1024406 126.321406 102.640900 1048771 1044476.0 1044480.0 1044478.0
1024115 126.171272 101.988095 1048475 1044475.0 1044479.0 1044477.0
212609 130.105159 95.853362 223485 1044480.0 1044466.0 1044473.0
678408 128.442574 94.071228 696562 1044479.0 1044464.0 1044471.5
1020880 120.657288 95.991879 1045012 1044472.0 1044467.0 1044469.5
7005 117.979749 101.199214 7040 1044465.0 1044473.0 1044469.0
7006 117.979749 101.199214 7041 1044465.0 1044473.0 1044469.0
scores_sorted.tail(10)
score1 score2 RECORD score1 rank score2 rank final
443730 0.092243 0.022357 459607 130.0 477.0 303.5
408048 0.104202 0.020191 423613 292.0 233.0 262.5
428809 0.092621 0.020661 444598 134.0 273.0 203.5
437408 0.102844 0.016975 453273 266.0 64.0 165.0
511325 0.100810 0.017665 527820 232.0 80.0 156.0
408873 0.102089 0.012103 424441 256.0 13.0 134.5
408049 0.101055 0.012903 423614 236.0 17.0 126.5
445557 0.094642 0.018453 461446 146.0 104.0 125.0
308515 0.082448 0.019058 321161 80.0 138.0 109.0
182515 0.081563 0.015904 186718 75.0 40.0 57.5
scores.describe()
score1 score2 RECORD score1 rank score2 rank final
count 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06
mean 4.861766e-01 2.176407e-01 5.368069e+05 5.222470e+05 5.222470e+05 5.222470e+05
std 2.182575e+00 1.563808e+00 3.080025e+05 3.015193e+05 3.015193e+05 2.926665e+05
min 4.918587e-02 8.101042e-03 9.000000e+00 1.000000e+00 1.000000e+00 5.750000e+01
25% 2.089560e-01 6.090604e-02 2.729100e+05 2.611240e+05 2.611240e+05 2.638430e+05
50% 2.204428e-01 9.609887e-02 5.387720e+05 5.222470e+05 5.222470e+05 5.045235e+05
75% 2.907654e-01 1.986070e-01 8.022750e+05 7.833700e+05 7.833700e+05 7.812585e+05
max 7.260135e+02 7.248246e+02 1.070994e+06 1.044493e+06 1.044493e+06 1.044493e+06
scores_sorted.describe()
score1 score2 RECORD score1 rank score2 rank final
count 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06 1.044493e+06
mean 4.861766e-01 2.176407e-01 5.368069e+05 5.222470e+05 5.222470e+05 5.222470e+05
std 2.182575e+00 1.563808e+00 3.080025e+05 3.015193e+05 3.015193e+05 2.926665e+05
min 4.918587e-02 8.101042e-03 9.000000e+00 1.000000e+00 1.000000e+00 5.750000e+01
25% 2.089560e-01 6.090604e-02 2.729100e+05 2.611240e+05 2.611240e+05 2.638430e+05
50% 2.204428e-01 9.609887e-02 5.387720e+05 5.222470e+05 5.222470e+05 5.045235e+05
75% 2.907654e-01 1.986070e-01 8.022750e+05 7.833700e+05 7.833700e+05 7.812585e+05
max 7.260135e+02 7.248246e+02 1.070994e+06 1.044493e+06 1.044493e+06 1.044493e+06
scores_sorted.set_index('RECORD', drop=True, inplace=True)
scores_sorted.head(10)
score1 score2 score1 rank score2 rank final
RECORD
956520 726.013452 724.824555 1044493.0 1044493.0 1044493.0
658933 658.286442 657.201701 1044492.0 1044492.0 1044492.0
917942 358.381867 329.739441 1044491.0 1044491.0 1044491.0
649717 251.986026 215.525578 1044490.0 1044490.0 1044490.0
982930 226.307024 213.092588 1044488.0 1044489.0 1044488.5
459429 241.769839 201.444558 1044489.0 1044487.0 1044488.0
980276 224.150706 211.014789 1044487.0 1044488.0 1044487.5
996722 206.165107 164.724296 1044486.0 1044486.0 1044486.0
333412 159.593777 160.141393 1044485.0 1044485.0 1044485.0
241946 159.122510 131.377231 1044484.0 1044484.0 1044484.0
sc1max = int(score1.max())
plt.hist(score1, bins =100, range=(0,sc1max+1))
plt.yscale('log')
plt.ylim(ymin=.1)

sc2max = int(score2.max())
sc2max
724
sc2max = int(score2.max())
print(sc2max)
plt.hist(score2, bins =100, range=(0,sc2max+1))
plt.yscale('log')
plt.ylim(ymin=.1)
724

The flatter the next plot, the more similar are the two scores. If the two scores are very similar then the rank order hardly changes and the plot is flat.

sns.displot(scores['final'])

top_records = scores_sorted.head(ntop).index
print(top_records)
Index([ 956520,  658933,  917942,  649717,  982930,  459429,  980276,  996722,
        333412,  241946,
       ...
         98965, 1059396,  956149,  779437, 1008025,  161344,   51125,  929750,
         58145,  944706],
      dtype='int64', name='RECORD', length=10000)
data_zs['RECORD'] = save_record
data_zs.set_index('RECORD', inplace=True, drop=True)
data_zs.head()
r1 r2 r3 r4 r5 r6 r7 r8 r9 r1_zip5 ... r6_taxclass r7_taxclass r8_taxclass r9_taxclass value_ratio size_ratio lot_bldsize high_market_value_indicator lot_building_interaction market_assessed_value_interaction
RECORD
9 -0.090145 -0.127342 -0.133756 0.110096 -0.091946 -0.137407 0.368018 -0.043675 -0.104214 -0.079731 ... -0.179724 0.061624 -0.133347 -0.157886 0.263333 0.09067 -0.059646 1.998831 0.001134 -0.004945
10 -0.077345 -0.130520 -0.160673 0.209141 -0.082181 -0.144458 0.580589 -0.024786 -0.111822 -0.061050 ... -0.182296 0.184280 -0.120504 -0.160818 0.263333 0.09067 -0.059646 1.998831 0.001134 -0.002055
11 -0.041637 -0.129765 -0.188695 0.226007 -0.080518 -0.145247 1.173586 0.027909 -0.119723 -0.008939 ... -0.182584 0.526444 -0.084678 -0.163863 0.212785 0.09067 -0.059646 1.998831 0.001134 0.011570
12 -0.107459 -0.097456 0.115909 -0.055310 -0.103429 -0.071250 0.013019 -0.075221 -0.032759 -0.104999 ... -0.155594 -0.143213 -0.154794 -0.130347 0.263334 0.09067 -0.059646 1.998831 0.001134 -0.007425
13 -0.102001 -0.120103 -0.072713 0.018352 -0.100991 -0.121359 0.171113 -0.061173 -0.086892 -0.097034 ... -0.173871 -0.051991 -0.145243 -0.151210 0.263334 0.09067 -0.059646 1.998831 0.001134 -0.006683

5 rows × 33 columns

scores.set_index('RECORD',inplace=True)
scores.drop(columns=['score1','score2'],inplace=True)
scores.head(30)
score1 rank score2 rank final
RECORD
9 678868.0 808100.0 743484.0
10 765499.0 856994.0 811246.5
11 875948.0 963320.0 919634.0
12 850129.0 870794.0 860461.5
13 759084.0 813206.0 786145.0
14 768573.0 815829.0 792201.0
15 850134.0 937470.0 893802.0
16 844765.0 931056.0 887910.5
17 844764.0 931058.0 887911.0
18 844769.0 931061.0 887915.0
19 844770.0 931062.0 887916.0
20 844767.0 931060.0 887913.5
21 844768.0 931057.0 887912.5
22 844766.0 931059.0 887912.5
23 844681.0 930980.0 887830.5
24 841943.0 928028.0 884985.5
25 841942.0 928027.0 884984.5
26 848537.0 935575.0 892056.0
27 848535.0 935573.0 892054.0
28 848534.0 935572.0 892053.0
29 848536.0 935574.0 892055.0
30 848072.0 935039.0 891555.5
31 719090.0 828360.0 773725.0
32 719092.0 828358.0 773725.0
33 719096.0 828366.0 773731.0
34 719087.0 828357.0 773722.0
35 719089.0 828359.0 773724.0
36 719095.0 828365.0 773730.0
37 719091.0 828363.0 773727.0
38 719094.0 828364.0 773729.0
scores.tail(30)
score1 rank score2 rank final
RECORD
1070965 522201.0 502692.0 512446.5
1070966 836834.0 898565.0 867699.5
1070967 824253.0 848050.0 836151.5
1070968 891054.0 980773.0 935913.5
1070969 830636.0 862357.0 846496.5
1070970 743884.0 767988.0 755936.0
1070971 820033.0 838925.0 829479.0
1070972 736017.0 764968.0 750492.5
1070973 761012.0 777574.0 769293.0
1070974 757279.0 776538.0 766908.5
1070975 559461.0 576011.0 567736.0
1070976 791591.0 801969.0 796780.0
1070977 739251.0 769153.0 754202.0
1070978 579983.0 558380.0 569181.5
1070979 432258.0 447401.0 439829.5
1070980 1037809.0 1041573.0 1039691.0
1070981 594367.0 592528.0 593447.5
1070982 623164.0 646634.0 634899.0
1070983 693762.0 722095.0 707928.5
1070984 806874.0 818237.0 812555.5
1070985 657713.0 695697.0 676705.0
1070986 726138.0 755905.0 741021.5
1070987 391803.0 435637.0 413720.0
1070988 220064.0 216836.0 218450.0
1070989 864500.0 934193.0 899346.5
1070990 633654.0 663809.0 648731.5
1070991 906686.0 991818.0 949252.0
1070992 861146.0 938972.0 900059.0
1070993 658402.0 684300.0 671351.0
1070994 310196.0 257939.0 284067.5
NY_data_with_scores = NY_data_orig.join(scores, on='RECORD')
NY_data_with_scores['final'].fillna(1,inplace=True)
NY_data_with_scores
RECORD BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT ... AVTOT2 EXLAND2 EXTOT2 EXCD2 PERIOD YEAR VALTYPE score1 rank score2 rank final
0 1 1000010101 1 1 101 NaN U S GOVT LAND & BLDGS P7 4 500 ... 8613000.0 3775500.0 8613000.0 NaN FINAL 2010/11 AC-TR NaN NaN 1.0
1 2 1000010201 1 1 201 NaN U S GOVT LAND & BLDGS Z9 4 27 ... 80690400.0 11111400.0 80690400.0 NaN FINAL 2010/11 AC-TR NaN NaN 1.0
2 3 1000020001 1 2 1 NaN DEPT OF GENERAL SERVI Y7 4 709 ... 40179510.0 32321790.0 40179510.0 NaN FINAL 2010/11 AC-TR NaN NaN 1.0
3 4 1000020023 1 2 23 NaN DEPARTMENT OF BUSINES T2 4 793 ... 15750000.0 13644000.0 15750000.0 NaN FINAL 2010/11 AC-TR NaN NaN 1.0
4 5 1000030001 1 3 1 NaN PARKS AND RECREATION Q1 4 323 ... 107758350.0 106348680.0 107758350.0 NaN FINAL 2010/11 AC-TR NaN NaN 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1070989 1070990 5080500083 5 8050 83 NaN TOBIN, GALE A1 1 60 ... NaN NaN NaN NaN FINAL 2010/11 AC-TR 633654.0 663809.0 648731.5
1070990 1070991 5080500086 5 8050 86 NaN SHERRI MILINAZZO A1 1 62 ... NaN NaN NaN NaN FINAL 2010/11 AC-TR 906686.0 991818.0 949252.0
1070991 1070992 5080500089 5 8050 89 NaN JOHN GERVASI A1 1 53 ... NaN NaN NaN 1017.0 FINAL 2010/11 AC-TR 861146.0 938972.0 900059.0
1070992 1070993 5080500092 5 8050 92 NaN RITA M MOOG A1 1 52 ... NaN NaN NaN NaN FINAL 2010/11 AC-TR 658402.0 684300.0 671351.0
1070993 1070994 5080500094 5 8050 94 NaN EDWARD DONOHUE A1 1 50 ... NaN NaN NaN NaN FINAL 2010/11 AC-TR 310196.0 257939.0 284067.5

1070994 rows × 35 columns

NY_data_scored_and_sorted = NY_data_with_scores.sort_values(by=['final','RECORD'], ascending = [False,True])
NY_data_scored_zs = NY_data_with_scores.join(data_zs, on='RECORD')
NY_data_scored_zs.set_index('RECORD',inplace=True)
NY_data_scored_zs.head(20)
BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT LTDEPTH ... r6_taxclass r7_taxclass r8_taxclass r9_taxclass value_ratio size_ratio lot_bldsize high_market_value_indicator lot_building_interaction market_assessed_value_interaction
RECORD
1 1000010101 1 1 101 NaN U S GOVT LAND & BLDGS P7 4 500 1046 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1000010201 1 1 201 NaN U S GOVT LAND & BLDGS Z9 4 27 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1000020001 1 2 1 NaN DEPT OF GENERAL SERVI Y7 4 709 564 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1000020023 1 2 23 NaN DEPARTMENT OF BUSINES T2 4 793 551 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 1000030001 1 3 1 NaN PARKS AND RECREATION Q1 4 323 1260 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 1000030002 1 3 2 NaN PARKS AND RECREATION Q1 4 496 76 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 1000030003 1 3 3 NaN PARKS AND RECREATION Q1 4 180 370 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 1000030010 1 3 10 NaN DEPT RE-CITY OF NY Z9 4 362 177 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 1000041001 1 4 1001 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.179724 0.061624 -0.133347 -0.157886 0.263333 0.09067 -0.059646 1.998831 0.001134 -0.004945
10 1000041002 1 4 1002 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.182296 0.184280 -0.120504 -0.160818 0.263333 0.09067 -0.059646 1.998831 0.001134 -0.002055
11 1000041003 1 4 1003 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.182584 0.526444 -0.084678 -0.163863 0.212785 0.09067 -0.059646 1.998831 0.001134 0.011570
12 1000041004 1 4 1004 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.155594 -0.143213 -0.154794 -0.130347 0.263334 0.09067 -0.059646 1.998831 0.001134 -0.007425
13 1000041005 1 4 1005 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.173871 -0.051991 -0.145243 -0.151210 0.263334 0.09067 -0.059646 1.998831 0.001134 -0.006683
14 1000041006 1 4 1006 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.173052 -0.060826 -0.146168 -0.150276 0.263333 0.09067 -0.059646 1.998831 0.001134 -0.006781
15 1000041007 1 4 1007 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.183813 0.326043 -0.105661 -0.162547 0.263333 0.09067 -0.059646 1.998831 0.001134 0.002596
16 1000041008 1 4 1008 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.183696 0.311661 -0.107167 -0.162414 0.263333 0.09067 -0.059646 1.998831 0.001134 0.002060
17 1000041009 1 4 1009 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.183696 0.311661 -0.107167 -0.162414 0.263333 0.09067 -0.059646 1.998831 0.001134 0.002060
18 1000041010 1 4 1010 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.183696 0.311661 -0.107167 -0.162414 0.263333 0.09067 -0.059646 1.998831 0.001134 0.002060
19 1000041011 1 4 1011 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.183696 0.311661 -0.107167 -0.162414 0.263333 0.09067 -0.059646 1.998831 0.001134 0.002060
20 1000041012 1 4 1012 NaN TRZ HOLDINGS, LLC R5 4 0 0 ... -0.183696 0.311661 -0.107167 -0.162414 0.263333 0.09067 -0.059646 1.998831 0.001134 0.002060

20 rows × 67 columns

NY_data_scored_zs_sorted = NY_data_scored_zs.sort_values(by=['final','RECORD'], ascending = [False,True])
NY_data_top_n = NY_data_scored_zs_sorted.head(ntop)
NY_data_top_n
BBLE BORO BLOCK LOT EASEMENT OWNER BLDGCL TAXCLASS LTFRONT LTDEPTH ... r6_taxclass r7_taxclass r8_taxclass r9_taxclass value_ratio size_ratio lot_bldsize high_market_value_indicator lot_building_interaction market_assessed_value_interaction
RECORD
956520 5006590012 5 659 12 NaN TROMPETA RIZALINA A1 1 25 91 ... 670.807012 -0.075673 614.704237 709.558482 -0.102717 316.656659 -0.068627 1.998831 1.575971 -0.007597
658933 4029060054 4 2906 54 NaN WAN CHIU CHEUNG C0 1 25 100 ... 636.374724 0.066416 509.715716 607.332283 -0.110012 443.540911 -0.068628 1.998831 2.671111 -0.007586
917942 4142600001 4 14260 1 NaN LOGAN PROPERTY, INC. T1 4 4910 0 ... 15.300514 9.478741 208.261239 40.667816 12.040692 -0.051220 1.776500 -0.500292 0.244888 662.000028
649717 4025270002 4 2527 2 NaN 57-43 LLC V1 4 51 940 ... 2.396351 88.289486 12.295657 6.686020 0.478378 -0.050460 0.812136 -0.500292 -0.004937 -0.007600
982930 5020900016 5 2090 16 NaN FOREST VIEW HOMEOWNER Z0 1 371 211 ... 11.027945 445.303652 18.567554 18.446932 -0.085063 -0.051105 1.514320 -0.500292 -0.003375 -0.007600
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
161344 2028830060F 2 2883 60 F NaN V1 4 308 414 ... -0.175918 0.119849 -0.108352 -0.128775 0.471817 -0.045016 0.117096 -0.500292 0.100182 -0.007599
51125 1009130040 1 913 40 NaN RICHARD A CIRILLO B9 1 25 98 ... 0.931872 2.670813 1.486738 0.706372 -0.113786 -0.013116 -0.035622 -0.500292 -0.007942 -0.005410
929750 4161220056 4 16122 56 NaN GRADY, DIANA L C2 2A 40 97 ... -0.139221 0.433725 0.341563 0.139585 0.455862 -0.037621 0.020984 -0.500292 -0.007959 -0.007600
58145 1010114004 1 1011 4004 NaN CENTRAL PARK REALTY H R4 2 0 0 ... 1.577151 0.393144 1.562417 4.051715 0.314487 0.065690 -0.057737 1.998831 0.013932 -0.007600
944706 5002580005 5 258 5 NaN NaN V0 1B 47 840 ... -0.093423 1.306094 -0.081544 -0.072015 7.276806 -0.044005 0.093359 -0.500292 0.003742 -0.007599

10000 rows × 67 columns

NY_data_top_n['OWNER'].head(40)
RECORD
956520         TROMPETA RIZALINA
658933           WAN CHIU CHEUNG
917942      LOGAN PROPERTY, INC.
649717                 57-43 LLC
982930     FOREST VIEW HOMEOWNER
459429                       NaN
980276     WOODMONT WEST HOA INC
996722     IMPERIAL COURT HOMEOW
333412            SPOONER ALSTON
241946       RUFFALO ENTERPRISES
111420     BOXWOOD FLTD PARNTERS
1053859                      NaN
979038           CITY WEST H.O.A
1048771     HUGUENOT VILLAGE H O
1048475    PARK VILLAGE RESIDENT
223485             ASSET HLDG CP
696562                       NaN
1045012           LINDA VITALONE
7033                         NaN
7034              HSIA, JONATHAN
7035                NOVARO, HUGO
7036             LECH, JOHN PAUL
7037                 HANLON, AMY
7038               OWEN, MICHAEL
7039              LAMBERT, PETER
7040            BILINKAS, EDWARD
7041                         NaN
7042           SWENDSRUD, MONICA
7043                CHOI, JOSEPH
844895                XEDIT CORP
146231              NY NH & H RR
373564               J BRENOWITZ
524297             MARTIN OLINER
871048       VALERIE SHAKESPEARE
114694                       NaN
539374         CONS LAND INV INC
435070               KLEIN HARRY
718883           GARDEN VIEW LTD
877038              DASH MARLENE
1000954    MATIONAL COLLECTORS &
Name: OWNER, dtype: object
# you can look at this list and add some to the exclusions if you want
plt.figure(figsize=(6,14))
NY_data_top_n['OWNER'].value_counts().head(50).sort_values().plot(kind='barh')

NY_data_top_n.shape
(10000, 67)
NY_top_lotsize_ne_0 = NY_data_top_n[NY_data_top_n['LTFRONT'] != 0]
NY_top_sizes_ne_0 = NY_top_lotsize_ne_0[NY_top_lotsize_ne_0['BLDDEPTH'] != 0]
nfields = 34
data_base_vars = NY_data_top_n.iloc[:,nfields:nfields+9]
data_base_vars.head()
r1 r2 r3 r4 r5 r6 r7 r8 r9
RECORD
956520 -0.109693 31.053329 12.586493 -0.092675 35.717026 15.129681 -0.089171 51.484331 28.990580
658933 -0.100307 25.979800 10.859515 -0.081413 33.529659 14.345071 -0.068305 42.678673 24.796319
917942 -0.082428 0.627763 -0.031184 8.275884 169.725243 42.304772 16.688598 306.471822 105.826575
649717 89.063881 36.576029 16.284157 90.587261 8.849828 6.925421 153.273624 18.237425 17.653725
982930 15.633665 0.239314 -0.085664 28.627389 0.594399 0.095272 65.313551 1.484807 0.634812
# The heatmaps are good for seeing which variables are driving the high scores
data_heatmap = data_base_vars.abs().head(23)
plt.rcParams['figure.figsize'] = (20,10)
ax = sns.heatmap(data_heatmap, center=0, vmin=0, vmax=50, cmap='Reds')
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
plt.xticks(rotation=90)
(array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5]),
 [Text(0.5, 1, 'r1'),
  Text(1.5, 1, 'r2'),
  Text(2.5, 1, 'r3'),
  Text(3.5, 1, 'r4'),
  Text(4.5, 1, 'r5'),
  Text(5.5, 1, 'r6'),
  Text(6.5, 1, 'r7'),
  Text(7.5, 1, 'r8'),
  Text(8.5, 1, 'r9')])

data_all_vars = NY_data_top_n.iloc[:,nfields:]
data_all_vars.head()
r1 r2 r3 r4 r5 r6 r7 r8 r9 r1_zip5 ... r6_taxclass r7_taxclass r8_taxclass r9_taxclass value_ratio size_ratio lot_bldsize high_market_value_indicator lot_building_interaction market_assessed_value_interaction
RECORD
956520 -0.109693 31.053329 12.586493 -0.092675 35.717026 15.129681 -0.089171 51.484331 28.990580 -0.120225 ... 670.807012 -0.075673 614.704237 709.558482 -0.102717 316.656659 -0.068627 1.998831 1.575971 -0.007597
658933 -0.100307 25.979800 10.859515 -0.081413 33.529659 14.345071 -0.068305 42.678673 24.796319 -0.110202 ... 636.374724 0.066416 509.715716 607.332283 -0.110012 443.540911 -0.068628 1.998831 2.671111 -0.007586
917942 -0.082428 0.627763 -0.031184 8.275884 169.725243 42.304772 16.688598 306.471822 105.826575 0.084077 ... 15.300514 9.478741 208.261239 40.667816 12.040692 -0.051220 1.776500 -0.500292 0.244888 662.000028
649717 89.063881 36.576029 16.284157 90.587261 8.849828 6.925421 153.273624 18.237425 17.653725 277.932913 ... 2.396351 88.289486 12.295657 6.686020 0.478378 -0.050460 0.812136 -0.500292 -0.004937 -0.007600
982930 15.633665 0.239314 -0.085664 28.627389 0.594399 0.095272 65.313551 1.484807 0.634812 9.801218 ... 11.027945 445.303652 18.567554 18.446932 -0.085063 -0.051105 1.514320 -0.500292 -0.003375 -0.007600

5 rows × 33 columns

data_heatmap = data_all_vars.abs().head(30)
plt.rcParams['figure.figsize'] = (20,10)
ax = sns.heatmap(data_heatmap, center=0, vmin=0, vmax=50, cmap='Reds')
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
plt.xticks(rotation=90)
(array([ 0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5,  7.5,  8.5,  9.5, 10.5,
        11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
        22.5, 23.5, 24.5, 25.5, 26.5, 27.5, 28.5, 29.5, 30.5, 31.5, 32.5]),
 [Text(0.5, 1, 'r1'),
  Text(1.5, 1, 'r2'),
  Text(2.5, 1, 'r3'),
  Text(3.5, 1, 'r4'),
  Text(4.5, 1, 'r5'),
  Text(5.5, 1, 'r6'),
  Text(6.5, 1, 'r7'),
  Text(7.5, 1, 'r8'),
  Text(8.5, 1, 'r9'),
  Text(9.5, 1, 'r1_zip5'),
  Text(10.5, 1, 'r2_zip5'),
  Text(11.5, 1, 'r3_zip5'),
  Text(12.5, 1, 'r4_zip5'),
  Text(13.5, 1, 'r5_zip5'),
  Text(14.5, 1, 'r6_zip5'),
  Text(15.5, 1, 'r7_zip5'),
  Text(16.5, 1, 'r8_zip5'),
  Text(17.5, 1, 'r9_zip5'),
  Text(18.5, 1, 'r1_taxclass'),
  Text(19.5, 1, 'r2_taxclass'),
  Text(20.5, 1, 'r3_taxclass'),
  Text(21.5, 1, 'r4_taxclass'),
  Text(22.5, 1, 'r5_taxclass'),
  Text(23.5, 1, 'r6_taxclass'),
  Text(24.5, 1, 'r7_taxclass'),
  Text(25.5, 1, 'r8_taxclass'),
  Text(26.5, 1, 'r9_taxclass'),
  Text(27.5, 1, 'value_ratio'),
  Text(28.5, 1, 'size_ratio'),
  Text(29.5, 1, 'lot_bldsize'),
  Text(30.5, 1, 'high_market_value_indicator'),
  Text(31.5, 1, 'lot_building_interaction'),
  Text(32.5, 1, 'market_assessed_value_interaction')])

top_records_df = pd.DataFrame(top_records)