In [30]:
import seaborn as sns
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stat

from sklearn.ensemble import RandomForestRegressor
from sklearn import cross_validation as cv
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

from imblearn.combine import SMOTEENN

import matplotlib.pyplot as plt
%matplotlib inline

1. Introduction

1.1 Background

Fannie Mae is the government spnsored corporation whose primary purpose is to purchase mortgage loans from primary lenders such as big commercial banks, say Chase. It then sells those loans as securities in the bond market. Fannie Mae's goverment-backed backgruond benifits the corporation in several ways. First, the federal background exposure it to more funding resources than most local commercial banks. Second, the securities sold by Fannie Mae on the bond market have higher credit ranking since many long term investers believe a security is not liekly to default if it is somehow related to federal goverment. Fannie Mae claims that they "expand the secondary mortgage market by securitizing mortgages in the form of mortgage-backed securities, allowing lenders to reinvest their assets into more lending and in effect increasing the number of lenders in the mortgage market by reducing the reliance on locally based savings and loan associations". In short words, Fannie Mae connects investors all around the world to U.S.'s mortgage market. It provides funding liquidity to local lenders and provides U.S. martgage market access to global investers.
Even though Fannie Mae is supported by the federal goverment, the performance of its mortgage-backed securities is still strongly related to the repayment of loans purchased by Fannie Mae. No primary lenders can guarantee that the borrowers they pick will never default. For Fannie Mae, it need to avoid those borrowers who are very likely to default at the end when they purchase mortgages from primary lenders.

1.2 Project Goal

This project is to analyse the Fannie Mae Single-Family Loan Performance Data. Fannie Mae release data for acquisition and performance of loans every quarter. This dataset contains a lot of information and records from the borrowers during their loan-repay processing. We try to see whether we can predict from this data about the risk of defaulting for given borrowers.

2. Data Gathering.

The Fannie Mae data are located at :https://loanperformancedata.fanniemae.com/lppub/index.html. It is free for public and anyone can download it after register with Fanie Mae. Fannie Mae also provides an instruction on how to understand the format of the dataset.

2.1 Import Fannie Mae Data

The Fannie Mae data consists of two separate txt files for each quater. The first one is the "Acuisition" file, which contains static informaion when Fannie Mae acquired a single family loan from a commercial bank. The second one is the "Performance" file, which contains time series information of the payment history of a specific loan. Notice taht those two files do not contain any columns names so we need to provide the name specified in the instruction provided by Fannie Mae.

In [2]:
col_acq = ['LoanID','Channel','SellerName','OrInterestRate','OrUnpaidPrinc','OrLoanTerm',
        'OrDate','FirstPayment','OrLTV','OrCLTV','NumBorrow','DTIRat','CreditScore',
        'FTHomeBuyer','LoanPurpose','PropertyType','NumUnits','OccStatus','PropertyState',
        'Zip','MortInsPerc','ProductType','CoCreditScore','MortInsType','RelMortInd'];

col_per = ['LoanID','MonthRep','Servicer','CurrInterestRate','CAUPB','LoanAge','MonthsToMaturity',
          'AdMonthsToMaturity','MaturityDate','MSA','CLDS','ModFlag','ZeroBalCode','ZeroBalDate',
          'LastInstallDate','ForeclosureDate','DispositionDate','PPRC','AssetRecCost','MHRC',
          'ATFHP','NetSaleProceeds','CreditEnhProceeds','RPMWP','OFP','NIBUPB','PFUPB','RMWPF',
          'FPWA','ServicingIndicator'];

df_acq = pd.read_csv('/Users/xu_zh/Dropbox/Tony_Rutgers/2018_1_Spring_Intepretation of Data/4-Homework/Acquisition_2007Q4.txt', sep='|', names=col_acq, index_col=False)
df_perf = pd.read_csv('/Users/xu_zh/Dropbox/Tony_Rutgers/2018_1_Spring_Intepretation of Data/4-Homework/Performance_2007Q4.txt', sep='|', names=col_per, usecols=[0, 15], index_col=False)

2.2 Data Wrangling

In this project, the question we want to answer is that whether a single family loan will ended up as default. We need to find a way to investigate 'Acquisition' and 'Performance' at the same time.
As a starting point, we only keep the last row of loan payment history under each distinct LoanID. We do it that way since a loan is considered default at the end once it can not fullfill the payment requirement during any time period. We then retrieve column ForecolsureDate from 'Performance' file and merge it with 'Acuisition' file by LoanID. Also, we rename the Foreclosure Date as Default and we fill the 0 to the null values and fill the 1 to all values which is not the 0. when default still equal 0, it means that the Load ID still repay their Loan mortgage.when default=1, it means that the Load ID stop repay their Loan mortgage.

In [3]:
df_perf.drop_duplicates(subset='LoanID', keep='last', inplace=True)
df = pd.merge(df_acq, df_perf, on='LoanID', how='inner')

df.rename(index=str, columns={"ForeclosureDate": 'Default'}, inplace=True)
df['Default'].fillna(0, inplace=True)
df.loc[df['Default'] != 0, 'Default'] = 1

df['Default'] = df['Default'].astype(int)

Now lets take a look at the data frame we have.

In [4]:
print("number of attributes:",len(df.columns))
print("number of records:",len(df))
number of attributes: 26
number of records: 391209
In [5]:
df
Out[5]:
LoanID Channel SellerName OrInterestRate OrUnpaidPrinc OrLoanTerm OrDate FirstPayment OrLTV OrCLTV ... NumUnits OccStatus PropertyState Zip MortInsPerc ProductType CoCreditScore MortInsType RelMortInd Default
0 100002296854 C BANK OF AMERICA, N.A. 6.625 343000 360 10/2007 12/2007 86 86.0 ... 1 P CO 809 25.0 FRM 756.0 2.0 N 0
1 100006876815 C BANK OF AMERICA, N.A. 6.250 400000 360 10/2007 12/2007 62 62.0 ... 1 P CA 920 NaN FRM 790.0 NaN N 0
2 100008184591 B OTHER 6.625 81000 360 11/2007 01/2008 64 82.0 ... 1 P LA 708 NaN FRM NaN NaN N 0
3 100008870761 B AMTRUST BANK 6.500 119000 360 11/2007 01/2008 71 71.0 ... 1 P IL 600 NaN FRM NaN NaN N 0
4 100013284158 B BANK OF AMERICA, N.A. 6.625 205000 360 10/2007 12/2007 27 27.0 ... 1 P CA 907 NaN FRM NaN NaN N 0
5 100016840211 R PNC BANK, N.A. 6.375 210000 360 08/2007 10/2007 80 93.0 ... 1 P PA 160 NaN FRM NaN NaN N 0
6 100023772994 R JPMORGAN CHASE BANK, NATIONAL ASSOCIATION 6.750 172000 360 09/2007 11/2007 95 95.0 ... 1 P IL 605 30.0 FRM NaN 1.0 N 0
7 100026162343 B BANK OF AMERICA, N.A. 6.875 108000 360 11/2007 01/2008 90 90.0 ... 2 I MO 631 25.0 FRM 668.0 1.0 N 1
8 100030319603 C JPMORGAN CHASE BANK, NATIONAL ASSOCIATION 6.250 175000 180 10/2007 12/2007 70 70.0 ... 1 P IL 601 NaN FRM NaN NaN N 0
9 100031046995 C BANK OF AMERICA, N.A. 6.500 224000 360 10/2007 12/2007 62 62.0 ... 1 P AZ 850 NaN FRM NaN NaN N 0
10 100036338486 B BANK OF AMERICA, N.A. 7.000 47000 360 08/2007 11/2007 67 67.0 ... 1 P IN 464 NaN FRM NaN NaN N 0
11 100039990921 C BANK OF AMERICA, N.A. 6.500 166000 360 10/2007 12/2007 95 95.0 ... 1 S VA 240 30.0 FRM 666.0 1.0 N 0
12 100043379676 R BANK OF AMERICA, N.A. 6.125 70000 360 11/2007 01/2008 77 77.0 ... 1 P SC 298 NaN FRM 686.0 NaN N 0
13 100043973870 B BANK OF AMERICA, N.A. 7.250 147000 360 08/2007 10/2007 90 90.0 ... 1 I IL 601 25.0 FRM NaN 1.0 N 1
14 100046332832 B SUNTRUST MORTGAGE INC. 6.500 45000 360 11/2007 01/2008 57 57.0 ... 1 P WV 257 NaN FRM 739.0 NaN N 0
15 100051174956 B FIRST TENNESSEE BANK NATIONAL ASSOCIATION 5.625 204000 360 10/2007 12/2007 75 75.0 ... 1 I MA 21 NaN FRM NaN NaN N 1
16 100053251572 R WELLS FARGO BANK, N.A. 6.375 120000 360 10/2007 12/2007 80 80.0 ... 1 S FL 327 NaN FRM 676.0 NaN N 0
17 100053457392 C FLAGSTAR CAPITAL MARKETS CORPORATION 7.375 111000 360 08/2007 10/2007 75 93.0 ... 1 S TX 775 NaN FRM 713.0 NaN N 1
18 100055723684 C FLAGSTAR CAPITAL MARKETS CORPORATION 6.875 331000 360 09/2007 11/2007 71 71.0 ... 1 P NY 109 NaN FRM 609.0 NaN N 0
19 100057662068 R OTHER 6.625 372000 360 09/2007 11/2007 49 49.0 ... 1 P WA 981 NaN FRM 673.0 NaN N 0
20 100058998523 B FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB 6.500 250000 360 12/2007 02/2008 50 50.0 ... 1 P CA 917 NaN FRM NaN NaN N 0
21 100059647383 C BANK OF AMERICA, N.A. 6.625 159000 360 11/2007 12/2007 89 89.0 ... 1 P VA 244 25.0 FRM 785.0 1.0 N 1
22 100064375524 C WELLS FARGO BANK, N.A. 6.750 224000 360 08/2007 10/2007 70 70.0 ... 1 P MO 631 NaN FRM NaN NaN N 0
23 100064846848 C CITIMORTGAGE, INC. 6.250 156000 360 11/2007 12/2007 80 80.0 ... 1 P WV 253 NaN FRM 683.0 NaN N 0
24 100065596836 C AMTRUST BANK 6.375 127000 360 08/2007 11/2007 55 55.0 ... 1 P IN 474 NaN FRM 693.0 NaN N 0
25 100066447645 C BANK OF AMERICA, N.A. 7.250 264000 360 09/2007 11/2007 80 80.0 ... 1 I MD 207 NaN FRM 724.0 NaN N 0
26 100067374219 C WELLS FARGO BANK, N.A. 6.750 268000 360 07/2007 09/2007 95 95.0 ... 1 P FL 323 25.0 FRM NaN 1.0 N 0
27 100069119582 C BANK OF AMERICA, N.A. 7.250 233000 360 12/2007 02/2008 92 92.0 ... 1 P IL 622 30.0 FRM 632.0 2.0 N 0
28 100070906603 B SUNTRUST MORTGAGE INC. 6.625 332000 360 08/2007 10/2007 80 80.0 ... 1 P VA 231 NaN FRM 773.0 NaN N 0
29 100072621645 C JPMORGAN CHASE BANK, NATIONAL ASSOCIATION 6.250 313000 360 10/2007 12/2007 56 56.0 ... 1 P NY 109 NaN FRM NaN NaN N 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
391179 999921876990 B FIRST TENNESSEE BANK NATIONAL ASSOCIATION 8.000 109000 360 09/2007 11/2007 75 75.0 ... 1 I UT 840 NaN FRM 613.0 NaN N 0
391180 999922546207 R OTHER 6.625 280000 360 11/2007 01/2008 80 80.0 ... 1 P GA 300 NaN FRM NaN NaN N 0
391181 999927406299 R OTHER 7.375 44000 360 10/2007 12/2007 80 80.0 ... 1 I IN 473 NaN FRM 770.0 NaN N 0
391182 999929944792 R OTHER 6.625 350000 360 08/2007 10/2007 75 75.0 ... 1 P FL 333 NaN FRM NaN NaN N 1
391183 999930157069 B BANK OF AMERICA, N.A. 6.875 261000 360 10/2007 12/2007 95 95.0 ... 1 P NV 891 30.0 FRM 584.0 1.0 N 0
391184 999933620064 B PNC BANK, N.A. 6.375 143000 360 11/2007 01/2008 56 56.0 ... 1 P MD 217 NaN FRM 755.0 NaN N 0
391185 999934164645 R PNC BANK, N.A. 6.125 105000 180 11/2007 01/2008 80 80.0 ... 1 S IN 460 NaN FRM 739.0 NaN N 0
391186 999934665412 B BANK OF AMERICA, N.A. 5.875 178000 360 10/2007 12/2007 75 75.0 ... 1 P AZ 859 NaN FRM NaN NaN N 0
391187 999942349981 B SUNTRUST MORTGAGE INC. 6.750 120000 360 11/2007 01/2008 89 89.0 ... 1 P MA 11 25.0 FRM NaN 1.0 N 0
391188 999946001004 C WELLS FARGO BANK, N.A. 6.500 267000 360 08/2007 10/2007 94 94.0 ... 1 P TX 750 25.0 FRM NaN 1.0 N 0
391189 999947275603 B SUNTRUST MORTGAGE INC. 6.750 149000 360 09/2007 11/2007 70 70.0 ... 2 I OR 974 NaN FRM NaN NaN N 0
391190 999947372245 B FIRST TENNESSEE BANK NATIONAL ASSOCIATION 6.500 412000 360 09/2007 11/2007 80 95.0 ... 1 P TX 761 NaN FRM 763.0 NaN N 0
391191 999951031363 B JPMORGAN CHASE BANK, NATIONAL ASSOCIATION 6.125 375000 360 10/2007 12/2007 79 79.0 ... 1 P TX 773 NaN FRM NaN NaN N 0
391192 999954821017 C JPMORGAN CHASE BANK, NATIONAL ASSOCIATION 6.625 118000 360 09/2007 11/2007 80 80.0 ... 1 P MI 494 NaN FRM NaN NaN N 0
391193 999957860102 C FLAGSTAR CAPITAL MARKETS CORPORATION 6.500 202000 360 09/2007 11/2007 89 89.0 ... 1 P WI 532 17.0 FRM 705.0 1.0 N 0
391194 999958944118 R OTHER 6.875 101000 360 11/2007 01/2008 90 90.0 ... 1 I OH 442 25.0 FRM 791.0 2.0 N 0
391195 999960451839 C BANK OF AMERICA, N.A. 6.250 160000 360 11/2007 01/2008 77 77.0 ... 1 P NJ 88 NaN FRM NaN NaN N 0
391196 999961761904 R WELLS FARGO BANK, N.A. 6.500 176000 360 09/2007 11/2007 90 90.0 ... 1 P GA 300 25.0 FRM NaN 1.0 N 0
391197 999963866940 C BANK OF AMERICA, N.A. 6.500 212000 360 10/2007 12/2007 67 67.0 ... 1 P AL 354 NaN FRM 727.0 NaN N 0
391198 999964289708 C CITIMORTGAGE, INC. 6.000 253000 360 11/2007 01/2008 56 56.0 ... 1 P DE 198 NaN FRM 806.0 NaN N 0
391199 999965448735 C CITIMORTGAGE, INC. 5.750 225000 360 10/2007 12/2007 70 90.0 ... 1 P MA 10 NaN FRM NaN NaN N 0
391200 999970019289 B FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB 6.000 350000 360 09/2007 11/2007 75 75.0 ... 1 P CA 902 NaN FRM NaN NaN N 0
391201 999974637553 C BANK OF AMERICA, N.A. 6.750 215000 360 09/2007 11/2007 70 70.0 ... 1 P MD 210 NaN FRM 713.0 NaN N 0
391202 999977642464 C GMAC MORTGAGE, LLC 6.500 223000 360 08/2007 11/2007 88 88.0 ... 1 P IL 604 25.0 FRM 783.0 1.0 N 0
391203 999977655671 B AMTRUST BANK 6.000 273000 360 12/2007 02/2008 54 58.0 ... 1 P VA 201 NaN FRM 788.0 NaN N 0
391204 999984651015 R OTHER 6.500 139000 360 11/2007 01/2008 58 58.0 ... 1 P MO 641 NaN FRM NaN NaN N 0
391205 999986532145 R OTHER 6.375 47000 180 01/2007 03/2007 76 76.0 ... 1 P OH 441 NaN FRM NaN NaN N 0
391206 999988683412 C CITIMORTGAGE, INC. 6.125 135000 360 11/2007 01/2008 56 56.0 ... 1 I CA 913 NaN FRM NaN NaN N 0
391207 999999388912 B BANK OF AMERICA, N.A. 6.250 344000 360 11/2007 01/2008 80 80.0 ... 1 P PA 155 NaN FRM 771.0 NaN N 0
391208 999999565669 B PNC BANK, N.A. 6.750 360000 360 09/2007 11/2007 38 38.0 ... 1 P FL 330 NaN FRM NaN NaN N 0

391209 rows × 26 columns

We can see that the dataframe has 391,209 rows and 26 columns, and contains several dooferent type of information. There are numerical data like Interest Rate and Credit Score. There are textual data such as Seller's Name and Property State. There are also time stamp data. We can notice that many columns contain missing values that need to be either filled in or removed before we do any analysis.

3. Data Cleaning and Transformation.

Data cleaning and transformation is an important step before we go to build our predicting models. As shown in previuos section, there are two main problems. First, the Fannie Mae dataset contains a lot of missing values that need to be handled. Second, there are alot of categorical features whose values are strings. We need to transform those features' type such that they can be processed in popular python machine learning packages.

3.1 Data Cleanning

Let's now check how many how many missing values are there for each feature. We can see from below that 'MortInsPerc', 'CoCreditScore' and 'MortInsType' have too many null values. It will not be meaning to fullfill those null values since more than 50% observations doesn't have those information. We can simply drop them.

In [6]:
df.apply(lambda x: x.isnull().sum(), axis=0)
Out[6]:
LoanID                 0
Channel                0
SellerName             0
OrInterestRate         1
OrUnpaidPrinc          0
OrLoanTerm             0
OrDate                 0
FirstPayment           0
OrLTV                  0
OrCLTV                32
NumBorrow              6
DTIRat             11481
CreditScore          626
FTHomeBuyer            0
LoanPurpose            0
PropertyType           0
NumUnits               0
OccStatus              0
PropertyState          0
Zip                    0
MortInsPerc       306284
ProductType            0
CoCreditScore     234970
MortInsType       306284
RelMortInd             0
Default                0
dtype: int64

We also checked the number of unique values of each feature. Notice that column "ProductType" has only 1 unique value, which means it has the same effect on all observations. So we drop that column in further analysis. In contrary, the column "LoanID" has different value for every observation since it is required to be distinct as an key value. It is obvious that "LoanID" doesn't has any effect on the default probability of a loan and we can drop it.

In [7]:
df.apply(lambda x: x.nunique(), axis=0)
Out[7]:
LoanID            391209
Channel                3
SellerName            14
OrInterestRate       528
OrUnpaidPrinc        649
OrLoanTerm           140
OrDate                53
FirstPayment          53
OrLTV                 97
OrCLTV               120
NumBorrow              9
DTIRat                64
CreditScore          386
FTHomeBuyer            3
LoanPurpose            3
PropertyType           5
NumUnits               4
OccStatus              3
PropertyState         54
Zip                  893
MortInsPerc           18
ProductType            1
CoCreditScore        359
MortInsType            2
RelMortInd             2
Default                2
dtype: int64

Now let's drop all the features mentioned above.

In [8]:
df.drop(['LoanID','MortInsPerc','MortInsType','CoCreditScore','ProductType'], axis=1, inplace=True)

Here we fill the missing value for all numerical features. Instead of simpling filling NAN with mean value, we use regression techniques.

In [9]:
nan_columns = df.columns[df.isnull().any()]
for name in nan_columns:
    if df[name].dtypes != 'object':
        df_number = df.select_dtypes(exclude=['object'])
        y = df_number.loc[df[name].notnull(), name].values
        X = df_number.loc[df[name].notnull()].drop(nan_columns, axis=1).values
        X_test = df_number.loc[df[name].isnull()].drop(nan_columns, axis=1).values
        print("Filling Column:",name)
        model = RandomForestRegressor(n_estimators=400, max_depth=3)
        model.fit(X, y)
        df.loc[df[name].isnull(), name] = model.predict(X_test)
        print("Filling Finish")
Filling Column: OrInterestRate
Filling Finish
Filling Column: OrCLTV
Filling Finish
Filling Column: NumBorrow
Filling Finish
Filling Column: DTIRat
Filling Finish
Filling Column: CreditScore
Filling Finish

Let's make sure that our filling in procedure works well and there is no missing value in the data frame anymore.

In [10]:
#df.to_csv('2007Q4_cleaned.txt', header=True, index=False, sep='\t', mode='a')
df.apply(lambda x: x.isnull().sum(), axis=0)
Out[10]:
Channel           0
SellerName        0
OrInterestRate    0
OrUnpaidPrinc     0
OrLoanTerm        0
OrDate            0
FirstPayment      0
OrLTV             0
OrCLTV            0
NumBorrow         0
DTIRat            0
CreditScore       0
FTHomeBuyer       0
LoanPurpose       0
PropertyType      0
NumUnits          0
OccStatus         0
PropertyState     0
Zip               0
RelMortInd        0
Default           0
dtype: int64

From the result above we can guarantee that we don't need to worry about missing value in our further analysis anymore.

3.2 Data Transformation

Recall that the features we have in the Fannie Mae dataset belong to different data types. It usually will cause trouble if we put different types of data together into one predictive model. So let's check what type of features we have in our data frame first.

In [11]:
df.dtypes
Out[11]:
Channel            object
SellerName         object
OrInterestRate    float64
OrUnpaidPrinc       int64
OrLoanTerm          int64
OrDate             object
FirstPayment       object
OrLTV               int64
OrCLTV            float64
NumBorrow         float64
DTIRat            float64
CreditScore       float64
FTHomeBuyer        object
LoanPurpose        object
PropertyType       object
NumUnits            int64
OccStatus          object
PropertyState      object
Zip                 int64
RelMortInd         object
Default             int32
dtype: object

Is is shown that all features in this Pandas dataframe are stored as either numbers or objects. Those features with 'object' type are actualy categorical variables. We replicate this dataframe to a new dataframe and used simple natrual numbers to represent different values of categorical variables. For example, suppose we have "BoA" and "Chase" as loan sellers, we may then use 1 for "BoA" and 2 for "Chase". The newly created dataframe is more suitable to train a predictive model.
Notice that we also treat zip code as a categorical feature. Although zip code appeared to be numbers but it is meaningless to apply arithmetic ruls on zip code. So zip code is indeed a categorical feature.

In [12]:
df_2 = df.copy()

num_columns = ['OrInterestRate','OrUnpaidPrinc','OrLoanTerm','OrLTV','OrCLTV',
               'NumBorrow','DTIRat','CreditScore','NumUnits'];
cat_columns = ['Channel','SellerName','FTHomeBuyer','LoanPurpose','PropertyType',
               'OccStatus','PropertyState','RelMortInd','Zip'];

for column in cat_columns:
    df_2[column]=df_2[column].astype("category").cat.codes

We performed another potentially pre-processing step, that is, we split up any date columns into their month and year components. These may be effective since the time stamp may include some seasonal information.

In [13]:
df_2['OrDateMonth'] = df_2['OrDate'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_2['OrDateYear'] = df_2['OrDate'].apply(lambda x: x.split('/')[1].strip()).astype(int)

df_2['FirstMonth'] = df_2['FirstPayment'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_2['FirstYear'] = df_2['FirstPayment'].apply(lambda x: x.split('/')[1].strip()).astype(int)

df_2.drop(['OrDate','FirstPayment'], axis=1, inplace=True)

Now all features in the new dataframe, called df_2, are represented as numbers.

In [14]:
df_2.dtypes
Out[14]:
Channel              int8
SellerName           int8
OrInterestRate    float64
OrUnpaidPrinc       int64
OrLoanTerm          int64
OrLTV               int64
OrCLTV            float64
NumBorrow         float64
DTIRat            float64
CreditScore       float64
FTHomeBuyer          int8
LoanPurpose          int8
PropertyType         int8
NumUnits            int64
OccStatus            int8
PropertyState        int8
Zip                 int16
RelMortInd           int8
Default             int32
OrDateMonth         int32
OrDateYear          int32
FirstMonth          int32
FirstYear           int32
dtype: object

Another popular way to deal with categorical feature is to create s dummy variable for each possible value of the feature. Considering that some of our categoricla features have more than a doze posible values (such as "Seller's Name"), we prefer not to use this method unless it is neccessary for our predictive model to perform well.

4. Simple Indicators Exploration.

In this section we are trying to see that if some simple indicators can be useful in the decision process when Fannie Mae purchases new loans. Remember that Fannie Mae buys these loans from banks and generate cash flow through the repayment of individual borrower, so we believe we need to pay attention to banks who sell loans and to individual borrowers behind each loan.
Compared to local martgage lenders, Fannie Mae doesn't have the ability to get in touch with borrowers face by face. In contrast, commercial banks have agents who can talk to borrowers and collect more detailed information when a borrower applies for loan. If a commercial bank is more capable at picking borrowers with reliable repayment, Fannie Mae should consider buying loan from this bank first.
We also want to explore that if there is any difference that can be seen between realible borrowers and defaulting borrowers. We hope that reliable borrowers may perform significantly different at some indicators than defaulting borrowers.

In summary, we decided to explore the following questions:

1. Which bank are more unlikely to sell loans that end up defaulting?
2. Is there any difference at any features between good borrowers and bad borrowers? If yes, which feature is the best indicator to distinct good and bad indicator?

4.1 Which bank are more unlikely to sell loans that end up defaulting?

In this section we tried to visulize the percentage of default loans sold by each bank. We also compared the default rate of each bank with the avergae default rate of the whole market and tried to find banks that have better than average performance.

In [15]:
xtab = pd.pivot_table(df, index='SellerName', columns='Default', aggfunc='size')
xtab[0]['Overall']=xtab[0].sum()
xtab[1]['Overall']=xtab[1].sum()

df_seller = pd.concat([xtab[0],xtab[1]], axis=1)
df_seller
Out[15]:
0 1
SellerName
AMTRUST BANK 10051 989
BANK OF AMERICA, N.A. 99410 12218
CHASE HOME FINANCE, LLC 6223 607
CITIMORTGAGE, INC. 33972 3216
FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB 7730 1598
FIRST TENNESSEE BANK NATIONAL ASSOCIATION 7067 617
FLAGSTAR CAPITAL MARKETS CORPORATION 11812 1438
GMAC MORTGAGE, LLC 20755 2107
HSBC BANK USA, NATIONAL ASSOCIATION 532 5
JPMORGAN CHASE BANK, NATIONAL ASSOCIATION 36786 2410
OTHER 55376 3235
PNC BANK, N.A. 25126 2232
SUNTRUST MORTGAGE INC. 14156 1793
WELLS FARGO BANK, N.A. 27420 2328
Overall 356416 34793
In [16]:
#xtab = pd.pivot_table(df, index='SellerName', columns='Default', aggfunc='size')
df_seller = df_seller.div(df_seller.sum(axis=1), axis=0)
df_seller = df_seller.sort_values((0))


df_seller.plot.barh(stacked=True, figsize=(6,4))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.xlabel('Fraction of Borrowers')
plt.ylabel('Sellers')
Out[16]:
Text(0,0.5,'Sellers')

From the figure above, we can see that HSBC Bank has an extraordinary good performance that nearly no loan came from HSBC Bank was default at the end. However, we must keep in mind that HSBC only sold a little more than 5 hundreds loans to Finnie Mae in this quater. This number is really small compared to other banks. The low default rate of HSBC's loan may be biased since the data amount of HSBC may be not enough.
Loans purchased from the following major banks have lower default rate than the overall market: JP Morgan Chase, Wells Fargo, First Tennesse Bank, PNC Bank, Citi Mortgage and Chase Home Finance. Considering that these banks are already large commercial banks, we were confident to say that these banks show stronger capability at risk management.
On the other hand, some major banks such as Bank of American performs worse than overall average. We suggest that Fannie Mae should reduce the number of loans purchased from these banks.
We also noticed an interesting phenomenon related to all the smaller banks that are not listed specifically in the dataset. It can be seen that the average default rate of smaller banks are lower than all other major banks except HSBC. Our guess is that those smaller banks have more local connections to their customers and those connections help them know the borrowers better during mortgage application processes.

Conclusion for section 4.1

In summary, we have three advices for Fannie Mae.

1. Fannie Mae should give higher preference to the following large commercial banks: JP Morgan Chase, Wells Fargo, First Tennesse Bank, PNC Bank, Citi Mortgage and Chase Home Finance.
2. Fannie Mae should reduce number of loan purchased from the following banks: Amtrust Bank, GMAC Mortgage, Flagstar Capital, Bank of America and Suntrust Mortgage.
3. Fannie Mae should consider exploring more opportunities to work with smaller local banks.

4.2 Is there any difference at any features between good borrowers and bad borrowers??

To answer the quesion, we explore several features using box plots. In each sub plot, the blue box show the distribitnion of value of that in feature spanned by the default class, while the orange boxe show value spanned by the non-default class. For the sake of simplicity, we focused only on 6 numerical features: original interest rate, original unpaid principle, original LTV, original CLTV, debt to income rate and credit score.

In [17]:
num_columns_2 = ['OrInterestRate','OrUnpaidPrinc','OrLTV','OrCLTV','DTIRat','CreditScore'];

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(8,6))
plt.tight_layout(w_pad=4.0, h_pad=4.0)

for i, column in zip(range(1,len(num_columns)+1), num_columns_2):
    plt.subplot(2,3,i)
    sns.boxplot(x="Default", y=column, data=df, linewidth=0.5)
    plt.xlabel('Default')

The above figures show that on average, defaulters tend to have higher original interest rate, higher date to income rate and lower credit score. From the figure, it looks like credit score is the feature that has the most difference on both distribution and average value. So we decide to verify the assumption that defaulters and non-defaulters behave significantly different at credit score.

Are defaulters and non-defaulters behave different at credit score?

To answer this question, we first explore some key statistics of the distribution of credit score numerically for both defaulters and non-defaulters.

In [18]:
crdscr_cmp=pd.concat([df[['CreditScore']][df['Default']==0].describe(),df[['CreditScore']][df['Default']==1].describe()],axis=1)
crdscr_cmp.columns = ['CreditScore when Default=0', 'CreditScore when Default=1']
crdscr_cmp
Out[18]:
CreditScore when Default=0 CreditScore when Default=1
count 356416.000000 34793.000000
mean 722.983789 690.826426
std 60.983618 56.766786
min 392.000000 437.000000
25% 677.000000 649.000000
50% 730.000000 687.000000
75% 776.000000 733.000000
max 843.000000 827.000000

Credit score of non-defaulters are clearly higher than defaulters when compare most statistics such as mean, 25% quantile, median and 75% quantile. To verify that non-defaulters did tend to have higher credit score, we furtherly decide to run the following hypothesis testing. Let $\mu_0$ be the average credit score of non-defaulters, $\mu_1$ be the average credit score of defaulters. The hypothesis testing is formulated as $$ H_0: \mu_0 \leq \mu_1 $$ $$ H_1: \mu_0 > \mu_1 $$ We did one sided t-test for this hypothesis.

In [19]:
result = stat.ttest_ind(df[df['Default'] == 0]['CreditScore'],df[df['Default'] == 1]['CreditScore'],equal_var = False)
result
Out[19]:
Ttest_indResult(statistic=100.17299990501223, pvalue=0.0)

Recall that the scipy function ttest_ind we used above always gives the test statistic as signed. Given t and p values from the two-tailed test, we should reject the null hypothesis of a greater-than test when p/2 < alpha and t > 0, and of a less-than test when p/2 < alpha and t < 0. From the above result, we have $$ t > 0 $$ $$ p/2 < .05 $$ So we are confident to reject the null hypothesis and say that average scredit score of non-defaulters are higher than average credit score of non0defaulters.

We also visulized the distribution density of credit score of both defaulters and non-defaulters. We tried to see that if there is a boundary that we can use to distinct defaulters from non-defaulters.

In [20]:
cs_nondefault=df['CreditScore'][df['Default']==0].dropna()
cs_default=df['CreditScore'][df['Default']==1].dropna()

sns.set()
sns.distplot( cs_nondefault , color="skyblue", label="Default=0")
sns.distplot( cs_default , color="red", label="Default=1")
plt.legend()
Out[20]:
<matplotlib.legend.Legend at 0x1c00ac0d978>

From the distribution density plot, we can see that the distribution of credit score for non-defaulters is clearly skewed to higher value. Although the two distributions are heavily overlapped, we may say that the boundary between good and bad credit score is around 730.

Conclusion for section 4.2

In summary, we can say that

1. Non-defaulters tend to have higher credit score than defaulters.
2. If a borrower's credit score is higher than 730, s/he probably won't default.

5. Predicting Bad Loans

In this section we are trying to build a sophiscated predictive model that can make use of all the information hidden in every feature for Fannie Mae. Predicting whether a loan will end up as default is a binary clssification problem. Before trainning our model, we first need to furtherly pre-process the data.

5.1 Data Preprocessing for Training

Here we did 3 data preprocessing for steps:

1. We scaled the numerical fetures.
2. We resampled the dataset to solve the imbalance problem for the response feature "Default".
3. We splitted the dataset into training and testing datasets.
In [21]:
#Step 1: rescaling
scaler = StandardScaler()
df_2[num_columns] = scaler.fit_transform(df_2[num_columns])

#Step 2: resampling
sm = SMOTEENN()

y = df_2['Default'].values
X = df_2.drop(['Default'], axis=1).values

X_resampled, y_resampled = sm.fit_sample(X, y)

#Step 3: splitting
X_train, X_test, y_train, y_test = cv.train_test_split(X_resampled, y_resampled, test_size = 0.25, random_state=0)

5.2 Training with Logistic Regression

To figure out whether a loan will be default or not is a typical two classes classification problem. Logistic regression is one of the most popular classification method since it is powerful yet simple. We started with feeding our data into a logistic regression model to see if it works well.

In [22]:
model_fitted=LogisticRegressionCV(random_state=1,cv=3).fit(X_train,y_train)
predictions = model_fitted.predict(X_test)
In [23]:
# Functions for model evaluation
def compute_false_negatives(target,predictions):
    false_negatives=pd.DataFrame({"target":target,"predictions":predictions})
    neg_rate=false_negatives[(false_negatives["target"]==1) & (false_negatives["predictions"]==0)].shape[0]/(false_negatives[(false_negatives["target"]==1)].shape[0]+1)
    return neg_rate

def compute_false_positive(target,predictions):
    false_positives=pd.DataFrame({"target":target,"predictions":predictions})
    pos_rate=false_positives[(false_positives["target"]==0) & (false_positives["predictions"]==1)].shape[0]/(false_positives[(false_positives["target"]==0)].shape[0]+1)
    return pos_rate
In [24]:
# Model evaluation
print(metrics.classification_report(y_test, predictions))

accuracy=metrics.accuracy_score(y_test,predictions)
FN=compute_false_negatives(y_test,predictions)
FP=compute_false_positive(y_test,predictions)
print("Accuracy of the model:{}".format(accuracy))
print("False Negatives Rates:{}".format(FN))
print("False Positive Rates:{}".format(FP))
             precision    recall  f1-score   support

          0       0.79      0.69      0.74     60637
          1       0.81      0.88      0.84     88294

avg / total       0.80      0.80      0.80    148931

Accuracy of the model:0.802196990552672
False Negatives Rates:0.12292881816637409
False Positive Rates:0.30682080543553547

It can be seen that our model has a accuracy of almost 80% for testing data. At least we can say that our model show some power over random guess. However, recall that the current default rate in the data set is 34793/391209=8.89%. If Fannie Mae use this logistic model, its default rate will be $$1 - \text{precision of non-defaulters}=1-0.79=21\%$$ It is far more worse than Fannie Mae's current performance. The logistic regression model need to be improved. We belive that one of the weakness of this model is that it is still "nice" to the potential defaulters since 30 percent defaulters is classified as non-defaulters. We checked the AUC curve to see if moving the decision boundary can be helpful.

In [34]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, model_fitted.predict_proba(X_test)[:,1])
roc_auc = metrics.roc_auc_score(y_test, predictions)

plt.plot(fpr, tpr, lw=1, label='AUC = %0.2f'%(roc_auc))
plt.plot([0, 1], [0, 1], '--k', lw=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC')
plt.legend(loc="lower right", frameon = True).get_frame().set_edgecolor('black')

The AUC value is 0.71, which is not bad but also not outstanding.

Improvement directions

We decided to improve our models from two directions. The first one is to improve the data transformation step. The second one is to make use of a more sophiscated classification model.

categorical features, we created dummy variables for each possible value of every categorical feature. This transformation method is more precise in theory and is widely used in several statistical tasks such as experimental design.

In [28]:
def getdummies(df):
    columns = df.columns[df.isnull().any()]
    nan_cols = df[columns]

    df.drop(nan_cols.columns, axis=1, inplace=True)

    cat = df.select_dtypes(include=['object'])
    num = df.drop(cat.columns, axis=1)

    data = pd.DataFrame()
    for i in cat.columns:
        tmp = pd.get_dummies(cat[i], drop_first=True)
        data = pd.concat([data, tmp], axis=1)

    df = pd.concat([num,data,nan_cols], axis=1).reset_index(drop=True)
    return df
In [29]:
df_3 = df.copy()

df_3['OrDateMonth'] = df_3['OrDate'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_3['OrDateYear'] = df_3['OrDate'].apply(lambda x: x.split('/')[1].strip()).astype(int)

df_3['FirstMonth'] = df_3['FirstPayment'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_3['FirstYear'] = df_3['FirstPayment'].apply(lambda x: x.split('/')[1].strip()).astype(int)

df_3.drop(['OrDate','FirstPayment'], axis=1, inplace=True)

df_3 = getdummies(df_3)

df_3.dtypes
Out[29]:
OrInterestRate                               float64
OrUnpaidPrinc                                  int64
OrLoanTerm                                     int64
OrLTV                                          int64
OrCLTV                                       float64
NumBorrow                                    float64
DTIRat                                       float64
CreditScore                                  float64
NumUnits                                       int64
Zip                                            int64
Default                                        int32
OrDateMonth                                    int32
OrDateYear                                     int32
FirstMonth                                     int32
FirstYear                                      int32
C                                              uint8
R                                              uint8
BANK OF AMERICA, N.A.                          uint8
CHASE HOME FINANCE, LLC                        uint8
CITIMORTGAGE, INC.                             uint8
FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB       uint8
FIRST TENNESSEE BANK NATIONAL ASSOCIATION      uint8
FLAGSTAR CAPITAL MARKETS CORPORATION           uint8
GMAC MORTGAGE, LLC                             uint8
HSBC BANK USA, NATIONAL ASSOCIATION            uint8
JPMORGAN CHASE BANK, NATIONAL ASSOCIATION      uint8
OTHER                                          uint8
PNC BANK, N.A.                                 uint8
SUNTRUST MORTGAGE INC.                         uint8
WELLS FARGO BANK, N.A.                         uint8
                                              ...   
MO                                             uint8
MS                                             uint8
MT                                             uint8
NC                                             uint8
ND                                             uint8
NE                                             uint8
NH                                             uint8
NJ                                             uint8
NM                                             uint8
NV                                             uint8
NY                                             uint8
OH                                             uint8
OK                                             uint8
OR                                             uint8
PA                                             uint8
PR                                             uint8
RI                                             uint8
SC                                             uint8
SD                                             uint8
TN                                             uint8
TX                                             uint8
UT                                             uint8
VA                                             uint8
VI                                             uint8
VT                                             uint8
WA                                             uint8
WI                                             uint8
WV                                             uint8
WY                                             uint8
Y                                              uint8
Length: 94, dtype: object

We then go through the same preprocessing steps and built the new trainning and testing data sets.

In [31]:
#Step 1: rescaling
scaler = StandardScaler()
df_3[num_columns] = scaler.fit_transform(df_3[num_columns])

#Step 2: resampling
sm = SMOTEENN()

y3 = df_3['Default'].values
X3 = df_3.drop(['Default'], axis=1).values

X3_resampled, y3_resampled = sm.fit_sample(X3, y3)

#Step 3: splitting
X3_train, X3_test, y3_train, y3_test = cv.train_test_split(X3_resampled, y3_resampled, test_size = 0.25, random_state=0)

We also switched our model from logistic regressions to random forrests. Random forrests is realtively new but has been shown to be powerful in classification task.

In [32]:
model_rf = RandomForestClassifier(n_estimators=250)
model_rf = model_rf.fit(X3_train, y3_train)
predictions_rf = model_rf.predict(X3_test)
In [33]:
# Model evaluation
print(metrics.classification_report(y3_test, predictions_rf))

accuracy_rf=metrics.accuracy_score(y3_test,predictions_rf)
FN_rf=compute_false_negatives(y3_test,predictions_rf)
FP_rf=compute_false_positive(y3_test,predictions_rf)
print("Accuracy of the model:{}".format(accuracy_rf))
print("False Negatives Rates:{}".format(FN_rf))
print("False Positive Rates:{}".format(FP_rf))
             precision    recall  f1-score   support

          0       0.94      0.96      0.95     58991
          1       0.97      0.96      0.97     88705

avg / total       0.96      0.96      0.96    147696

Accuracy of the model:0.9611296175928935
False Negatives Rates:0.03889252136270376
False Positive Rates:0.0388357743422837

The new predictive model shows huge improvement. If Fannie Mae use the random forrest model, its overall default rate will decrease from 8.89% to $$1 - \text{precision of non-defaulters}=1-0.94=6\%$$ Which can increase Fannie Mae's mortgage backed securities' performance on the bond market.
We furtherly checked the AUC curve.

In [35]:
fpr, tpr, thresholds = metrics.roc_curve(y3_test, model_rf.predict_proba(X3_test)[:,1])
roc_auc = metrics.roc_auc_score(y3_test, predictions_rf)

plt.plot(fpr, tpr, lw=1, label='AUC = %0.2f'%(roc_auc))
plt.plot([0, 1], [0, 1], '--k', lw=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC')
plt.legend(loc="lower right", frameon = True).get_frame().set_edgecolor('black')

The AUC value is 0.71, which is far more better.

Conclusion for section 5

After thorough exploration, we belive that the random forrest classifier trained with data set which uses dummy variables for categorical features brings impressive improvement to Fannie Mae's current risk management processes. Fannie Mae can benifit a 2 percent default rate decreasing if take this classifier into use.

6. Conclusions

After explore Fannie Mae's loan acquisition and performance dataset, we have provide the following results that may help Fannie decrease the default rate of the loans it purchases.

  1. Fannie Mae should give higher preference to commercial banks with good performance history, including JP Morgan Chase, Wells Fargo, First Tennesse Bank, PNC Bank, Citi Mortgage and Chase Home Finance. Meanwhile, Fannie Mae should reduce number of loan purchased from banks which performs below average, including Amtrust Bank, GMAC Mortgage, Flagstar Capital, Bank of America and Suntrust Mortgage. Fannie Mae should consider exploring more opportunities to work with smaller local banks.
  2. Fannie Mae should give higher preference to loans carried by borrowers with high credit score, probably to boorrower whose credit score is higher than 730.
  3. We delivered a random forrest classifier that can help Fannie decrease its default rate from 8.89 percent to 6 percent. Although 2 percent doesn't sound like significant, but it can help Fannie Mae save a lot of money considering Fannie purchase more than 300k loans per quarter.

References

[1] Vanderplas, J. T. (2016). Python data science handbook: Tools and techniques for developers. Beijing: OReilly.
[2] Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M. Ingersoll. "An introduction to logistic regression analysis and reporting." The journal of educational research 96.1 (2002): 3-14.
[3] Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest." R news 2.3 (2002): 18-22.
[4] Dimitrios Tsementzis. Lecture 1-7.