import seaborn as sns
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stat

from sklearn.ensemble import RandomForestRegressor
from sklearn import cross_validation as cv
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

from imblearn.combine import SMOTEENN

import matplotlib.pyplot as plt
%matplotlib inline

1. Introduction¶

1.1 Background¶

Fannie Mae is the government spnsored corporation whose primary purpose is to purchase mortgage loans from primary lenders such as big commercial banks, say Chase. It then sells those loans as securities in the bond market. Fannie Mae's goverment-backed backgruond benifits the corporation in several ways. First, the federal background exposure it to more funding resources than most local commercial banks. Second, the securities sold by Fannie Mae on the bond market have higher credit ranking since many long term investers believe a security is not liekly to default if it is somehow related to federal goverment. Fannie Mae claims that they "expand the secondary mortgage market by securitizing mortgages in the form of mortgage-backed securities, allowing lenders to reinvest their assets into more lending and in effect increasing the number of lenders in the mortgage market by reducing the reliance on locally based savings and loan associations". In short words, Fannie Mae connects investors all around the world to U.S.'s mortgage market. It provides funding liquidity to local lenders and provides U.S. martgage market access to global investers.
Even though Fannie Mae is supported by the federal goverment, the performance of its mortgage-backed securities is still strongly related to the repayment of loans purchased by Fannie Mae. No primary lenders can guarantee that the borrowers they pick will never default. For Fannie Mae, it need to avoid those borrowers who are very likely to default at the end when they purchase mortgages from primary lenders.

1.2 Project Goal¶

This project is to analyse the Fannie Mae Single-Family Loan Performance Data. Fannie Mae release data for acquisition and performance of loans every quarter. This dataset contains a lot of information and records from the borrowers during their loan-repay processing. We try to see whether we can predict from this data about the risk of defaulting for given borrowers.

2. Data Gathering.¶

The Fannie Mae data are located at :https://loanperformancedata.fanniemae.com/lppub/index.html. It is free for public and anyone can download it after register with Fanie Mae. Fannie Mae also provides an instruction on how to understand the format of the dataset.

2.1 Import Fannie Mae Data¶

The Fannie Mae data consists of two separate txt files for each quater. The first one is the "Acuisition" file, which contains static informaion when Fannie Mae acquired a single family loan from a commercial bank. The second one is the "Performance" file, which contains time series information of the payment history of a specific loan. Notice taht those two files do not contain any columns names so we need to provide the name specified in the instruction provided by Fannie Mae.

col_acq = ['LoanID','Channel','SellerName','OrInterestRate','OrUnpaidPrinc','OrLoanTerm',
        'OrDate','FirstPayment','OrLTV','OrCLTV','NumBorrow','DTIRat','CreditScore',
        'FTHomeBuyer','LoanPurpose','PropertyType','NumUnits','OccStatus','PropertyState',
        'Zip','MortInsPerc','ProductType','CoCreditScore','MortInsType','RelMortInd'];

col_per = ['LoanID','MonthRep','Servicer','CurrInterestRate','CAUPB','LoanAge','MonthsToMaturity',
          'AdMonthsToMaturity','MaturityDate','MSA','CLDS','ModFlag','ZeroBalCode','ZeroBalDate',
          'LastInstallDate','ForeclosureDate','DispositionDate','PPRC','AssetRecCost','MHRC',
          'ATFHP','NetSaleProceeds','CreditEnhProceeds','RPMWP','OFP','NIBUPB','PFUPB','RMWPF',
          'FPWA','ServicingIndicator'];

df_acq = pd.read_csv('/Users/xu_zh/Dropbox/Tony_Rutgers/2018_1_Spring_Intepretation of Data/4-Homework/Acquisition_2007Q4.txt', sep='|', names=col_acq, index_col=False)
df_perf = pd.read_csv('/Users/xu_zh/Dropbox/Tony_Rutgers/2018_1_Spring_Intepretation of Data/4-Homework/Performance_2007Q4.txt', sep='|', names=col_per, usecols=[0, 15], index_col=False)

2.2 Data Wrangling¶

In this project, the question we want to answer is that whether a single family loan will ended up as default. We need to find a way to investigate 'Acquisition' and 'Performance' at the same time.
As a starting point, we only keep the last row of loan payment history under each distinct LoanID. We do it that way since a loan is considered default at the end once it can not fullfill the payment requirement during any time period. We then retrieve column ForecolsureDate from 'Performance' file and merge it with 'Acuisition' file by LoanID. Also, we rename the Foreclosure Date as Default and we fill the 0 to the null values and fill the 1 to all values which is not the 0. when default still equal 0, it means that the Load ID still repay their Loan mortgage.when default=1, it means that the Load ID stop repay their Loan mortgage.

df_perf.drop_duplicates(subset='LoanID', keep='last', inplace=True)
df = pd.merge(df_acq, df_perf, on='LoanID', how='inner')

df.rename(index=str, columns={"ForeclosureDate": 'Default'}, inplace=True)
df['Default'].fillna(0, inplace=True)
df.loc[df['Default'] != 0, 'Default'] = 1

df['Default'] = df['Default'].astype(int)

Now lets take a look at the data frame we have.

print("number of attributes:",len(df.columns))
print("number of records:",len(df))

number of attributes: 26
number of records: 391209

df

We can see that the dataframe has 391,209 rows and 26 columns, and contains several dooferent type of information. There are numerical data like Interest Rate and Credit Score. There are textual data such as Seller's Name and Property State. There are also time stamp data. We can notice that many columns contain missing values that need to be either filled in or removed before we do any analysis.

3. Data Cleaning and Transformation.¶

Data cleaning and transformation is an important step before we go to build our predicting models. As shown in previuos section, there are two main problems. First, the Fannie Mae dataset contains a lot of missing values that need to be handled. Second, there are alot of categorical features whose values are strings. We need to transform those features' type such that they can be processed in popular python machine learning packages.

3.1 Data Cleanning¶

Let's now check how many how many missing values are there for each feature. We can see from below that 'MortInsPerc', 'CoCreditScore' and 'MortInsType' have too many null values. It will not be meaning to fullfill those null values since more than 50% observations doesn't have those information. We can simply drop them.

df.apply(lambda x: x.isnull().sum(), axis=0)

LoanID                 0
Channel                0
SellerName             0
OrInterestRate         1
OrUnpaidPrinc          0
OrLoanTerm             0
OrDate                 0
FirstPayment           0
OrLTV                  0
OrCLTV                32
NumBorrow              6
DTIRat             11481
CreditScore          626
FTHomeBuyer            0
LoanPurpose            0
PropertyType           0
NumUnits               0
OccStatus              0
PropertyState          0
Zip                    0
MortInsPerc       306284
ProductType            0
CoCreditScore     234970
MortInsType       306284
RelMortInd             0
Default                0
dtype: int64

We also checked the number of unique values of each feature. Notice that column "ProductType" has only 1 unique value, which means it has the same effect on all observations. So we drop that column in further analysis. In contrary, the column "LoanID" has different value for every observation since it is required to be distinct as an key value. It is obvious that "LoanID" doesn't has any effect on the default probability of a loan and we can drop it.

df.apply(lambda x: x.nunique(), axis=0)

LoanID            391209
Channel                3
SellerName            14
OrInterestRate       528
OrUnpaidPrinc        649
OrLoanTerm           140
OrDate                53
FirstPayment          53
OrLTV                 97
OrCLTV               120
NumBorrow              9
DTIRat                64
CreditScore          386
FTHomeBuyer            3
LoanPurpose            3
PropertyType           5
NumUnits               4
OccStatus              3
PropertyState         54
Zip                  893
MortInsPerc           18
ProductType            1
CoCreditScore        359
MortInsType            2
RelMortInd             2
Default                2
dtype: int64

Now let's drop all the features mentioned above.

df.drop(['LoanID','MortInsPerc','MortInsType','CoCreditScore','ProductType'], axis=1, inplace=True)

Here we fill the missing value for all numerical features. Instead of simpling filling NAN with mean value, we use regression techniques.

nan_columns = df.columns[df.isnull().any()]
for name in nan_columns:
    if df[name].dtypes != 'object':
        df_number = df.select_dtypes(exclude=['object'])
        y = df_number.loc[df[name].notnull(), name].values
        X = df_number.loc[df[name].notnull()].drop(nan_columns, axis=1).values
        X_test = df_number.loc[df[name].isnull()].drop(nan_columns, axis=1).values
        print("Filling Column:",name)
        model = RandomForestRegressor(n_estimators=400, max_depth=3)
        model.fit(X, y)
        df.loc[df[name].isnull(), name] = model.predict(X_test)
        print("Filling Finish")

Filling Column: OrInterestRate
Filling Finish
Filling Column: OrCLTV
Filling Finish
Filling Column: NumBorrow
Filling Finish
Filling Column: DTIRat
Filling Finish
Filling Column: CreditScore
Filling Finish

Let's make sure that our filling in procedure works well and there is no missing value in the data frame anymore.

#df.to_csv('2007Q4_cleaned.txt', header=True, index=False, sep='\t', mode='a')
df.apply(lambda x: x.isnull().sum(), axis=0)

Channel           0
SellerName        0
OrInterestRate    0
OrUnpaidPrinc     0
OrLoanTerm        0
OrDate            0
FirstPayment      0
OrLTV             0
OrCLTV            0
NumBorrow         0
DTIRat            0
CreditScore       0
FTHomeBuyer       0
LoanPurpose       0
PropertyType      0
NumUnits          0
OccStatus         0
PropertyState     0
Zip               0
RelMortInd        0
Default           0
dtype: int64

From the result above we can guarantee that we don't need to worry about missing value in our further analysis anymore.

3.2 Data Transformation¶

Recall that the features we have in the Fannie Mae dataset belong to different data types. It usually will cause trouble if we put different types of data together into one predictive model. So let's check what type of features we have in our data frame first.

df.dtypes

Channel            object
SellerName         object
OrInterestRate    float64
OrUnpaidPrinc       int64
OrLoanTerm          int64
OrDate             object
FirstPayment       object
OrLTV               int64
OrCLTV            float64
NumBorrow         float64
DTIRat            float64
CreditScore       float64
FTHomeBuyer        object
LoanPurpose        object
PropertyType       object
NumUnits            int64
OccStatus          object
PropertyState      object
Zip                 int64
RelMortInd         object
Default             int32
dtype: object

Is is shown that all features in this Pandas dataframe are stored as either numbers or objects. Those features with 'object' type are actualy categorical variables. We replicate this dataframe to a new dataframe and used simple natrual numbers to represent different values of categorical variables. For example, suppose we have "BoA" and "Chase" as loan sellers, we may then use 1 for "BoA" and 2 for "Chase". The newly created dataframe is more suitable to train a predictive model.
Notice that we also treat zip code as a categorical feature. Although zip code appeared to be numbers but it is meaningless to apply arithmetic ruls on zip code. So zip code is indeed a categorical feature.

df_2 = df.copy()

num_columns = ['OrInterestRate','OrUnpaidPrinc','OrLoanTerm','OrLTV','OrCLTV',
               'NumBorrow','DTIRat','CreditScore','NumUnits'];
cat_columns = ['Channel','SellerName','FTHomeBuyer','LoanPurpose','PropertyType',
               'OccStatus','PropertyState','RelMortInd','Zip'];

for column in cat_columns:
    df_2[column]=df_2[column].astype("category").cat.codes

We performed another potentially pre-processing step, that is, we split up any date columns into their month and year components. These may be effective since the time stamp may include some seasonal information.

df_2['OrDateMonth'] = df_2['OrDate'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_2['OrDateYear'] = df_2['OrDate'].apply(lambda x: x.split('/')[1].strip()).astype(int)

df_2['FirstMonth'] = df_2['FirstPayment'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_2['FirstYear'] = df_2['FirstPayment'].apply(lambda x: x.split('/')[1].strip()).astype(int)

df_2.drop(['OrDate','FirstPayment'], axis=1, inplace=True)

Now all features in the new dataframe, called df_2, are represented as numbers.

df_2.dtypes

Channel              int8
SellerName           int8
OrInterestRate    float64
OrUnpaidPrinc       int64
OrLoanTerm          int64
OrLTV               int64
OrCLTV            float64
NumBorrow         float64
DTIRat            float64
CreditScore       float64
FTHomeBuyer          int8
LoanPurpose          int8
PropertyType         int8
NumUnits            int64
OccStatus            int8
PropertyState        int8
Zip                 int16
RelMortInd           int8
Default             int32
OrDateMonth         int32
OrDateYear          int32
FirstMonth          int32
FirstYear           int32
dtype: object

Another popular way to deal with categorical feature is to create s dummy variable for each possible value of the feature. Considering that some of our categoricla features have more than a doze posible values (such as "Seller's Name"), we prefer not to use this method unless it is neccessary for our predictive model to perform well.

4. Simple Indicators Exploration.¶

In this section we are trying to see that if some simple indicators can be useful in the decision process when Fannie Mae purchases new loans. Remember that Fannie Mae buys these loans from banks and generate cash flow through the repayment of individual borrower, so we believe we need to pay attention to banks who sell loans and to individual borrowers behind each loan.
Compared to local martgage lenders, Fannie Mae doesn't have the ability to get in touch with borrowers face by face. In contrast, commercial banks have agents who can talk to borrowers and collect more detailed information when a borrower applies for loan. If a commercial bank is more capable at picking borrowers with reliable repayment, Fannie Mae should consider buying loan from this bank first.
We also want to explore that if there is any difference that can be seen between realible borrowers and defaulting borrowers. We hope that reliable borrowers may perform significantly different at some indicators than defaulting borrowers.

In summary, we decided to explore the following questions:

1. Which bank are more unlikely to sell loans that end up defaulting?
2. Is there any difference at any features between good borrowers and bad borrowers? If yes, which feature is the best indicator to distinct good and bad indicator?

4.1 Which bank are more unlikely to sell loans that end up defaulting?¶

In this section we tried to visulize the percentage of default loans sold by each bank. We also compared the default rate of each bank with the avergae default rate of the whole market and tried to find banks that have better than average performance.

xtab = pd.pivot_table(df, index='SellerName', columns='Default', aggfunc='size')
xtab[0]['Overall']=xtab[0].sum()
xtab[1]['Overall']=xtab[1].sum()

df_seller = pd.concat([xtab[0],xtab[1]], axis=1)
df_seller

#xtab = pd.pivot_table(df, index='SellerName', columns='Default', aggfunc='size')
df_seller = df_seller.div(df_seller.sum(axis=1), axis=0)
df_seller = df_seller.sort_values((0))


df_seller.plot.barh(stacked=True, figsize=(6,4))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.xlabel('Fraction of Borrowers')
plt.ylabel('Sellers')

Text(0,0.5,'Sellers')

From the figure above, we can see that HSBC Bank has an extraordinary good performance that nearly no loan came from HSBC Bank was default at the end. However, we must keep in mind that HSBC only sold a little more than 5 hundreds loans to Finnie Mae in this quater. This number is really small compared to other banks. The low default rate of HSBC's loan may be biased since the data amount of HSBC may be not enough.
Loans purchased from the following major banks have lower default rate than the overall market: JP Morgan Chase, Wells Fargo, First Tennesse Bank, PNC Bank, Citi Mortgage and Chase Home Finance. Considering that these banks are already large commercial banks, we were confident to say that these banks show stronger capability at risk management.
On the other hand, some major banks such as Bank of American performs worse than overall average. We suggest that Fannie Mae should reduce the number of loans purchased from these banks.
We also noticed an interesting phenomenon related to all the smaller banks that are not listed specifically in the dataset. It can be seen that the average default rate of smaller banks are lower than all other major banks except HSBC. Our guess is that those smaller banks have more local connections to their customers and those connections help them know the borrowers better during mortgage application processes.

Conclusion for section 4.1¶

In summary, we have three advices for Fannie Mae.

1. Fannie Mae should give higher preference to the following large commercial banks: JP Morgan Chase, Wells Fargo, First Tennesse Bank, PNC Bank, Citi Mortgage and Chase Home Finance.
2. Fannie Mae should reduce number of loan purchased from the following banks: Amtrust Bank, GMAC Mortgage, Flagstar Capital, Bank of America and Suntrust Mortgage.
3. Fannie Mae should consider exploring more opportunities to work with smaller local banks.

4.2 Is there any difference at any features between good borrowers and bad borrowers??¶

To answer the quesion, we explore several features using box plots. In each sub plot, the blue box show the distribitnion of value of that in feature spanned by the default class, while the orange boxe show value spanned by the non-default class. For the sake of simplicity, we focused only on 6 numerical features: original interest rate, original unpaid principle, original LTV, original CLTV, debt to income rate and credit score.

num_columns_2 = ['OrInterestRate','OrUnpaidPrinc','OrLTV','OrCLTV','DTIRat','CreditScore'];

fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(8,6))
plt.tight_layout(w_pad=4.0, h_pad=4.0)

for i, column in zip(range(1,len(num_columns)+1), num_columns_2):
    plt.subplot(2,3,i)
    sns.boxplot(x="Default", y=column, data=df, linewidth=0.5)
    plt.xlabel('Default')

The above figures show that on average, defaulters tend to have higher original interest rate, higher date to income rate and lower credit score. From the figure, it looks like credit score is the feature that has the most difference on both distribution and average value. So we decide to verify the assumption that defaulters and non-defaulters behave significantly different at credit score.

Are defaulters and non-defaulters behave different at credit score?¶

To answer this question, we first explore some key statistics of the distribution of credit score numerically for both defaulters and non-defaulters.

crdscr_cmp=pd.concat([df[['CreditScore']][df['Default']==0].describe(),df[['CreditScore']][df['Default']==1].describe()],axis=1)
crdscr_cmp.columns = ['CreditScore when Default=0', 'CreditScore when Default=1']
crdscr_cmp

Credit score of non-defaulters are clearly higher than defaulters when compare most statistics such as mean, 25% quantile, median and 75% quantile. To verify that non-defaulters did tend to have higher credit score, we furtherly decide to run the following hypothesis testing. Let $\mu_0$ be the average credit score of non-defaulters, $\mu_1$ be the average credit score of defaulters. The hypothesis testing is formulated as $$ H_0: \mu_0 \leq \mu_1 $$ $$ H_1: \mu_0 > \mu_1 $$ We did one sided t-test for this hypothesis.

result = stat.ttest_ind(df[df['Default'] == 0]['CreditScore'],df[df['Default'] == 1]['CreditScore'],equal_var = False)
result

Ttest_indResult(statistic=100.17299990501223, pvalue=0.0)

Recall that the scipy function ttest_ind we used above always gives the test statistic as signed. Given t and p values from the two-tailed test, we should reject the null hypothesis of a greater-than test when p/2 < alpha and t > 0, and of a less-than test when p/2 < alpha and t < 0. From the above result, we have $$ t > 0 $$ $$ p/2 < .05 $$ So we are confident to reject the null hypothesis and say that average scredit score of non-defaulters are higher than average credit score of non0defaulters.

We also visulized the distribution density of credit score of both defaulters and non-defaulters. We tried to see that if there is a boundary that we can use to distinct defaulters from non-defaulters.

cs_nondefault=df['CreditScore'][df['Default']==0].dropna()
cs_default=df['CreditScore'][df['Default']==1].dropna()

sns.set()
sns.distplot( cs_nondefault , color="skyblue", label="Default=0")
sns.distplot( cs_default , color="red", label="Default=1")
plt.legend()

<matplotlib.legend.Legend at 0x1c00ac0d978>

From the distribution density plot, we can see that the distribution of credit score for non-defaulters is clearly skewed to higher value. Although the two distributions are heavily overlapped, we may say that the boundary between good and bad credit score is around 730.

Conclusion for section 4.2¶

In summary, we can say that

1. Non-defaulters tend to have higher credit score than defaulters.
2. If a borrower's credit score is higher than 730, s/he probably won't default.

5. Predicting Bad Loans¶

In this section we are trying to build a sophiscated predictive model that can make use of all the information hidden in every feature for Fannie Mae. Predicting whether a loan will end up as default is a binary clssification problem. Before trainning our model, we first need to furtherly pre-process the data.

5.1 Data Preprocessing for Training¶

Here we did 3 data preprocessing for steps:

1. We scaled the numerical fetures.
2. We resampled the dataset to solve the imbalance problem for the response feature "Default".
3. We splitted the dataset into training and testing datasets.

#Step 1: rescaling
scaler = StandardScaler()
df_2[num_columns] = scaler.fit_transform(df_2[num_columns])

#Step 2: resampling
sm = SMOTEENN()

y = df_2['Default'].values
X = df_2.drop(['Default'], axis=1).values

X_resampled, y_resampled = sm.fit_sample(X, y)

#Step 3: splitting
X_train, X_test, y_train, y_test = cv.train_test_split(X_resampled, y_resampled, test_size = 0.25, random_state=0)

5.2 Training with Logistic Regression¶

To figure out whether a loan will be default or not is a typical two classes classification problem. Logistic regression is one of the most popular classification method since it is powerful yet simple. We started with feeding our data into a logistic regression model to see if it works well.

model_fitted=LogisticRegressionCV(random_state=1,cv=3).fit(X_train,y_train)
predictions = model_fitted.predict(X_test)

# Functions for model evaluation
def compute_false_negatives(target,predictions):
    false_negatives=pd.DataFrame({"target":target,"predictions":predictions})
    neg_rate=false_negatives[(false_negatives["target"]==1) & (false_negatives["predictions"]==0)].shape[0]/(false_negatives[(false_negatives["target"]==1)].shape[0]+1)
    return neg_rate

def compute_false_positive(target,predictions):
    false_positives=pd.DataFrame({"target":target,"predictions":predictions})
    pos_rate=false_positives[(false_positives["target"]==0) & (false_positives["predictions"]==1)].shape[0]/(false_positives[(false_positives["target"]==0)].shape[0]+1)
    return pos_rate

# Model evaluation
print(metrics.classification_report(y_test, predictions))

accuracy=metrics.accuracy_score(y_test,predictions)
FN=compute_false_negatives(y_test,predictions)
FP=compute_false_positive(y_test,predictions)
print("Accuracy of the model:{}".format(accuracy))
print("False Negatives Rates:{}".format(FN))
print("False Positive Rates:{}".format(FP))

             precision    recall  f1-score   support

          0       0.79      0.69      0.74     60637
          1       0.81      0.88      0.84     88294

avg / total       0.80      0.80      0.80    148931

Accuracy of the model:0.802196990552672
False Negatives Rates:0.12292881816637409
False Positive Rates:0.30682080543553547

It can be seen that our model has a accuracy of almost 80% for testing data. At least we can say that our model show some power over random guess. However, recall that the current default rate in the data set is 34793/391209=8.89%. If Fannie Mae use this logistic model, its default rate will be $$1 - \text{precision of non-defaulters}=1-0.79=21\%$$ It is far more worse than Fannie Mae's current performance. The logistic regression model need to be improved. We belive that one of the weakness of this model is that it is still "nice" to the potential defaulters since 30 percent defaulters is classified as non-defaulters. We checked the AUC curve to see if moving the decision boundary can be helpful.

fpr, tpr, thresholds = metrics.roc_curve(y_test, model_fitted.predict_proba(X_test)[:,1])
roc_auc = metrics.roc_auc_score(y_test, predictions)

plt.plot(fpr, tpr, lw=1, label='AUC = %0.2f'%(roc_auc))
plt.plot([0, 1], [0, 1], '--k', lw=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC')
plt.legend(loc="lower right", frameon = True).get_frame().set_edgecolor('black')

The AUC value is 0.71, which is not bad but also not outstanding.

Improvement directions¶

We decided to improve our models from two directions. The first one is to improve the data transformation step. The second one is to make use of a more sophiscated classification model.

categorical features, we created dummy variables for each possible value of every categorical feature. This transformation method is more precise in theory and is widely used in several statistical tasks such as experimental design.

def getdummies(df):
    columns = df.columns[df.isnull().any()]
    nan_cols = df[columns]

    df.drop(nan_cols.columns, axis=1, inplace=True)

    cat = df.select_dtypes(include=['object'])
    num = df.drop(cat.columns, axis=1)

    data = pd.DataFrame()
    for i in cat.columns:
        tmp = pd.get_dummies(cat[i], drop_first=True)
        data = pd.concat([data, tmp], axis=1)

    df = pd.concat([num,data,nan_cols], axis=1).reset_index(drop=True)
    return df

df_3 = df.copy()

df_3['OrDateMonth'] = df_3['OrDate'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_3['OrDateYear'] = df_3['OrDate'].apply(lambda x: x.split('/')[1].strip()).astype(int)

df_3['FirstMonth'] = df_3['FirstPayment'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_3['FirstYear'] = df_3['FirstPayment'].apply(lambda x: x.split('/')[1].strip()).astype(int)

df_3.drop(['OrDate','FirstPayment'], axis=1, inplace=True)

df_3 = getdummies(df_3)

df_3.dtypes

OrInterestRate                               float64
OrUnpaidPrinc                                  int64
OrLoanTerm                                     int64
OrLTV                                          int64
OrCLTV                                       float64
NumBorrow                                    float64
DTIRat                                       float64
CreditScore                                  float64
NumUnits                                       int64
Zip                                            int64
Default                                        int32
OrDateMonth                                    int32
OrDateYear                                     int32
FirstMonth                                     int32
FirstYear                                      int32
C                                              uint8
R                                              uint8
BANK OF AMERICA, N.A.                          uint8
CHASE HOME FINANCE, LLC                        uint8
CITIMORTGAGE, INC.                             uint8
FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB       uint8
FIRST TENNESSEE BANK NATIONAL ASSOCIATION      uint8
FLAGSTAR CAPITAL MARKETS CORPORATION           uint8
GMAC MORTGAGE, LLC                             uint8
HSBC BANK USA, NATIONAL ASSOCIATION            uint8
JPMORGAN CHASE BANK, NATIONAL ASSOCIATION      uint8
OTHER                                          uint8
PNC BANK, N.A.                                 uint8
SUNTRUST MORTGAGE INC.                         uint8
WELLS FARGO BANK, N.A.                         uint8
                                              ...   
MO                                             uint8
MS                                             uint8
MT                                             uint8
NC                                             uint8
ND                                             uint8
NE                                             uint8
NH                                             uint8
NJ                                             uint8
NM                                             uint8
NV                                             uint8
NY                                             uint8
OH                                             uint8
OK                                             uint8
OR                                             uint8
PA                                             uint8
PR                                             uint8
RI                                             uint8
SC                                             uint8
SD                                             uint8
TN                                             uint8
TX                                             uint8
UT                                             uint8
VA                                             uint8
VI                                             uint8
VT                                             uint8
WA                                             uint8
WI                                             uint8
WV                                             uint8
WY                                             uint8
Y                                              uint8
Length: 94, dtype: object

We then go through the same preprocessing steps and built the new trainning and testing data sets.

#Step 1: rescaling
scaler = StandardScaler()
df_3[num_columns] = scaler.fit_transform(df_3[num_columns])

#Step 2: resampling
sm = SMOTEENN()

y3 = df_3['Default'].values
X3 = df_3.drop(['Default'], axis=1).values

X3_resampled, y3_resampled = sm.fit_sample(X3, y3)

#Step 3: splitting
X3_train, X3_test, y3_train, y3_test = cv.train_test_split(X3_resampled, y3_resampled, test_size = 0.25, random_state=0)

We also switched our model from logistic regressions to random forrests. Random forrests is realtively new but has been shown to be powerful in classification task.

model_rf = RandomForestClassifier(n_estimators=250)
model_rf = model_rf.fit(X3_train, y3_train)
predictions_rf = model_rf.predict(X3_test)

# Model evaluation
print(metrics.classification_report(y3_test, predictions_rf))

accuracy_rf=metrics.accuracy_score(y3_test,predictions_rf)
FN_rf=compute_false_negatives(y3_test,predictions_rf)
FP_rf=compute_false_positive(y3_test,predictions_rf)
print("Accuracy of the model:{}".format(accuracy_rf))
print("False Negatives Rates:{}".format(FN_rf))
print("False Positive Rates:{}".format(FP_rf))

             precision    recall  f1-score   support

          0       0.94      0.96      0.95     58991
          1       0.97      0.96      0.97     88705

avg / total       0.96      0.96      0.96    147696

Accuracy of the model:0.9611296175928935
False Negatives Rates:0.03889252136270376
False Positive Rates:0.0388357743422837

The new predictive model shows huge improvement. If Fannie Mae use the random forrest model, its overall default rate will decrease from 8.89% to $$1 - \text{precision of non-defaulters}=1-0.94=6\%$$ Which can increase Fannie Mae's mortgage backed securities' performance on the bond market.
We furtherly checked the AUC curve.

fpr, tpr, thresholds = metrics.roc_curve(y3_test, model_rf.predict_proba(X3_test)[:,1])
roc_auc = metrics.roc_auc_score(y3_test, predictions_rf)

plt.plot(fpr, tpr, lw=1, label='AUC = %0.2f'%(roc_auc))
plt.plot([0, 1], [0, 1], '--k', lw=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC')
plt.legend(loc="lower right", frameon = True).get_frame().set_edgecolor('black')

The AUC value is 0.71, which is far more better.

Conclusion for section 5¶

After thorough exploration, we belive that the random forrest classifier trained with data set which uses dummy variables for categorical features brings impressive improvement to Fannie Mae's current risk management processes. Fannie Mae can benifit a 2 percent default rate decreasing if take this classifier into use.

6. Conclusions¶

After explore Fannie Mae's loan acquisition and performance dataset, we have provide the following results that may help Fannie decrease the default rate of the loans it purchases.

Fannie Mae should give higher preference to commercial banks with good performance history, including JP Morgan Chase, Wells Fargo, First Tennesse Bank, PNC Bank, Citi Mortgage and Chase Home Finance. Meanwhile, Fannie Mae should reduce number of loan purchased from banks which performs below average, including Amtrust Bank, GMAC Mortgage, Flagstar Capital, Bank of America and Suntrust Mortgage. Fannie Mae should consider exploring more opportunities to work with smaller local banks.
Fannie Mae should give higher preference to loans carried by borrowers with high credit score, probably to boorrower whose credit score is higher than 730.
We delivered a random forrest classifier that can help Fannie decrease its default rate from 8.89 percent to 6 percent. Although 2 percent doesn't sound like significant, but it can help Fannie Mae save a lot of money considering Fannie purchase more than 300k loans per quarter.

References¶

[1] Vanderplas, J. T. (2016). Python data science handbook: Tools and techniques for developers. Beijing: OReilly.
[2] Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M. Ingersoll. "An introduction to logistic regression analysis and reporting." The journal of educational research 96.1 (2002): 3-14.
[3] Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest." R news 2.3 (2002): 18-22.
[4] Dimitrios Tsementzis. Lecture 1-7.

	LoanID	Channel	SellerName	OrInterestRate	OrUnpaidPrinc	OrLoanTerm	OrDate	FirstPayment	OrLTV	OrCLTV	...	NumUnits	OccStatus	PropertyState	Zip	MortInsPerc	ProductType	CoCreditScore	MortInsType	RelMortInd	Default
0	100002296854	C	BANK OF AMERICA, N.A.	6.625	343000	360	10/2007	12/2007	86	86.0	...	1	P	CO	809	25.0	FRM	756.0	2.0	N	0
1	100006876815	C	BANK OF AMERICA, N.A.	6.250	400000	360	10/2007	12/2007	62	62.0	...	1	P	CA	920	NaN	FRM	790.0	NaN	N	0
2	100008184591	B	OTHER	6.625	81000	360	11/2007	01/2008	64	82.0	...	1	P	LA	708	NaN	FRM	NaN	NaN	N	0
3	100008870761	B	AMTRUST BANK	6.500	119000	360	11/2007	01/2008	71	71.0	...	1	P	IL	600	NaN	FRM	NaN	NaN	N	0
4	100013284158	B	BANK OF AMERICA, N.A.	6.625	205000	360	10/2007	12/2007	27	27.0	...	1	P	CA	907	NaN	FRM	NaN	NaN	N	0
5	100016840211	R	PNC BANK, N.A.	6.375	210000	360	08/2007	10/2007	80	93.0	...	1	P	PA	160	NaN	FRM	NaN	NaN	N	0
6	100023772994	R	JPMORGAN CHASE BANK, NATIONAL ASSOCIATION	6.750	172000	360	09/2007	11/2007	95	95.0	...	1	P	IL	605	30.0	FRM	NaN	1.0	N	0
7	100026162343	B	BANK OF AMERICA, N.A.	6.875	108000	360	11/2007	01/2008	90	90.0	...	2	I	MO	631	25.0	FRM	668.0	1.0	N	1
8	100030319603	C	JPMORGAN CHASE BANK, NATIONAL ASSOCIATION	6.250	175000	180	10/2007	12/2007	70	70.0	...	1	P	IL	601	NaN	FRM	NaN	NaN	N	0
9	100031046995	C	BANK OF AMERICA, N.A.	6.500	224000	360	10/2007	12/2007	62	62.0	...	1	P	AZ	850	NaN	FRM	NaN	NaN	N	0
10	100036338486	B	BANK OF AMERICA, N.A.	7.000	47000	360	08/2007	11/2007	67	67.0	...	1	P	IN	464	NaN	FRM	NaN	NaN	N	0
11	100039990921	C	BANK OF AMERICA, N.A.	6.500	166000	360	10/2007	12/2007	95	95.0	...	1	S	VA	240	30.0	FRM	666.0	1.0	N	0
12	100043379676	R	BANK OF AMERICA, N.A.	6.125	70000	360	11/2007	01/2008	77	77.0	...	1	P	SC	298	NaN	FRM	686.0	NaN	N	0
13	100043973870	B	BANK OF AMERICA, N.A.	7.250	147000	360	08/2007	10/2007	90	90.0	...	1	I	IL	601	25.0	FRM	NaN	1.0	N	1
14	100046332832	B	SUNTRUST MORTGAGE INC.	6.500	45000	360	11/2007	01/2008	57	57.0	...	1	P	WV	257	NaN	FRM	739.0	NaN	N	0
15	100051174956	B	FIRST TENNESSEE BANK NATIONAL ASSOCIATION	5.625	204000	360	10/2007	12/2007	75	75.0	...	1	I	MA	21	NaN	FRM	NaN	NaN	N	1
16	100053251572	R	WELLS FARGO BANK, N.A.	6.375	120000	360	10/2007	12/2007	80	80.0	...	1	S	FL	327	NaN	FRM	676.0	NaN	N	0
17	100053457392	C	FLAGSTAR CAPITAL MARKETS CORPORATION	7.375	111000	360	08/2007	10/2007	75	93.0	...	1	S	TX	775	NaN	FRM	713.0	NaN	N	1
18	100055723684	C	FLAGSTAR CAPITAL MARKETS CORPORATION	6.875	331000	360	09/2007	11/2007	71	71.0	...	1	P	NY	109	NaN	FRM	609.0	NaN	N	0
19	100057662068	R	OTHER	6.625	372000	360	09/2007	11/2007	49	49.0	...	1	P	WA	981	NaN	FRM	673.0	NaN	N	0
20	100058998523	B	FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB	6.500	250000	360	12/2007	02/2008	50	50.0	...	1	P	CA	917	NaN	FRM	NaN	NaN	N	0
21	100059647383	C	BANK OF AMERICA, N.A.	6.625	159000	360	11/2007	12/2007	89	89.0	...	1	P	VA	244	25.0	FRM	785.0	1.0	N	1
22	100064375524	C	WELLS FARGO BANK, N.A.	6.750	224000	360	08/2007	10/2007	70	70.0	...	1	P	MO	631	NaN	FRM	NaN	NaN	N	0
23	100064846848	C	CITIMORTGAGE, INC.	6.250	156000	360	11/2007	12/2007	80	80.0	...	1	P	WV	253	NaN	FRM	683.0	NaN	N	0
24	100065596836	C	AMTRUST BANK	6.375	127000	360	08/2007	11/2007	55	55.0	...	1	P	IN	474	NaN	FRM	693.0	NaN	N	0
25	100066447645	C	BANK OF AMERICA, N.A.	7.250	264000	360	09/2007	11/2007	80	80.0	...	1	I	MD	207	NaN	FRM	724.0	NaN	N	0
26	100067374219	C	WELLS FARGO BANK, N.A.	6.750	268000	360	07/2007	09/2007	95	95.0	...	1	P	FL	323	25.0	FRM	NaN	1.0	N	0
27	100069119582	C	BANK OF AMERICA, N.A.	7.250	233000	360	12/2007	02/2008	92	92.0	...	1	P	IL	622	30.0	FRM	632.0	2.0	N	0
28	100070906603	B	SUNTRUST MORTGAGE INC.	6.625	332000	360	08/2007	10/2007	80	80.0	...	1	P	VA	231	NaN	FRM	773.0	NaN	N	0
29	100072621645	C	JPMORGAN CHASE BANK, NATIONAL ASSOCIATION	6.250	313000	360	10/2007	12/2007	56	56.0	...	1	P	NY	109	NaN	FRM	NaN	NaN	N	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
391179	999921876990	B	FIRST TENNESSEE BANK NATIONAL ASSOCIATION	8.000	109000	360	09/2007	11/2007	75	75.0	...	1	I	UT	840	NaN	FRM	613.0	NaN	N	0
391180	999922546207	R	OTHER	6.625	280000	360	11/2007	01/2008	80	80.0	...	1	P	GA	300	NaN	FRM	NaN	NaN	N	0
391181	999927406299	R	OTHER	7.375	44000	360	10/2007	12/2007	80	80.0	...	1	I	IN	473	NaN	FRM	770.0	NaN	N	0
391182	999929944792	R	OTHER	6.625	350000	360	08/2007	10/2007	75	75.0	...	1	P	FL	333	NaN	FRM	NaN	NaN	N	1
391183	999930157069	B	BANK OF AMERICA, N.A.	6.875	261000	360	10/2007	12/2007	95	95.0	...	1	P	NV	891	30.0	FRM	584.0	1.0	N	0
391184	999933620064	B	PNC BANK, N.A.	6.375	143000	360	11/2007	01/2008	56	56.0	...	1	P	MD	217	NaN	FRM	755.0	NaN	N	0
391185	999934164645	R	PNC BANK, N.A.	6.125	105000	180	11/2007	01/2008	80	80.0	...	1	S	IN	460	NaN	FRM	739.0	NaN	N	0
391186	999934665412	B	BANK OF AMERICA, N.A.	5.875	178000	360	10/2007	12/2007	75	75.0	...	1	P	AZ	859	NaN	FRM	NaN	NaN	N	0
391187	999942349981	B	SUNTRUST MORTGAGE INC.	6.750	120000	360	11/2007	01/2008	89	89.0	...	1	P	MA	11	25.0	FRM	NaN	1.0	N	0
391188	999946001004	C	WELLS FARGO BANK, N.A.	6.500	267000	360	08/2007	10/2007	94	94.0	...	1	P	TX	750	25.0	FRM	NaN	1.0	N	0
391189	999947275603	B	SUNTRUST MORTGAGE INC.	6.750	149000	360	09/2007	11/2007	70	70.0	...	2	I	OR	974	NaN	FRM	NaN	NaN	N	0
391190	999947372245	B	FIRST TENNESSEE BANK NATIONAL ASSOCIATION	6.500	412000	360	09/2007	11/2007	80	95.0	...	1	P	TX	761	NaN	FRM	763.0	NaN	N	0
391191	999951031363	B	JPMORGAN CHASE BANK, NATIONAL ASSOCIATION	6.125	375000	360	10/2007	12/2007	79	79.0	...	1	P	TX	773	NaN	FRM	NaN	NaN	N	0
391192	999954821017	C	JPMORGAN CHASE BANK, NATIONAL ASSOCIATION	6.625	118000	360	09/2007	11/2007	80	80.0	...	1	P	MI	494	NaN	FRM	NaN	NaN	N	0
391193	999957860102	C	FLAGSTAR CAPITAL MARKETS CORPORATION	6.500	202000	360	09/2007	11/2007	89	89.0	...	1	P	WI	532	17.0	FRM	705.0	1.0	N	0
391194	999958944118	R	OTHER	6.875	101000	360	11/2007	01/2008	90	90.0	...	1	I	OH	442	25.0	FRM	791.0	2.0	N	0
391195	999960451839	C	BANK OF AMERICA, N.A.	6.250	160000	360	11/2007	01/2008	77	77.0	...	1	P	NJ	88	NaN	FRM	NaN	NaN	N	0
391196	999961761904	R	WELLS FARGO BANK, N.A.	6.500	176000	360	09/2007	11/2007	90	90.0	...	1	P	GA	300	25.0	FRM	NaN	1.0	N	0
391197	999963866940	C	BANK OF AMERICA, N.A.	6.500	212000	360	10/2007	12/2007	67	67.0	...	1	P	AL	354	NaN	FRM	727.0	NaN	N	0
391198	999964289708	C	CITIMORTGAGE, INC.	6.000	253000	360	11/2007	01/2008	56	56.0	...	1	P	DE	198	NaN	FRM	806.0	NaN	N	0
391199	999965448735	C	CITIMORTGAGE, INC.	5.750	225000	360	10/2007	12/2007	70	90.0	...	1	P	MA	10	NaN	FRM	NaN	NaN	N	0
391200	999970019289	B	FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB	6.000	350000	360	09/2007	11/2007	75	75.0	...	1	P	CA	902	NaN	FRM	NaN	NaN	N	0
391201	999974637553	C	BANK OF AMERICA, N.A.	6.750	215000	360	09/2007	11/2007	70	70.0	...	1	P	MD	210	NaN	FRM	713.0	NaN	N	0
391202	999977642464	C	GMAC MORTGAGE, LLC	6.500	223000	360	08/2007	11/2007	88	88.0	...	1	P	IL	604	25.0	FRM	783.0	1.0	N	0
391203	999977655671	B	AMTRUST BANK	6.000	273000	360	12/2007	02/2008	54	58.0	...	1	P	VA	201	NaN	FRM	788.0	NaN	N	0
391204	999984651015	R	OTHER	6.500	139000	360	11/2007	01/2008	58	58.0	...	1	P	MO	641	NaN	FRM	NaN	NaN	N	0
391205	999986532145	R	OTHER	6.375	47000	180	01/2007	03/2007	76	76.0	...	1	P	OH	441	NaN	FRM	NaN	NaN	N	0
391206	999988683412	C	CITIMORTGAGE, INC.	6.125	135000	360	11/2007	01/2008	56	56.0	...	1	I	CA	913	NaN	FRM	NaN	NaN	N	0
391207	999999388912	B	BANK OF AMERICA, N.A.	6.250	344000	360	11/2007	01/2008	80	80.0	...	1	P	PA	155	NaN	FRM	771.0	NaN	N	0
391208	999999565669	B	PNC BANK, N.A.	6.750	360000	360	09/2007	11/2007	38	38.0	...	1	P	FL	330	NaN	FRM	NaN	NaN	N	0

	0	1
SellerName
AMTRUST BANK	10051	989
BANK OF AMERICA, N.A.	99410	12218
CHASE HOME FINANCE, LLC	6223	607
CITIMORTGAGE, INC.	33972	3216
FDIC, RECEIVER, INDYMAC FEDERAL BANK FSB	7730	1598
FIRST TENNESSEE BANK NATIONAL ASSOCIATION	7067	617
FLAGSTAR CAPITAL MARKETS CORPORATION	11812	1438
GMAC MORTGAGE, LLC	20755	2107
HSBC BANK USA, NATIONAL ASSOCIATION	532	5
JPMORGAN CHASE BANK, NATIONAL ASSOCIATION	36786	2410
OTHER	55376	3235
PNC BANK, N.A.	25126	2232
SUNTRUST MORTGAGE INC.	14156	1793
WELLS FARGO BANK, N.A.	27420	2328
Overall	356416	34793

	CreditScore when Default=0	CreditScore when Default=1
count	356416.000000	34793.000000
mean	722.983789	690.826426
std	60.983618	56.766786
min	392.000000	437.000000
25%	677.000000	649.000000
50%	730.000000	687.000000
75%	776.000000	733.000000
max	843.000000	827.000000