import seaborn as sns
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stat
from sklearn.ensemble import RandomForestRegressor
from sklearn import cross_validation as cv
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from imblearn.combine import SMOTEENN
import matplotlib.pyplot as plt
%matplotlib inline
Fannie Mae is the government spnsored corporation whose primary purpose is to purchase mortgage loans from primary lenders such as big commercial banks, say Chase. It then sells those loans as securities in the bond market. Fannie Mae's goverment-backed backgruond benifits the corporation in several ways. First, the federal background exposure it to more funding resources than most local commercial banks. Second, the securities sold by Fannie Mae on the bond market have higher credit ranking since many long term investers believe a security is not liekly to default if it is somehow related to federal goverment. Fannie Mae claims that they "expand the secondary mortgage market by securitizing mortgages in the form of mortgage-backed securities, allowing lenders to reinvest their assets into more lending and in effect increasing the number of lenders in the mortgage market by reducing the reliance on locally based savings and loan associations". In short words, Fannie Mae connects investors all around the world to U.S.'s mortgage market. It provides funding liquidity to local lenders and provides U.S. martgage market access to global investers.
Even though Fannie Mae is supported by the federal goverment, the performance of its mortgage-backed securities is still strongly related to the repayment of loans purchased by Fannie Mae. No primary lenders can guarantee that the borrowers they pick will never default. For Fannie Mae, it need to avoid those borrowers who are very likely to default at the end when they purchase mortgages from primary lenders.
This project is to analyse the Fannie Mae Single-Family Loan Performance Data. Fannie Mae release data for acquisition and performance of loans every quarter. This dataset contains a lot of information and records from the borrowers during their loan-repay processing. We try to see whether we can predict from this data about the risk of defaulting for given borrowers.
The Fannie Mae data are located at :https://loanperformancedata.fanniemae.com/lppub/index.html. It is free for public and anyone can download it after register with Fanie Mae. Fannie Mae also provides an instruction on how to understand the format of the dataset.
The Fannie Mae data consists of two separate txt files for each quater. The first one is the "Acuisition" file, which contains static informaion when Fannie Mae acquired a single family loan from a commercial bank. The second one is the "Performance" file, which contains time series information of the payment history of a specific loan. Notice taht those two files do not contain any columns names so we need to provide the name specified in the instruction provided by Fannie Mae.
col_acq = ['LoanID','Channel','SellerName','OrInterestRate','OrUnpaidPrinc','OrLoanTerm',
'OrDate','FirstPayment','OrLTV','OrCLTV','NumBorrow','DTIRat','CreditScore',
'FTHomeBuyer','LoanPurpose','PropertyType','NumUnits','OccStatus','PropertyState',
'Zip','MortInsPerc','ProductType','CoCreditScore','MortInsType','RelMortInd'];
col_per = ['LoanID','MonthRep','Servicer','CurrInterestRate','CAUPB','LoanAge','MonthsToMaturity',
'AdMonthsToMaturity','MaturityDate','MSA','CLDS','ModFlag','ZeroBalCode','ZeroBalDate',
'LastInstallDate','ForeclosureDate','DispositionDate','PPRC','AssetRecCost','MHRC',
'ATFHP','NetSaleProceeds','CreditEnhProceeds','RPMWP','OFP','NIBUPB','PFUPB','RMWPF',
'FPWA','ServicingIndicator'];
df_acq = pd.read_csv('/Users/xu_zh/Dropbox/Tony_Rutgers/2018_1_Spring_Intepretation of Data/4-Homework/Acquisition_2007Q4.txt', sep='|', names=col_acq, index_col=False)
df_perf = pd.read_csv('/Users/xu_zh/Dropbox/Tony_Rutgers/2018_1_Spring_Intepretation of Data/4-Homework/Performance_2007Q4.txt', sep='|', names=col_per, usecols=[0, 15], index_col=False)
In this project, the question we want to answer is that whether a single family loan will ended up as default. We need to find a way to investigate 'Acquisition' and 'Performance' at the same time.
As a starting point, we only keep the last row of loan payment history under each distinct LoanID. We do it that way since a loan is considered default at the end once it can not fullfill the payment requirement during any time period. We then retrieve column ForecolsureDate from 'Performance' file and merge it with 'Acuisition' file by LoanID. Also, we rename the Foreclosure Date as Default and we fill the 0 to the null values and fill the 1 to all values which is not the 0. when default still equal 0, it means that the Load ID still repay their Loan mortgage.when default=1, it means that the Load ID stop repay their Loan mortgage.
df_perf.drop_duplicates(subset='LoanID', keep='last', inplace=True)
df = pd.merge(df_acq, df_perf, on='LoanID', how='inner')
df.rename(index=str, columns={"ForeclosureDate": 'Default'}, inplace=True)
df['Default'].fillna(0, inplace=True)
df.loc[df['Default'] != 0, 'Default'] = 1
df['Default'] = df['Default'].astype(int)
Now lets take a look at the data frame we have.
print("number of attributes:",len(df.columns))
print("number of records:",len(df))
df
We can see that the dataframe has 391,209 rows and 26 columns, and contains several dooferent type of information. There are numerical data like Interest Rate and Credit Score. There are textual data such as Seller's Name and Property State. There are also time stamp data. We can notice that many columns contain missing values that need to be either filled in or removed before we do any analysis.
Data cleaning and transformation is an important step before we go to build our predicting models. As shown in previuos section, there are two main problems. First, the Fannie Mae dataset contains a lot of missing values that need to be handled. Second, there are alot of categorical features whose values are strings. We need to transform those features' type such that they can be processed in popular python machine learning packages.
Let's now check how many how many missing values are there for each feature. We can see from below that 'MortInsPerc', 'CoCreditScore' and 'MortInsType' have too many null values. It will not be meaning to fullfill those null values since more than 50% observations doesn't have those information. We can simply drop them.
df.apply(lambda x: x.isnull().sum(), axis=0)
We also checked the number of unique values of each feature. Notice that column "ProductType" has only 1 unique value, which means it has the same effect on all observations. So we drop that column in further analysis. In contrary, the column "LoanID" has different value for every observation since it is required to be distinct as an key value. It is obvious that "LoanID" doesn't has any effect on the default probability of a loan and we can drop it.
df.apply(lambda x: x.nunique(), axis=0)
Now let's drop all the features mentioned above.
df.drop(['LoanID','MortInsPerc','MortInsType','CoCreditScore','ProductType'], axis=1, inplace=True)
Here we fill the missing value for all numerical features. Instead of simpling filling NAN with mean value, we use regression techniques.
nan_columns = df.columns[df.isnull().any()]
for name in nan_columns:
if df[name].dtypes != 'object':
df_number = df.select_dtypes(exclude=['object'])
y = df_number.loc[df[name].notnull(), name].values
X = df_number.loc[df[name].notnull()].drop(nan_columns, axis=1).values
X_test = df_number.loc[df[name].isnull()].drop(nan_columns, axis=1).values
print("Filling Column:",name)
model = RandomForestRegressor(n_estimators=400, max_depth=3)
model.fit(X, y)
df.loc[df[name].isnull(), name] = model.predict(X_test)
print("Filling Finish")
Let's make sure that our filling in procedure works well and there is no missing value in the data frame anymore.
#df.to_csv('2007Q4_cleaned.txt', header=True, index=False, sep='\t', mode='a')
df.apply(lambda x: x.isnull().sum(), axis=0)
From the result above we can guarantee that we don't need to worry about missing value in our further analysis anymore.
Recall that the features we have in the Fannie Mae dataset belong to different data types. It usually will cause trouble if we put different types of data together into one predictive model. So let's check what type of features we have in our data frame first.
df.dtypes
Is is shown that all features in this Pandas dataframe are stored as either numbers or objects. Those features with 'object' type are actualy categorical variables. We replicate this dataframe to a new dataframe and used simple natrual numbers to represent different values of categorical variables. For example, suppose we have "BoA" and "Chase" as loan sellers, we may then use 1 for "BoA" and 2 for "Chase". The newly created dataframe is more suitable to train a predictive model.
Notice that we also treat zip code as a categorical feature. Although zip code appeared to be numbers but it is meaningless to apply arithmetic ruls on zip code. So zip code is indeed a categorical feature.
df_2 = df.copy()
num_columns = ['OrInterestRate','OrUnpaidPrinc','OrLoanTerm','OrLTV','OrCLTV',
'NumBorrow','DTIRat','CreditScore','NumUnits'];
cat_columns = ['Channel','SellerName','FTHomeBuyer','LoanPurpose','PropertyType',
'OccStatus','PropertyState','RelMortInd','Zip'];
for column in cat_columns:
df_2[column]=df_2[column].astype("category").cat.codes
We performed another potentially pre-processing step, that is, we split up any date columns into their month and year components. These may be effective since the time stamp may include some seasonal information.
df_2['OrDateMonth'] = df_2['OrDate'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_2['OrDateYear'] = df_2['OrDate'].apply(lambda x: x.split('/')[1].strip()).astype(int)
df_2['FirstMonth'] = df_2['FirstPayment'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_2['FirstYear'] = df_2['FirstPayment'].apply(lambda x: x.split('/')[1].strip()).astype(int)
df_2.drop(['OrDate','FirstPayment'], axis=1, inplace=True)
Now all features in the new dataframe, called df_2, are represented as numbers.
df_2.dtypes
Another popular way to deal with categorical feature is to create s dummy variable for each possible value of the feature. Considering that some of our categoricla features have more than a doze posible values (such as "Seller's Name"), we prefer not to use this method unless it is neccessary for our predictive model to perform well.
In this section we are trying to see that if some simple indicators can be useful in the decision process when Fannie Mae purchases new loans. Remember that Fannie Mae buys these loans from banks and generate cash flow through the repayment of individual borrower, so we believe we need to pay attention to banks who sell loans and to individual borrowers behind each loan.
Compared to local martgage lenders, Fannie Mae doesn't have the ability to get in touch with borrowers face by face. In contrast, commercial banks have agents who can talk to borrowers and collect more detailed information when a borrower applies for loan. If a commercial bank is more capable at picking borrowers with reliable repayment, Fannie Mae should consider buying loan from this bank first.
We also want to explore that if there is any difference that can be seen between realible borrowers and defaulting borrowers. We hope that reliable borrowers may perform significantly different at some indicators than defaulting borrowers.
In summary, we decided to explore the following questions:
1. Which bank are more unlikely to sell loans that end up defaulting?
2. Is there any difference at any features between good borrowers and bad borrowers? If yes, which feature is the best indicator to distinct good and bad indicator?
In this section we tried to visulize the percentage of default loans sold by each bank. We also compared the default rate of each bank with the avergae default rate of the whole market and tried to find banks that have better than average performance.
xtab = pd.pivot_table(df, index='SellerName', columns='Default', aggfunc='size')
xtab[0]['Overall']=xtab[0].sum()
xtab[1]['Overall']=xtab[1].sum()
df_seller = pd.concat([xtab[0],xtab[1]], axis=1)
df_seller
#xtab = pd.pivot_table(df, index='SellerName', columns='Default', aggfunc='size')
df_seller = df_seller.div(df_seller.sum(axis=1), axis=0)
df_seller = df_seller.sort_values((0))
df_seller.plot.barh(stacked=True, figsize=(6,4))
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.xlabel('Fraction of Borrowers')
plt.ylabel('Sellers')
From the figure above, we can see that HSBC Bank has an extraordinary good performance that nearly no loan came from HSBC Bank was default at the end. However, we must keep in mind that HSBC only sold a little more than 5 hundreds loans to Finnie Mae in this quater. This number is really small compared to other banks. The low default rate of HSBC's loan may be biased since the data amount of HSBC may be not enough.
Loans purchased from the following major banks have lower default rate than the overall market: JP Morgan Chase, Wells Fargo, First Tennesse Bank, PNC Bank, Citi Mortgage and Chase Home Finance. Considering that these banks are already large commercial banks, we were confident to say that these banks show stronger capability at risk management.
On the other hand, some major banks such as Bank of American performs worse than overall average. We suggest that Fannie Mae should reduce the number of loans purchased from these banks.
We also noticed an interesting phenomenon related to all the smaller banks that are not listed specifically in the dataset. It can be seen that the average default rate of smaller banks are lower than all other major banks except HSBC. Our guess is that those smaller banks have more local connections to their customers and those connections help them know the borrowers better during mortgage application processes.
In summary, we have three advices for Fannie Mae.
1. Fannie Mae should give higher preference to the following large commercial banks: JP Morgan Chase, Wells Fargo, First Tennesse Bank, PNC Bank, Citi Mortgage and Chase Home Finance.
2. Fannie Mae should reduce number of loan purchased from the following banks: Amtrust Bank, GMAC Mortgage, Flagstar Capital, Bank of America and Suntrust Mortgage.
3. Fannie Mae should consider exploring more opportunities to work with smaller local banks.
To answer the quesion, we explore several features using box plots. In each sub plot, the blue box show the distribitnion of value of that in feature spanned by the default class, while the orange boxe show value spanned by the non-default class. For the sake of simplicity, we focused only on 6 numerical features: original interest rate, original unpaid principle, original LTV, original CLTV, debt to income rate and credit score.
num_columns_2 = ['OrInterestRate','OrUnpaidPrinc','OrLTV','OrCLTV','DTIRat','CreditScore'];
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(8,6))
plt.tight_layout(w_pad=4.0, h_pad=4.0)
for i, column in zip(range(1,len(num_columns)+1), num_columns_2):
plt.subplot(2,3,i)
sns.boxplot(x="Default", y=column, data=df, linewidth=0.5)
plt.xlabel('Default')
The above figures show that on average, defaulters tend to have higher original interest rate, higher date to income rate and lower credit score. From the figure, it looks like credit score is the feature that has the most difference on both distribution and average value. So we decide to verify the assumption that defaulters and non-defaulters behave significantly different at credit score.
To answer this question, we first explore some key statistics of the distribution of credit score numerically for both defaulters and non-defaulters.
crdscr_cmp=pd.concat([df[['CreditScore']][df['Default']==0].describe(),df[['CreditScore']][df['Default']==1].describe()],axis=1)
crdscr_cmp.columns = ['CreditScore when Default=0', 'CreditScore when Default=1']
crdscr_cmp
Credit score of non-defaulters are clearly higher than defaulters when compare most statistics such as mean, 25% quantile, median and 75% quantile. To verify that non-defaulters did tend to have higher credit score, we furtherly decide to run the following hypothesis testing. Let $\mu_0$ be the average credit score of non-defaulters, $\mu_1$ be the average credit score of defaulters. The hypothesis testing is formulated as $$ H_0: \mu_0 \leq \mu_1 $$ $$ H_1: \mu_0 > \mu_1 $$ We did one sided t-test for this hypothesis.
result = stat.ttest_ind(df[df['Default'] == 0]['CreditScore'],df[df['Default'] == 1]['CreditScore'],equal_var = False)
result
Recall that the scipy function ttest_ind we used above always gives the test statistic as signed. Given t and p values from the two-tailed test, we should reject the null hypothesis of a greater-than test when p/2 < alpha and t > 0, and of a less-than test when p/2 < alpha and t < 0. From the above result, we have $$ t > 0 $$ $$ p/2 < .05 $$ So we are confident to reject the null hypothesis and say that average scredit score of non-defaulters are higher than average credit score of non0defaulters.
We also visulized the distribution density of credit score of both defaulters and non-defaulters. We tried to see that if there is a boundary that we can use to distinct defaulters from non-defaulters.
cs_nondefault=df['CreditScore'][df['Default']==0].dropna()
cs_default=df['CreditScore'][df['Default']==1].dropna()
sns.set()
sns.distplot( cs_nondefault , color="skyblue", label="Default=0")
sns.distplot( cs_default , color="red", label="Default=1")
plt.legend()
From the distribution density plot, we can see that the distribution of credit score for non-defaulters is clearly skewed to higher value. Although the two distributions are heavily overlapped, we may say that the boundary between good and bad credit score is around 730.
In summary, we can say that
1. Non-defaulters tend to have higher credit score than defaulters.
2. If a borrower's credit score is higher than 730, s/he probably won't default.
In this section we are trying to build a sophiscated predictive model that can make use of all the information hidden in every feature for Fannie Mae. Predicting whether a loan will end up as default is a binary clssification problem. Before trainning our model, we first need to furtherly pre-process the data.
Here we did 3 data preprocessing for steps:
1. We scaled the numerical fetures.
2. We resampled the dataset to solve the imbalance problem for the response feature "Default".
3. We splitted the dataset into training and testing datasets.
#Step 1: rescaling
scaler = StandardScaler()
df_2[num_columns] = scaler.fit_transform(df_2[num_columns])
#Step 2: resampling
sm = SMOTEENN()
y = df_2['Default'].values
X = df_2.drop(['Default'], axis=1).values
X_resampled, y_resampled = sm.fit_sample(X, y)
#Step 3: splitting
X_train, X_test, y_train, y_test = cv.train_test_split(X_resampled, y_resampled, test_size = 0.25, random_state=0)
To figure out whether a loan will be default or not is a typical two classes classification problem. Logistic regression is one of the most popular classification method since it is powerful yet simple. We started with feeding our data into a logistic regression model to see if it works well.
model_fitted=LogisticRegressionCV(random_state=1,cv=3).fit(X_train,y_train)
predictions = model_fitted.predict(X_test)
# Functions for model evaluation
def compute_false_negatives(target,predictions):
false_negatives=pd.DataFrame({"target":target,"predictions":predictions})
neg_rate=false_negatives[(false_negatives["target"]==1) & (false_negatives["predictions"]==0)].shape[0]/(false_negatives[(false_negatives["target"]==1)].shape[0]+1)
return neg_rate
def compute_false_positive(target,predictions):
false_positives=pd.DataFrame({"target":target,"predictions":predictions})
pos_rate=false_positives[(false_positives["target"]==0) & (false_positives["predictions"]==1)].shape[0]/(false_positives[(false_positives["target"]==0)].shape[0]+1)
return pos_rate
# Model evaluation
print(metrics.classification_report(y_test, predictions))
accuracy=metrics.accuracy_score(y_test,predictions)
FN=compute_false_negatives(y_test,predictions)
FP=compute_false_positive(y_test,predictions)
print("Accuracy of the model:{}".format(accuracy))
print("False Negatives Rates:{}".format(FN))
print("False Positive Rates:{}".format(FP))
It can be seen that our model has a accuracy of almost 80% for testing data. At least we can say that our model show some power over random guess. However, recall that the current default rate in the data set is 34793/391209=8.89%. If Fannie Mae use this logistic model, its default rate will be $$1 - \text{precision of non-defaulters}=1-0.79=21\%$$ It is far more worse than Fannie Mae's current performance. The logistic regression model need to be improved. We belive that one of the weakness of this model is that it is still "nice" to the potential defaulters since 30 percent defaulters is classified as non-defaulters. We checked the AUC curve to see if moving the decision boundary can be helpful.
fpr, tpr, thresholds = metrics.roc_curve(y_test, model_fitted.predict_proba(X_test)[:,1])
roc_auc = metrics.roc_auc_score(y_test, predictions)
plt.plot(fpr, tpr, lw=1, label='AUC = %0.2f'%(roc_auc))
plt.plot([0, 1], [0, 1], '--k', lw=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logistic Regression ROC')
plt.legend(loc="lower right", frameon = True).get_frame().set_edgecolor('black')
The AUC value is 0.71, which is not bad but also not outstanding.
We decided to improve our models from two directions. The first one is to improve the data transformation step. The second one is to make use of a more sophiscated classification model.
categorical features, we created dummy variables for each possible value of every categorical feature. This transformation method is more precise in theory and is widely used in several statistical tasks such as experimental design.
def getdummies(df):
columns = df.columns[df.isnull().any()]
nan_cols = df[columns]
df.drop(nan_cols.columns, axis=1, inplace=True)
cat = df.select_dtypes(include=['object'])
num = df.drop(cat.columns, axis=1)
data = pd.DataFrame()
for i in cat.columns:
tmp = pd.get_dummies(cat[i], drop_first=True)
data = pd.concat([data, tmp], axis=1)
df = pd.concat([num,data,nan_cols], axis=1).reset_index(drop=True)
return df
df_3 = df.copy()
df_3['OrDateMonth'] = df_3['OrDate'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_3['OrDateYear'] = df_3['OrDate'].apply(lambda x: x.split('/')[1].strip()).astype(int)
df_3['FirstMonth'] = df_3['FirstPayment'].apply(lambda x: x.split('/')[0].strip()).astype(int)
df_3['FirstYear'] = df_3['FirstPayment'].apply(lambda x: x.split('/')[1].strip()).astype(int)
df_3.drop(['OrDate','FirstPayment'], axis=1, inplace=True)
df_3 = getdummies(df_3)
df_3.dtypes
We then go through the same preprocessing steps and built the new trainning and testing data sets.
#Step 1: rescaling
scaler = StandardScaler()
df_3[num_columns] = scaler.fit_transform(df_3[num_columns])
#Step 2: resampling
sm = SMOTEENN()
y3 = df_3['Default'].values
X3 = df_3.drop(['Default'], axis=1).values
X3_resampled, y3_resampled = sm.fit_sample(X3, y3)
#Step 3: splitting
X3_train, X3_test, y3_train, y3_test = cv.train_test_split(X3_resampled, y3_resampled, test_size = 0.25, random_state=0)
We also switched our model from logistic regressions to random forrests. Random forrests is realtively new but has been shown to be powerful in classification task.
model_rf = RandomForestClassifier(n_estimators=250)
model_rf = model_rf.fit(X3_train, y3_train)
predictions_rf = model_rf.predict(X3_test)
# Model evaluation
print(metrics.classification_report(y3_test, predictions_rf))
accuracy_rf=metrics.accuracy_score(y3_test,predictions_rf)
FN_rf=compute_false_negatives(y3_test,predictions_rf)
FP_rf=compute_false_positive(y3_test,predictions_rf)
print("Accuracy of the model:{}".format(accuracy_rf))
print("False Negatives Rates:{}".format(FN_rf))
print("False Positive Rates:{}".format(FP_rf))
The new predictive model shows huge improvement. If Fannie Mae use the random forrest model, its overall default rate will decrease from 8.89% to
$$1 - \text{precision of non-defaulters}=1-0.94=6\%$$
Which can increase Fannie Mae's mortgage backed securities' performance on the bond market.
We furtherly checked the AUC curve.
fpr, tpr, thresholds = metrics.roc_curve(y3_test, model_rf.predict_proba(X3_test)[:,1])
roc_auc = metrics.roc_auc_score(y3_test, predictions_rf)
plt.plot(fpr, tpr, lw=1, label='AUC = %0.2f'%(roc_auc))
plt.plot([0, 1], [0, 1], '--k', lw=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Random Forest ROC')
plt.legend(loc="lower right", frameon = True).get_frame().set_edgecolor('black')
The AUC value is 0.71, which is far more better.
After thorough exploration, we belive that the random forrest classifier trained with data set which uses dummy variables for categorical features brings impressive improvement to Fannie Mae's current risk management processes. Fannie Mae can benifit a 2 percent default rate decreasing if take this classifier into use.
After explore Fannie Mae's loan acquisition and performance dataset, we have provide the following results that may help Fannie decrease the default rate of the loans it purchases.
[1] Vanderplas, J. T. (2016). Python data science handbook: Tools and techniques for developers. Beijing: OReilly.
[2] Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M. Ingersoll. "An introduction to logistic regression analysis and reporting." The journal of educational research 96.1 (2002): 3-14.
[3] Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest." R news 2.3 (2002): 18-22.
[4] Dimitrios Tsementzis. Lecture 1-7.