Univariate Exploration

## [1] 113937     81

Taking a peak at the dimensions of the data we can see that this dataset has many (81) variables. I used a list of these variables to create a shorter set of variables of interest to explore. There are 113,937 observations (which makes it worrisome that the variable loan key is a factor with only 113066 levels!) Let’s explore that after seeing a summary of the dataset.

##                    ListingKey          Term      
##  17A93590655669644DB4C06:     6   Min.   :12.00  
##  349D3587495831350F0F648:     4   1st Qu.:36.00  
##  47C1359638497431975670B:     4   Median :36.00  
##  8474358854651984137201C:     4   Mean   :40.83  
##  DE8535960513435199406CE:     4   3rd Qu.:36.00  
##  04C13599434217079754AEE:     3   Max.   :60.00  
##  (Other)                :113912                  
##                  LoanStatus     BorrowerAPR       BorrowerRate   
##  Current              :56576   Min.   :0.00653   Min.   :0.0000  
##  Completed            :38074   1st Qu.:0.15629   1st Qu.:0.1340  
##  Chargedoff           :11992   Median :0.20976   Median :0.1840  
##  Defaulted            : 5018   Mean   :0.21883   Mean   :0.1928  
##  Past Due (1-15 days) :  806   3rd Qu.:0.28381   3rd Qu.:0.2500  
##  Past Due (31-60 days):  363   Max.   :0.51229   Max.   :0.4975  
##  (Other)              : 1108   NA's   :25                        
##   LenderYield      EstimatedEffectiveYield  ProsperScore  
##  Min.   :-0.0100   Min.   :-0.183          Min.   : 1.00  
##  1st Qu.: 0.1242   1st Qu.: 0.116          1st Qu.: 4.00  
##  Median : 0.1730   Median : 0.162          Median : 6.00  
##  Mean   : 0.1827   Mean   : 0.169          Mean   : 5.95  
##  3rd Qu.: 0.2400   3rd Qu.: 0.224          3rd Qu.: 8.00  
##  Max.   : 0.4925   Max.   : 0.320          Max.   :11.00  
##                    NA's   :29084           NA's   :29084  
##  ListingCategory..numeric. BorrowerState  
##  Min.   : 0.000            CA     :14717  
##  1st Qu.: 1.000            TX     : 6842  
##  Median : 1.000            NY     : 6729  
##  Mean   : 2.774            FL     : 6720  
##  3rd Qu.: 3.000            IL     : 5921  
##  Max.   :20.000                   : 5515  
##                            (Other):67493  
##                     Occupation         EmploymentStatus
##  Other                   :28617   Employed     :67322  
##  Professional            :13628   Full-time    :26355  
##  Computer Programmer     : 4478   Self-employed: 6134  
##  Executive               : 4311   Not available: 5347  
##  Teacher                 : 3759   Other        : 3806  
##  Administrative Assistant: 3688                : 2255  
##  (Other)                 :55456   (Other)      : 2718  
##  EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.00           Min.   :  0.0         Min.   : 19.0        
##  1st Qu.: 26.00           1st Qu.:660.0         1st Qu.:679.0        
##  Median : 67.00           Median :680.0         Median :699.0        
##  Mean   : 96.07           Mean   :685.6         Mean   :704.6        
##  3rd Qu.:137.00           3rd Qu.:720.0         3rd Qu.:739.0        
##  Max.   :755.00           Max.   :880.0         Max.   :899.0        
##  NA's   :7625             NA's   :591           NA's   :591          
##          IncomeRange    IncomeVerifiable LoanOriginalAmount
##  $25,000-49,999:32192   False:  8669     Min.   : 1000     
##  $50,000-74,999:31050   True :105268     1st Qu.: 4000     
##  $100,000+     :17337                    Median : 6500     
##  $75,000-99,999:16916                    Mean   : 8337     
##  Not displayed : 7741                    3rd Qu.:12000     
##  $1-24,999     : 7274                    Max.   :35000     
##  (Other)       : 1427                                      
##  MonthlyLoanPayment
##  Min.   :   0.0    
##  1st Qu.: 131.6    
##  Median : 217.7    
##  Mean   : 272.5    
##  3rd Qu.: 371.6    
##  Max.   :2251.5    
## 

This gives us all sorts of information there at a glance. We get to see an overview of ranges, maximums and counts. Large inconsistencies can be spotted at this stage, but it is a lot to take in at once, so let’s dive into the individual variables I singled out interesting.

I also chose to convert Prosper Score and Term into factors.

Listing Key

Number of listing keys:

## [1] 113937

Number of unique listing keys:

## [1] 113066

Here I wanted to see if the listing keys, which are supposed to be unique identifiers, really were unique. It seems like a lot of them are not.

times_listing_key_appears count
1 112239
2 790
3 32
4 4
6 1

So we can see that many of the keys appear multiple times, with one loan appearing a whopping six times. Let’s see that loan.

ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating..numeric. ProsperRating..Alpha. ProsperScore ListingCategory..numeric. BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent..percentage. TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 60 Current 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5 B 4 1 MD Other Employed 26 False False 2013-12-23 09:38:12 720 739 1986-12-26 00:00:00 12 12 20 6 348 0 5 0 0 0 0 0 14635 0.57 10865 17 1 0 0.41 $25,000-49,999 True 3000 CB1B37030986463208432A1 NA NA NA NA NA NA NA NA 0 NA 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0 0 0 0 1 0 0 0 96
17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 60 Current 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5 B 8 1 MD Other Employed 26 False False 2013-12-23 09:38:12 720 739 1986-12-26 00:00:00 12 12 20 6 348 0 5 0 0 0 0 0 14635 0.57 10865 17 1 0 0.41 $25,000-49,999 True 3000 CB1B37030986463208432A1 NA NA NA NA NA NA NA NA 0 NA 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0 0 0 0 1 0 0 0 96
17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 60 Current 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5 B 7 1 MD Other Employed 26 False False 2013-12-23 09:38:12 720 739 1986-12-26 00:00:00 12 12 20 6 348 0 5 0 0 0 0 0 14635 0.57 10865 17 1 0 0.41 $25,000-49,999 True 3000 CB1B37030986463208432A1 NA NA NA NA NA NA NA NA 0 NA 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0 0 0 0 1 0 0 0 96
17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 60 Current 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5 B 10 1 MD Other Employed 26 False False 2013-12-23 09:38:12 720 739 1986-12-26 00:00:00 12 12 20 6 348 0 5 0 0 0 0 0 14635 0.57 10865 17 1 0 0.41 $25,000-49,999 True 3000 CB1B37030986463208432A1 NA NA NA NA NA NA NA NA 0 NA 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0 0 0 0 1 0 0 0 96
17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 60 Current 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5 B 5 1 MD Other Employed 26 False False 2013-12-23 09:38:12 720 739 1986-12-26 00:00:00 12 12 20 6 348 0 5 0 0 0 0 0 14635 0.57 10865 17 1 0 0.41 $25,000-49,999 True 3000 CB1B37030986463208432A1 NA NA NA NA NA NA NA NA 0 NA 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0 0 0 0 1 0 0 0 96
17A93590655669644DB4C06 951186 2013-10-02 17:20:16.550000000 60 Current 0.16662 0.1435 0.1335 0.1264 0.0524 0.074 5 B 6 1 MD Other Employed 26 False False 2013-12-23 09:38:12 720 739 1986-12-26 00:00:00 12 12 20 6 348 0 5 0 0 0 0 0 14635 0.57 10865 17 1 0 0.41 $25,000-49,999 True 3000 CB1B37030986463208432A1 NA NA NA NA NA NA NA NA 0 NA 2 126059 10000 2014-01-13 00:00:00 Q1 2014 F80D3694083622957BA09F2 234.5 234.5 112.62 121.88 -8.49 0 0 0 0 1 0 0 0 96

It seems this loan is just repeated in the dataset 6 times. Here I run a function to see if any of the columns contains values that aren’t exact duplicates.

## [1] "ProsperScore"

There it is, this loan somehow as 6 different ProsperScores. Let’s see if ProsperScore is the only non-duplicated value in all of the other repeated loans.

It seems that it is. Are these Prosper Scores distributed in a predictable manner?

##            ind
## 1 ProsperScore

Taking a look at just a few loans we can see that the spread isn’t always the same between ProsperScores with the same listing ID. Because there doesn’t seem to be a pattern here I elected to eliminate all duplicate rows from my dataset

## [1] "Number of listing keys: "
## [1] 112239
## [1] "Number of unique listing keys: "
## [1] 112239

Now we’re cooking with gas.

Term

##    12    36    60 
##  1614 86694 23931

Seems like the minimum loan term is a year, but the mean is 40.83 months, a little over 3 years.

Here we can see that term is discrete, and that 3 year loans far outpace 5 year loans, with 1 year loans coming in a distant third.

## # A tibble: 3 × 3
##     Term count_per_term percentage
##   <fctr>          <int>      <dbl>
## 1     12           1614   1.438003
## 2     36          86694  77.240531
## 3     60          23931  21.321466

Here is the same information in a table with percentages.

Loan Status

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38048 
##                Current              Defaulted FinalPaymentInProgress 
##                  54928                   5018                    201 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    794                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    359                    309                    304

I don’t like that Past Due > 120 is out of order with the other past dues, so I changed the order of the levels.

A quick note on definitions via Investopedia.com – >Chargedoff : “A charge-off or chargeoff is the declaration by a creditor (usually a credit card account) that an amount of debt is unlikely to be collected. This occurs when a consumer becomes severely delinquent on a debt. Traditionally, creditors will make this declaration at the point of six months without payment.”

Default : “Default is the failure to pay interest or principal on a loan or security when due. Default occurs when a debtor is unable to meet the legal obligation of debt repayment … When an individual, a business or even a nation defaults on a debt obligation, the lender or investor has some recourse to reclaim the funds, but it varies based on the type of security involved.

Happily, the majority of the loans are either current or completed. If we were creating a model for this data it would likely be to predict if a new loan was going to become charged-off, defaulted, or past due. Cancelled, as we can see from our table, has just 5 loans.

Here we look at the distribution of Past Due loans.
It seems that most loans that are past due are only 1-15 days past Due. This suggests that many of these loans are will become current again. I’m not sure if loans need to go through the whole 120 days of being past due before the default or become charged off. It is also worth noting that some of the bins here are larger than others, making the graph slightly misleading. (First two are 15 days, next three are 30.)

Since there are just 5 cancelled loans, let’s examine them.

##                ListingKey Term LoanStatus BorrowerAPR BorrowerRate
## 1 093D3365430569291709120   36  Cancelled     0.25650       0.2375
## 2 F77F33664545520971829DC   36  Cancelled     0.14659       0.1395
## 3 09E1336627538645259AE42   36  Cancelled     0.20735       0.2000
## 4 29453366240667435235AEE   36  Cancelled     0.25650       0.2375
## 5 E77E34330098056070F5C30   36  Cancelled     0.16219       0.1075
##   LenderYield EstimatedEffectiveYield ProsperScore
## 1      0.2325                      NA         <NA>
## 2      0.1345                      NA         <NA>
## 3      0.1950                      NA         <NA>
## 4      0.2325                      NA         <NA>
## 5      0.0975                      NA         <NA>
##   ListingCategory..numeric. BorrowerState               Occupation
## 1                         0                                       
## 2                         0               Administrative Assistant
## 3                         0                                       
## 4                         0            NC                         
## 5                         1            FL                    Other
##   EmploymentStatus EmploymentStatusDuration CreditScoreRangeLower
## 1                                        NA                   520
## 2    Not available                       NA                   640
## 3                                        NA                    NA
## 4                                        NA                   500
## 5        Full-time                       30                   720
##   CreditScoreRangeUpper    IncomeRange IncomeVerifiable LoanOriginalAmount
## 1                   539  Not displayed            False               1000
## 2                   659  Not displayed             True               2500
## 3                    NA  Not displayed             True               3000
## 4                   519  Not displayed             True               1000
## 5                   739 $25,000-49,999             True               1000
##   MonthlyLoanPayment
## 1              39.10
## 2              85.38
## 3             111.49
## 4              39.10
## 5              32.62

Seems that these are loans that started to be funded and then were cancelled. There are only 5 of them, and they all seem to have been closed less than a month after creation.

Our original LoanStatus bar chart had a bit too much going on, so above I have simplified it. This chart gives us a better idea of how many loans are past due.

These LoanStatus’ are intersting, but a lot depends on the data. Is this a snapshot of a month, a year? If this was for all time Completed and Defaulted will eventually become a huge part of the mix as the amount of loan history outpaces loans being originated. This distribution makes me think that it is a snapshot of a certain time frame.

Borrower APR

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.21020 0.21910 0.28490 0.51230      25

The Borrower APR is interesting to me in several ways. First off, there are people with incredible deals and terrible deals. Our minimum is .6% - that’s a pretty great rate. I don’t know anyone getting a rate that rate cheap outside of the Fed. On the other end, there is a borrower paying 51% APR. Yikes. What I’m starting to like about this dataset is you can begin to see where the line is set for risk. Prosper, or their lending partners, thinks there is a positive return on someone so risky you need to charge them over half the principal per year to loan it.

Several interesting things there. First off it’s almost a normal distribution with an average around 20%, but it seems to have spikes near round APR’s, as well as a spike a bit under 30%. There is also a large spike at around 35%. This histogram is a little noisy, so let’s look at a density plot.

## Warning: Removed 25 rows containing non-finite values (stat_density).

The peaks are easier to see with a density plot. But it is easy to see that a large amount of loans hover around 20% APR.

## Warning: Removed 25 rows containing non-finite values (stat_boxplot).

It seems that a couple of datapoints at the top of the range could be considered outliers. Let’s see how many.

##   count
## 1     3
##                ListingKey ListingNumber           ListingCreationDate
## 1 844033650124564886B3EDC           690 2006-02-23 13:57:02.087000000
## 2 A79D33661366830833F3EF5          2231 2006-03-16 19:30:16.753000000
## 3 BBED336465905564254DC8B          1112 2006-03-02 19:00:17.593000000
##   CreditGrade Term LoanStatus          ClosedDate BorrowerAPR BorrowerRate
## 1          HR   36  Completed 2006-03-14 00:00:00     0.51229       0.4975
## 2          HR   36  Defaulted 2006-09-19 00:00:00     0.50633       0.4975
## 3          HR   36  Defaulted 2006-09-05 00:00:00     0.48873       0.4800
##   LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## 1      0.4925                      NA            NA              NA
## 2      0.4800                      NA            NA              NA
## 3      0.4750                      NA            NA              NA
##   ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## 1                      NA                               <NA>
## 2                      NA                               <NA>
## 3                      NA                               <NA>
##   ListingCategory..numeric. BorrowerState Occupation EmploymentStatus
## 1                         0                                          
## 2                         0                                          
## 3                         0                                          
##   EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## 1                       NA               False            False
## 2                       NA               False            False
## 3                       NA               False            False
##                  GroupKey              DateCreditPulled
## 1                         2006-02-23 13:57:02.087000000
## 2 F555336594467179889CEFB 2006-03-16 18:39:30.963000000
## 3                         2006-03-02 11:32:57.040000000
##   CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
## 1                    NA                    NA                        
## 2                    NA                    NA                        
## 3                    NA                    NA                        
##   CurrentCreditLines OpenCreditLines TotalCreditLinespast7years
## 1                 NA              NA                         NA
## 2                 NA              NA                         NA
## 3                 NA              NA                         NA
##   OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
## 1                     0                           0                   NA
## 2                     0                           0                   NA
## 3                     0                           0                   NA
##   TotalInquiries CurrentDelinquencies AmountDelinquent
## 1             NA                   NA               NA
## 2             NA                   NA               NA
## 3             NA                   NA               NA
##   DelinquenciesLast7Years PublicRecordsLast10Years
## 1                      NA                       NA
## 2                      NA                       NA
## 3                      NA                       NA
##   PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
## 1                        NA                     NA                  NA
## 2                        NA                     NA                  NA
## 3                        NA                     NA                  NA
##   AvailableBankcardCredit TotalTrades TradesNeverDelinquent..percentage.
## 1                      NA          NA                                 NA
## 2                      NA          NA                                 NA
## 3                      NA          NA                                 NA
##   TradesOpenedLast6Months DebtToIncomeRatio   IncomeRange IncomeVerifiable
## 1                      NA           0.02852 Not displayed             True
## 2                      NA           0.03909 Not displayed            False
## 3                      NA           0.10021 Not displayed             True
##   StatedMonthlyIncome                 LoanKey TotalProsperLoans
## 1            2838.000 08423365150780340A7980C                NA
## 2               0.000 51A6336556683591380DA02                NA
## 3            1583.333 66E93364923759158FBB4B6                NA
##   TotalProsperPaymentsBilled OnTimeProsperPayments
## 1                         NA                    NA
## 2                         NA                    NA
## 3                         NA                    NA
##   ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 1                                  NA                              NA
## 2                                  NA                              NA
## 3                                  NA                              NA
##   ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 1                       NA                          NA
## 2                       NA                          NA
## 3                       NA                          NA
##   ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 1                          NA                         0
## 2                          NA                       211
## 3                          NA                       225
##   LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## 1                            NA                         97         89
## 2                             6                         96        245
## 3                             6                         96        137
##   LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## 1               1500 2006-02-27 00:00:00                Q1 2006
## 2               3000 2006-03-21 00:00:00                Q1 2006
## 3               3000 2006-03-07 00:00:00                Q1 2006
##                 MemberKey MonthlyLoanPayment LP_CustomerPayments
## 1 45F83366003403405932608              80.94             1530.67
## 2 9151336553609388240ECB8             161.88              158.70
## 3 0FA73365900322998FDD36E             158.66              158.66
##   LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## 1                      1500.00              30.67          -0.29
## 2                        35.12             123.58          -1.27
## 3                        36.36             122.30          -1.27
##   LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## 1                 0                  0.00                0.00
## 2                 0               2964.88             2964.88
## 3                 0               2963.64             2963.64
##   LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## 1                               0             1               0
## 2                               0             1               0
## 3                               0             1               0
##   InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## 1                          0                           0         9
## 2                          0                           0        21
## 3                          0                           0        20
##   simple.LoanStatus
## 1         Completed
## 2         Defaulted
## 3         Defaulted

Just because they’re outliers doesn’t mean they’re not valid. There are three loans with extremely high Borrower APR, and it seems that one of them was paid off and two defaulted. They were all made in Q1 2006 and were for $1,500 and $3,000. I wonder if this was a quirk that was later worked out?

BorrowerRate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1342  0.1840  0.1931  0.2511  0.4975

Here we notice that the rate is NOT the same thing as the APR. Looking at the definitions document we’re simply told that the APR is the Annual Percentage Rate and that the Borrow’s Rate is the borrower’s interest rate for the loan. Some research tells us that these are calculated by (from Investopedia):

APR - Annual Percentage Rate. A percentage that represents the actual yearly cost of funds over the term of the loan. Includes additional costs or fees but does not take compounding into account.

Interest Rate - Refers only to the interst charged on the loan, doesn’t take other expenses into acount. Therefore, APR tends to be higher than the loan’s nominal interst rate.

APY or EAR - Annual Percentage Yield or Effective Annual Rate takes compound interest into account. Tends to be larger than APR on the same loan.

From researching the issue it seems that APR can be misleading when it comes to actual cost, and is best used when the loan is long-term, otherwise it does not account properly for costs and fees. However, for our purposes it is probably a good way to measure what a borrower is being charged.

Looks like the same 3 outliers from BorrowerAPR on the high end here.

Estimated Effective Yield

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -0.183   0.117   0.162   0.169   0.225   0.320   29084

You can see that this distribution mimics the distribution of APR’s but moved slightly lower, with some tail to the left as well. Let’s see what’s causing negative estimated yield loans.

##                    ListingKey  Term                      LoanStatus 
##  009B3495741061236C0C927:  1   12:  0   Completed             :146  
##  02283482060850607BC27FE:  1   36:190   Chargedoff            : 42  
##  02933496318097934A683F4:  1   60:  0   Defaulted             :  2  
##  029A3499706893097DB1393:  1            Cancelled             :  0  
##  038535015703591930CD2E0:  1            Current               :  0  
##  03AA34675425410661C8213:  1            FinalPaymentInProgress:  0  
##  (Other)                :184            (Other)               :  0  
##   BorrowerAPR       BorrowerRate     LenderYield    
##  Min.   :0.08999   Min.   :0.0693   Min.   :0.0593  
##  1st Qu.:0.21306   1st Qu.:0.1833   1st Qu.:0.1733  
##  Median :0.29031   Median :0.2500   Median :0.2400  
##  Mean   :0.29372   Mean   :0.2607   Mean   :0.2507  
##  3rd Qu.:0.37453   3rd Qu.:0.3500   3rd Qu.:0.3400  
##  Max.   :0.41355   Max.   :0.3500   Max.   :0.3400  
##                                                     
##  EstimatedEffectiveYield  ProsperScore ListingCategory..numeric.
##  Min.   :-0.18270        1      :105   Min.   :1.000            
##  1st Qu.:-0.06338        5      : 30   1st Qu.:1.000            
##  Median :-0.03420        3      : 17   Median :2.000            
##  Mean   :-0.04802        2      : 16   Mean   :3.237            
##  3rd Qu.:-0.01787        4      : 12   3rd Qu.:6.000            
##  Max.   :-0.00060        8      :  5   Max.   :7.000            
##                          (Other):  5                            
##  BorrowerState             Occupation      EmploymentStatus
##  CA     :33    Other            :50   Full-time    :107    
##  MI     :14    Professional     :23   Employed     : 33    
##  FL     :13    Military Enlisted:14   Self-employed: 33    
##  GA     :13    Teacher          :10   Retired      :  8    
##  CO     : 9    Sales - Retail   : 7   Not employed :  5    
##  TX     : 9    Analyst          : 6   Part-time    :  4    
##  (Other):99    (Other)          :80   (Other)      :  0    
##  EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
##  Min.   :  0.00           Min.   :600.0         Min.   :619.0        
##  1st Qu.: 26.25           1st Qu.:620.0         1st Qu.:639.0        
##  Median : 65.50           Median :640.0         Median :659.0        
##  Mean   : 84.34           Mean   :657.6         Mean   :676.6        
##  3rd Qu.:123.00           3rd Qu.:680.0         3rd Qu.:699.0        
##  Max.   :303.00           Max.   :820.0         Max.   :839.0        
##                                                                      
##          IncomeRange IncomeVerifiable LoanOriginalAmount
##  $25,000-49,999:67   False: 55        Min.   : 1000     
##  $50,000-74,999:56   True :135        1st Qu.: 1500     
##  $1-24,999     :21                    Median : 2438     
##  $100,000+     :20                    Mean   : 3369     
##  $75,000-99,999:20                    3rd Qu.: 4000     
##  Not employed  : 5                    Max.   :15000     
##  (Other)       : 1                                      
##  MonthlyLoanPayment
##  Min.   :  0.00    
##  1st Qu.: 54.83    
##  Median : 90.68    
##  Mean   :135.49    
##  3rd Qu.:175.07    
##  Max.   :678.37    
## 

It seems that some of the EstimatedEffectiveYields have a negative value. I will explore this in the Bivariate Exploration.

ProsperScore

##     1     2     3     4     5     6     7     8     9    10    11  NA's 
##   917  5632  7456 12300  9661 12064 10392 11875  6813  4637  1408 29084

Here you can see that ProsperScore is discrete, and potentially bimodal? Let’s see what the scale is here.

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"

1 to 11? That’s a strange scale for scoring. Maybe it’s analogous to 0 to 10, but they don’t want to make people feel bad by giving them 0’s.

Listing Category

In our dataset the Listing Categories are numeric, so I wrote a short function to convert them to their string values.

It looks like Debt Consolidation is the most popular reason to apply for a loan.

Borrower State

This bar chart doesn’t actually tell us much because we would have to weight these by the population of each state to see if there are real differences between them. In fact the first four are ordered exactly as they would be ordered by population.

Occupation

## # A tibble: 10 × 2
##                  Occupation count
##                      <fctr> <int>
## 1                     Other 28242
## 2              Professional 13440
## 3       Computer Programmer  4407
## 4                 Executive  4250
## 5                   Teacher  3699
## 6  Administrative Assistant  3666
## 7                   Analyst  3554
## 8        Sales - Commission  3384
## 9            Accountant/CPA  3185
## 10                 Clerical  3108

Again, to see if any occupation actually had an undue number of loans we would have to get an estimate of how many people had each job in the population. Still, categorical variables like this are always interesting to look at since they put a human face on our data.

Employment Status

## # A tibble: 9 × 2
##   EmploymentStatus count
##             <fctr> <int>
## 1         Employed 65912
## 2        Full-time 26353
## 3    Self-employed  5973
## 4    Not available  5347
## 5            Other  3681
## 6                   2255
## 7        Part-time  1088
## 8     Not employed   835
## 9          Retired   795

Oh, so many questions here.
Are some of these categories subsets of other categories? (ie Does ‘Employed = Full-time + Part-Time + Self Employed’? Let’s try to find out.

##   Self_Full_Or_Part_time
## 1                  33414
##   Employed
## 1    65912

Even with the addition of ‘Other’ and the NULLs the categories would not approach the number of employment. The way these categories are split up makes this variable less usefull.

Employment Status Duration

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   26.00   67.00   96.05  137.00  755.00    7625

##   status.mean status.median status.min status.max
## 1    8.004111      5.583333          0   62.91667

Looks like someone claims to have worked somewhere for almost 63 years! Maybe they’re saying the’ve been employed for 63 years straight. This distribution is what we would expect to see, with the exception of the responders who sayed their duration was 0 months.

Credit Score

Credit Range Upper and Lower. I would like to know the interval between the scores for each person.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   699.0   704.5   739.0   899.0     591

Interesting that there are Credit Scores under 300, which is the lowest Credit Score you can have.

##   int_mean int_median int_max int_min
## 1       19         19      19      19

Surprisingly the interval between CreditScoreRangeUpper and CreditScoreRangeLower isthe same for every loan, which saves me the trouble of constructing a variable to represent the most likely single score.

## Warning: Removed 591 rows containing non-finite values (stat_bin).

The two bumps in credit score here remind me of the bumps in APR/Rate. Potentially there is correlation there - more people with the same credit score more people with the same rate. I wonder how Prosper aquires their score, is it just Fico or something more? Also it would be interesting to see what credit score distribution looks like in the general population versus in this dataset.

Lender Yield

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1246  0.1740  0.1831  0.2411  0.4925

The same spike that exists in the other Rate plots exists her at around 30%. Hard to know if this is a pricing strategy, or a rate that is popular for other reasons. It is roughly equivalent to a credit card. The max is slightly under the max borrower rate, so fees paid to Prosper must not be included in this number.

Income Range

Income Verifiable

Looks like most loans go to people with verifiable incomes.

The distribution of IncomeRanges for non-verified incomes is slightly different from the verified incomes. Seems like $100k+ gets a bump upwards in particular.

Monthly Loan Payment

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   130.6   216.6   271.4   369.7  2252.0

Looks like there’s a big outlier there with our max 2252. But our histogram makes it hard to see this. Let’s zoom in on the bulk of the payments.

After 200 there is a sharp drop off in monthly payments. Perhaps there is some sort of sales tactic there, or a correlation with a common loan amount.

Loan Original Amount

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6100    8293   12000   35000

We can see that Loan Amounts tend to be in whole numbers. Let’s see the most common loan amounts.

## # A tibble: 9 × 2
##   LoanOriginalAmount count
##                <int> <int>
## 1               4000 14085
## 2              15000 12065
## 3              10000 10816
## 4               5000  6918
## 5               2000  6020
## 6               3000  5709
## 7              25000  3548
## 8               1000  3206
## 9              20000  3178

Lots of outliers here at the upper range of Loan Original Amount. It makes sense that there would be less people with the high incomes that would qualify them for these large loans.

Percent Funded

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7000  1.0000  1.0000  0.9986  1.0000  1.0120

OK, so it looks like the vast percentage were 100% funded (or they probably wouldn’t be in this dataset!) Let’s take a closer look at those that weren’t.
Also, it looks like minimally 70% of the loans in the set were funded. Maybe that is Prosper’s cutoff?

The only thing I see for sure here is that of the loans that do not get 100% funded, many are 70% funded, and the amount at each precentage from 70-99% trends downward until it hits 100%, where most of the loans reside. This could be interpreted as the higher your funding percentage the more likely you are to be funded 100%.

Number of Investors

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00   45.00   81.26  116.00 1189.00

Should be related to the size of the loan.

Wow, on a a platform that lets you pool your risk with other investors, a vast majority of loans still only have 1 investor.

## # A tibble: 10 × 2
##    Investors count
##        <int> <int>
## 1          1 26579
## 2          2  1364
## 3          3   967
## 4          4   817
## 5          5   747
## 6          6   721
## 7          7   690
## 8          8   745
## 9          9   717
## 10        10   720

Univariate Analysis

What is the structure of your dataset?

There are 112,239 unique loans in this dataset, with 81 features about both the loans (APR, Term, Status) and the Borrower (Occupation, Income Range, Credit Score).

What is/are the main feature(s) of interest in your dataset?

The main features of interest here are:
* What makes someone credit-worthy? / What factors determine how much you pay to borrow money? This is reflected in BorrowerApr. * What determines how much you can borrow? (OriginalLoanAmount) * What loans are likely to go bad? (LoanStatus) * and, for me, ‘What can we learn about the people these loans were made to?’ (Occupation, State, etc)

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think Credit Score and Income range are likely to be large factors in credit-worthyness, and how much you can borrow. Employment, ProsperScore, and Credit Score (which is already an aggregate of many credit-related data points) should be good indicators for rate. And variables like State, Occupation and Employment Duration will let me know more about the borrowers.

Did you create any new variables from existing variables in the dataset?

  • simple.LoanStatus - A simplified LoanStatus column called that combines all the ‘Past Due’ Loans into one factor, eliminates the small amount of cancelled and final payment loans.
  • ListingCategory - String representation of ListingCategory..numeric.
  • For the bi/multi variate exploration I will probably simplify the factors of Employment.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The biggest cleaning opporation I had to undertake was eliminating all of the duplicate loans that shared unique identifiers (ListingKey) and were identical except for Prosper Score.

I will be filtering out the loans with CreditScoreRangeUpper ‘19’ as they seem to represent a credit-not-available value.

Bivariate Plots Section

Correlations

## Warning in ggcorr(., hjust = 0.95, size = 5, color = "black", layout.exp =
## 2): data in column(s) 'ListingKey', 'Term', 'LoanStatus', 'ProsperScore',
## 'BorrowerState', 'Occupation', 'EmploymentStatus', 'IncomeRange',
## 'IncomeVerifiable' are not numeric and were ignored

This is a nice preview of things to come. Rates to the borrower and yield for the lender are positively correlated. Both Yield and Rate have a slight negative correlation to variables that provide credit worthiness, such as Credit Score and Employment Status Duration. And, Prosper Score seems to have very little correlation with the rest of the variables.

Proportion per Term by Loan Status

Seems like there are different patterns for different Income Ranges. For $1-24,999 and $25,000-49,999 the preference seems to be for 3 year term loans. For $50,000-74,999 long term 5-year loans are more popular. In the $100k and above range, short term one-year loans are the most popular, followed by 5 year-loans.

Here we can see that 3-year loans are more likely to default than others.

BorrowerApr by Term

## Warning: Removed 25 rows containing non-finite values (stat_boxplot).

Median APR is practically lat over all three terms. However, longest term loans tend to be in a tighter, lower APR range.

Estimated Effective Yield by Term

## Warning: Removed 29084 rows containing non-finite values (stat_boxplot).

It makes intuitive sense that 5-year loans would have a higher yield, since they have longer to collect interest. We know that 3-year loans have the largest range because there are more loans in this category, and we’ve already explored the negative yield, closed-out loans.

Estimated Effective Yield and Loan Status

In our univariate plots we noticed that there are some negative expected yields. Let’s take a deeper look at that.

Looks like all of the 184 loans that have a negative EstimatedEffectiveYield are not current, but have either been completed, chargedoff, or defaulted. This probably represents either the loss incurred from the loans that where completed ahead of schedule, or had to be taken as a loss. It’s not really an ‘estimate’ for any of these statuses because the loss has been realized.

Original Loan Amount by Term

Seems that in general the longer the term the higher the median amount of the loan. Interesting to note the hard ceiling of 35k.

## [1] 35000

Borrower APR by Loan Status

Completed and Current loans tend to have a lower APR than loans that are Past Due or that have Defaulted. A hopeful thing I notice here is that Past Due > 120 Days has a higher median APR than the short past due categories, so perhaps only the higher APR loans of those groups will default. Unfortunately, they have roughly the same APR as the chargedoff loans, so maybe things aren’t that much sunnier.

Listing Category and Loan Status

So there is a lot we can’t see in these charts, but mostly we’re mostly checking if there is any Listing Category that consistently leads to bad outcomes. It seems that Personal Loan, Not Available, and Student use result in a higher amount of bad outcome loans. Loans for Baby&Adoption and Wedding Loans tend to stay current.

LoanStatus and IncomeRange

Here we see that people in the $0 and Not employed income ranges have a higher proportion of Chargedoff, Defaulted, and PastDue Loans. Charge offs and defaults decrease with higher IncomeRange, though Past Due tends to stay flat.

For this graph I eliminated ‘cancelled’ and ‘final payment in progress’ loans because they are such a small part of the mix. It is also important to note that Completed, Chargedoff and Defaulted loans should grow with the amount of time included with this snapshot. If we had 20 years of loan data they would be many times the amount of currently active loans.

Borrower APR and Credit Score

This is some of what we would expect to see. The higher your Credit Score the lower your APR, in general.

It seems that even if you have a lower credit score you can get a loan at the high interest rage of .3 or .375. I wonder if the high interest loans made to borrowers with a high credit score are low amounts? I’ll explore this in the multivariate section.

We see some lines at APR’s that were present in our overall BorrowerAPR distribution. Especially that line at 30%. I wonder if there is legal lending limits or psycological pricing is at work there.

Looks like there is something that causes some people with no credit to be listed with a Lower Credit Score Range of 0 and an Upper of 19. It’s interesting to me that loans can be made to these types of users. Might be worth exploring this population further.

Borrower APR and Listing Category

Mean Borrower APR and Occupation

On a superficial level it seems that highly paid professions get lower average APR’s. But that just reiterates an obvious relationship between high income, good credit, and the ability to repay loans. I did smile when I saw that judges in particular have a low average apr. On further investigation there are 22 people who claimed Judge as their occupation in our dataset.

##   Judge_Count
## 1          22

Borrower APR vs Employment Duration

It seems like being ‘Not Employed’ results in a higher mean APR.

Employment Status Duration and BorrowerApr

It doesnt seem that duration of employment status for people who are employed has much of an effect on Borrower APR.

Prosper Score and Borrower APR

ProsperScore seems to be the best indicator for APR’s that we’ve explored - of course ProsperScore is a variable created by Prosper out of all the other variables they care about, so it’s completely surprising. You can see that in each of the scores there are outliers, so ProsperScore isn’t a total predictor, and there are probably other variables, such as Loan Amount, that will change the APR even if you have a certain score.

## # A tibble: 4 × 2
##   EmploymentStatus.simple `n()`
##                     <chr> <int>
## 1                Employed 98238
## 2            Not employed   835
## 3               Part-time  1088
## 4                 Retired   795

I like this scatter plot with jitter because you can both see that there aren’t many loans in the extremes of either range, and that there is a linear, positive relationship between Credit Score and Prosper Score, which is what we would expect. Distractingly, loans that do not have a Prosper Score have a wide range of Credit Scores.

Listing Category and Borrower State

## Source: local data frame [52 x 3]
## Groups: BorrowerState [52]
## 
##    BorrowerState    ListingCategory count
##           <fctr>              <chr> <int>
## 1                     Not Available  5474
## 2             AK Debt Consolidation   102
## 3             AL Debt Consolidation   817
## 4             AR Debt Consolidation   485
## 5             AZ Debt Consolidation   913
## 6             CA Debt Consolidation  7509
## 7             CO Debt Consolidation  1198
## 8             CT Debt Consolidation   946
## 9             DC Debt Consolidation   230
## 10            DE Debt Consolidation   183
## # ... with 42 more rows

I explored the relationship between state and Listing Category, but the overwhelming majority of loans are for Debt Consolidation or have a category of ‘Not Available’ so there was nothing surprising here.

Listing Category and Credit Score

## Source: local data frame [5 x 5]
## Groups: bin [5]
## 
##     bin    ListingCategory count minCredit maxCredit
##   <int>              <chr> <int>     <int>     <int>
## 1     1      Not Available  8994       379       659
## 2     2 Debt Consolidation 12590       659       699
## 3     3 Debt Consolidation 13408       699       719
## 4     4 Debt Consolidation 13111       719       759
## 5     5 Debt Consolidation 10703       759       899

Pretty all income ranges are listing for Debt Consolidation, or the info is unavailable.

Borrower State and most common Income Range

## Source: local data frame [53 x 3]
## Groups: BorrowerState [52]
## 
##    BorrowerState    IncomeRange count
##           <fctr>         <fctr> <int>
## 1                 Not displayed  3813
## 2             AK $50,000-74,999    71
## 3             AL $25,000-49,999   529
## 4             AR $25,000-49,999   326
## 5             AZ $25,000-49,999   594
## 6             CA $50,000-74,999  3858
## 7             CO $50,000-74,999   650
## 8             CT $50,000-74,999   466
## 9             DC      $100,000+   100
## 10            DE $25,000-49,999    93
## # ... with 43 more rows

Credit Score and Income Range

Interesting to see that all our out of range Credit Scores are in the ‘Not Displayed’ Income Range. There seems to be a positive correlation between haveing a higher Income Range and having a high credit score, but we can also see that most of the loans have been made to people with a Credit Score centered around 700, with a floor at about 525 (except for Not Displayed Income Range)

Income Range and Monthly Loan Payment

In general, if you have income, the higher your income range the higher your monthly loan payment can be.

Loan Original Amount and Income Range

The higher the income, the higher the median Loan Amount.

Bivariate Analysis.

How did the feature(s) of interest vary with other features in the dataset?

APR

  • Loan Terms don’t change APR much, but longer term loans tend to be in a tighter, lower APR range.

  • Having a higher Credit Score results in a lower APR in general, with some interesting reoccuring interest rates.

  • Having a high credit score and high Income Range are correlated.

  • Being unemployed correlates with having a higher APR, though employment status being ‘Not Available’ had the lowest mean apr.

  • ProsperScore good APR indicator. but indicator of inditcators.

Loan Amount

  • The longer the loan term the higher the median amount of loans - with the maximum loan amount being 35k.

  • Higher Income Ranges also tend to have higher monthly payments, probably because they tend to be taking out larger loans.

Possibility of bad outcome

  • Loan category does seem to have an effect on whether or not loans go bad (chargeoff, default).

  • Chargeoffs and defaults decrease with higher IncomeRange, though the amount of Past Due loans remains relatively flat.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • Professionals tend to have low APRs with Judges having the lowest.
  • The reasons for wanting loans remains constant across states.
  • The top occupation in each state tends to vary.
  • DC, treated as its own state, has the highest most frequent income range (100k+).

What was the strongest relationship you found?

The relationship between ProsperScore and APR is very strong, though that is to be expected since ProsperScore is an amalgam of all the variables Prosper finds to be import, and is probably part of the process of assigning an APR.

Multivariate Plots Section

There is some overplotting here, but there is also a lot of information. We see that really large loans are only made to people that are in the $100,000+ Income Range. This makes sense not just because it’s the largest income but because as a category it extends to every possible Income Range above it. For people with an IncomeRange below 100k it seems that regardless of credit score their maximum loan amount is $25k. Also, lines emerge for Original Loan Amounts at easily chosen amounts - 10k, 15k, 20k. It seems that for most borrowers 25k is a hard ceiling unless you have a high income. It seems that for income ranges above $25k the relationship between Original Loan Amount and Credit Score is relatively constant amount income ranges. Below $25k a year, loans in the $0 bin have a loan amount that increases more quickly than others given a higher credit score, and the ‘Not employed’ bin experiences a steeper increase in Loan Amount than people in the $1-24,999 bin across the same Credit Scores.

My intuition was right about lower Original Loan Amounts resulting in high APRs. From this chart we can see that the best APRs seem to go to people with hgh credit scores, borrowing medium amounts of money.

It seems that, for people with income, larger loans amount are more likely to stay current. This surprises me, but may show that, in general, if Prosper is willing to loan you a lot of money they are very certain you will not become delinquent.

Here, each band is a term, and you can see that it tends to be the Loans with higher monthly payments within their term that have bad outcomes. This is deffinitely not causality, those loans had higher APR’s because the borrowers were deamed riskier.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Seeing Loan Amount, Credit Score, and Income range together strengthed what I knew about their relationships and highlighted the role Income Range plays, especially in large loans.

By combining APR, Credit Score and Loan Amount we can see that, in general, larger loans have lower APRs they don’t have the lowest, and that the largest loans are centered in a credit score of 750-850 and an APR between 10 and 20%.

Outcome by Income Range and Loan Amount was surprising - but reinforces the fact that this dataset is of loans that were deemed worthy to be made.

Were there any interesting or surprising interactions between features?

Deffinitely the last boxplot that shows Larger Loans being paradoxically more likely to have good outcomes was surprising to me, since larger loans are more difficult to pay back - but ability to repay is related to income and other factors.

It is also important to remember that this is a set of loans that were made. What we don’t see is loans that fell outside of the parameters of Prosper’s lending. All of these loans were deemed by lenders to be of acceptable risk.

Final Plots and Summary

Plot One

Description One

One of my first stops with this dataset was to look at the distribution of rates for borrowers, and the potential yields for Lenders. When looked at together Estimated Effective Yield has almost the same distribution, shifted lower on the scale of percentage rate. This make sense, because the yield will be lower than borrower is paying, and since the yield is estimated it is a little smoother than the actual borrower rate distribution of all of the loans in the dataset.

Plot Two

Description Two

One of my most surprising findings was that, on a platform for P2P lending, a 31% of the loans are originated by just one investor. If it was small, individual investors in these loans looking for maximum diversification I think there would be a mean around the average size of a loan divided by the minimum investment per loan ($25). It’s possible that institutional investors account for the large amount of these solo-investor loans. There were a lot of articles several years ago about how large investors, like hedge funds and banks, were getting into the P2P space, and there are even bonds that are bonds backed by P2P loans. In fact, I believe the prefered nomenclature is now ‘Marketplace Lending’, not P2P.

##   sum_count
## 1 0.3102848

Plot Three

Description Three

I included this plot because it both surprised me, and illustrated the strength of the Income variable in this dataset. It suggests that, for people with income, larger loans are more likely to stay current, and avoid becoming Past Due, Defaulted or Charged Off. My intuition is that Prosper is more willing to make larger loans to people who are more capable of paying them back. Also, our ‘Income Range’ variable includes some large buckets, and we can’t discount the fact that making $1 is very different than making $24,999 and that people making $75k are better off that people making $50k, but in both of these cases the loans are in the same Income Range bucket.


Reflection

Exploring this complex dataset was very enjoyable. It was hard to not explore each of the 81 variables. The data often had surprising quirks, like the duplicate listing keys and the 0-19 Credit Score Range, but mostly it told the story of how Prosper vets high quality borrowers.

I think a lot about the data for the loans that were not made and are not present. In many ways this data set is a picture of what lenders consider acceptable risk for a positive expected outcome. That being said, these loans are made in a pretty wide band, and I felt like I was able to tease out some of the levers and filters that were used to make that descision.

Initially, I struggled with the switch to R from python, then I got to know (and love) dplyr and the rest of the tidyverse. Once I got the hang of piping commands and ggplot2 I felt like I could explore without worrying too much about syntax, which felt very freeing and helped with the exploratory nature of this project.

I made the mistake of begining this project in a jupyter notebook, thinking it would be easy to convert it to R Markdown later. I was very wrong about that. Also, some of the quirks of knitting have taken a lot of my time - like trying to get my tables to be readable, but I think the kable package provides a workable solution, and some of the tables turned out to be unnecessary.

Overall, I was surprised how Credit Score was not the main indicator of a loan’s attributes. It was a big one, but in thinking through the data I guessed that it would be heavily weighted, since it is itself composed of a large number of historically tracked data points. Aside from this, the relationship between loan amount and outcome was unanticipated. However, it was fairly easy to see things from the lender’s point of view so many of the relationships were predictable.

I think in the future for this dataset I would like to re-create the model Prosper uses to assign its ‘Prosper Score’. Since we have the data, and the labels, I should be able to create a reasonable fascimile. I also would like to eventually examine at all of the variables. One of the reasons that this project took me so long is that I wanted to thoroughly examine every part of this data. I think using techniques like regression I could identify other variables worth exploring.

Overall, this project has made me an R enthusiast and gave me a better understanding of how to crack open a dataset and begin the process of finding all the insights it has to offer.