## [1] 113937 81
Taking a peak at the dimensions of the data we can see that this dataset has many (81) variables. I used a list of these variables to create a shorter set of variables of interest to explore. There are 113,937 observations (which makes it worrisome that the variable loan key is a factor with only 113066 levels!) Let’s explore that after seeing a summary of the dataset.
## ListingKey Term
## 17A93590655669644DB4C06: 6 Min. :12.00
## 349D3587495831350F0F648: 4 1st Qu.:36.00
## 47C1359638497431975670B: 4 Median :36.00
## 8474358854651984137201C: 4 Mean :40.83
## DE8535960513435199406CE: 4 3rd Qu.:36.00
## 04C13599434217079754AEE: 3 Max. :60.00
## (Other) :113912
## LoanStatus BorrowerAPR BorrowerRate
## Current :56576 Min. :0.00653 Min. :0.0000
## Completed :38074 1st Qu.:0.15629 1st Qu.:0.1340
## Chargedoff :11992 Median :0.20976 Median :0.1840
## Defaulted : 5018 Mean :0.21883 Mean :0.1928
## Past Due (1-15 days) : 806 3rd Qu.:0.28381 3rd Qu.:0.2500
## Past Due (31-60 days): 363 Max. :0.51229 Max. :0.4975
## (Other) : 1108 NA's :25
## LenderYield EstimatedEffectiveYield ProsperScore
## Min. :-0.0100 Min. :-0.183 Min. : 1.00
## 1st Qu.: 0.1242 1st Qu.: 0.116 1st Qu.: 4.00
## Median : 0.1730 Median : 0.162 Median : 6.00
## Mean : 0.1827 Mean : 0.169 Mean : 5.95
## 3rd Qu.: 0.2400 3rd Qu.: 0.224 3rd Qu.: 8.00
## Max. : 0.4925 Max. : 0.320 Max. :11.00
## NA's :29084 NA's :29084
## ListingCategory..numeric. BorrowerState
## Min. : 0.000 CA :14717
## 1st Qu.: 1.000 TX : 6842
## Median : 1.000 NY : 6729
## Mean : 2.774 FL : 6720
## 3rd Qu.: 3.000 IL : 5921
## Max. :20.000 : 5515
## (Other):67493
## Occupation EmploymentStatus
## Other :28617 Employed :67322
## Professional :13628 Full-time :26355
## Computer Programmer : 4478 Self-employed: 6134
## Executive : 4311 Not available: 5347
## Teacher : 3759 Other : 3806
## Administrative Assistant: 3688 : 2255
## (Other) :55456 (Other) : 2718
## EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.00 Min. : 0.0 Min. : 19.0
## 1st Qu.: 26.00 1st Qu.:660.0 1st Qu.:679.0
## Median : 67.00 Median :680.0 Median :699.0
## Mean : 96.07 Mean :685.6 Mean :704.6
## 3rd Qu.:137.00 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :755.00 Max. :880.0 Max. :899.0
## NA's :7625 NA's :591 NA's :591
## IncomeRange IncomeVerifiable LoanOriginalAmount
## $25,000-49,999:32192 False: 8669 Min. : 1000
## $50,000-74,999:31050 True :105268 1st Qu.: 4000
## $100,000+ :17337 Median : 6500
## $75,000-99,999:16916 Mean : 8337
## Not displayed : 7741 3rd Qu.:12000
## $1-24,999 : 7274 Max. :35000
## (Other) : 1427
## MonthlyLoanPayment
## Min. : 0.0
## 1st Qu.: 131.6
## Median : 217.7
## Mean : 272.5
## 3rd Qu.: 371.6
## Max. :2251.5
##
This gives us all sorts of information there at a glance. We get to see an overview of ranges, maximums and counts. Large inconsistencies can be spotted at this stage, but it is a lot to take in at once, so let’s dive into the individual variables I singled out interesting.
I also chose to convert Prosper Score and Term into factors.
Number of listing keys:
## [1] 113937
Number of unique listing keys:
## [1] 113066
Here I wanted to see if the listing keys, which are supposed to be unique identifiers, really were unique. It seems like a lot of them are not.
times_listing_key_appears | count |
---|---|
1 | 112239 |
2 | 790 |
3 | 32 |
4 | 4 |
6 | 1 |
So we can see that many of the keys appear multiple times, with one loan appearing a whopping six times. Let’s see that loan.
ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating..numeric. | ProsperRating..Alpha. | ProsperScore | ListingCategory..numeric. | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent..percentage. | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17A93590655669644DB4C06 | 951186 | 2013-10-02 17:20:16.550000000 | 60 | Current | 0.16662 | 0.1435 | 0.1335 | 0.1264 | 0.0524 | 0.074 | 5 | B | 4 | 1 | MD | Other | Employed | 26 | False | False | 2013-12-23 09:38:12 | 720 | 739 | 1986-12-26 00:00:00 | 12 | 12 | 20 | 6 | 348 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 14635 | 0.57 | 10865 | 17 | 1 | 0 | 0.41 | $25,000-49,999 | True | 3000 | CB1B37030986463208432A1 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 2 | 126059 | 10000 | 2014-01-13 00:00:00 | Q1 2014 | F80D3694083622957BA09F2 | 234.5 | 234.5 | 112.62 | 121.88 | -8.49 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 96 | |||
17A93590655669644DB4C06 | 951186 | 2013-10-02 17:20:16.550000000 | 60 | Current | 0.16662 | 0.1435 | 0.1335 | 0.1264 | 0.0524 | 0.074 | 5 | B | 8 | 1 | MD | Other | Employed | 26 | False | False | 2013-12-23 09:38:12 | 720 | 739 | 1986-12-26 00:00:00 | 12 | 12 | 20 | 6 | 348 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 14635 | 0.57 | 10865 | 17 | 1 | 0 | 0.41 | $25,000-49,999 | True | 3000 | CB1B37030986463208432A1 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 2 | 126059 | 10000 | 2014-01-13 00:00:00 | Q1 2014 | F80D3694083622957BA09F2 | 234.5 | 234.5 | 112.62 | 121.88 | -8.49 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 96 | |||
17A93590655669644DB4C06 | 951186 | 2013-10-02 17:20:16.550000000 | 60 | Current | 0.16662 | 0.1435 | 0.1335 | 0.1264 | 0.0524 | 0.074 | 5 | B | 7 | 1 | MD | Other | Employed | 26 | False | False | 2013-12-23 09:38:12 | 720 | 739 | 1986-12-26 00:00:00 | 12 | 12 | 20 | 6 | 348 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 14635 | 0.57 | 10865 | 17 | 1 | 0 | 0.41 | $25,000-49,999 | True | 3000 | CB1B37030986463208432A1 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 2 | 126059 | 10000 | 2014-01-13 00:00:00 | Q1 2014 | F80D3694083622957BA09F2 | 234.5 | 234.5 | 112.62 | 121.88 | -8.49 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 96 | |||
17A93590655669644DB4C06 | 951186 | 2013-10-02 17:20:16.550000000 | 60 | Current | 0.16662 | 0.1435 | 0.1335 | 0.1264 | 0.0524 | 0.074 | 5 | B | 10 | 1 | MD | Other | Employed | 26 | False | False | 2013-12-23 09:38:12 | 720 | 739 | 1986-12-26 00:00:00 | 12 | 12 | 20 | 6 | 348 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 14635 | 0.57 | 10865 | 17 | 1 | 0 | 0.41 | $25,000-49,999 | True | 3000 | CB1B37030986463208432A1 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 2 | 126059 | 10000 | 2014-01-13 00:00:00 | Q1 2014 | F80D3694083622957BA09F2 | 234.5 | 234.5 | 112.62 | 121.88 | -8.49 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 96 | |||
17A93590655669644DB4C06 | 951186 | 2013-10-02 17:20:16.550000000 | 60 | Current | 0.16662 | 0.1435 | 0.1335 | 0.1264 | 0.0524 | 0.074 | 5 | B | 5 | 1 | MD | Other | Employed | 26 | False | False | 2013-12-23 09:38:12 | 720 | 739 | 1986-12-26 00:00:00 | 12 | 12 | 20 | 6 | 348 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 14635 | 0.57 | 10865 | 17 | 1 | 0 | 0.41 | $25,000-49,999 | True | 3000 | CB1B37030986463208432A1 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 2 | 126059 | 10000 | 2014-01-13 00:00:00 | Q1 2014 | F80D3694083622957BA09F2 | 234.5 | 234.5 | 112.62 | 121.88 | -8.49 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 96 | |||
17A93590655669644DB4C06 | 951186 | 2013-10-02 17:20:16.550000000 | 60 | Current | 0.16662 | 0.1435 | 0.1335 | 0.1264 | 0.0524 | 0.074 | 5 | B | 6 | 1 | MD | Other | Employed | 26 | False | False | 2013-12-23 09:38:12 | 720 | 739 | 1986-12-26 00:00:00 | 12 | 12 | 20 | 6 | 348 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 14635 | 0.57 | 10865 | 17 | 1 | 0 | 0.41 | $25,000-49,999 | True | 3000 | CB1B37030986463208432A1 | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | 2 | 126059 | 10000 | 2014-01-13 00:00:00 | Q1 2014 | F80D3694083622957BA09F2 | 234.5 | 234.5 | 112.62 | 121.88 | -8.49 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 96 |
It seems this loan is just repeated in the dataset 6 times. Here I run a function to see if any of the columns contains values that aren’t exact duplicates.
## [1] "ProsperScore"
There it is, this loan somehow as 6 different ProsperScores. Let’s see if ProsperScore is the only non-duplicated value in all of the other repeated loans.
It seems that it is. Are these Prosper Scores distributed in a predictable manner?
## ind
## 1 ProsperScore
Taking a look at just a few loans we can see that the spread isn’t always the same between ProsperScores with the same listing ID. Because there doesn’t seem to be a pattern here I elected to eliminate all duplicate rows from my dataset
## [1] "Number of listing keys: "
## [1] 112239
## [1] "Number of unique listing keys: "
## [1] 112239
Now we’re cooking with gas.
## 12 36 60
## 1614 86694 23931
Seems like the minimum loan term is a year, but the mean is 40.83 months, a little over 3 years.
Here we can see that term is discrete, and that 3 year loans far outpace 5 year loans, with 1 year loans coming in a distant third.
## # A tibble: 3 × 3
## Term count_per_term percentage
## <fctr> <int> <dbl>
## 1 12 1614 1.438003
## 2 36 86694 77.240531
## 3 60 23931 21.321466
Here is the same information in a table with percentages.
## Cancelled Chargedoff Completed
## 5 11992 38048
## Current Defaulted FinalPaymentInProgress
## 54928 5018 201
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 794 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 359 309 304
I don’t like that Past Due > 120 is out of order with the other past dues, so I changed the order of the levels.
A quick note on definitions via Investopedia.com – >Chargedoff : “A charge-off or chargeoff is the declaration by a creditor (usually a credit card account) that an amount of debt is unlikely to be collected. This occurs when a consumer becomes severely delinquent on a debt. Traditionally, creditors will make this declaration at the point of six months without payment.”
Default : “Default is the failure to pay interest or principal on a loan or security when due. Default occurs when a debtor is unable to meet the legal obligation of debt repayment … When an individual, a business or even a nation defaults on a debt obligation, the lender or investor has some recourse to reclaim the funds, but it varies based on the type of security involved.
Happily, the majority of the loans are either current or completed. If we were creating a model for this data it would likely be to predict if a new loan was going to become charged-off, defaulted, or past due. Cancelled, as we can see from our table, has just 5 loans.
Here we look at the distribution of Past Due loans.
It seems that most loans that are past due are only 1-15 days past Due. This suggests that many of these loans are will become current again. I’m not sure if loans need to go through the whole 120 days of being past due before the default or become charged off. It is also worth noting that some of the bins here are larger than others, making the graph slightly misleading. (First two are 15 days, next three are 30.)
Since there are just 5 cancelled loans, let’s examine them.
## ListingKey Term LoanStatus BorrowerAPR BorrowerRate
## 1 093D3365430569291709120 36 Cancelled 0.25650 0.2375
## 2 F77F33664545520971829DC 36 Cancelled 0.14659 0.1395
## 3 09E1336627538645259AE42 36 Cancelled 0.20735 0.2000
## 4 29453366240667435235AEE 36 Cancelled 0.25650 0.2375
## 5 E77E34330098056070F5C30 36 Cancelled 0.16219 0.1075
## LenderYield EstimatedEffectiveYield ProsperScore
## 1 0.2325 NA <NA>
## 2 0.1345 NA <NA>
## 3 0.1950 NA <NA>
## 4 0.2325 NA <NA>
## 5 0.0975 NA <NA>
## ListingCategory..numeric. BorrowerState Occupation
## 1 0
## 2 0 Administrative Assistant
## 3 0
## 4 0 NC
## 5 1 FL Other
## EmploymentStatus EmploymentStatusDuration CreditScoreRangeLower
## 1 NA 520
## 2 Not available NA 640
## 3 NA NA
## 4 NA 500
## 5 Full-time 30 720
## CreditScoreRangeUpper IncomeRange IncomeVerifiable LoanOriginalAmount
## 1 539 Not displayed False 1000
## 2 659 Not displayed True 2500
## 3 NA Not displayed True 3000
## 4 519 Not displayed True 1000
## 5 739 $25,000-49,999 True 1000
## MonthlyLoanPayment
## 1 39.10
## 2 85.38
## 3 111.49
## 4 39.10
## 5 32.62
Seems that these are loans that started to be funded and then were cancelled. There are only 5 of them, and they all seem to have been closed less than a month after creation.
Our original LoanStatus bar chart had a bit too much going on, so above I have simplified it. This chart gives us a better idea of how many loans are past due.
These LoanStatus’ are intersting, but a lot depends on the data. Is this a snapshot of a month, a year? If this was for all time Completed and Defaulted will eventually become a huge part of the mix as the amount of loan history outpaces loans being originated. This distribution makes me think that it is a snapshot of a certain time frame.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.21020 0.21910 0.28490 0.51230 25
The Borrower APR is interesting to me in several ways. First off, there are people with incredible deals and terrible deals. Our minimum is .6% - that’s a pretty great rate. I don’t know anyone getting a rate that rate cheap outside of the Fed. On the other end, there is a borrower paying 51% APR. Yikes. What I’m starting to like about this dataset is you can begin to see where the line is set for risk. Prosper, or their lending partners, thinks there is a positive return on someone so risky you need to charge them over half the principal per year to loan it.
Several interesting things there. First off it’s almost a normal distribution with an average around 20%, but it seems to have spikes near round APR’s, as well as a spike a bit under 30%. There is also a large spike at around 35%. This histogram is a little noisy, so let’s look at a density plot.
## Warning: Removed 25 rows containing non-finite values (stat_density).
The peaks are easier to see with a density plot. But it is easy to see that a large amount of loans hover around 20% APR.
## Warning: Removed 25 rows containing non-finite values (stat_boxplot).
It seems that a couple of datapoints at the top of the range could be considered outliers. Let’s see how many.
## count
## 1 3
## ListingKey ListingNumber ListingCreationDate
## 1 844033650124564886B3EDC 690 2006-02-23 13:57:02.087000000
## 2 A79D33661366830833F3EF5 2231 2006-03-16 19:30:16.753000000
## 3 BBED336465905564254DC8B 1112 2006-03-02 19:00:17.593000000
## CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate
## 1 HR 36 Completed 2006-03-14 00:00:00 0.51229 0.4975
## 2 HR 36 Defaulted 2006-09-19 00:00:00 0.50633 0.4975
## 3 HR 36 Defaulted 2006-09-05 00:00:00 0.48873 0.4800
## LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn
## 1 0.4925 NA NA NA
## 2 0.4800 NA NA NA
## 3 0.4750 NA NA NA
## ProsperRating..numeric. ProsperRating..Alpha. ProsperScore
## 1 NA <NA>
## 2 NA <NA>
## 3 NA <NA>
## ListingCategory..numeric. BorrowerState Occupation EmploymentStatus
## 1 0
## 2 0
## 3 0
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## 1 NA False False
## 2 NA False False
## 3 NA False False
## GroupKey DateCreditPulled
## 1 2006-02-23 13:57:02.087000000
## 2 F555336594467179889CEFB 2006-03-16 18:39:30.963000000
## 3 2006-03-02 11:32:57.040000000
## CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine
## 1 NA NA
## 2 NA NA
## 3 NA NA
## CurrentCreditLines OpenCreditLines TotalCreditLinespast7years
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## TotalInquiries CurrentDelinquencies AmountDelinquent
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## DelinquenciesLast7Years PublicRecordsLast10Years
## 1 NA NA
## 2 NA NA
## 3 NA NA
## PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## AvailableBankcardCredit TotalTrades TradesNeverDelinquent..percentage.
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable
## 1 NA 0.02852 Not displayed True
## 2 NA 0.03909 Not displayed False
## 3 NA 0.10021 Not displayed True
## StatedMonthlyIncome LoanKey TotalProsperLoans
## 1 2838.000 08423365150780340A7980C NA
## 2 0.000 51A6336556683591380DA02 NA
## 3 1583.333 66E93364923759158FBB4B6 NA
## TotalProsperPaymentsBilled OnTimeProsperPayments
## 1 NA NA
## 2 NA NA
## 3 NA NA
## ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
## 1 NA NA
## 2 NA NA
## 3 NA NA
## ProsperPrincipalBorrowed ProsperPrincipalOutstanding
## 1 NA NA
## 2 NA NA
## 3 NA NA
## ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent
## 1 NA 0
## 2 NA 211
## 3 NA 225
## LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber
## 1 NA 97 89
## 2 6 96 245
## 3 6 96 137
## LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter
## 1 1500 2006-02-27 00:00:00 Q1 2006
## 2 3000 2006-03-21 00:00:00 Q1 2006
## 3 3000 2006-03-07 00:00:00 Q1 2006
## MemberKey MonthlyLoanPayment LP_CustomerPayments
## 1 45F83366003403405932608 80.94 1530.67
## 2 9151336553609388240ECB8 161.88 158.70
## 3 0FA73365900322998FDD36E 158.66 158.66
## LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees
## 1 1500.00 30.67 -0.29
## 2 35.12 123.58 -1.27
## 3 36.36 122.30 -1.27
## LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss
## 1 0 0.00 0.00
## 2 0 2964.88 2964.88
## 3 0 2963.64 2963.64
## LP_NonPrincipalRecoverypayments PercentFunded Recommendations
## 1 0 1 0
## 2 0 1 0
## 3 0 1 0
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## 1 0 0 9
## 2 0 0 21
## 3 0 0 20
## simple.LoanStatus
## 1 Completed
## 2 Defaulted
## 3 Defaulted
Just because they’re outliers doesn’t mean they’re not valid. There are three loans with extremely high Borrower APR, and it seems that one of them was paid off and two defaulted. They were all made in Q1 2006 and were for $1,500 and $3,000. I wonder if this was a quirk that was later worked out?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1342 0.1840 0.1931 0.2511 0.4975
Here we notice that the rate is NOT the same thing as the APR. Looking at the definitions document we’re simply told that the APR is the Annual Percentage Rate and that the Borrow’s Rate is the borrower’s interest rate for the loan. Some research tells us that these are calculated by (from Investopedia):
APR - Annual Percentage Rate. A percentage that represents the actual yearly cost of funds over the term of the loan. Includes additional costs or fees but does not take compounding into account.
Interest Rate - Refers only to the interst charged on the loan, doesn’t take other expenses into acount. Therefore, APR tends to be higher than the loan’s nominal interst rate.
APY or EAR - Annual Percentage Yield or Effective Annual Rate takes compound interest into account. Tends to be larger than APR on the same loan.
From researching the issue it seems that APR can be misleading when it comes to actual cost, and is best used when the loan is long-term, otherwise it does not account properly for costs and fees. However, for our purposes it is probably a good way to measure what a borrower is being charged.
Looks like the same 3 outliers from BorrowerAPR on the high end here.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.117 0.162 0.169 0.225 0.320 29084
You can see that this distribution mimics the distribution of APR’s but moved slightly lower, with some tail to the left as well. Let’s see what’s causing negative estimated yield loans.
## ListingKey Term LoanStatus
## 009B3495741061236C0C927: 1 12: 0 Completed :146
## 02283482060850607BC27FE: 1 36:190 Chargedoff : 42
## 02933496318097934A683F4: 1 60: 0 Defaulted : 2
## 029A3499706893097DB1393: 1 Cancelled : 0
## 038535015703591930CD2E0: 1 Current : 0
## 03AA34675425410661C8213: 1 FinalPaymentInProgress: 0
## (Other) :184 (Other) : 0
## BorrowerAPR BorrowerRate LenderYield
## Min. :0.08999 Min. :0.0693 Min. :0.0593
## 1st Qu.:0.21306 1st Qu.:0.1833 1st Qu.:0.1733
## Median :0.29031 Median :0.2500 Median :0.2400
## Mean :0.29372 Mean :0.2607 Mean :0.2507
## 3rd Qu.:0.37453 3rd Qu.:0.3500 3rd Qu.:0.3400
## Max. :0.41355 Max. :0.3500 Max. :0.3400
##
## EstimatedEffectiveYield ProsperScore ListingCategory..numeric.
## Min. :-0.18270 1 :105 Min. :1.000
## 1st Qu.:-0.06338 5 : 30 1st Qu.:1.000
## Median :-0.03420 3 : 17 Median :2.000
## Mean :-0.04802 2 : 16 Mean :3.237
## 3rd Qu.:-0.01787 4 : 12 3rd Qu.:6.000
## Max. :-0.00060 8 : 5 Max. :7.000
## (Other): 5
## BorrowerState Occupation EmploymentStatus
## CA :33 Other :50 Full-time :107
## MI :14 Professional :23 Employed : 33
## FL :13 Military Enlisted:14 Self-employed: 33
## GA :13 Teacher :10 Retired : 8
## CO : 9 Sales - Retail : 7 Not employed : 5
## TX : 9 Analyst : 6 Part-time : 4
## (Other):99 (Other) :80 (Other) : 0
## EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.00 Min. :600.0 Min. :619.0
## 1st Qu.: 26.25 1st Qu.:620.0 1st Qu.:639.0
## Median : 65.50 Median :640.0 Median :659.0
## Mean : 84.34 Mean :657.6 Mean :676.6
## 3rd Qu.:123.00 3rd Qu.:680.0 3rd Qu.:699.0
## Max. :303.00 Max. :820.0 Max. :839.0
##
## IncomeRange IncomeVerifiable LoanOriginalAmount
## $25,000-49,999:67 False: 55 Min. : 1000
## $50,000-74,999:56 True :135 1st Qu.: 1500
## $1-24,999 :21 Median : 2438
## $100,000+ :20 Mean : 3369
## $75,000-99,999:20 3rd Qu.: 4000
## Not employed : 5 Max. :15000
## (Other) : 1
## MonthlyLoanPayment
## Min. : 0.00
## 1st Qu.: 54.83
## Median : 90.68
## Mean :135.49
## 3rd Qu.:175.07
## Max. :678.37
##
It seems that some of the EstimatedEffectiveYields have a negative value. I will explore this in the Bivariate Exploration.
## 1 2 3 4 5 6 7 8 9 10 11 NA's
## 917 5632 7456 12300 9661 12064 10392 11875 6813 4637 1408 29084
Here you can see that ProsperScore is discrete, and potentially bimodal? Let’s see what the scale is here.
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
1 to 11? That’s a strange scale for scoring. Maybe it’s analogous to 0 to 10, but they don’t want to make people feel bad by giving them 0’s.
In our dataset the Listing Categories are numeric, so I wrote a short function to convert them to their string values.
It looks like Debt Consolidation is the most popular reason to apply for a loan.
This bar chart doesn’t actually tell us much because we would have to weight these by the population of each state to see if there are real differences between them. In fact the first four are ordered exactly as they would be ordered by population.
## # A tibble: 10 × 2
## Occupation count
## <fctr> <int>
## 1 Other 28242
## 2 Professional 13440
## 3 Computer Programmer 4407
## 4 Executive 4250
## 5 Teacher 3699
## 6 Administrative Assistant 3666
## 7 Analyst 3554
## 8 Sales - Commission 3384
## 9 Accountant/CPA 3185
## 10 Clerical 3108
Again, to see if any occupation actually had an undue number of loans we would have to get an estimate of how many people had each job in the population. Still, categorical variables like this are always interesting to look at since they put a human face on our data.
## # A tibble: 9 × 2
## EmploymentStatus count
## <fctr> <int>
## 1 Employed 65912
## 2 Full-time 26353
## 3 Self-employed 5973
## 4 Not available 5347
## 5 Other 3681
## 6 2255
## 7 Part-time 1088
## 8 Not employed 835
## 9 Retired 795
Oh, so many questions here.
Are some of these categories subsets of other categories? (ie Does ‘Employed = Full-time + Part-Time + Self Employed’? Let’s try to find out.
## Self_Full_Or_Part_time
## 1 33414
## Employed
## 1 65912
Even with the addition of ‘Other’ and the NULLs the categories would not approach the number of employment. The way these categories are split up makes this variable less usefull.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.05 137.00 755.00 7625
## status.mean status.median status.min status.max
## 1 8.004111 5.583333 0 62.91667
Looks like someone claims to have worked somewhere for almost 63 years! Maybe they’re saying the’ve been employed for 63 years straight. This distribution is what we would expect to see, with the exception of the responders who sayed their duration was 0 months.
Credit Range Upper and Lower. I would like to know the interval between the scores for each person.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.5 739.0 899.0 591
Interesting that there are Credit Scores under 300, which is the lowest Credit Score you can have.
## int_mean int_median int_max int_min
## 1 19 19 19 19
Surprisingly the interval between CreditScoreRangeUpper and CreditScoreRangeLower isthe same for every loan, which saves me the trouble of constructing a variable to represent the most likely single score.
## Warning: Removed 591 rows containing non-finite values (stat_bin).
The two bumps in credit score here remind me of the bumps in APR/Rate. Potentially there is correlation there - more people with the same credit score more people with the same rate. I wonder how Prosper aquires their score, is it just Fico or something more? Also it would be interesting to see what credit score distribution looks like in the general population versus in this dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1246 0.1740 0.1831 0.2411 0.4925
The same spike that exists in the other Rate plots exists her at around 30%. Hard to know if this is a pricing strategy, or a rate that is popular for other reasons. It is roughly equivalent to a credit card. The max is slightly under the max borrower rate, so fees paid to Prosper must not be included in this number.
Looks like most loans go to people with verifiable incomes.
The distribution of IncomeRanges for non-verified incomes is slightly different from the verified incomes. Seems like $100k+ gets a bump upwards in particular.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 130.6 216.6 271.4 369.7 2252.0
Looks like there’s a big outlier there with our max 2252. But our histogram makes it hard to see this. Let’s zoom in on the bulk of the payments.
After 200 there is a sharp drop off in monthly payments. Perhaps there is some sort of sales tactic there, or a correlation with a common loan amount.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6100 8293 12000 35000
We can see that Loan Amounts tend to be in whole numbers. Let’s see the most common loan amounts.
## # A tibble: 9 × 2
## LoanOriginalAmount count
## <int> <int>
## 1 4000 14085
## 2 15000 12065
## 3 10000 10816
## 4 5000 6918
## 5 2000 6020
## 6 3000 5709
## 7 25000 3548
## 8 1000 3206
## 9 20000 3178
Lots of outliers here at the upper range of Loan Original Amount. It makes sense that there would be less people with the high incomes that would qualify them for these large loans.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7000 1.0000 1.0000 0.9986 1.0000 1.0120
OK, so it looks like the vast percentage were 100% funded (or they probably wouldn’t be in this dataset!) Let’s take a closer look at those that weren’t.
Also, it looks like minimally 70% of the loans in the set were funded. Maybe that is Prosper’s cutoff?
The only thing I see for sure here is that of the loans that do not get 100% funded, many are 70% funded, and the amount at each precentage from 70-99% trends downward until it hits 100%, where most of the loans reside. This could be interpreted as the higher your funding percentage the more likely you are to be funded 100%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 45.00 81.26 116.00 1189.00
Should be related to the size of the loan.
Wow, on a a platform that lets you pool your risk with other investors, a vast majority of loans still only have 1 investor.
## # A tibble: 10 × 2
## Investors count
## <int> <int>
## 1 1 26579
## 2 2 1364
## 3 3 967
## 4 4 817
## 5 5 747
## 6 6 721
## 7 7 690
## 8 8 745
## 9 9 717
## 10 10 720
There are 112,239 unique loans in this dataset, with 81 features about both the loans (APR, Term, Status) and the Borrower (Occupation, Income Range, Credit Score).
The main features of interest here are:
* What makes someone credit-worthy? / What factors determine how much you pay to borrow money? This is reflected in BorrowerApr. * What determines how much you can borrow? (OriginalLoanAmount) * What loans are likely to go bad? (LoanStatus) * and, for me, ‘What can we learn about the people these loans were made to?’ (Occupation, State, etc)
I think Credit Score and Income range are likely to be large factors in credit-worthyness, and how much you can borrow. Employment, ProsperScore, and Credit Score (which is already an aggregate of many credit-related data points) should be good indicators for rate. And variables like State, Occupation and Employment Duration will let me know more about the borrowers.
The biggest cleaning opporation I had to undertake was eliminating all of the duplicate loans that shared unique identifiers (ListingKey) and were identical except for Prosper Score.
I will be filtering out the loans with CreditScoreRangeUpper ‘19’ as they seem to represent a credit-not-available value.
## Warning in ggcorr(., hjust = 0.95, size = 5, color = "black", layout.exp =
## 2): data in column(s) 'ListingKey', 'Term', 'LoanStatus', 'ProsperScore',
## 'BorrowerState', 'Occupation', 'EmploymentStatus', 'IncomeRange',
## 'IncomeVerifiable' are not numeric and were ignored
This is a nice preview of things to come. Rates to the borrower and yield for the lender are positively correlated. Both Yield and Rate have a slight negative correlation to variables that provide credit worthiness, such as Credit Score and Employment Status Duration. And, Prosper Score seems to have very little correlation with the rest of the variables.
Seems like there are different patterns for different Income Ranges. For $1-24,999 and $25,000-49,999 the preference seems to be for 3 year term loans. For $50,000-74,999 long term 5-year loans are more popular. In the $100k and above range, short term one-year loans are the most popular, followed by 5 year-loans.
Here we can see that 3-year loans are more likely to default than others.
## Warning: Removed 25 rows containing non-finite values (stat_boxplot).
Median APR is practically lat over all three terms. However, longest term loans tend to be in a tighter, lower APR range.
## Warning: Removed 29084 rows containing non-finite values (stat_boxplot).
It makes intuitive sense that 5-year loans would have a higher yield, since they have longer to collect interest. We know that 3-year loans have the largest range because there are more loans in this category, and we’ve already explored the negative yield, closed-out loans.
In our univariate plots we noticed that there are some negative expected yields. Let’s take a deeper look at that.
Looks like all of the 184 loans that have a negative EstimatedEffectiveYield are not current, but have either been completed, chargedoff, or defaulted. This probably represents either the loss incurred from the loans that where completed ahead of schedule, or had to be taken as a loss. It’s not really an ‘estimate’ for any of these statuses because the loss has been realized.
Seems that in general the longer the term the higher the median amount of the loan. Interesting to note the hard ceiling of 35k.
## [1] 35000
Completed and Current loans tend to have a lower APR than loans that are Past Due or that have Defaulted. A hopeful thing I notice here is that Past Due > 120 Days has a higher median APR than the short past due categories, so perhaps only the higher APR loans of those groups will default. Unfortunately, they have roughly the same APR as the chargedoff loans, so maybe things aren’t that much sunnier.
So there is a lot we can’t see in these charts, but mostly we’re mostly checking if there is any Listing Category that consistently leads to bad outcomes. It seems that Personal Loan, Not Available, and Student use result in a higher amount of bad outcome loans. Loans for Baby&Adoption and Wedding Loans tend to stay current.
Here we see that people in the $0 and Not employed income ranges have a higher proportion of Chargedoff, Defaulted, and PastDue Loans. Charge offs and defaults decrease with higher IncomeRange, though Past Due tends to stay flat.
For this graph I eliminated ‘cancelled’ and ‘final payment in progress’ loans because they are such a small part of the mix. It is also important to note that Completed, Chargedoff and Defaulted loans should grow with the amount of time included with this snapshot. If we had 20 years of loan data they would be many times the amount of currently active loans.
This is some of what we would expect to see. The higher your Credit Score the lower your APR, in general.
It seems that even if you have a lower credit score you can get a loan at the high interest rage of .3 or .375. I wonder if the high interest loans made to borrowers with a high credit score are low amounts? I’ll explore this in the multivariate section.
We see some lines at APR’s that were present in our overall BorrowerAPR distribution. Especially that line at 30%. I wonder if there is legal lending limits or psycological pricing is at work there.
Looks like there is something that causes some people with no credit to be listed with a Lower Credit Score Range of 0 and an Upper of 19. It’s interesting to me that loans can be made to these types of users. Might be worth exploring this population further.
On a superficial level it seems that highly paid professions get lower average APR’s. But that just reiterates an obvious relationship between high income, good credit, and the ability to repay loans. I did smile when I saw that judges in particular have a low average apr. On further investigation there are 22 people who claimed Judge as their occupation in our dataset.
## Judge_Count
## 1 22
It seems like being ‘Not Employed’ results in a higher mean APR.
It doesnt seem that duration of employment status for people who are employed has much of an effect on Borrower APR.
ProsperScore seems to be the best indicator for APR’s that we’ve explored - of course ProsperScore is a variable created by Prosper out of all the other variables they care about, so it’s completely surprising. You can see that in each of the scores there are outliers, so ProsperScore isn’t a total predictor, and there are probably other variables, such as Loan Amount, that will change the APR even if you have a certain score.
## # A tibble: 4 × 2
## EmploymentStatus.simple `n()`
## <chr> <int>
## 1 Employed 98238
## 2 Not employed 835
## 3 Part-time 1088
## 4 Retired 795
I like this scatter plot with jitter because you can both see that there aren’t many loans in the extremes of either range, and that there is a linear, positive relationship between Credit Score and Prosper Score, which is what we would expect. Distractingly, loans that do not have a Prosper Score have a wide range of Credit Scores.
Listing Category and Borrower State
## Source: local data frame [52 x 3]
## Groups: BorrowerState [52]
##
## BorrowerState ListingCategory count
## <fctr> <chr> <int>
## 1 Not Available 5474
## 2 AK Debt Consolidation 102
## 3 AL Debt Consolidation 817
## 4 AR Debt Consolidation 485
## 5 AZ Debt Consolidation 913
## 6 CA Debt Consolidation 7509
## 7 CO Debt Consolidation 1198
## 8 CT Debt Consolidation 946
## 9 DC Debt Consolidation 230
## 10 DE Debt Consolidation 183
## # ... with 42 more rows
I explored the relationship between state and Listing Category, but the overwhelming majority of loans are for Debt Consolidation or have a category of ‘Not Available’ so there was nothing surprising here.
## Source: local data frame [5 x 5]
## Groups: bin [5]
##
## bin ListingCategory count minCredit maxCredit
## <int> <chr> <int> <int> <int>
## 1 1 Not Available 8994 379 659
## 2 2 Debt Consolidation 12590 659 699
## 3 3 Debt Consolidation 13408 699 719
## 4 4 Debt Consolidation 13111 719 759
## 5 5 Debt Consolidation 10703 759 899
Pretty all income ranges are listing for Debt Consolidation, or the info is unavailable.
## Source: local data frame [53 x 3]
## Groups: BorrowerState [52]
##
## BorrowerState IncomeRange count
## <fctr> <fctr> <int>
## 1 Not displayed 3813
## 2 AK $50,000-74,999 71
## 3 AL $25,000-49,999 529
## 4 AR $25,000-49,999 326
## 5 AZ $25,000-49,999 594
## 6 CA $50,000-74,999 3858
## 7 CO $50,000-74,999 650
## 8 CT $50,000-74,999 466
## 9 DC $100,000+ 100
## 10 DE $25,000-49,999 93
## # ... with 43 more rows
Interesting to see that all our out of range Credit Scores are in the ‘Not Displayed’ Income Range. There seems to be a positive correlation between haveing a higher Income Range and having a high credit score, but we can also see that most of the loans have been made to people with a Credit Score centered around 700, with a floor at about 525 (except for Not Displayed Income Range)
In general, if you have income, the higher your income range the higher your monthly loan payment can be.
The higher the income, the higher the median Loan Amount.
Loan Terms don’t change APR much, but longer term loans tend to be in a tighter, lower APR range.
Having a higher Credit Score results in a lower APR in general, with some interesting reoccuring interest rates.
Having a high credit score and high Income Range are correlated.
Being unemployed correlates with having a higher APR, though employment status being ‘Not Available’ had the lowest mean apr.
ProsperScore good APR indicator. but indicator of inditcators.
The longer the loan term the higher the median amount of loans - with the maximum loan amount being 35k.
Higher Income Ranges also tend to have higher monthly payments, probably because they tend to be taking out larger loans.
Loan category does seem to have an effect on whether or not loans go bad (chargeoff, default).
Chargeoffs and defaults decrease with higher IncomeRange, though the amount of Past Due loans remains relatively flat.
The relationship between ProsperScore and APR is very strong, though that is to be expected since ProsperScore is an amalgam of all the variables Prosper finds to be import, and is probably part of the process of assigning an APR.
There is some overplotting here, but there is also a lot of information. We see that really large loans are only made to people that are in the $100,000+ Income Range. This makes sense not just because it’s the largest income but because as a category it extends to every possible Income Range above it. For people with an IncomeRange below 100k it seems that regardless of credit score their maximum loan amount is $25k. Also, lines emerge for Original Loan Amounts at easily chosen amounts - 10k, 15k, 20k. It seems that for most borrowers 25k is a hard ceiling unless you have a high income. It seems that for income ranges above $25k the relationship between Original Loan Amount and Credit Score is relatively constant amount income ranges. Below $25k a year, loans in the $0 bin have a loan amount that increases more quickly than others given a higher credit score, and the ‘Not employed’ bin experiences a steeper increase in Loan Amount than people in the $1-24,999 bin across the same Credit Scores.
My intuition was right about lower Original Loan Amounts resulting in high APRs. From this chart we can see that the best APRs seem to go to people with hgh credit scores, borrowing medium amounts of money.
It seems that, for people with income, larger loans amount are more likely to stay current. This surprises me, but may show that, in general, if Prosper is willing to loan you a lot of money they are very certain you will not become delinquent.
Here, each band is a term, and you can see that it tends to be the Loans with higher monthly payments within their term that have bad outcomes. This is deffinitely not causality, those loans had higher APR’s because the borrowers were deamed riskier.
Seeing Loan Amount, Credit Score, and Income range together strengthed what I knew about their relationships and highlighted the role Income Range plays, especially in large loans.
By combining APR, Credit Score and Loan Amount we can see that, in general, larger loans have lower APRs they don’t have the lowest, and that the largest loans are centered in a credit score of 750-850 and an APR between 10 and 20%.
Outcome by Income Range and Loan Amount was surprising - but reinforces the fact that this dataset is of loans that were deemed worthy to be made.
Deffinitely the last boxplot that shows Larger Loans being paradoxically more likely to have good outcomes was surprising to me, since larger loans are more difficult to pay back - but ability to repay is related to income and other factors.
It is also important to remember that this is a set of loans that were made. What we don’t see is loans that fell outside of the parameters of Prosper’s lending. All of these loans were deemed by lenders to be of acceptable risk.
One of my first stops with this dataset was to look at the distribution of rates for borrowers, and the potential yields for Lenders. When looked at together Estimated Effective Yield has almost the same distribution, shifted lower on the scale of percentage rate. This make sense, because the yield will be lower than borrower is paying, and since the yield is estimated it is a little smoother than the actual borrower rate distribution of all of the loans in the dataset.
One of my most surprising findings was that, on a platform for P2P lending, a 31% of the loans are originated by just one investor. If it was small, individual investors in these loans looking for maximum diversification I think there would be a mean around the average size of a loan divided by the minimum investment per loan ($25). It’s possible that institutional investors account for the large amount of these solo-investor loans. There were a lot of articles several years ago about how large investors, like hedge funds and banks, were getting into the P2P space, and there are even bonds that are bonds backed by P2P loans. In fact, I believe the prefered nomenclature is now ‘Marketplace Lending’, not P2P.
## sum_count
## 1 0.3102848
I included this plot because it both surprised me, and illustrated the strength of the Income variable in this dataset. It suggests that, for people with income, larger loans are more likely to stay current, and avoid becoming Past Due, Defaulted or Charged Off. My intuition is that Prosper is more willing to make larger loans to people who are more capable of paying them back. Also, our ‘Income Range’ variable includes some large buckets, and we can’t discount the fact that making $1 is very different than making $24,999 and that people making $75k are better off that people making $50k, but in both of these cases the loans are in the same Income Range bucket.
Exploring this complex dataset was very enjoyable. It was hard to not explore each of the 81 variables. The data often had surprising quirks, like the duplicate listing keys and the 0-19 Credit Score Range, but mostly it told the story of how Prosper vets high quality borrowers.
I think a lot about the data for the loans that were not made and are not present. In many ways this data set is a picture of what lenders consider acceptable risk for a positive expected outcome. That being said, these loans are made in a pretty wide band, and I felt like I was able to tease out some of the levers and filters that were used to make that descision.
Initially, I struggled with the switch to R from python, then I got to know (and love) dplyr and the rest of the tidyverse. Once I got the hang of piping commands and ggplot2 I felt like I could explore without worrying too much about syntax, which felt very freeing and helped with the exploratory nature of this project.
I made the mistake of begining this project in a jupyter notebook, thinking it would be easy to convert it to R Markdown later. I was very wrong about that. Also, some of the quirks of knitting have taken a lot of my time - like trying to get my tables to be readable, but I think the kable package provides a workable solution, and some of the tables turned out to be unnecessary.
Overall, I was surprised how Credit Score was not the main indicator of a loan’s attributes. It was a big one, but in thinking through the data I guessed that it would be heavily weighted, since it is itself composed of a large number of historically tracked data points. Aside from this, the relationship between loan amount and outcome was unanticipated. However, it was fairly easy to see things from the lender’s point of view so many of the relationships were predictable.
I think in the future for this dataset I would like to re-create the model Prosper uses to assign its ‘Prosper Score’. Since we have the data, and the labels, I should be able to create a reasonable fascimile. I also would like to eventually examine at all of the variables. One of the reasons that this project took me so long is that I wanted to thoroughly examine every part of this data. I think using techniques like regression I could identify other variables worth exploring.
Overall, this project has made me an R enthusiast and gave me a better understanding of how to crack open a dataset and begin the process of finding all the insights it has to offer.