college-scorecard-project.Rmd

---
title: "College Scorecard Analytics Project"
author: "Thomas P Kiehne"
date: "December 8, 2015"
output: html_document
---
```{r global_options, include=FALSE, echo=FALSE}
options(width = 116, scipen = 5, digits = 2)
knitr::opts_chunk$set(comment=NA, fig.width=9, fig.height=6, warning=FALSE, message=FALSE)
rm( list = ls())  # Clear environment

#install.packages("stringr")
#install.packages("plyr")
#install.packages("reshape2")
#install.packages("corrgram")
#install.packages("ggplot2")
#install.packages('caret', dependencies = TRUE)
#install.packages("pROC")
library(stringr)
library(plyr)
library(reshape2)
library(corrgram)
library(ggplot2)
library(caret)
library(pROC)
```

About The Data
--------------

[The College Scorecard project](https://collegescorecard.ed.gov) aggregates data that are provided through federal reporting from institutions, data on federal financial aid, and tax information.  These data provide insights into the performance of schools that receive federal financial aid dollars, and the outcomes of the students of those schools. The project intends to provide data to help students and families compare college costs and outcomes as they weigh the tradeoffs of different colleges, accounting for their own needs and educational goals.

In this document we will collect and analyze the most recent Scorecard data to attempt to determine the following:

* What colleges provide the highest lifetime value their graduates, and
* What attributes of a college contribute the most to the chances of student graduation.


Data Collection and Cleaning
----------------------------

This section outlines the steps taken to retrieve the Scorecard data and prepare it for analysis

The most recent data are available for download in CSV format.  We download the most recent published data and read the data into memory as shown below.  One particular detail is that some NSLDS and Treasury elements are protected for privacy purposes; those data are shown as PrivacySuppressed.  This is not consistent between institutions; some variables may be PrivacySuppressed for one instition while not for other institutions.  Aside from the obvious bias this introduces for those variables, for our purposes these fields are effectively missing and are coded as NA.


```{r retreiveDataRecent}
# get data file if not present
# ref: https://collegescorecard.ed.gov/data/
if(!file.exists("./Most+Recent+Cohorts+(All+Data+Elements).csv") &
   file.exists("./Most+Recent+Cohorts+(All+Data+Elements).zip")) {
  unzip("./Most+Recent+Cohorts+(All+Data+Elements).zip", "./Most+Recent+Cohorts+(All+Data+Elements).csv")
}
if(!file.exists("./Most+Recent+Cohorts+(All+Data+Elements).csv")) {
  download.file("https://s3.amazonaws.com/ed-college-choice-public/Most+Recent+Cohorts+(All+Data+Elements).csv", "./Most+Recent+Cohorts+(All+Data+Elements).csv", mode="wb", method="curl")
}

data <- read.csv("./Most+Recent+Cohorts+(All+Data+Elements).csv", stringsAsFactors=F, na.strings=c("NULL","PrivacySuppressed"))
```


The [Data Dictionary](https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary-09-08-2015.csv) defines codes for a number of coded variables.  In this section we translate those variables into factors and provide the code labels where appropriate.

```{r datacleaning}
# Predominant Degree
data$PREDDEG <- factor(data$PREDDEG, 0:4, labels=c("Not classified", 
                               "Predominantly certificate-degree granting", 
                               "Predominantly associate's-degree granting", 
                               "Predominantly bachelor's-degree granting", 
                               "Entirely graduate-degree granting"))

# Highest Degree
data$HIGHDEG <- factor(data$HIGHDEG, 0:4, labels=c("Non-degree-granting", 
                               "Certificate degree", 
                               "Associate degree", 
                               "Bachelor's degree", 
                               "Graduate degree"))

# Convert state abbreviaions to factor
data$STABBR <- factor(data$STABBR)

# Translate FIPS into State names
data$st_fips <- factor(data$st_fips,
                       c(1,2,4,5,6,8,9,10,11,12,13,15,16,17,18,
                         19,20,21,22,23,24,25,26,27,28,29,30,31,
                         32,33,34,35,36,37,38,39,40,41,42,44,45,
                         46,47,48,49,50,51,53,54,55,56,60,64,66,
                         69,70,72,78),
                       labels=c("Alabama","Alaska","Arizona","Arkansas","California","Colorado",
                                "Connecticut","Delaware","District of Columbia",
                                "Florida","Georgia", "Hawaii","Idaho","Illinois",
                                "Indiana","Iowa","Kansas","Kentucky","Louisiana", "Maine",
                                "Maryland","Massachusetts","Michigan","Minnesota","Mississippi",
                                "Missouri","Montana","Nebraska","Nevada","New Hampshire",
                                "New Jersey", "New Mexico","New York","North Carolina",
                                "North Dakota","Ohio","Oklahoma", "Oregon","Pennsylvania",
                                "Rhode Island","South Carolina","South Dakota", "Tennessee",
                                "Texas","Utah","Vermont","Virginia","Washington",
                                "West Virginia", "Wisconsin","Wyoming","American Samoa",
                                "Federated States of Micronesia", "Guam",
                                "Northern Mariana Islands","Palau","Puerto Rico",
                                "Virgin Islands")
)

# Institutional control / type
data$CONTROL <- factor(data$CONTROL, 1:3, c("Public","Private nonprofit","Private for-profit"))

# Region
data$region <- factor(data$region, 0:9, c("U.S. Service Schools",
                                "New England",
                                "Mid East",
                                "Great Lakes",
                                "Plains",
                                "Southeast",
                                "Southwest",
                                "Rocky Mountains",
                                "Far West",
                                "Outlying Areas"))

# Locale description
data$LOCALE <- factor(data$LOCALE, c(11,12,13,21,22,23,31,32,33,41,42,43), c("City: Large",
                                 "City: Midsize",
                                 "City: Small",
                                 "Suburb: Large",
                                 "Suburb: Midsize",
                                 "Suburb: Small",
                                 "Town: Fringe",
                                 "Town: Distant",
                                 "Town: Remote",
                                 "Rural: Fringe",
                                 "Rural: Distant",
                                 "Rural: Remote"))

# Carnegie basic classicifcation
data$CCBASIC <- factor(data$CCBASIC, 0:33, c("Not classified",
                                "Associate's-Public Rural-serving Small",
                                "Associate's-Public Rural-serving Medium",
                                "Associate's-Public Rural-serving Large",
                                "Associate's-Public Suburban-serving Single Campus",
                                "Associate's-Public Suburban-serving Multicampus",
                                "Associate's-Public Urban-serving Single Campus",
                                "Associate's-Public Urban-serving Multicampus",
                                "Associate's-Public Special Use",
                                "Associate's-Private Not-for-profit",
                                "Associate's-Private For-profit",
                                "Associate's-Public 2-year colleges under 4-year universities",
                                "Associate's-Public 4-year Primarily Associate's",
                                "Associate's-Private Not-for-profit 4-year Primarily Associate's",
                                "Associate's-Private For-profit 4-year Primarily Associate's",
                                "Research Universities (very high research activity)",
                                "Research Universities (high research activity)",
                                "Doctoral/Research Universities",
                                "Master's Colleges and Universities (larger programs)",
                                "Master's Colleges and Universities (medium programs)",
                                "Master's Colleges and Universities (smaller programs)",
                                "Baccalaureate Colleges--Arts & Sciences",
                                "Baccalaureate Colleges--Diverse Fields",
                                "Baccalaureate/Associate's Colleges",
                                "Special Focus Institutions--Theological seminaries, Bible colleges, and other faith-related institutions",
                                "Special Focus Institutions--Medical schools and medical centers",
                                "Special Focus Institutions--Other health professions schools",
                                "Special Focus Institutions--Schools of engineering",
                                "Special Focus Institutions--Other technology-related schools",
                                "Special Focus Institutions--Schools of business and management",
                                "Special Focus Institutions--Schools of art, music, and design",
                                "Special Focus Institutions--Schools of law",
                                "Special Focus Institutions--Other special-focus institutions",
                                "Tribal Colleges"))

# Carnegie Undergraduate profile
data$CCUGPROF <- factor(data$CCUGPROF, 1:14, c("Higher part-time two-year",
                                "Mixed part/full-time two-year",
                                "Medium full-time two-year",
                                "Higher full-time two-year",
                                "Higher part-time four-year",
                                "Medium full-time four-year, inclusive",
                                "Medium full-time four-year, selective, lower transfer-in",
                                "Medium full-time four-year, selective, higher transfer-in",
                                "Full-time four-year, inclusive",
                                "Full-time four-year, selective, lower transfer-in",
                                "Full-time four-year, selective, higher transfer-in",
                                "Full-time four-year, more selective, lower transfer-in",
                                "Full-time four-year, more selective, higher transfer-in",
                                "Not classified or not applicable"))

# Carnegie size classification
data$CCSIZSET <- factor(data$CCSIZSET, 1:18, c("Very small 2-year",
                                "Small 2-year",
                                "Medium 2-year",
                                "Large 2-year",
                                "Very large 2-year",
                                "Very small 4-year, primarily nonresidential",
                                "Very small 4-year, primarily residential",
                                "Very small 4-year, highly residential",
                                "Small 4-year, primarily nonresidential",
                                "Small 4-year, primarily residential",
                                "Small 4-year, highly residential",
                                "Medium 4-year, primarily nonresidential",
                                "Medium 4-year, primarily residential",
                                "Medium 4-year, highly residential",
                                "Large 4-year, primarily nonresidential",
                                "Large 4-year, primarily residential",
                                "Large 4-year, highly residential",
                                "Not applicable, special-focus institution"))
```


The data contain legacy entries for institutions that are currently no longer operating.  We purge these records with the understanding that no students will be admitted to these schools in the future.


```{r datapurge}
# Convert Still Operating flag into logical
data$CURROPER <- as.logical(data$CURROPER)

# Purge institutions that are no longer operating
data <- data[data$CURROPER == TRUE,]
```


Next we look at the numerous variables pertaining to degree fields and types.  Using the Classification of Instructional Programs and degree level codes from the data dictionary we create a set of objects to be used for translating these variables into aggregates as needed.

```{r related.data}
# Classification of Instructional Programs
cip.nums <- c(1,3,4,5,9,10,11,12,13,14,15,16,19,22,
              23,24,25,26,27,29,30,31,38,39,40,41,42,
              43,44,45,46,47,48,49,50,51,52,54)
cip.names <- c("Agriculture and Related Sciences","Natural Resources and Conservation",
               "Architecture and Related Services","Ethnic, Cultural, and Gender Studies",
               "Communication and Journalism","Communications Technologies",
               "Computer and Information Sciences","Personal and Culinary Services",
               "Education","Engineering","Engineering Technologies",
               "Languages, Literature, and Linguistics","Family and Consumer Sciences",
               "Legal Professions and Studies","English Language and Literature",
               "Liberal Arts and Humanities","Library Science",
               "Biological and Biomedical Sciences","Mathematics and Statistics",
               "Military Technologies and Applied Sciences",
               "Multi/Interdisciplinary Studies","Parks, Recreation, and Fitness Studies",
               "Philosophy and Religious Studies","Theology and Religious Vocations",
               "Physical Sciences","Science Technologies/Technicians","Psychology",
               "Law Enforcement and Firefighting Services",
               "Public Administration and Social Services","Social Sciences",
               "Construction Trades","Mechanic and Repair Technologies",
               "Precision Production","Transportation","Visual and Performing Arts",
               "Health Professions","Business, Management, and Marketing","History")

# Vector of variable names that contain flags for specific disciplines
cip.codes <- data.frame(code=cip.nums, name=cip.names, stringsAsFactors=F)

# Vector of variable names that contain percentage of degrees awarded by CIP class
pcip.codes <- with(cip.codes, paste0("PCIP",str_pad(code, 2, pad="0"))) 

# Program levels 
prog.levels <- c("CERT1","CERT2","ASSOC","CERT4","BACHL")

# Vector of variable names that contain flags for specific programs
prog.codes <- as.vector(sapply(cip.codes$code, FUN=function(s){paste0("CIP",str_pad(s, 2, pad="0"),prog.levels)}),mode="character") 

# Matrix for lookup of variable names per program level
prog.level.codes <- sapply(prog.levels, FUN=function(s){paste0("CIP",str_pad(cip.codes$code, 2, pad="0"),s)}) 
```


Next, we use create a data frame that is a subset of the full data set which contains only the variables needed for subsequent analysis.  Variable names are reformatted to become more intuitive and several aggregate measures are created to simplify analysis. Several notes on the decisions made:

* The type or level of residency per institution is not relevant to our analaysis; removing residency classifications allows us to compact the Carnegie Size Classifications into a more usable form;
* The number of degrees offered is summed per degree level; no distinction is made between distance and in-person programs;
* Mean family incomes are represented for both dependent and independent students and an additional mean of these is also created;
* 105% and 200% completion levels are set for either 4-year or less than 4-year programs or the mean where both are present;
* Repayment progress and earnings are maintained on a cohort year basis as well as represented in a separate variable that takes the mean over the cohort years;
* Several of the Carnegie Size Classifications are NA; upon a manual review of some of these rows it seems that most, if not all, are special purpose institutions and are coded as such; It is acknowledged that this assumption may slightly skew the number of special purpose institutions.


```{r reformat.data}
data.min <- data.frame(stringsAsFactors=F,
  UnitID = data$UNITID,
  Name = data$INSTNM,
  City = data$CITY,
  State = data$STABBR,
  ZIP = data$ZIP,
  Type = data$CONTROL,
  Region = data$region,
  Locale = data$LOCALE,
  Size = mapvalues(data$CCSIZSET, 
         from=c("Very small 2-year",
                "Small 2-year",
                "Medium 2-year",
                "Large 2-year",
                "Very large 2-year",
                "Very small 4-year, primarily nonresidential",
                "Very small 4-year, primarily residential",
                "Very small 4-year, highly residential",
                "Small 4-year, primarily nonresidential",
                "Small 4-year, primarily residential",
                "Small 4-year, highly residential",
                "Medium 4-year, primarily nonresidential",
                "Medium 4-year, primarily residential",
                "Medium 4-year, highly residential",
                "Large 4-year, primarily nonresidential",
                "Large 4-year, primarily residential",
                "Large 4-year, highly residential",
                "Not applicable, special-focus institution"),
         to=c("Very small 2-year",
                "Small 2-year",
                "Medium 2-year",
                "Large 2-year",
                "Very large 2-year",
                "Very small 4-year",
                "Very small 4-year",
                "Very small 4-year",
                "Small 4-year",
                "Small 4-year",
                "Small 4-year",
                "Medium 4-year",
                "Medium 4-year",
                "Medium 4-year",
                "Large 4-year",
                "Large 4-year",
                "Large 4-year",
                "Special-focus institution")),
  Degree.mode = data$PREDDEG,
  Degree.high = data$HIGHDEG,
  Degree.cert.1yr = (rowSums(data[,prog.level.codes[,c("CERT1")]] > 0) > 0),
  Degree.cert.2yr = (rowSums(data[,prog.level.codes[,c("CERT2")]] > 0) > 0),
  Degree.assoc = (rowSums(data[,prog.level.codes[,c("ASSOC")]] > 0) > 0),
  Degree.cert.4yr = (rowSums(data[,prog.level.codes[,c("CERT4")]] > 0) > 0),
  Degree.bachl = (rowSums(data[,prog.level.codes[,c("BACHL")]] > 0) > 0),
  Cohort.size.earnings = data$count_ed,
  Admission.rate = data$ADM_RATE,
  SAT.mean = data$SAT_AVG,
  Income.dep = data$DEP_INC_AVG,
  Income.indep = data$IND_INC_AVG,
  Income.mean = rowMeans(data[,c("DEP_INC_AVG","IND_INC_AVG")], na.rm=T),
  Race.white = data$UG_WHITENH,
  Race.black = data$UG_BLACKNH,
  Race.pacific = data$UG_API,
  Race.aboriginal = data$UG_AIANOld,
  Race.hispanic = data$UG_HISPOld,
  Comp.150 = rowMeans(data[,c("C150_4","C150_L4")], na.rm=T),
  Comp.200 = rowMeans(data[,c("C200_4","C200_L4")], na.rm=T),
  Comp.yr8 = data$COMP_ORIG_YR8_RT,
  Cost.attend.aca = data$COSTT4_A,
  Cost.attend.prog = data$COSTT4_P,
  Cost.instate = data$TUITIONFEE_IN,
  Cost.outstate = data$TUITIONFEE_OUT,
  Cost.prog = data$TUITIONFEE_PROG,
  Loan.percent = data$PCTFLOAN,
  Debt.median = data$GRAD_DEBT_MDN,
  Repay.1yr = data$RPY_1YR_RT,
  Repay.3yr = data$RPY_3YR_RT,
  Repay.5yr = data$RPY_5YR_RT,
  Repay.7yr = data$RPY_7YR_RT,
  Repay.7yr.mean = rowMeans(data[,c("RPY_1YR_RT","RPY_3YR_RT","RPY_5YR_RT","RPY_7YR_RT")], na.rm=T),
  Earn.6yr = data$mn_earn_wne_p6,
  Earn.7yr = data$mn_earn_wne_p7,
  Earn.8yr = data$mn_earn_wne_p8,
  Earn.9yr = data$mn_earn_wne_p9,
  Earn.10yr = data$mn_earn_wne_p10,
  Earn.10yr.mean = rowMeans(data[,c("mn_earn_wne_p6","mn_earn_wne_p7","mn_earn_wne_p8","mn_earn_wne_p9","mn_earn_wne_p10")], na.rm=T)
)
# Presume NAs are special institutions
data.min[is.na(data.min$Size),c("Size")] <- "Special-focus institution"
```


Exploratory Analysis - Institutions
-----------------------------------

In this section we look closely at the compiled data to determine trends, features, and other aspects that might be useful in answering our questions.

Using the converted Carnegie Size Classifications we can see the distribution of schools by size. Excluding the special purpose schools, the distribution appears to be bivariate with peaks among the 2-year and 4-year institutions, respectively.


```{r explore.school}
# Institution size
size.agg <- aggregate(data.min$UnitID, by=list(data.min$Size), FUN=length)
par.old <- par(mar=c(10,4,2,2))
barplot(size.agg$x, 
        names.arg=size.agg$Group.1, 
        col=topo.colors(nrow(size.agg)), 
        main="Distribution of Institution Sizes", 
        las=2, 
        cex.names=0.8)
```


Next, the institutions are aggregated by U.S. region and highest degree offered. Certificate programs are the most frequent (likely due to the large number of special purpose schools found above), followed by Bachelors and Associates degrees. There is significant variance in the number of institutions across different regions, but the same general curve of degree level is generally consistent.

After this the same data is aggregated by institution size instead of region. Special purpose schools again dominate in numbers while the bivariate peaks show Associates and Bachelors degrees are the most common among 2 and 4-year institutions respectively.


```{r explore.school.2}
par(par.old)

# Count of colleges By highest degree offered
barplot(height=tapply(data.min[["UnitID"]], data.min[,c("Degree.high", "Region")], length), 
        beside=T, 
        legend.text=levels(data.min$Degree.high), 
        cex.names=0.6, 
        col=heat.colors(length(levels(data.min$Degree.high))), 
        args.legend = list(x="topright", cex = 0.5), 
        main="Number of Institutions by Highest Degree Offered and Region")

# ... By degree and size
barplot(height=tapply(data.min[["UnitID"]], data.min[,c("Degree.high", "Size")], length),
        beside=T, 
        legend.text=levels(data.min$Degree.high), 
        cex.names=0.6, 
        col=heat.colors(length(levels(data.min$Degree.high))), 
        args.legend = list(x="top", cex = 0.5), 
        main="Number of Institutions by Highest Degree Offered and Size")
```


Next we look at degree levels and subject areas of study.  We show the percent of degrees awarded by discipline.  There is a wide range of distributions, but the most common areas stand out as being more common amongst institutions: Health Professions, Business Management & Marketing, and Personal & Culinary Services. 


```{r explore.school.3}
# Degree types awarded - none above 100% - ***Uses original data set
par.old <- par(mar=c(11,5,1,1))
boxplot(data[,pcip.codes], 
        names=cip.codes$name, 
        las=2, 
        pars=list(cex.axis=0.6),
        main="% Degree Types Awarded for All Institutions")
```


Unfortunately, this is as far as we can go with per-degree analysis.  The data documentation makes this explicit: "the data are not yet available to produce program-level earnings data."  Briefly, this means that though we can see what types of degrees institutions are granting or offering in aggregaete, we cannot make any relationships to other aggregate data such as costs, demographics, etc.

Next we look at the cost data.  Costs are compiled for all full-time, first-time, degree-/certificate-seeking undergraduates who receive Title IV aid, which means that costs relative to students who do not receive (or need) financial aid are not included.  This indicates a possible skew away from more affluent students.

First we compare the different types of costs for all institutions.  Predictably, the cost of tuition comes in below the overall costs of attendance and there are a large number of outliers, which is expected. 

Second, we divide academic and program year costs by institution type.  Private nonprofit institutions (which includes the likes of Harvard, Yale, et al.) dominates the Academic program costs while program year costs (for certificates, and professional education) are more consistent across institution types.  

The third and fourth graphs distribute academic and program year costs by institution type and size.  Program year data are sparse, demonstrating that not all types of institutions, particularly the larger and 4-year schools, offer non-academic programs.  Academic year costs are clearly not normally distributed across institution types, let alone sizes. In general, private institutions, both for and non-profit, tend to cost more than public institutions.


```{r explore.school.4}
par(par.old)

# Costs of attendance, tuition, and fees - no extreme outliers
par.old <- par(mar=c(8,4,2,2))
boxplot(data.min[,c("Cost.attend.aca","Cost.attend.prog","Cost.instate",
                    "Cost.outstate","Cost.prog")], 
        names=c("Avg. COA, Academic","Avg. COA, Program","In-state Tuition",
                "Out-of-state Tuition","Program Tuition"), 
        las=2, 
        pars=list(cex.axis=0.8), 
        main="Range of Costs for All Institutions")
par(par.old)

# ... by Type
par.old <- par(mfrow=c(1,2),oma=c(0,0,2,0))
boxplot(Cost.attend.aca~Type, 
        data=data.min, 
        pars=list(cex.axis=0.7), 
        ylim=c(0,round(max(data.min$Cost.attend.aca, data.min$Cost.attend.prog, na.rm=T), digits=-4)), 
        main="Academic Year Institutions")

boxplot(Cost.attend.prog~Type, 
        data=data.min, 
        pars=list(cex.axis=0.7), 
        ylim=c(0,round(max(data.min$Cost.attend.aca, data.min$Cost.attend.prog, na.rm=T), digits=-4)), 
        main="Program Year Institutions")

mtext("Total Costs Of Attendance", outer=T, cex=1.5)
par(par.old)

# ... and Size
qplot(Type, 
      Cost.attend.aca, 
      data=data.min, 
      facets=.~Size, 
      geom="boxplot", 
      ymin=0, 
      ymax=max(Cost.attend.aca, na.rm=T)) + 
  labs(x="Institution Type", 
       y="Academic-year Costs", 
       title="Costs vs Institution Type and Size") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

qplot(Type, 
      Cost.attend.prog, 
      data=data.min, 
      facets=.~Size, 
      geom="boxplot", 
      ymin=0, 
      ymax=max(Cost.attend.prog, na.rm=T)) + 
  labs(x="Institution Type", 
       y="Program-year Costs", 
       title="Costs vs Institution Type and Size") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

```

Exploratory Analysis - Students
-------------------------------

Student demographics and statistics are represented in aggregate overall.  First we look at admissions rates by institution then by school type and size.  Both public and private nonprofit cover a wide range of rates, but private for-profit shows a marked increase in admission rates.  Being that for-profit institutions have a warrant to admit as many students as possible to maximize profits, their standards are clearly less rigorous than the other types of institutions. This trend generally holds when looking at different school sizes.  Of note is the fact that large 4-year private non-profit institutions have significantly lower admission rates, likely due to the presence of elite and ivy-league schools in that class.


```{r explore.students}
# Admissions rates - none above 100%
boxplot(Admission.rate~Type, 
        data=data.min, 
        main="Admissions Rates by Institution Type")

# ... by type and size
qplot(Type, 
      Admission.rate, 
      data=data.min, 
      facets=.~Size, 
      geom="boxplot") + 
  labs(x="Institution Type", 
       y="Admissions Rate", 
       title="Admissions Rate vs Institution Type and Size") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```


Entrance exams are widely considered to be a strong indicator of admissions and student performance.  Below we see that private nonprofit institutions generally have aggregate student SAT scores that are higher than other classes of institutions.  The range of scores is quite wide, however.


```{r explore.students.2}
# SAT scores - below 1600
boxplot(SAT.mean~Type, 
        data=data.min, 
        main="SAT Scores by Institution Type")

# ... by type and size
qplot(Type, 
      SAT.mean, 
      data=data.min, 
      facets=.~Size, 
      geom="boxplot") + 
  labs(x="Institution Type", 
       y="SAT Score", 
       title="SAT Scores vs Institution Type and Size") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```


Family income could be a predictor of various types of student performance as well as havign a relation to financial aid and costs.  Again, private nonprofit institutions tend to have greater mean family income levels than other classes.


```{r explore.students.3}
# Demographics - family income
boxplot(Income.mean~Type, 
        data=data.min, 
        main="Average Family Income by Institution Type")

# ... by type and size
qplot(Type, 
      Income.mean, 
      data=data.min, 
      facets=.~Size, geom="boxplot") + 
  labs(x="Institution Type", 
       y="Average Family Income", 
       title="Average Family Income vs Institution Type and Size") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```


Race demographics are difficult to establish from the given data.  At best we have the proportion of each institution that is a given race.  Perhaps unsurprisingly white students tend to dominate, though there are some exclusively black and hispanic colleges.

The two graphs that follow show a trend for program completion rates and earnings for different races.  Apparently, the more homogeneous towards white or pacific islander a school tends toward, the higher the rate.  Conversely, the more black, native american, or hispanic a school is the lower the rate, the exception being among small 2-year schools, likely some of which are traditionally black colleges. Aggregates such as these are probably unreliable as there are many factors that could account for the trends.


```{r explore.students.4}
# Demographics - racial makeup
par.old <- par(mfrow=c(1,3),oma=c(0,0,2,0))
boxplot(Race.white~Type, 
        data=data.min, 
        pars=list(cex.axis=0.7), 
        ylim=c(0,1), 
        main="% White")

boxplot(Race.black~Type, 
        data=data.min, 
        pars=list(cex.axis=0.7), 
        ylim=c(0,1), 
        main="% Black")

boxplot(Race.hispanic~Type, 
        data=data.min, 
        pars=list(cex.axis=0.7), 
        ylim=c(0,1), 
        main="% Hispanic")

mtext("Racial Distribution", outer=T, cex=1.5)
par(par.old)

# Race comparisons
# Completion
par.old <- par(mfrow=c(1,5),oma=c(0,0,2,0))
ylim.race <- c(0,round(max(data.min$Comp.200, na.rm=T), digits=2))

plot(data.min$Race.white, 
     data.min$Comp.200, 
     xlab="%", 
     ylab="%", 
     ylim=ylim.race, 
     main="White")
abline(lm(Comp.200 ~ Race.white, data=data.min, na.action="na.exclude"), 
       col="red")

plot(data.min$Race.black, 
     data.min$Comp.200, 
     xlab="%", 
     ylab="%", 
     ylim=ylim.race, 
     main="Black")
abline(lm(Comp.200 ~ Race.black, data=data.min, na.action="na.exclude"), 
       col="red")

plot(data.min$Race.pacific, 
     data.min$Comp.200, 
     xlab="%", 
     ylab="%", 
     ylim=ylim.race, 
     main="Pacific Islander")
abline(lm(Comp.200 ~ Race.pacific, data=data.min, na.action="na.exclude"), 
       col="red")

plot(data.min$Race.aboriginal, 
     data.min$Comp.200, 
     xlab="%", 
     ylab="%", 
     ylim=ylim.race, 
     main="Am. Indian")
abline(lm(Comp.200 ~ Race.aboriginal, data=data.min, na.action="na.exclude"), 
       col="red")

plot(data.min$Race.hispanic, 
     data.min$Comp.200, 
     xlab="%", 
     ylab="%", 
     ylim=ylim.race, 
     main="Hispanic")
abline(lm(Comp.200 ~ Race.hispanic, data=data.min, na.action="na.exclude"), 
       col="red")

mtext("200% Time Completion Rate by Undergraduate Race", outer=T, cex=1.5)
par(par.old)

# Earnings
par.old <- par(mfrow=c(1,5),oma=c(0,0,2,0))
ylim.race <- c(10000,round(max(data.min$Earn.10yr.mean, na.rm=T), digits=-4))

plot(data.min$Race.white, 
     data.min$Earn.10yr.mean, 
     xlab="%", 
     ylab="$", 
     ylim=ylim.race, 
     main="White")
abline(lm(Earn.10yr.mean ~ Race.white, data=data.min, na.action="na.exclude"), 
       col="red")

plot(data.min$Race.black, 
     data.min$Earn.10yr.mean, 
     xlab="%", 
     ylab="$", 
     ylim=ylim.race, 
     main="Black")
abline(lm(Earn.10yr.mean ~ Race.black, data=data.min, na.action="na.exclude"), 
       col="red")

plot(data.min$Race.pacific, 
     data.min$Earn.10yr.mean, 
     xlab="%", 
     ylab="$", 
     ylim=ylim.race, 
     main="Pacific Islander")
abline(lm(Earn.10yr.mean ~ Race.pacific, data=data.min, na.action="na.exclude"), 
       col="red")

plot(data.min$Race.aboriginal, 
     data.min$Earn.10yr.mean, 
     xlab="%", 
     ylab="$", 
     ylim=ylim.race, 
     main="Am. Indian")
abline(lm(Earn.10yr.mean ~ Race.aboriginal, data=data.min, na.action="na.exclude"), 
       col="red")

plot(data.min$Race.hispanic, 
     data.min$Earn.10yr.mean, 
     xlab="%", 
     ylab="$", 
     ylim=ylim.race, 
     main="Hispanic")
abline(lm(Earn.10yr.mean ~ Race.hispanic, data=data.min, na.action="na.exclude"), 
       col="red")

mtext("Earnings by Undergraduate Race", outer=T, cex=1.5)
par(par.old)
```


Next we look at aggregate completion rates.  The data show no discernable trends with a wide variety between school types.  The data are not complete, however.  From the data documentation: "Exclusion of part-time students, transfer students, and students who do not start during the fall from IPEDS completion rates. First-time full-time students make up fewer than half of all college students, or even less in some sectors of institutions (e.g. community colleges). Furthermore, although schools have the option to report transfer outcomes for first-time full-time students, many choose not to."


```{r explore.students.5}
# Completion Rates
par.old <- par(mfrow=c(1,2),oma=c(0,0,2,0))
boxplot(Comp.150~Type, 
        data=data.min, 
        pars=list(cex.axis=0.7), 
        main="150% of Expected Time")

boxplot(Comp.200~Type, 
        data=data.min, 
        pars=list(cex.axis=0.7), 
        main="200% of Expected Time")

mtext("Completion Rates by Institution Type", outer=T, cex=1.5)
par(par.old)

# ... by type and size
qplot(Type, 
      Comp.200, 
      data=data.min, 
      facets=.~Size, 
      geom="boxplot") + 
  labs(x="Institution Type", 
       y="Completion Rate", 
       title="200% Completion Rates for Full-Time Students vs Institution Type and Size") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# 8 years completion at same school
boxplot(Comp.yr8~Type, 
        data=data.min, 
        main="% Completed Within 8 Yrs. at Original Institution")
```

The next section describe various student financial aid variables.  As with costs, this data only reflects Title IV recipients, thus excluding more affluent students who may not need loans or fund their education privately. Predictably, more students take loans for the more expensive private schools, with a slight trend towards more accumulated debt.

Repayment depicts the fraction of borrowers at an institution who have not defaulted on their federal loans and who are making progress in paying them down.  Positive repayment rates ioncrease over time which reflects the increase in salary that results from being in the workforce for a longer period of time.


```{r explore.students.6}
# Loans
boxplot(Loan.percent~Type, 
        data=data.min, 
        main="% Students with Federal Loans by Institution Type")

# Debt & Repayment
boxplot(Debt.median~Type, 
        data=data.min, 
        main="Median Accumulated Graduate Debt by Institution Type")

par.old <- par(mar=c(7,4,2,2))
boxplot(data.min[,c("Repay.1yr","Repay.3yr","Repay.5yr","Repay.7yr")], 
        names=c("1 Year in Repay","3 Years in Repay","5 Years in Repay","7 Years in Repay"), 
        las=2, 
        pars=list(cex.axis=0.8), 
        main="% of Borrowers Progressing on Repayment")
par(par.old)
```


Earnings data include only Title IV-receiving students, so figures may not be representative of schools with a low proportion of Title IV-eligible students.  The earnings rates are generally consistent over the 10 year cohort period, with an increasing number of outliers as over time.  This might reflect the general tendency towards income inequality, though the data here are insufficient to determine any proximate cause.


```{r explore.students.7}
# Earnings trends
par.old <- par(mar=c(7,4,2,2))
boxplot(data.min[,c("Earn.6yr","Earn.7yr","Earn.8yr","Earn.9yr","Earn.10yr")], 
        names=c("6 Years Post-Entry","7 Years Post-Entry","8 Years Post-Entry",
                "9 Years Post-Entry","10 Years Post-Entry"), 
        las=2, pars=list(cex.axis=0.8), 
        main="Mean Earnings for Title IV Cohorts")
abline(h=25000, 
       col="red")
text(4.5,
     24000,
     "Threshold Earnings", 
     col="red", 
     cex=0.6)
par(par.old)

qplot(Type, 
      Earn.10yr.mean, 
      data=data.min, 
      facets=.~Size, 
      geom="boxplot") + 
  labs(x="Institution Type", 
       y="10 Year Average Earnings", 
       title="Mean Earnings for Title IV Cohorts vs Institution Type and Size") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

```


Observations from Data
----------------------

To recap some of the observations made above and from statements in the data definitions:

* Many variables are available only for Title IV recipients, or students who receive federal grants and loans;
* All NSLDS and Treasury elements are protected for privacy purposes; those data are shown as PrivacySuppressed;
* For most variables, data are pooled across two years of data to reduce year-over-year variability in figures;
* Costs are compiled for all full-time, first-time, degree-/certificate-seeking undergraduates who receive Title IV aid;
* Exclusion of part-time students, transfer students, and students who do not start during the fall from IPEDS completion rates. First-time full-time students make up fewer than half of all college students, or even less in some sectors of institutions (e.g. community colleges). Furthermore, although schools have the option to report transfer outcomes for first-time full-time students, many choose not to;
* The data are not yet available to produce program-level earnings data;
* Earnings data include only Title IV-receiving students, so figures may not be representative of schools with a low proportion of Title IV-eligible students;
* Earnings data are restricted to students who are not enrolled, so students who are currently enrolled in graduate school at the time of measurement are excluded;
* Repayment depicts the fraction of borrowers at an institution who have not defaulted on their federal loans and who are making progress in paying them down;
* Missing cohort sizes for some metrics so we are unable to calculate rates reliably.


Exploratory Analysis - Comparisons
----------------------------------

Using the above observations, we can develop a correlation matrix to compare how significant variables appear to affect each other

```{r exploratory.analysis}
# correlation matrix
corrgram(data.min[c("Admission.rate","SAT.mean","Income.mean","Comp.200",
                    "Cost.attend.aca","Loan.percent","Debt.median","Earn.10yr.mean",
                    "Repay.7yr.mean")], 
         order=TRUE, 
         lower.panel=panel.shade, 
         upper.panel=NULL, 
         text.panel=panel.txt, 
         diag.panel=panel.minmax, 
         main="Correlation Matrix for Selected Data")
```


With respect to our first question, we can take cost and earnings as a metric for institutional value. Below we compare cost and earnings in specific contexts:


```{r exploratory.analysis.2}
# Correlations for value metrics
cor(data.min[c("Comp.200","Cost.attend.aca","Income.mean","Admission.rate")],
    data.min[c("Debt.median","Earn.10yr.mean","Repay.7yr.mean")], 
    use="pairwise.complete.obs")

qplot(Cost.attend.aca, 
      Earn.10yr.mean, 
      data=data.min, 
      facets=Type~Region, 
      geom=c("point","smooth")) + 
  labs(x="Cost", 
       y="Earnings", 
       title="Cost vs Earnings by Type & Region") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

qplot(Cost.attend.aca, 
      Earn.10yr.mean, 
      data=data.min, 
      facets=Size~Region, 
      geom=c("point","smooth")) + 
  labs(x="Cost", 
       y="Earnings", 
       title="Cost vs Earnings by Size & Region") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

qplot(Cost.attend.aca, 
      Earn.10yr.mean, 
      data=data.min, 
      facets=Size~Type, 
      geom=c("point","smooth")) + 
  labs(x="Cost", 
       y="Earnings", 
       title="Cost vs Earnings by Size & Type") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```


Regarding our second question, we explore some variables likely to relate to completion rate:


```{r exploratory.analysis.3}
# Correlations for success metrics
cor(data.min[c("SAT.mean","Income.mean","Cost.attend.aca","Loan.percent",
               "Debt.median","Admission.rate")], 
    data.min$Comp.200, 
    use="pairwise.complete.obs")
```


Model Development - Institutional Value
------------------------------------

We now attempt to build a model for institutional value based on the following features:

* Insitution type - we have observed how earnings vary based on the type of institution;
* Institution size - we have seen how size of institution has an effect on cost and earnings;
* Admission rate - more rigorous or elite schools have lower admissions rates which affects the perception of a degree from those schools and can affect earnings potential.

To begin, we create a factor to indicate the relative value of a school, expressed as a ratio of earnings to cost.  The higher the ratio, the better the value.  We use a 300% ratio as a cutoff to determine high versus average or low value institutions. 

We analyzed only complete cases of the variables above.  This significantly limited the population for analysis but ensured better results.

A number of models were tried against these variables.  Most indicated the presence of some non-linear factors, but we were unable to compensate for those.  In the end, a General Linear Model provided the most consistent model for the analysis.

Unfortunately, the population contained few cases of "high" value subjects, so the results, despite an apparently strong accuracy rate, are somewhat ambiguous.  Undoubtedely the nonlinear aspects of the model will have to be adjusted or otherwise accounted for, or we would have to find better features to test on.

The top 10 institutions resulting from this model are listed below as are the significant p factor values.  The result is that a combination of certain sizes of 4-year institutions that are private nonprofits and have a low admission rate predict higher value.


```{r analysis.value}
# Create a value factor
data.min$Cost.earn.ratio <- cut(data.min$Earn.10yr.mean/data.min$Cost.attend.aca,
                                breaks=c(0,3,99), 
                                labels=c("Low","High"))

# complete.cases to avoid imputation
# contributing features: type, size, admissions rate 
data.comp.value <- data.min[complete.cases(data.min[,c("Cost.earn.ratio","Type","Size",
                                                       "Admission.rate")]),
                            c("UnitID","Name","Cost.earn.ratio","Type","Size","Admission.rate")]

rownames(data.comp.value) <- paste(data.comp.value$Name, data.comp.value$UnitID) # Identify rows for analysis

value.train <- createDataPartition(y=data.comp.value$Cost.earn.ratio, p=0.5, list=FALSE)
value.training<-data.comp.value[value.train,]
value.testing<-data.comp.value[-value.train,]

set.seed(14345)
value.fit<-train(Cost.earn.ratio ~ Size+Type+Admission.rate, data=value.training, method='glm')

par.old <- par(mfrow=c(2,2))
plot(value.fit$finalModel)
par(par.old)

# Top 10 earnings to cost ratio schools
head(sort(value.fit$finalModel$residuals, decreasing=T), 10)

value.pred <- predict(value.fit, newdata=value.testing)

confusionMatrix(value.pred, value.testing$Cost.earn.ratio)

value.p <- t <- coef(summary(value.fit))[,4]
value.p[value.p < 0.05]
```


Model Development - Graduation Potential
----------------------------------------

We now attempt to build a model for graduation potential based on the following features:

* SAT Scores
* Mean Family Income
* Costs of attendance

Again, we analyzed only complete cases of the variables above.  This significantly limited the population for analysis but ensured better results.

We selected a Linear Model with multiple regression.  Outside of a few outliers this model performed well.  The error rate is perhaps a bit high owing to these outliers and other factors.  The summary statistics for the model show that each variable tested is significant, meaning a high SAT score, high family income, and more expensive (meaning exclusive) schools result in higher completion outcomes.


```{r analysis.grad}

# complete.cases to avoid imputation
# Contributing features: SAT scores, family income, and program cost, Admission.rate (race%?)
data.comp.compl <- data.min[complete.cases(data.min[,c("SAT.mean","Income.mean",
                                                       "Cost.attend.aca","Comp.200")]),
                            c("UnitID","Name","SAT.mean","Income.mean","Cost.attend.aca",
                              "Admission.rate","Comp.200")]

rownames(data.comp.compl) <- paste(data.comp.compl$Name, data.comp.compl$UnitID) # Identify rows for analysis

compl.train <- createDataPartition(y=data.comp.compl$Comp.200, p=0.5, list=FALSE)
compl.training<-data.comp.compl[compl.train,]
compl.testing<-data.comp.compl[-compl.train,]

set.seed(14445)
compl.fit<-train(Comp.200 ~ SAT.mean+Income.mean+Cost.attend.aca, data=compl.training, method='lm')

par.old <- par(mfrow=c(2,2))
plot(compl.fit$finalModel)
par(par.old)

compl.pred <- predict(compl.fit, newdata=compl.testing)

thresh.recode <- function(d) ifelse(d > thresh.compl, 1, 0)
thresh.compl <- 0.6
confusionMatrix(thresh.recode(compl.pred), thresh.recode(compl.testing$Comp.200))

plot.roc(thresh.recode(compl.pred), thresh.recode(compl.testing$Comp.200))

compl.p <- t <- coef(summary(compl.fit))[,4]
compl.p[compl.p < 0.05]
```


---
(end)