-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcollege-scorecard-project.Rmd
992 lines (803 loc) · 46.7 KB
/
college-scorecard-project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
---
title: "College Scorecard Analytics Project"
author: "Thomas P Kiehne"
date: "December 8, 2015"
output: html_document
---
```{r global_options, include=FALSE, echo=FALSE}
options(width = 116, scipen = 5, digits = 2)
knitr::opts_chunk$set(comment=NA, fig.width=9, fig.height=6, warning=FALSE, message=FALSE)
rm( list = ls()) # Clear environment
#install.packages("stringr")
#install.packages("plyr")
#install.packages("reshape2")
#install.packages("corrgram")
#install.packages("ggplot2")
#install.packages('caret', dependencies = TRUE)
#install.packages("pROC")
library(stringr)
library(plyr)
library(reshape2)
library(corrgram)
library(ggplot2)
library(caret)
library(pROC)
```
About The Data
--------------
[The College Scorecard project](https://collegescorecard.ed.gov) aggregates data that are provided through federal reporting from institutions, data on federal financial aid, and tax information. These data provide insights into the performance of schools that receive federal financial aid dollars, and the outcomes of the students of those schools. The project intends to provide data to help students and families compare college costs and outcomes as they weigh the tradeoffs of different colleges, accounting for their own needs and educational goals.
In this document we will collect and analyze the most recent Scorecard data to attempt to determine the following:
* What colleges provide the highest lifetime value their graduates, and
* What attributes of a college contribute the most to the chances of student graduation.
Data Collection and Cleaning
----------------------------
This section outlines the steps taken to retrieve the Scorecard data and prepare it for analysis
The most recent data are available for download in CSV format. We download the most recent published data and read the data into memory as shown below. One particular detail is that some NSLDS and Treasury elements are protected for privacy purposes; those data are shown as PrivacySuppressed. This is not consistent between institutions; some variables may be PrivacySuppressed for one instition while not for other institutions. Aside from the obvious bias this introduces for those variables, for our purposes these fields are effectively missing and are coded as NA.
```{r retreiveDataRecent}
# get data file if not present
# ref: https://collegescorecard.ed.gov/data/
if(!file.exists("./Most+Recent+Cohorts+(All+Data+Elements).csv") &
file.exists("./Most+Recent+Cohorts+(All+Data+Elements).zip")) {
unzip("./Most+Recent+Cohorts+(All+Data+Elements).zip", "./Most+Recent+Cohorts+(All+Data+Elements).csv")
}
if(!file.exists("./Most+Recent+Cohorts+(All+Data+Elements).csv")) {
download.file("https://s3.amazonaws.com/ed-college-choice-public/Most+Recent+Cohorts+(All+Data+Elements).csv", "./Most+Recent+Cohorts+(All+Data+Elements).csv", mode="wb", method="curl")
}
data <- read.csv("./Most+Recent+Cohorts+(All+Data+Elements).csv", stringsAsFactors=F, na.strings=c("NULL","PrivacySuppressed"))
```
The [Data Dictionary](https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary-09-08-2015.csv) defines codes for a number of coded variables. In this section we translate those variables into factors and provide the code labels where appropriate.
```{r datacleaning}
# Predominant Degree
data$PREDDEG <- factor(data$PREDDEG, 0:4, labels=c("Not classified",
"Predominantly certificate-degree granting",
"Predominantly associate's-degree granting",
"Predominantly bachelor's-degree granting",
"Entirely graduate-degree granting"))
# Highest Degree
data$HIGHDEG <- factor(data$HIGHDEG, 0:4, labels=c("Non-degree-granting",
"Certificate degree",
"Associate degree",
"Bachelor's degree",
"Graduate degree"))
# Convert state abbreviaions to factor
data$STABBR <- factor(data$STABBR)
# Translate FIPS into State names
data$st_fips <- factor(data$st_fips,
c(1,2,4,5,6,8,9,10,11,12,13,15,16,17,18,
19,20,21,22,23,24,25,26,27,28,29,30,31,
32,33,34,35,36,37,38,39,40,41,42,44,45,
46,47,48,49,50,51,53,54,55,56,60,64,66,
69,70,72,78),
labels=c("Alabama","Alaska","Arizona","Arkansas","California","Colorado",
"Connecticut","Delaware","District of Columbia",
"Florida","Georgia", "Hawaii","Idaho","Illinois",
"Indiana","Iowa","Kansas","Kentucky","Louisiana", "Maine",
"Maryland","Massachusetts","Michigan","Minnesota","Mississippi",
"Missouri","Montana","Nebraska","Nevada","New Hampshire",
"New Jersey", "New Mexico","New York","North Carolina",
"North Dakota","Ohio","Oklahoma", "Oregon","Pennsylvania",
"Rhode Island","South Carolina","South Dakota", "Tennessee",
"Texas","Utah","Vermont","Virginia","Washington",
"West Virginia", "Wisconsin","Wyoming","American Samoa",
"Federated States of Micronesia", "Guam",
"Northern Mariana Islands","Palau","Puerto Rico",
"Virgin Islands")
)
# Institutional control / type
data$CONTROL <- factor(data$CONTROL, 1:3, c("Public","Private nonprofit","Private for-profit"))
# Region
data$region <- factor(data$region, 0:9, c("U.S. Service Schools",
"New England",
"Mid East",
"Great Lakes",
"Plains",
"Southeast",
"Southwest",
"Rocky Mountains",
"Far West",
"Outlying Areas"))
# Locale description
data$LOCALE <- factor(data$LOCALE, c(11,12,13,21,22,23,31,32,33,41,42,43), c("City: Large",
"City: Midsize",
"City: Small",
"Suburb: Large",
"Suburb: Midsize",
"Suburb: Small",
"Town: Fringe",
"Town: Distant",
"Town: Remote",
"Rural: Fringe",
"Rural: Distant",
"Rural: Remote"))
# Carnegie basic classicifcation
data$CCBASIC <- factor(data$CCBASIC, 0:33, c("Not classified",
"Associate's-Public Rural-serving Small",
"Associate's-Public Rural-serving Medium",
"Associate's-Public Rural-serving Large",
"Associate's-Public Suburban-serving Single Campus",
"Associate's-Public Suburban-serving Multicampus",
"Associate's-Public Urban-serving Single Campus",
"Associate's-Public Urban-serving Multicampus",
"Associate's-Public Special Use",
"Associate's-Private Not-for-profit",
"Associate's-Private For-profit",
"Associate's-Public 2-year colleges under 4-year universities",
"Associate's-Public 4-year Primarily Associate's",
"Associate's-Private Not-for-profit 4-year Primarily Associate's",
"Associate's-Private For-profit 4-year Primarily Associate's",
"Research Universities (very high research activity)",
"Research Universities (high research activity)",
"Doctoral/Research Universities",
"Master's Colleges and Universities (larger programs)",
"Master's Colleges and Universities (medium programs)",
"Master's Colleges and Universities (smaller programs)",
"Baccalaureate Colleges--Arts & Sciences",
"Baccalaureate Colleges--Diverse Fields",
"Baccalaureate/Associate's Colleges",
"Special Focus Institutions--Theological seminaries, Bible colleges, and other faith-related institutions",
"Special Focus Institutions--Medical schools and medical centers",
"Special Focus Institutions--Other health professions schools",
"Special Focus Institutions--Schools of engineering",
"Special Focus Institutions--Other technology-related schools",
"Special Focus Institutions--Schools of business and management",
"Special Focus Institutions--Schools of art, music, and design",
"Special Focus Institutions--Schools of law",
"Special Focus Institutions--Other special-focus institutions",
"Tribal Colleges"))
# Carnegie Undergraduate profile
data$CCUGPROF <- factor(data$CCUGPROF, 1:14, c("Higher part-time two-year",
"Mixed part/full-time two-year",
"Medium full-time two-year",
"Higher full-time two-year",
"Higher part-time four-year",
"Medium full-time four-year, inclusive",
"Medium full-time four-year, selective, lower transfer-in",
"Medium full-time four-year, selective, higher transfer-in",
"Full-time four-year, inclusive",
"Full-time four-year, selective, lower transfer-in",
"Full-time four-year, selective, higher transfer-in",
"Full-time four-year, more selective, lower transfer-in",
"Full-time four-year, more selective, higher transfer-in",
"Not classified or not applicable"))
# Carnegie size classification
data$CCSIZSET <- factor(data$CCSIZSET, 1:18, c("Very small 2-year",
"Small 2-year",
"Medium 2-year",
"Large 2-year",
"Very large 2-year",
"Very small 4-year, primarily nonresidential",
"Very small 4-year, primarily residential",
"Very small 4-year, highly residential",
"Small 4-year, primarily nonresidential",
"Small 4-year, primarily residential",
"Small 4-year, highly residential",
"Medium 4-year, primarily nonresidential",
"Medium 4-year, primarily residential",
"Medium 4-year, highly residential",
"Large 4-year, primarily nonresidential",
"Large 4-year, primarily residential",
"Large 4-year, highly residential",
"Not applicable, special-focus institution"))
```
The data contain legacy entries for institutions that are currently no longer operating. We purge these records with the understanding that no students will be admitted to these schools in the future.
```{r datapurge}
# Convert Still Operating flag into logical
data$CURROPER <- as.logical(data$CURROPER)
# Purge institutions that are no longer operating
data <- data[data$CURROPER == TRUE,]
```
Next we look at the numerous variables pertaining to degree fields and types. Using the Classification of Instructional Programs and degree level codes from the data dictionary we create a set of objects to be used for translating these variables into aggregates as needed.
```{r related.data}
# Classification of Instructional Programs
cip.nums <- c(1,3,4,5,9,10,11,12,13,14,15,16,19,22,
23,24,25,26,27,29,30,31,38,39,40,41,42,
43,44,45,46,47,48,49,50,51,52,54)
cip.names <- c("Agriculture and Related Sciences","Natural Resources and Conservation",
"Architecture and Related Services","Ethnic, Cultural, and Gender Studies",
"Communication and Journalism","Communications Technologies",
"Computer and Information Sciences","Personal and Culinary Services",
"Education","Engineering","Engineering Technologies",
"Languages, Literature, and Linguistics","Family and Consumer Sciences",
"Legal Professions and Studies","English Language and Literature",
"Liberal Arts and Humanities","Library Science",
"Biological and Biomedical Sciences","Mathematics and Statistics",
"Military Technologies and Applied Sciences",
"Multi/Interdisciplinary Studies","Parks, Recreation, and Fitness Studies",
"Philosophy and Religious Studies","Theology and Religious Vocations",
"Physical Sciences","Science Technologies/Technicians","Psychology",
"Law Enforcement and Firefighting Services",
"Public Administration and Social Services","Social Sciences",
"Construction Trades","Mechanic and Repair Technologies",
"Precision Production","Transportation","Visual and Performing Arts",
"Health Professions","Business, Management, and Marketing","History")
# Vector of variable names that contain flags for specific disciplines
cip.codes <- data.frame(code=cip.nums, name=cip.names, stringsAsFactors=F)
# Vector of variable names that contain percentage of degrees awarded by CIP class
pcip.codes <- with(cip.codes, paste0("PCIP",str_pad(code, 2, pad="0")))
# Program levels
prog.levels <- c("CERT1","CERT2","ASSOC","CERT4","BACHL")
# Vector of variable names that contain flags for specific programs
prog.codes <- as.vector(sapply(cip.codes$code, FUN=function(s){paste0("CIP",str_pad(s, 2, pad="0"),prog.levels)}),mode="character")
# Matrix for lookup of variable names per program level
prog.level.codes <- sapply(prog.levels, FUN=function(s){paste0("CIP",str_pad(cip.codes$code, 2, pad="0"),s)})
```
Next, we use create a data frame that is a subset of the full data set which contains only the variables needed for subsequent analysis. Variable names are reformatted to become more intuitive and several aggregate measures are created to simplify analysis. Several notes on the decisions made:
* The type or level of residency per institution is not relevant to our analaysis; removing residency classifications allows us to compact the Carnegie Size Classifications into a more usable form;
* The number of degrees offered is summed per degree level; no distinction is made between distance and in-person programs;
* Mean family incomes are represented for both dependent and independent students and an additional mean of these is also created;
* 105% and 200% completion levels are set for either 4-year or less than 4-year programs or the mean where both are present;
* Repayment progress and earnings are maintained on a cohort year basis as well as represented in a separate variable that takes the mean over the cohort years;
* Several of the Carnegie Size Classifications are NA; upon a manual review of some of these rows it seems that most, if not all, are special purpose institutions and are coded as such; It is acknowledged that this assumption may slightly skew the number of special purpose institutions.
```{r reformat.data}
data.min <- data.frame(stringsAsFactors=F,
UnitID = data$UNITID,
Name = data$INSTNM,
City = data$CITY,
State = data$STABBR,
ZIP = data$ZIP,
Type = data$CONTROL,
Region = data$region,
Locale = data$LOCALE,
Size = mapvalues(data$CCSIZSET,
from=c("Very small 2-year",
"Small 2-year",
"Medium 2-year",
"Large 2-year",
"Very large 2-year",
"Very small 4-year, primarily nonresidential",
"Very small 4-year, primarily residential",
"Very small 4-year, highly residential",
"Small 4-year, primarily nonresidential",
"Small 4-year, primarily residential",
"Small 4-year, highly residential",
"Medium 4-year, primarily nonresidential",
"Medium 4-year, primarily residential",
"Medium 4-year, highly residential",
"Large 4-year, primarily nonresidential",
"Large 4-year, primarily residential",
"Large 4-year, highly residential",
"Not applicable, special-focus institution"),
to=c("Very small 2-year",
"Small 2-year",
"Medium 2-year",
"Large 2-year",
"Very large 2-year",
"Very small 4-year",
"Very small 4-year",
"Very small 4-year",
"Small 4-year",
"Small 4-year",
"Small 4-year",
"Medium 4-year",
"Medium 4-year",
"Medium 4-year",
"Large 4-year",
"Large 4-year",
"Large 4-year",
"Special-focus institution")),
Degree.mode = data$PREDDEG,
Degree.high = data$HIGHDEG,
Degree.cert.1yr = (rowSums(data[,prog.level.codes[,c("CERT1")]] > 0) > 0),
Degree.cert.2yr = (rowSums(data[,prog.level.codes[,c("CERT2")]] > 0) > 0),
Degree.assoc = (rowSums(data[,prog.level.codes[,c("ASSOC")]] > 0) > 0),
Degree.cert.4yr = (rowSums(data[,prog.level.codes[,c("CERT4")]] > 0) > 0),
Degree.bachl = (rowSums(data[,prog.level.codes[,c("BACHL")]] > 0) > 0),
Cohort.size.earnings = data$count_ed,
Admission.rate = data$ADM_RATE,
SAT.mean = data$SAT_AVG,
Income.dep = data$DEP_INC_AVG,
Income.indep = data$IND_INC_AVG,
Income.mean = rowMeans(data[,c("DEP_INC_AVG","IND_INC_AVG")], na.rm=T),
Race.white = data$UG_WHITENH,
Race.black = data$UG_BLACKNH,
Race.pacific = data$UG_API,
Race.aboriginal = data$UG_AIANOld,
Race.hispanic = data$UG_HISPOld,
Comp.150 = rowMeans(data[,c("C150_4","C150_L4")], na.rm=T),
Comp.200 = rowMeans(data[,c("C200_4","C200_L4")], na.rm=T),
Comp.yr8 = data$COMP_ORIG_YR8_RT,
Cost.attend.aca = data$COSTT4_A,
Cost.attend.prog = data$COSTT4_P,
Cost.instate = data$TUITIONFEE_IN,
Cost.outstate = data$TUITIONFEE_OUT,
Cost.prog = data$TUITIONFEE_PROG,
Loan.percent = data$PCTFLOAN,
Debt.median = data$GRAD_DEBT_MDN,
Repay.1yr = data$RPY_1YR_RT,
Repay.3yr = data$RPY_3YR_RT,
Repay.5yr = data$RPY_5YR_RT,
Repay.7yr = data$RPY_7YR_RT,
Repay.7yr.mean = rowMeans(data[,c("RPY_1YR_RT","RPY_3YR_RT","RPY_5YR_RT","RPY_7YR_RT")], na.rm=T),
Earn.6yr = data$mn_earn_wne_p6,
Earn.7yr = data$mn_earn_wne_p7,
Earn.8yr = data$mn_earn_wne_p8,
Earn.9yr = data$mn_earn_wne_p9,
Earn.10yr = data$mn_earn_wne_p10,
Earn.10yr.mean = rowMeans(data[,c("mn_earn_wne_p6","mn_earn_wne_p7","mn_earn_wne_p8","mn_earn_wne_p9","mn_earn_wne_p10")], na.rm=T)
)
# Presume NAs are special institutions
data.min[is.na(data.min$Size),c("Size")] <- "Special-focus institution"
```
Exploratory Analysis - Institutions
-----------------------------------
In this section we look closely at the compiled data to determine trends, features, and other aspects that might be useful in answering our questions.
Using the converted Carnegie Size Classifications we can see the distribution of schools by size. Excluding the special purpose schools, the distribution appears to be bivariate with peaks among the 2-year and 4-year institutions, respectively.
```{r explore.school}
# Institution size
size.agg <- aggregate(data.min$UnitID, by=list(data.min$Size), FUN=length)
par.old <- par(mar=c(10,4,2,2))
barplot(size.agg$x,
names.arg=size.agg$Group.1,
col=topo.colors(nrow(size.agg)),
main="Distribution of Institution Sizes",
las=2,
cex.names=0.8)
```
Next, the institutions are aggregated by U.S. region and highest degree offered. Certificate programs are the most frequent (likely due to the large number of special purpose schools found above), followed by Bachelors and Associates degrees. There is significant variance in the number of institutions across different regions, but the same general curve of degree level is generally consistent.
After this the same data is aggregated by institution size instead of region. Special purpose schools again dominate in numbers while the bivariate peaks show Associates and Bachelors degrees are the most common among 2 and 4-year institutions respectively.
```{r explore.school.2}
par(par.old)
# Count of colleges By highest degree offered
barplot(height=tapply(data.min[["UnitID"]], data.min[,c("Degree.high", "Region")], length),
beside=T,
legend.text=levels(data.min$Degree.high),
cex.names=0.6,
col=heat.colors(length(levels(data.min$Degree.high))),
args.legend = list(x="topright", cex = 0.5),
main="Number of Institutions by Highest Degree Offered and Region")
# ... By degree and size
barplot(height=tapply(data.min[["UnitID"]], data.min[,c("Degree.high", "Size")], length),
beside=T,
legend.text=levels(data.min$Degree.high),
cex.names=0.6,
col=heat.colors(length(levels(data.min$Degree.high))),
args.legend = list(x="top", cex = 0.5),
main="Number of Institutions by Highest Degree Offered and Size")
```
Next we look at degree levels and subject areas of study. We show the percent of degrees awarded by discipline. There is a wide range of distributions, but the most common areas stand out as being more common amongst institutions: Health Professions, Business Management & Marketing, and Personal & Culinary Services.
```{r explore.school.3}
# Degree types awarded - none above 100% - ***Uses original data set
par.old <- par(mar=c(11,5,1,1))
boxplot(data[,pcip.codes],
names=cip.codes$name,
las=2,
pars=list(cex.axis=0.6),
main="% Degree Types Awarded for All Institutions")
```
Unfortunately, this is as far as we can go with per-degree analysis. The data documentation makes this explicit: "the data are not yet available to produce program-level earnings data." Briefly, this means that though we can see what types of degrees institutions are granting or offering in aggregaete, we cannot make any relationships to other aggregate data such as costs, demographics, etc.
Next we look at the cost data. Costs are compiled for all full-time, first-time, degree-/certificate-seeking undergraduates who receive Title IV aid, which means that costs relative to students who do not receive (or need) financial aid are not included. This indicates a possible skew away from more affluent students.
First we compare the different types of costs for all institutions. Predictably, the cost of tuition comes in below the overall costs of attendance and there are a large number of outliers, which is expected.
Second, we divide academic and program year costs by institution type. Private nonprofit institutions (which includes the likes of Harvard, Yale, et al.) dominates the Academic program costs while program year costs (for certificates, and professional education) are more consistent across institution types.
The third and fourth graphs distribute academic and program year costs by institution type and size. Program year data are sparse, demonstrating that not all types of institutions, particularly the larger and 4-year schools, offer non-academic programs. Academic year costs are clearly not normally distributed across institution types, let alone sizes. In general, private institutions, both for and non-profit, tend to cost more than public institutions.
```{r explore.school.4}
par(par.old)
# Costs of attendance, tuition, and fees - no extreme outliers
par.old <- par(mar=c(8,4,2,2))
boxplot(data.min[,c("Cost.attend.aca","Cost.attend.prog","Cost.instate",
"Cost.outstate","Cost.prog")],
names=c("Avg. COA, Academic","Avg. COA, Program","In-state Tuition",
"Out-of-state Tuition","Program Tuition"),
las=2,
pars=list(cex.axis=0.8),
main="Range of Costs for All Institutions")
par(par.old)
# ... by Type
par.old <- par(mfrow=c(1,2),oma=c(0,0,2,0))
boxplot(Cost.attend.aca~Type,
data=data.min,
pars=list(cex.axis=0.7),
ylim=c(0,round(max(data.min$Cost.attend.aca, data.min$Cost.attend.prog, na.rm=T), digits=-4)),
main="Academic Year Institutions")
boxplot(Cost.attend.prog~Type,
data=data.min,
pars=list(cex.axis=0.7),
ylim=c(0,round(max(data.min$Cost.attend.aca, data.min$Cost.attend.prog, na.rm=T), digits=-4)),
main="Program Year Institutions")
mtext("Total Costs Of Attendance", outer=T, cex=1.5)
par(par.old)
# ... and Size
qplot(Type,
Cost.attend.aca,
data=data.min,
facets=.~Size,
geom="boxplot",
ymin=0,
ymax=max(Cost.attend.aca, na.rm=T)) +
labs(x="Institution Type",
y="Academic-year Costs",
title="Costs vs Institution Type and Size") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
qplot(Type,
Cost.attend.prog,
data=data.min,
facets=.~Size,
geom="boxplot",
ymin=0,
ymax=max(Cost.attend.prog, na.rm=T)) +
labs(x="Institution Type",
y="Program-year Costs",
title="Costs vs Institution Type and Size") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Exploratory Analysis - Students
-------------------------------
Student demographics and statistics are represented in aggregate overall. First we look at admissions rates by institution then by school type and size. Both public and private nonprofit cover a wide range of rates, but private for-profit shows a marked increase in admission rates. Being that for-profit institutions have a warrant to admit as many students as possible to maximize profits, their standards are clearly less rigorous than the other types of institutions. This trend generally holds when looking at different school sizes. Of note is the fact that large 4-year private non-profit institutions have significantly lower admission rates, likely due to the presence of elite and ivy-league schools in that class.
```{r explore.students}
# Admissions rates - none above 100%
boxplot(Admission.rate~Type,
data=data.min,
main="Admissions Rates by Institution Type")
# ... by type and size
qplot(Type,
Admission.rate,
data=data.min,
facets=.~Size,
geom="boxplot") +
labs(x="Institution Type",
y="Admissions Rate",
title="Admissions Rate vs Institution Type and Size") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Entrance exams are widely considered to be a strong indicator of admissions and student performance. Below we see that private nonprofit institutions generally have aggregate student SAT scores that are higher than other classes of institutions. The range of scores is quite wide, however.
```{r explore.students.2}
# SAT scores - below 1600
boxplot(SAT.mean~Type,
data=data.min,
main="SAT Scores by Institution Type")
# ... by type and size
qplot(Type,
SAT.mean,
data=data.min,
facets=.~Size,
geom="boxplot") +
labs(x="Institution Type",
y="SAT Score",
title="SAT Scores vs Institution Type and Size") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Family income could be a predictor of various types of student performance as well as havign a relation to financial aid and costs. Again, private nonprofit institutions tend to have greater mean family income levels than other classes.
```{r explore.students.3}
# Demographics - family income
boxplot(Income.mean~Type,
data=data.min,
main="Average Family Income by Institution Type")
# ... by type and size
qplot(Type,
Income.mean,
data=data.min,
facets=.~Size, geom="boxplot") +
labs(x="Institution Type",
y="Average Family Income",
title="Average Family Income vs Institution Type and Size") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Race demographics are difficult to establish from the given data. At best we have the proportion of each institution that is a given race. Perhaps unsurprisingly white students tend to dominate, though there are some exclusively black and hispanic colleges.
The two graphs that follow show a trend for program completion rates and earnings for different races. Apparently, the more homogeneous towards white or pacific islander a school tends toward, the higher the rate. Conversely, the more black, native american, or hispanic a school is the lower the rate, the exception being among small 2-year schools, likely some of which are traditionally black colleges. Aggregates such as these are probably unreliable as there are many factors that could account for the trends.
```{r explore.students.4}
# Demographics - racial makeup
par.old <- par(mfrow=c(1,3),oma=c(0,0,2,0))
boxplot(Race.white~Type,
data=data.min,
pars=list(cex.axis=0.7),
ylim=c(0,1),
main="% White")
boxplot(Race.black~Type,
data=data.min,
pars=list(cex.axis=0.7),
ylim=c(0,1),
main="% Black")
boxplot(Race.hispanic~Type,
data=data.min,
pars=list(cex.axis=0.7),
ylim=c(0,1),
main="% Hispanic")
mtext("Racial Distribution", outer=T, cex=1.5)
par(par.old)
# Race comparisons
# Completion
par.old <- par(mfrow=c(1,5),oma=c(0,0,2,0))
ylim.race <- c(0,round(max(data.min$Comp.200, na.rm=T), digits=2))
plot(data.min$Race.white,
data.min$Comp.200,
xlab="%",
ylab="%",
ylim=ylim.race,
main="White")
abline(lm(Comp.200 ~ Race.white, data=data.min, na.action="na.exclude"),
col="red")
plot(data.min$Race.black,
data.min$Comp.200,
xlab="%",
ylab="%",
ylim=ylim.race,
main="Black")
abline(lm(Comp.200 ~ Race.black, data=data.min, na.action="na.exclude"),
col="red")
plot(data.min$Race.pacific,
data.min$Comp.200,
xlab="%",
ylab="%",
ylim=ylim.race,
main="Pacific Islander")
abline(lm(Comp.200 ~ Race.pacific, data=data.min, na.action="na.exclude"),
col="red")
plot(data.min$Race.aboriginal,
data.min$Comp.200,
xlab="%",
ylab="%",
ylim=ylim.race,
main="Am. Indian")
abline(lm(Comp.200 ~ Race.aboriginal, data=data.min, na.action="na.exclude"),
col="red")
plot(data.min$Race.hispanic,
data.min$Comp.200,
xlab="%",
ylab="%",
ylim=ylim.race,
main="Hispanic")
abline(lm(Comp.200 ~ Race.hispanic, data=data.min, na.action="na.exclude"),
col="red")
mtext("200% Time Completion Rate by Undergraduate Race", outer=T, cex=1.5)
par(par.old)
# Earnings
par.old <- par(mfrow=c(1,5),oma=c(0,0,2,0))
ylim.race <- c(10000,round(max(data.min$Earn.10yr.mean, na.rm=T), digits=-4))
plot(data.min$Race.white,
data.min$Earn.10yr.mean,
xlab="%",
ylab="$",
ylim=ylim.race,
main="White")
abline(lm(Earn.10yr.mean ~ Race.white, data=data.min, na.action="na.exclude"),
col="red")
plot(data.min$Race.black,
data.min$Earn.10yr.mean,
xlab="%",
ylab="$",
ylim=ylim.race,
main="Black")
abline(lm(Earn.10yr.mean ~ Race.black, data=data.min, na.action="na.exclude"),
col="red")
plot(data.min$Race.pacific,
data.min$Earn.10yr.mean,
xlab="%",
ylab="$",
ylim=ylim.race,
main="Pacific Islander")
abline(lm(Earn.10yr.mean ~ Race.pacific, data=data.min, na.action="na.exclude"),
col="red")
plot(data.min$Race.aboriginal,
data.min$Earn.10yr.mean,
xlab="%",
ylab="$",
ylim=ylim.race,
main="Am. Indian")
abline(lm(Earn.10yr.mean ~ Race.aboriginal, data=data.min, na.action="na.exclude"),
col="red")
plot(data.min$Race.hispanic,
data.min$Earn.10yr.mean,
xlab="%",
ylab="$",
ylim=ylim.race,
main="Hispanic")
abline(lm(Earn.10yr.mean ~ Race.hispanic, data=data.min, na.action="na.exclude"),
col="red")
mtext("Earnings by Undergraduate Race", outer=T, cex=1.5)
par(par.old)
```
Next we look at aggregate completion rates. The data show no discernable trends with a wide variety between school types. The data are not complete, however. From the data documentation: "Exclusion of part-time students, transfer students, and students who do not start during the fall from IPEDS completion rates. First-time full-time students make up fewer than half of all college students, or even less in some sectors of institutions (e.g. community colleges). Furthermore, although schools have the option to report transfer outcomes for first-time full-time students, many choose not to."
```{r explore.students.5}
# Completion Rates
par.old <- par(mfrow=c(1,2),oma=c(0,0,2,0))
boxplot(Comp.150~Type,
data=data.min,
pars=list(cex.axis=0.7),
main="150% of Expected Time")
boxplot(Comp.200~Type,
data=data.min,
pars=list(cex.axis=0.7),
main="200% of Expected Time")
mtext("Completion Rates by Institution Type", outer=T, cex=1.5)
par(par.old)
# ... by type and size
qplot(Type,
Comp.200,
data=data.min,
facets=.~Size,
geom="boxplot") +
labs(x="Institution Type",
y="Completion Rate",
title="200% Completion Rates for Full-Time Students vs Institution Type and Size") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# 8 years completion at same school
boxplot(Comp.yr8~Type,
data=data.min,
main="% Completed Within 8 Yrs. at Original Institution")
```
The next section describe various student financial aid variables. As with costs, this data only reflects Title IV recipients, thus excluding more affluent students who may not need loans or fund their education privately. Predictably, more students take loans for the more expensive private schools, with a slight trend towards more accumulated debt.
Repayment depicts the fraction of borrowers at an institution who have not defaulted on their federal loans and who are making progress in paying them down. Positive repayment rates ioncrease over time which reflects the increase in salary that results from being in the workforce for a longer period of time.
```{r explore.students.6}
# Loans
boxplot(Loan.percent~Type,
data=data.min,
main="% Students with Federal Loans by Institution Type")
# Debt & Repayment
boxplot(Debt.median~Type,
data=data.min,
main="Median Accumulated Graduate Debt by Institution Type")
par.old <- par(mar=c(7,4,2,2))
boxplot(data.min[,c("Repay.1yr","Repay.3yr","Repay.5yr","Repay.7yr")],
names=c("1 Year in Repay","3 Years in Repay","5 Years in Repay","7 Years in Repay"),
las=2,
pars=list(cex.axis=0.8),
main="% of Borrowers Progressing on Repayment")
par(par.old)
```
Earnings data include only Title IV-receiving students, so figures may not be representative of schools with a low proportion of Title IV-eligible students. The earnings rates are generally consistent over the 10 year cohort period, with an increasing number of outliers as over time. This might reflect the general tendency towards income inequality, though the data here are insufficient to determine any proximate cause.
```{r explore.students.7}
# Earnings trends
par.old <- par(mar=c(7,4,2,2))
boxplot(data.min[,c("Earn.6yr","Earn.7yr","Earn.8yr","Earn.9yr","Earn.10yr")],
names=c("6 Years Post-Entry","7 Years Post-Entry","8 Years Post-Entry",
"9 Years Post-Entry","10 Years Post-Entry"),
las=2, pars=list(cex.axis=0.8),
main="Mean Earnings for Title IV Cohorts")
abline(h=25000,
col="red")
text(4.5,
24000,
"Threshold Earnings",
col="red",
cex=0.6)
par(par.old)
qplot(Type,
Earn.10yr.mean,
data=data.min,
facets=.~Size,
geom="boxplot") +
labs(x="Institution Type",
y="10 Year Average Earnings",
title="Mean Earnings for Title IV Cohorts vs Institution Type and Size") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Observations from Data
----------------------
To recap some of the observations made above and from statements in the data definitions:
* Many variables are available only for Title IV recipients, or students who receive federal grants and loans;
* All NSLDS and Treasury elements are protected for privacy purposes; those data are shown as PrivacySuppressed;
* For most variables, data are pooled across two years of data to reduce year-over-year variability in figures;
* Costs are compiled for all full-time, first-time, degree-/certificate-seeking undergraduates who receive Title IV aid;
* Exclusion of part-time students, transfer students, and students who do not start during the fall from IPEDS completion rates. First-time full-time students make up fewer than half of all college students, or even less in some sectors of institutions (e.g. community colleges). Furthermore, although schools have the option to report transfer outcomes for first-time full-time students, many choose not to;
* The data are not yet available to produce program-level earnings data;
* Earnings data include only Title IV-receiving students, so figures may not be representative of schools with a low proportion of Title IV-eligible students;
* Earnings data are restricted to students who are not enrolled, so students who are currently enrolled in graduate school at the time of measurement are excluded;
* Repayment depicts the fraction of borrowers at an institution who have not defaulted on their federal loans and who are making progress in paying them down;
* Missing cohort sizes for some metrics so we are unable to calculate rates reliably.
Exploratory Analysis - Comparisons
----------------------------------
Using the above observations, we can develop a correlation matrix to compare how significant variables appear to affect each other
```{r exploratory.analysis}
# correlation matrix
corrgram(data.min[c("Admission.rate","SAT.mean","Income.mean","Comp.200",
"Cost.attend.aca","Loan.percent","Debt.median","Earn.10yr.mean",
"Repay.7yr.mean")],
order=TRUE,
lower.panel=panel.shade,
upper.panel=NULL,
text.panel=panel.txt,
diag.panel=panel.minmax,
main="Correlation Matrix for Selected Data")
```
With respect to our first question, we can take cost and earnings as a metric for institutional value. Below we compare cost and earnings in specific contexts:
```{r exploratory.analysis.2}
# Correlations for value metrics
cor(data.min[c("Comp.200","Cost.attend.aca","Income.mean","Admission.rate")],
data.min[c("Debt.median","Earn.10yr.mean","Repay.7yr.mean")],
use="pairwise.complete.obs")
qplot(Cost.attend.aca,
Earn.10yr.mean,
data=data.min,
facets=Type~Region,
geom=c("point","smooth")) +
labs(x="Cost",
y="Earnings",
title="Cost vs Earnings by Type & Region") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
qplot(Cost.attend.aca,
Earn.10yr.mean,
data=data.min,
facets=Size~Region,
geom=c("point","smooth")) +
labs(x="Cost",
y="Earnings",
title="Cost vs Earnings by Size & Region") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
qplot(Cost.attend.aca,
Earn.10yr.mean,
data=data.min,
facets=Size~Type,
geom=c("point","smooth")) +
labs(x="Cost",
y="Earnings",
title="Cost vs Earnings by Size & Type") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Regarding our second question, we explore some variables likely to relate to completion rate:
```{r exploratory.analysis.3}
# Correlations for success metrics
cor(data.min[c("SAT.mean","Income.mean","Cost.attend.aca","Loan.percent",
"Debt.median","Admission.rate")],
data.min$Comp.200,
use="pairwise.complete.obs")
```
Model Development - Institutional Value
------------------------------------
We now attempt to build a model for institutional value based on the following features:
* Insitution type - we have observed how earnings vary based on the type of institution;
* Institution size - we have seen how size of institution has an effect on cost and earnings;
* Admission rate - more rigorous or elite schools have lower admissions rates which affects the perception of a degree from those schools and can affect earnings potential.
To begin, we create a factor to indicate the relative value of a school, expressed as a ratio of earnings to cost. The higher the ratio, the better the value. We use a 300% ratio as a cutoff to determine high versus average or low value institutions.
We analyzed only complete cases of the variables above. This significantly limited the population for analysis but ensured better results.
A number of models were tried against these variables. Most indicated the presence of some non-linear factors, but we were unable to compensate for those. In the end, a General Linear Model provided the most consistent model for the analysis.
Unfortunately, the population contained few cases of "high" value subjects, so the results, despite an apparently strong accuracy rate, are somewhat ambiguous. Undoubtedely the nonlinear aspects of the model will have to be adjusted or otherwise accounted for, or we would have to find better features to test on.
The top 10 institutions resulting from this model are listed below as are the significant p factor values. The result is that a combination of certain sizes of 4-year institutions that are private nonprofits and have a low admission rate predict higher value.
```{r analysis.value}
# Create a value factor
data.min$Cost.earn.ratio <- cut(data.min$Earn.10yr.mean/data.min$Cost.attend.aca,
breaks=c(0,3,99),
labels=c("Low","High"))
# complete.cases to avoid imputation
# contributing features: type, size, admissions rate
data.comp.value <- data.min[complete.cases(data.min[,c("Cost.earn.ratio","Type","Size",
"Admission.rate")]),
c("UnitID","Name","Cost.earn.ratio","Type","Size","Admission.rate")]
rownames(data.comp.value) <- paste(data.comp.value$Name, data.comp.value$UnitID) # Identify rows for analysis
value.train <- createDataPartition(y=data.comp.value$Cost.earn.ratio, p=0.5, list=FALSE)
value.training<-data.comp.value[value.train,]
value.testing<-data.comp.value[-value.train,]
set.seed(14345)
value.fit<-train(Cost.earn.ratio ~ Size+Type+Admission.rate, data=value.training, method='glm')
par.old <- par(mfrow=c(2,2))
plot(value.fit$finalModel)
par(par.old)
# Top 10 earnings to cost ratio schools
head(sort(value.fit$finalModel$residuals, decreasing=T), 10)
value.pred <- predict(value.fit, newdata=value.testing)
confusionMatrix(value.pred, value.testing$Cost.earn.ratio)
value.p <- t <- coef(summary(value.fit))[,4]
value.p[value.p < 0.05]
```
Model Development - Graduation Potential
----------------------------------------
We now attempt to build a model for graduation potential based on the following features:
* SAT Scores
* Mean Family Income
* Costs of attendance
Again, we analyzed only complete cases of the variables above. This significantly limited the population for analysis but ensured better results.
We selected a Linear Model with multiple regression. Outside of a few outliers this model performed well. The error rate is perhaps a bit high owing to these outliers and other factors. The summary statistics for the model show that each variable tested is significant, meaning a high SAT score, high family income, and more expensive (meaning exclusive) schools result in higher completion outcomes.
```{r analysis.grad}
# complete.cases to avoid imputation
# Contributing features: SAT scores, family income, and program cost, Admission.rate (race%?)
data.comp.compl <- data.min[complete.cases(data.min[,c("SAT.mean","Income.mean",
"Cost.attend.aca","Comp.200")]),
c("UnitID","Name","SAT.mean","Income.mean","Cost.attend.aca",
"Admission.rate","Comp.200")]
rownames(data.comp.compl) <- paste(data.comp.compl$Name, data.comp.compl$UnitID) # Identify rows for analysis
compl.train <- createDataPartition(y=data.comp.compl$Comp.200, p=0.5, list=FALSE)
compl.training<-data.comp.compl[compl.train,]
compl.testing<-data.comp.compl[-compl.train,]
set.seed(14445)
compl.fit<-train(Comp.200 ~ SAT.mean+Income.mean+Cost.attend.aca, data=compl.training, method='lm')
par.old <- par(mfrow=c(2,2))
plot(compl.fit$finalModel)
par(par.old)
compl.pred <- predict(compl.fit, newdata=compl.testing)
thresh.recode <- function(d) ifelse(d > thresh.compl, 1, 0)
thresh.compl <- 0.6
confusionMatrix(thresh.recode(compl.pred), thresh.recode(compl.testing$Comp.200))
plot.roc(thresh.recode(compl.pred), thresh.recode(compl.testing$Comp.200))
compl.p <- t <- coef(summary(compl.fit))[,4]
compl.p[compl.p < 0.05]
```
---
(end)