Analysis of Computer Science Colleges

By: Nate Rose

Introduction

I chose to do an analysis on different computer science programs that colleges offer. Throughout this tutorial, my aim is to discover if there are relationships between how much money is put into the program and how much money students are earning coming out of the program.

Every year, the United States Department of Education releases a College Scorecard. This Scorecard is public data and contains metrics about each school. Some of these include location, admission rate, average cost, average debt, graduation rate, full-time enrollment, and more. They also release a similar dataset localized by Field of Study. This second dataset contains information specific to each field of study, or major, that each college provides. This dataset contains metrics for median earnings, monthly earnings, monthly loan payment, number of graduates, and more. We will strictly be looking at these metrics for computer science degrees.

Our goal for this tutorial is to read and clean this data so that we may be able to perform an analysis and find relationships within the data.

Setting Up Our Data

We will be using python alongside the vital imported packages: pandas, matplotlib, seaborn, and numpy.

In [149]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np

Gathering the Data

The first thing we need to do is gather the data. For my specific topic, I need data that is reliable, so I gathered my information from the US Department of Education's public College Scorecard. I am gathering data on colleges as a whole and specific majors of each college during the 2017-2018 and 2018-2019 academic years. This data set contains almost 3,000 columns, so we will need to process the data in order to extract only what is necessary.

Reading the Data

After mounting my drive to tutorial, we use panda's read_csv method to read the data collected from the Department of Education into a panda's dataframe.

In [150]:
college_data_1718_orig = pd.read_csv("/content/drive/MyDrive/Documents (1)/School/CMSC320/Final Tutorial/CollegeScorecard_Raw_Data_09012022/MERGED2017_18_PP.csv", low_memory=False)
college_data_1819_orig = pd.read_csv("/content/drive/MyDrive/Documents (1)/School/CMSC320/Final Tutorial/CollegeScorecard_Raw_Data_09012022/MERGED2018_19_PP.csv", low_memory=False)
field_of_study_orig = pd.read_csv("/content/drive/MyDrive/Documents (1)/School/CMSC320/Final Tutorial/CollegeScorecard_Raw_Data_09012022/FieldOfStudyData1718_1819_PP.csv")

college_data_1718_orig.head()
Out[150]:
UNITID OPEID OPEID6 INSTNM CITY STABBR ZIP ACCREDAGENCY INSTURL NPCURL ... COUNT_WNE_MALE1_P8 MD_EARN_WNE_MALE1_P8 GT_THRESHOLD_P10 MD_EARN_WNE_INC1_P10 MD_EARN_WNE_INC2_P10 MD_EARN_WNE_INC3_P10 MD_EARN_WNE_INDEP1_P10 MD_EARN_WNE_INDEP0_P10 MD_EARN_WNE_MALE0_P10 MD_EARN_WNE_MALE1_P10
0 100654 100200 1002 Alabama A & M University Normal AL 35762 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 100663 105200 1052 University of Alabama at Birmingham Birmingham AL 35294-0110 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 100690 2503400 25034 Amridge University Montgomery AL 36117-3553 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 100706 105500 1055 University of Alabama in Huntsville Huntsville AL 35899 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 100724 100500 1005 Alabama State University Montgomery AL 36104-0271 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 2989 columns

In [151]:
college_data_1819_orig.head()
Out[151]:
UNITID OPEID OPEID6 INSTNM CITY STABBR ZIP ACCREDAGENCY INSTURL NPCURL ... COUNT_WNE_MALE1_P8 MD_EARN_WNE_MALE1_P8 GT_THRESHOLD_P10 MD_EARN_WNE_INC1_P10 MD_EARN_WNE_INC2_P10 MD_EARN_WNE_INC3_P10 MD_EARN_WNE_INDEP1_P10 MD_EARN_WNE_INDEP0_P10 MD_EARN_WNE_MALE0_P10 MD_EARN_WNE_MALE1_P10
0 100654 100200 1002 Alabama A & M University Normal AL 35762 NaN NaN NaN ... 834.0 36639.0 0.6044 34076.0 35597.0 43145.0 40299.0 35424.0 36050.0 36377.0
1 100663 105200 1052 University of Alabama at Birmingham Birmingham AL 35294-0110 NaN NaN NaN ... 1233.0 49652.0 0.7472 42254.0 49817.0 51571.0 48182.0 46435.0 42007.0 56164.0
2 100690 2503400 25034 Amridge University Montgomery AL 36117-3553 NaN NaN NaN ... 78.0 50355.0 0.6286 36636.0 44836.0 NaN 39040.0 NaN 32311.0 49599.0
3 100706 105500 1055 University of Alabama in Huntsville Huntsville AL 35899 NaN NaN NaN ... 891.0 57542.0 0.7769 49469.0 60533.0 57411.0 56884.0 53803.0 45170.0 66070.0
4 100724 100500 1005 Alabama State University Montgomery AL 36104-0271 NaN NaN NaN ... 1077.0 32797.0 0.5178 30634.0 34533.0 38216.0 30602.0 32364.0 29836.0 35315.0

5 rows × 2989 columns

We can see from above the above dataframes that there are 2,989 columns containing metrics and data. This is way too many to be effective! We will need to preprocess the data in order to produce a more manageable dataframe. Also, some of the columns' names don't make sense, so we will rename them to increase readability. I was able to find more information on the columns in the Department of Education's glossary of the College Scorecard.

In [152]:
field_of_study_orig.head()
Out[152]:
UNITID OPEID6 INSTNM CONTROL MAIN CIPCODE CIPDESC CREDLEV CREDDESC IPEDSCOUNT1 ... EARN_COUNT_WNE_3YR EARN_CNTOVER150_3YR EARN_COUNT_PELL_NE_3YR EARN_PELL_NE_MDN_3YR EARN_COUNT_NOPELL_NE_3YR EARN_NOPELL_NE_MDN_3YR EARN_COUNT_MALE_NE_3YR EARN_MALE_NE_MDN_3YR EARN_COUNT_NOMALE_NE_3YR EARN_NOMALE_NE_MDN_3YR
0 100654.0 1002 Alabama A & M University Public 1 100 Agriculture, General. 3 Bachelor’s Degree NaN ... PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed
1 100654.0 1002 Alabama A & M University Public 1 109 Animal Sciences. 3 Bachelor’s Degree 5.0 ... PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed
2 100654.0 1002 Alabama A & M University Public 1 110 Food Science and Technology. 3 Bachelor’s Degree 9.0 ... PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed
3 100654.0 1002 Alabama A & M University Public 1 110 Food Science and Technology. 5 Master's Degree 5.0 ... PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed
4 100654.0 1002 Alabama A & M University Public 1 110 Food Science and Technology. 6 Doctoral Degree 1.0 ... PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed PrivacySuppressed

5 rows × 121 columns

A similar issue arises with this dataset. We will need to only pull columns that are necessary for analysis and rename the columns so that they make more sense. Explanations of some important columns were also found here.

Tidying the Data

Once we have out data, we need to prepare it for our analysis later on. One of the main issues we saw from before was a dataframe that is too large. We need to only pull necessary columns from that original dataframe in order to have one that is more manageable. For the college data, we will take the College ID, Name, 4-Year Public Tuition, 4-Year Private Tuition, Acceptance Rate, and Median Total Debt After Graduation. From the field of study data, we will take the College ID, Name, Median Salary 3 Years After Graduation, Degree Description, and Major Description.

In [153]:
college_data_1718 = college_data_1718_orig[["UNITID", "INSTNM", "NPT4_PUB", "NPT4_PRIV", "ADM_RATE", "GRAD_DEBT_MDN_SUPP"]]
college_data_1819 = college_data_1819_orig[["UNITID", "INSTNM", "NPT4_PUB", "NPT4_PRIV", "ADM_RATE", "GRAD_DEBT_MDN_SUPP"]]
field_of_study = field_of_study_orig[["UNITID", "INSTNM", "EARN_NE_MDN_3YR", "CREDDESC", "CIPDESC"]]

Since we are only looking at computer science programs and how they relate to salaries, we want to tidy the field of study data a bit further. We only want rows that contain data for the salary after three years for computer science graduates.

In [154]:
computer_science_data = field_of_study[(field_of_study["EARN_NE_MDN_3YR"] != "PrivacySuppressed") & (field_of_study["CIPDESC"] == "Computer Science.")]

Next, we want to rename the columns so that we can easily read and understand the dataframe and the data it contains.

In [155]:
# rename columns for readability
college_data_1718.columns = ["ID", "NAME", "17-18 AVG PUB TUIT", "17-18 AVG PRIV TUIT", "17-18 ACC RATE", "17-18 TOTAL DEBT"]
college_data_1819.columns = ["ID", "NAME", "18-19 AVG PUB TUIT", "18-19 AVG PRIV TUIT", "18-19 ACC RATE", "18-19 TOTAL DEBT"]
computer_science_data.columns = ["ID", "NAME", "MEDIAN SALARY", "DEGREE", "MAJOR"]

We now have three readable dataframes: one on college data in the 2017-2018 academic year, one on college data in the 2018-2019 academic year, and one on major-specific data from the 2017-2018 and 2018-2019 academic years. We want to merge all these into one dataframe that we can then conduct analysis on. We first start by merging the two dataframes on college data. This will produce one dataframe that contains information of colleges in the 2017-2018 and 2018-2019 academic years. We do this because it will match with the same years in our dataframe specific to fields of study.

In [156]:
college_data = pd.merge(college_data_1718, college_data_1819, how='inner', left_on= ['ID', 'NAME'], right_on= ['ID', 'NAME'])
college_data.head()
Out[156]:
ID NAME 17-18 AVG PUB TUIT 17-18 AVG PRIV TUIT 17-18 ACC RATE 17-18 TOTAL DEBT 18-19 AVG PUB TUIT 18-19 AVG PRIV TUIT 18-19 ACC RATE 18-19 TOTAL DEBT
0 100654 Alabama A & M University 15184.0 NaN 0.9027 34500 14444.0 NaN 0.8986 33375
1 100663 University of Alabama at Birmingham 17535.0 NaN 0.9181 22500 17005.0 NaN 0.9211 22500
2 100690 Amridge University NaN 9649.0 NaN 25002 NaN 15322.0 NaN 27334
3 100706 University of Alabama in Huntsville 19986.0 NaN 0.8123 22021 20909.0 NaN 0.8087 21607
4 100724 Alabama State University 12874.0 NaN 0.9787 32637 13043.0 NaN 0.9774 32000

The above shows the resulting dataframe. Instead of having tuition separated by private vs public and having tuition, acceptance rate, and total debt separated by year, we will take the average of the two years. This way, we can have a more accurate comparison with the computer science data that spans two years. Some of the data, specifically the Total Debt, needs to be cleaned in order to do this. We replace all values in both Total Debt columns that are "PrivacySuppressed" with NaN. We then convert all values in this column from strings to floats.

In [157]:
for index, row in college_data.iterrows():
  if str(row["17-18 AVG PUB TUIT"]) != "nan":
    avg_tuit = (row["17-18 AVG PUB TUIT"] + row["18-19 AVG PUB TUIT"]) / 2
    college_data.at[index, "AVG PUB TUIT"] = avg_tuit
    college_data.at[index, "SCHOOL TYPE"] = "PUBLIC"
  else:
    college_data.at[index, "AVG PUB TUIT"] = np.NaN
  if str(row["17-18 AVG PRIV TUIT"]) != "nan":
    avg_tuit = (row["17-18 AVG PRIV TUIT"] + row["18-19 AVG PRIV TUIT"]) / 2
    college_data.at[index, "AVG PRIV TUIT"] = avg_tuit
    college_data.at[index, "SCHOOL TYPE"] = "PRIVATE"
  else:
    college_data.at[index, "AVG PRIV TUIT"] = np.NaN
college_data["AVG TUITION"] = college_data[["17-18 AVG PUB TUIT", "18-19 AVG PUB TUIT", "17-18 AVG PRIV TUIT", "18-19 AVG PRIV TUIT"]].mean(axis=1)
college_data["AVG ACC RATE"] = college_data[["17-18 ACC RATE", "18-19 ACC RATE"]].mean(axis=1)
college_data["17-18 TOTAL DEBT"] = college_data["17-18 TOTAL DEBT"].replace("PrivacySuppressed", np.NaN)
college_data["18-19 TOTAL DEBT"] = college_data["18-19 TOTAL DEBT"].replace("PrivacySuppressed", np.NaN)
college_data["17-18 TOTAL DEBT"] = college_data["17-18 TOTAL DEBT"].astype(float)
college_data["18-19 TOTAL DEBT"] = college_data["18-19 TOTAL DEBT"].astype(float)
college_data["AVG TOTAL DEBT"] = college_data[["17-18 TOTAL DEBT", "18-19 TOTAL DEBT"]].mean(axis=1)

college_data = college_data[["ID", "NAME", "AVG TUITION", "AVG PUB TUIT", "AVG PRIV TUIT","AVG ACC RATE", "AVG TOTAL DEBT", "SCHOOL TYPE"]]
college_data.head()
Out[157]:
ID NAME AVG TUITION AVG PUB TUIT AVG PRIV TUIT AVG ACC RATE AVG TOTAL DEBT SCHOOL TYPE
0 100654 Alabama A & M University 14814.0 14814.0 NaN 0.90065 33937.5 PUBLIC
1 100663 University of Alabama at Birmingham 17270.0 17270.0 NaN 0.91960 22500.0 PUBLIC
2 100690 Amridge University 12485.5 NaN 12485.5 NaN 26168.0 PRIVATE
3 100706 University of Alabama in Huntsville 20447.5 20447.5 NaN 0.81050 21814.0 PUBLIC
4 100724 Alabama State University 12958.5 12958.5 NaN 0.97805 32318.5 PUBLIC

The final step in combining the data is to merge the above dataframe with the computer science data and doing any final cleaning of the data. We need to convert the Median Salary field to floats and remove any rows that do not have data for the average tuition (this only removes one row from the data).

In [158]:
computer_science_college_data = pd.merge(college_data, computer_science_data, how='inner', left_on=['ID', 'NAME'], right_on=['ID', 'NAME'])
computer_science_college_data = computer_science_college_data[computer_science_college_data["AVG TUITION"].notna()]
computer_science_college_data["MEDIAN SALARY"] = computer_science_college_data["MEDIAN SALARY"].astype(float)
computer_science_college_data.head()
Out[158]:
ID NAME AVG TUITION AVG PUB TUIT AVG PRIV TUIT AVG ACC RATE AVG TOTAL DEBT SCHOOL TYPE MEDIAN SALARY DEGREE MAJOR
0 102094 University of South Alabama 14976.5 14976.5 NaN 0.80695 25000.0 PUBLIC 65167.0 Bachelor’s Degree Computer Science.
1 102845 Charter College 30133.0 NaN 30133.0 1.00000 14523.0 PRIVATE 36051.0 Associate's Degree Computer Science.
2 104151 Arizona State University Campus Immersion 11403.5 11403.5 NaN 0.84465 20313.5 PUBLIC 86662.0 Bachelor’s Degree Computer Science.
3 104151 Arizona State University Campus Immersion 11403.5 11403.5 NaN 0.84465 20313.5 PUBLIC 114816.0 Master's Degree Computer Science.
4 104179 University of Arizona 14687.0 14687.0 NaN 0.84000 20085.5 PUBLIC 86387.0 Bachelor’s Degree Computer Science.

Now we have a very manageable and readable dataframe that we can use! Woohoo!

Average Tuition in Relation to Salary

The first relationship we should look at is whether average tuition is related to median salary three years after graduation. We begin this by creating a scatter plot and regression line to visualize the relationship between the two metrics.

In [159]:
fig = plt.subplots(figsize=(10, 6))
sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY", hue="DEGREE", data=computer_science_college_data)
sb.regplot(x="AVG TUITION", y="MEDIAN SALARY", data=computer_science_college_data, scatter=False)
plt.title("Median Salary 3 Years After Graduation Compared to Average Tuition")
plt.xlabel("Average Tuition in US Dollars")
plt.ylabel("Median Salary")
plt.show()

From this scatter plot, we can see that there is a general trend that higher tuition will lead to a higher median salary, but there is still lots of variation in the data. We can further analyze this by splitting the data by their degrees.

In [160]:
degrees = ["Associate's Degree", "Bachelor’s Degree", "Master's Degree", "Doctoral Degree"]
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20,12))
plt.suptitle("Median Salary vs Average Tuition at Different Degree Levels")

for degree, ax in zip(degrees, axs.ravel()):
  df = computer_science_college_data[computer_science_college_data["DEGREE"] == degree]
  sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY", data=df, ax=ax)
  sb.regplot(x="AVG TUITION", y="MEDIAN SALARY", data=df, scatter=False, ax=ax)
  ax.set_title(degree)
  ax.set_xlabel("Average Tuition")
  ax.set_ylabel("Median Salary")

plt.show()

From these graphs we can determine that the type of degree is an important factor when looking at the relationship between tuition and salary.

For starters, computer science graduates with an Associate's degree are less common and make less money soon after graduation. Also, excluding the outlier, the median salary for computer science graduates with an Associates degree is generally the same no matter the school's tuition.

When looking at graduates with a Master's degree, there doesn't seem to be a relationship between tuition and salary soon after graduation. The linear regression line also demonstrates that average tuition does not affect median salary soon after graduation.

However, there is definitely some sort of relationship between tuition and salary soon after graduation for Bachelor's degree graduates. There is a general trend that the higher the average tuition, the higher the median salary soon after graduation.

Lastly, we cannot make any meaningful generalizations from computer science graduates of Doctoral programs. There is not enough data for this group, possibly because we removed rows that were privacy suppressed which could have removed a lot of schools from this group.

To make more sense of these graphs, we should look at the difference between salary and debt. One's salary can be very different depending on the amount of debt they've accumulated. To incorporate this into the graph, we can create a formula to calculate the total debt after 3 years of interest, and subtract this from the median salary. According to Credible, the average student loan interest was around 5%.

In [161]:
for index, row in computer_science_college_data.iterrows():
  debt = row["AVG TOTAL DEBT"]
  interest_rate = 1.05
  years = 3
  debt_after_interest = debt * (interest_rate ** years)
  computer_science_college_data.at[index, "MEDIAN SALARY AFTER DEBT"] = row["MEDIAN SALARY"] - debt_after_interest

fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20,12))
plt.suptitle("Median Salary After Debt vs Average Tuition at Different Degree Levels")

for degree, ax in zip(degrees, axs.ravel()):
  df = computer_science_college_data[computer_science_college_data["DEGREE"] == degree]
  sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY AFTER DEBT", data=df, ax=ax)
  sb.regplot(x="AVG TUITION", y="MEDIAN SALARY AFTER DEBT", data=df, scatter=False, ax=ax)
  ax.set_title(degree)
  ax.set_xlabel("Average Tuition")
  ax.set_ylabel("Median Salary After Debt")

plt.show()

After this change, the graphs still hold the same generalizations. However, now they hold a bit more meaning than the graphs above.

Next, let's look at median salaries after splitting by private versus public schools.

In [162]:
school_types = ["PUBLIC", "PRIVATE"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))

plt.suptitle("Median Salary vs Average Tuition at Different Types of Schools")
for school_type, ax in zip(school_types, axs.ravel()):
  df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
  sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY", data=df, ax=ax)
  sb.regplot(x="AVG TUITION", y="MEDIAN SALARY", data=df, scatter=False, ax=ax)
  ax.set_title(school_type)
  ax.set_xlabel("Average Tuition")
  ax.set_ylabel("Median Salary")

plt.show()

These graphs are really cool! They show us that in general, graduates from private and public school earn similar salaries, but that there is more spread in how much graduates from private school earn. It also shows us the difference in distribution for average tuition between private and public schools.

Let's again factor in student loan debts.

In [163]:
school_types = ["PUBLIC", "PRIVATE"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))

plt.suptitle("Median Salary After Debt vs Average Tuition at Different Types of Schools")
for school_type, ax in zip(school_types, axs.ravel()):
  df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
  sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY AFTER DEBT", data=df, ax=ax)
  sb.regplot(x="AVG TUITION", y="MEDIAN SALARY AFTER DEBT", data=df, scatter=False, ax=ax)
  ax.set_title(school_type)
  ax.set_xlabel("Average Tuition")
  ax.set_ylabel("Median Salary After Debt")

plt.show()

This is much more interesting. Both linear regression lines flattened, meaning that after debt is factored in, paying a higher tuition does not necessarily mean a higher salary.

Let's now take a look at how the distributions of median salaries differ between public and private schools before and after debt is applied.

In [164]:
x_vals = ["MEDIAN SALARY", "MEDIAN SALARY AFTER DEBT"]
titles = ["Before Debt is Applied", "After Debt is Applied"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))

plt.suptitle("Distribution of Median Salary Before and After Debt is Applied")
for i, ax in zip(range(2), axs.ravel()):
  sb.boxplot(x=x_vals[i], y="SCHOOL TYPE", data=computer_science_college_data, ax=ax)
  ax.set_title(titles[i])
  ax.set_xlabel("Median Salary")
  ax.set_ylabel("School Type")

plt.show()

before_medians = computer_science_college_data.groupby(["SCHOOL TYPE"])["MEDIAN SALARY"].median()
after_medians = computer_science_college_data.groupby(["SCHOOL TYPE"])["MEDIAN SALARY AFTER DEBT"].median()

print("\nMedian Salaries Before Debt Is Applied")
print("PUBLIC: " + str(before_medians["PUBLIC"]))
print("PRIVATE: " + str(before_medians["PRIVATE"]))

print("\nMedian Salaries After Debt Is Applied")
print("PUBLIC: " + str(after_medians["PUBLIC"]))
print("PRIVATE: " + str(after_medians["PRIVATE"]))
Median Salaries Before Debt Is Applied
PUBLIC: 76432.0
PRIVATE: 83550.5

Median Salaries After Debt Is Applied
PUBLIC: 53461.171875
PRIVATE: 53426.089906249996

Wow! From this we can see that before we take debt into account, graduates from private schools tend to earn a higher salary soon after graduation. However, when we consider debt, the median salary for private and public school graduates are almost the same! Again we can see that the salary spread for graduates of private schools is much greater than that for graduates of public schools.

Type of Degree in Relation to Salary

Another interesting relationship we can look at is the distribution of median salaries per degree. For this, a box plot would be very useful. It will be able to show the median for each, as well as any potential outliers there may be. Since we now know we only have one data point for graduates of a Doctoral Degree, we will only look at Associate's, Bachelor's, and Master's Degrees.

In [165]:
no_doctoral = computer_science_college_data[computer_science_college_data["DEGREE"] != "Doctoral Degree"]

plt.figure(figsize=(10,6)) 
sb.boxplot(x="MEDIAN SALARY", y="DEGREE", data= no_doctoral)
plt.xlabel("Median Salary")
plt.ylabel("Degree Level")
plt.title("Distribution of Median Salary at Different Degree Levels")
plt.show()

This shows us a clear distinction of median salary between different degrees. Graduates of a computer science Associate's degree can expect to make less than graduates of a Bachelor's degree, who can expect to make less than graduates of a Master's degree.

Next we should look at the debt accumulated between different degrees.

In [166]:
plt.figure(figsize=(10,6)) 
sb.boxplot(x="AVG TOTAL DEBT", y="DEGREE", data= no_doctoral)
plt.xlabel("Average Total Debt")
plt.ylabel("Degree Level")
plt.title("Distribution of Average Total Debt at Different Degree Levels")
plt.show()

Interestingly, this shows us that the average total debt among the different degrees are very similar. Let's look at how the salary after debt differentiates between degrees.

In [167]:
plt.figure(figsize=(10,6)) 
sb.boxplot(x="MEDIAN SALARY AFTER DEBT", y="DEGREE", data= no_doctoral)
plt.xlabel("Median Salary After Debt")
plt.ylabel("Degree Level")
plt.title("Distribution of Median Salary at Different Degree Levels")
plt.show()

Each box changed slightly, but the overall generalization holds true.

Acceptance Rate in Relation to Salary

A final relationship we can look at is between a school's acceptance rate and computer science graduates' median salary three years after graduation. First, let's make a scatter plot and regression line to visualize this relationship.

In [168]:
fig = plt.subplots(figsize=(10, 6))
sb.scatterplot(x="AVG ACC RATE", y="MEDIAN SALARY", hue="DEGREE", data=computer_science_college_data)
sb.regplot(x="AVG ACC RATE", y="MEDIAN SALARY", data=computer_science_college_data, scatter=False)
plt.title("Median Salary Compared to Acceptance Rate")
plt.xlabel("Acceptance Rate")
plt.ylabel("Median Salary")
plt.show()

We can see a clear inverse relationship, that the lower the acceptance right, the higher the median salary will typically be. This indicates that both the average tuition and acceptance rate may factor into the median salary soon after graduation.

Similar to before, let's try to split this by private and public schools.

In [169]:
school_types = ["PUBLIC", "PRIVATE"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))

plt.suptitle("Median Salary vs Acceptance Rate at Different Types of Schools")
for school_type, ax in zip(school_types, axs.ravel()):
  df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
  sb.scatterplot(x="AVG ACC RATE", y="MEDIAN SALARY", hue="DEGREE", data=df, ax=ax)
  sb.regplot(x="AVG ACC RATE", y="MEDIAN SALARY", data=df, scatter=False, ax=ax)
  ax.set_title(school_type)
  ax.set_xlabel("Acceptance Rate")
  ax.set_ylabel("Median Salary")

plt.show()

Lastly, let's apply debt to this graph.

In [170]:
school_types = ["PUBLIC", "PRIVATE"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))

plt.suptitle("Median Salary After Debt vs Acceptance Rate at Different Types of Schools")
for school_type, ax in zip(school_types, axs.ravel()):
  df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
  sb.scatterplot(x="AVG ACC RATE", y="MEDIAN SALARY AFTER DEBT", hue="DEGREE", data=df, ax=ax)
  sb.regplot(x="AVG ACC RATE", y="MEDIAN SALARY AFTER DEBT", data=df, scatter=False, ax=ax)
  ax.set_title(school_type)
  ax.set_xlabel("Acceptance Rate")
  ax.set_ylabel("Median Salary After Debt")

plt.show()

Let's combine these two graphs to compare their regression lines.

In [145]:
school_types = ["PUBLIC", "PRIVATE"]

fig = plt.subplots(figsize=(10, 6))
sb.scatterplot(x="AVG ACC RATE", y="MEDIAN SALARY AFTER DEBT", hue="SCHOOL TYPE", data=computer_science_college_data)
for school_type in school_types:
  df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
  sb.regplot(x="AVG ACC RATE", y="MEDIAN SALARY AFTER DEBT", data=df, scatter=False)
plt.title("Median Salary Compared to Acceptance Rate")
plt.xlabel("Acceptance Rate")
plt.ylabel("Median Salary")
plt.show()

Wow look at that! It's very interesting to see how the graphs that split by school type differ from the graph that includes all school types! We can see that private schools have a much greater difference in median salary between a high acceptance rate and a low acceptance rate. The range is almost \$100,000, while the range for public schools is only around \$25,000. Another point that's really interesting is that going to a private school only pays off if the acceptance less than 60%. If the acceptance rate is greater than that, graduating from a public school would likely lead to a higher salary.

Conclusion

We can determine from the data that there is a correlation between median salary and average tuition, acceptance rate, and degree level. Most of our data was made up of programs offering a Bachelor's degree, so our findings should be more geared towards computer science degrees at the Bachelor's level. We found that, in general, a higher tuition could contribute to a higher median salary soon after graduation. We also found that the median total debt that students accumulate is also proportional to the school's tuition, and so total debt does not affect the relationship between average tuition and median salary. We also found a strong correlation between median salary and acceptance rate. This makes sense, since schools that are harder to get into tend to have more name recognition and correlate with higher earnings. While we had less data for Associate's Degrees and Master's Degrees, we still found a direct relationship between the degree level and median salary. What was most interesting, however, was the differences we saw when comparing public and private schools. It suggested that going to a private school is not always the best idea financially, and that one should consider all factors when deciding which program to go to.

While looking back at the data, more data specific to each program instead of the school as a whole would impact the data for the better. For example, average tuition was for the entire school, not separated by any degree level. Some of the schools in our data provide multiple degrees for computer science. These degree levels could have different tuition, but our data did not reflect that. This is similar to the total median debt, as it was for the school as a whole and not separated by major or degree level. If we were able to gather data that is specific to computer science students at these schools, our data would reflect the relationships we found more accurately. This really only affects the relationships we determined for Master's Degree programs and possibly Associate's Degree programs. This tutorial was just a small part of relationships that we can find in the vast amount of data that colleges and Universities provide. Hope you enjoyed it!