I chose to do an analysis on different computer science programs that colleges offer. Throughout this tutorial, my aim is to discover if there are relationships between how much money is put into the program and how much money students are earning coming out of the program.
Every year, the United States Department of Education releases a College Scorecard. This Scorecard is public data and contains metrics about each school. Some of these include location, admission rate, average cost, average debt, graduation rate, full-time enrollment, and more. They also release a similar dataset localized by Field of Study. This second dataset contains information specific to each field of study, or major, that each college provides. This dataset contains metrics for median earnings, monthly earnings, monthly loan payment, number of graduates, and more. We will strictly be looking at these metrics for computer science degrees.
Our goal for this tutorial is to read and clean this data so that we may be able to perform an analysis and find relationships within the data.
We will be using python alongside the vital imported packages: pandas, matplotlib, seaborn, and numpy.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import numpy as np
The first thing we need to do is gather the data. For my specific topic, I need data that is reliable, so I gathered my information from the US Department of Education's public College Scorecard. I am gathering data on colleges as a whole and specific majors of each college during the 2017-2018 and 2018-2019 academic years. This data set contains almost 3,000 columns, so we will need to process the data in order to extract only what is necessary.
After mounting my drive to tutorial, we use panda's read_csv method to read the data collected from the Department of Education into a panda's dataframe.
college_data_1718_orig = pd.read_csv("/content/drive/MyDrive/Documents (1)/School/CMSC320/Final Tutorial/CollegeScorecard_Raw_Data_09012022/MERGED2017_18_PP.csv", low_memory=False)
college_data_1819_orig = pd.read_csv("/content/drive/MyDrive/Documents (1)/School/CMSC320/Final Tutorial/CollegeScorecard_Raw_Data_09012022/MERGED2018_19_PP.csv", low_memory=False)
field_of_study_orig = pd.read_csv("/content/drive/MyDrive/Documents (1)/School/CMSC320/Final Tutorial/CollegeScorecard_Raw_Data_09012022/FieldOfStudyData1718_1819_PP.csv")
college_data_1718_orig.head()
college_data_1819_orig.head()
We can see from above the above dataframes that there are 2,989 columns containing metrics and data. This is way too many to be effective! We will need to preprocess the data in order to produce a more manageable dataframe. Also, some of the columns' names don't make sense, so we will rename them to increase readability. I was able to find more information on the columns in the Department of Education's glossary of the College Scorecard.
field_of_study_orig.head()
A similar issue arises with this dataset. We will need to only pull columns that are necessary for analysis and rename the columns so that they make more sense. Explanations of some important columns were also found here.
Once we have out data, we need to prepare it for our analysis later on. One of the main issues we saw from before was a dataframe that is too large. We need to only pull necessary columns from that original dataframe in order to have one that is more manageable. For the college data, we will take the College ID, Name, 4-Year Public Tuition, 4-Year Private Tuition, Acceptance Rate, and Median Total Debt After Graduation. From the field of study data, we will take the College ID, Name, Median Salary 3 Years After Graduation, Degree Description, and Major Description.
college_data_1718 = college_data_1718_orig[["UNITID", "INSTNM", "NPT4_PUB", "NPT4_PRIV", "ADM_RATE", "GRAD_DEBT_MDN_SUPP"]]
college_data_1819 = college_data_1819_orig[["UNITID", "INSTNM", "NPT4_PUB", "NPT4_PRIV", "ADM_RATE", "GRAD_DEBT_MDN_SUPP"]]
field_of_study = field_of_study_orig[["UNITID", "INSTNM", "EARN_NE_MDN_3YR", "CREDDESC", "CIPDESC"]]
Since we are only looking at computer science programs and how they relate to salaries, we want to tidy the field of study data a bit further. We only want rows that contain data for the salary after three years for computer science graduates.
computer_science_data = field_of_study[(field_of_study["EARN_NE_MDN_3YR"] != "PrivacySuppressed") & (field_of_study["CIPDESC"] == "Computer Science.")]
Next, we want to rename the columns so that we can easily read and understand the dataframe and the data it contains.
# rename columns for readability
college_data_1718.columns = ["ID", "NAME", "17-18 AVG PUB TUIT", "17-18 AVG PRIV TUIT", "17-18 ACC RATE", "17-18 TOTAL DEBT"]
college_data_1819.columns = ["ID", "NAME", "18-19 AVG PUB TUIT", "18-19 AVG PRIV TUIT", "18-19 ACC RATE", "18-19 TOTAL DEBT"]
computer_science_data.columns = ["ID", "NAME", "MEDIAN SALARY", "DEGREE", "MAJOR"]
We now have three readable dataframes: one on college data in the 2017-2018 academic year, one on college data in the 2018-2019 academic year, and one on major-specific data from the 2017-2018 and 2018-2019 academic years. We want to merge all these into one dataframe that we can then conduct analysis on. We first start by merging the two dataframes on college data. This will produce one dataframe that contains information of colleges in the 2017-2018 and 2018-2019 academic years. We do this because it will match with the same years in our dataframe specific to fields of study.
college_data = pd.merge(college_data_1718, college_data_1819, how='inner', left_on= ['ID', 'NAME'], right_on= ['ID', 'NAME'])
college_data.head()
The above shows the resulting dataframe. Instead of having tuition separated by private vs public and having tuition, acceptance rate, and total debt separated by year, we will take the average of the two years. This way, we can have a more accurate comparison with the computer science data that spans two years. Some of the data, specifically the Total Debt, needs to be cleaned in order to do this. We replace all values in both Total Debt columns that are "PrivacySuppressed" with NaN. We then convert all values in this column from strings to floats.
for index, row in college_data.iterrows():
if str(row["17-18 AVG PUB TUIT"]) != "nan":
avg_tuit = (row["17-18 AVG PUB TUIT"] + row["18-19 AVG PUB TUIT"]) / 2
college_data.at[index, "AVG PUB TUIT"] = avg_tuit
college_data.at[index, "SCHOOL TYPE"] = "PUBLIC"
else:
college_data.at[index, "AVG PUB TUIT"] = np.NaN
if str(row["17-18 AVG PRIV TUIT"]) != "nan":
avg_tuit = (row["17-18 AVG PRIV TUIT"] + row["18-19 AVG PRIV TUIT"]) / 2
college_data.at[index, "AVG PRIV TUIT"] = avg_tuit
college_data.at[index, "SCHOOL TYPE"] = "PRIVATE"
else:
college_data.at[index, "AVG PRIV TUIT"] = np.NaN
college_data["AVG TUITION"] = college_data[["17-18 AVG PUB TUIT", "18-19 AVG PUB TUIT", "17-18 AVG PRIV TUIT", "18-19 AVG PRIV TUIT"]].mean(axis=1)
college_data["AVG ACC RATE"] = college_data[["17-18 ACC RATE", "18-19 ACC RATE"]].mean(axis=1)
college_data["17-18 TOTAL DEBT"] = college_data["17-18 TOTAL DEBT"].replace("PrivacySuppressed", np.NaN)
college_data["18-19 TOTAL DEBT"] = college_data["18-19 TOTAL DEBT"].replace("PrivacySuppressed", np.NaN)
college_data["17-18 TOTAL DEBT"] = college_data["17-18 TOTAL DEBT"].astype(float)
college_data["18-19 TOTAL DEBT"] = college_data["18-19 TOTAL DEBT"].astype(float)
college_data["AVG TOTAL DEBT"] = college_data[["17-18 TOTAL DEBT", "18-19 TOTAL DEBT"]].mean(axis=1)
college_data = college_data[["ID", "NAME", "AVG TUITION", "AVG PUB TUIT", "AVG PRIV TUIT","AVG ACC RATE", "AVG TOTAL DEBT", "SCHOOL TYPE"]]
college_data.head()
The final step in combining the data is to merge the above dataframe with the computer science data and doing any final cleaning of the data. We need to convert the Median Salary field to floats and remove any rows that do not have data for the average tuition (this only removes one row from the data).
computer_science_college_data = pd.merge(college_data, computer_science_data, how='inner', left_on=['ID', 'NAME'], right_on=['ID', 'NAME'])
computer_science_college_data = computer_science_college_data[computer_science_college_data["AVG TUITION"].notna()]
computer_science_college_data["MEDIAN SALARY"] = computer_science_college_data["MEDIAN SALARY"].astype(float)
computer_science_college_data.head()
Now we have a very manageable and readable dataframe that we can use! Woohoo!
The first relationship we should look at is whether average tuition is related to median salary three years after graduation. We begin this by creating a scatter plot and regression line to visualize the relationship between the two metrics.
fig = plt.subplots(figsize=(10, 6))
sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY", hue="DEGREE", data=computer_science_college_data)
sb.regplot(x="AVG TUITION", y="MEDIAN SALARY", data=computer_science_college_data, scatter=False)
plt.title("Median Salary 3 Years After Graduation Compared to Average Tuition")
plt.xlabel("Average Tuition in US Dollars")
plt.ylabel("Median Salary")
plt.show()
From this scatter plot, we can see that there is a general trend that higher tuition will lead to a higher median salary, but there is still lots of variation in the data. We can further analyze this by splitting the data by their degrees.
degrees = ["Associate's Degree", "Bachelor’s Degree", "Master's Degree", "Doctoral Degree"]
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20,12))
plt.suptitle("Median Salary vs Average Tuition at Different Degree Levels")
for degree, ax in zip(degrees, axs.ravel()):
df = computer_science_college_data[computer_science_college_data["DEGREE"] == degree]
sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY", data=df, ax=ax)
sb.regplot(x="AVG TUITION", y="MEDIAN SALARY", data=df, scatter=False, ax=ax)
ax.set_title(degree)
ax.set_xlabel("Average Tuition")
ax.set_ylabel("Median Salary")
plt.show()
From these graphs we can determine that the type of degree is an important factor when looking at the relationship between tuition and salary.
For starters, computer science graduates with an Associate's degree are less common and make less money soon after graduation. Also, excluding the outlier, the median salary for computer science graduates with an Associates degree is generally the same no matter the school's tuition.
When looking at graduates with a Master's degree, there doesn't seem to be a relationship between tuition and salary soon after graduation. The linear regression line also demonstrates that average tuition does not affect median salary soon after graduation.
However, there is definitely some sort of relationship between tuition and salary soon after graduation for Bachelor's degree graduates. There is a general trend that the higher the average tuition, the higher the median salary soon after graduation.
Lastly, we cannot make any meaningful generalizations from computer science graduates of Doctoral programs. There is not enough data for this group, possibly because we removed rows that were privacy suppressed which could have removed a lot of schools from this group.
To make more sense of these graphs, we should look at the difference between salary and debt. One's salary can be very different depending on the amount of debt they've accumulated. To incorporate this into the graph, we can create a formula to calculate the total debt after 3 years of interest, and subtract this from the median salary. According to Credible, the average student loan interest was around 5%.
for index, row in computer_science_college_data.iterrows():
debt = row["AVG TOTAL DEBT"]
interest_rate = 1.05
years = 3
debt_after_interest = debt * (interest_rate ** years)
computer_science_college_data.at[index, "MEDIAN SALARY AFTER DEBT"] = row["MEDIAN SALARY"] - debt_after_interest
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20,12))
plt.suptitle("Median Salary After Debt vs Average Tuition at Different Degree Levels")
for degree, ax in zip(degrees, axs.ravel()):
df = computer_science_college_data[computer_science_college_data["DEGREE"] == degree]
sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY AFTER DEBT", data=df, ax=ax)
sb.regplot(x="AVG TUITION", y="MEDIAN SALARY AFTER DEBT", data=df, scatter=False, ax=ax)
ax.set_title(degree)
ax.set_xlabel("Average Tuition")
ax.set_ylabel("Median Salary After Debt")
plt.show()
After this change, the graphs still hold the same generalizations. However, now they hold a bit more meaning than the graphs above.
Next, let's look at median salaries after splitting by private versus public schools.
school_types = ["PUBLIC", "PRIVATE"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))
plt.suptitle("Median Salary vs Average Tuition at Different Types of Schools")
for school_type, ax in zip(school_types, axs.ravel()):
df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY", data=df, ax=ax)
sb.regplot(x="AVG TUITION", y="MEDIAN SALARY", data=df, scatter=False, ax=ax)
ax.set_title(school_type)
ax.set_xlabel("Average Tuition")
ax.set_ylabel("Median Salary")
plt.show()
These graphs are really cool! They show us that in general, graduates from private and public school earn similar salaries, but that there is more spread in how much graduates from private school earn. It also shows us the difference in distribution for average tuition between private and public schools.
Let's again factor in student loan debts.
school_types = ["PUBLIC", "PRIVATE"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))
plt.suptitle("Median Salary After Debt vs Average Tuition at Different Types of Schools")
for school_type, ax in zip(school_types, axs.ravel()):
df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
sb.scatterplot(x="AVG TUITION", y="MEDIAN SALARY AFTER DEBT", data=df, ax=ax)
sb.regplot(x="AVG TUITION", y="MEDIAN SALARY AFTER DEBT", data=df, scatter=False, ax=ax)
ax.set_title(school_type)
ax.set_xlabel("Average Tuition")
ax.set_ylabel("Median Salary After Debt")
plt.show()
This is much more interesting. Both linear regression lines flattened, meaning that after debt is factored in, paying a higher tuition does not necessarily mean a higher salary.
Let's now take a look at how the distributions of median salaries differ between public and private schools before and after debt is applied.
x_vals = ["MEDIAN SALARY", "MEDIAN SALARY AFTER DEBT"]
titles = ["Before Debt is Applied", "After Debt is Applied"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))
plt.suptitle("Distribution of Median Salary Before and After Debt is Applied")
for i, ax in zip(range(2), axs.ravel()):
sb.boxplot(x=x_vals[i], y="SCHOOL TYPE", data=computer_science_college_data, ax=ax)
ax.set_title(titles[i])
ax.set_xlabel("Median Salary")
ax.set_ylabel("School Type")
plt.show()
before_medians = computer_science_college_data.groupby(["SCHOOL TYPE"])["MEDIAN SALARY"].median()
after_medians = computer_science_college_data.groupby(["SCHOOL TYPE"])["MEDIAN SALARY AFTER DEBT"].median()
print("\nMedian Salaries Before Debt Is Applied")
print("PUBLIC: " + str(before_medians["PUBLIC"]))
print("PRIVATE: " + str(before_medians["PRIVATE"]))
print("\nMedian Salaries After Debt Is Applied")
print("PUBLIC: " + str(after_medians["PUBLIC"]))
print("PRIVATE: " + str(after_medians["PRIVATE"]))
Wow! From this we can see that before we take debt into account, graduates from private schools tend to earn a higher salary soon after graduation. However, when we consider debt, the median salary for private and public school graduates are almost the same! Again we can see that the salary spread for graduates of private schools is much greater than that for graduates of public schools.
Another interesting relationship we can look at is the distribution of median salaries per degree. For this, a box plot would be very useful. It will be able to show the median for each, as well as any potential outliers there may be. Since we now know we only have one data point for graduates of a Doctoral Degree, we will only look at Associate's, Bachelor's, and Master's Degrees.
no_doctoral = computer_science_college_data[computer_science_college_data["DEGREE"] != "Doctoral Degree"]
plt.figure(figsize=(10,6))
sb.boxplot(x="MEDIAN SALARY", y="DEGREE", data= no_doctoral)
plt.xlabel("Median Salary")
plt.ylabel("Degree Level")
plt.title("Distribution of Median Salary at Different Degree Levels")
plt.show()
This shows us a clear distinction of median salary between different degrees. Graduates of a computer science Associate's degree can expect to make less than graduates of a Bachelor's degree, who can expect to make less than graduates of a Master's degree.
Next we should look at the debt accumulated between different degrees.
plt.figure(figsize=(10,6))
sb.boxplot(x="AVG TOTAL DEBT", y="DEGREE", data= no_doctoral)
plt.xlabel("Average Total Debt")
plt.ylabel("Degree Level")
plt.title("Distribution of Average Total Debt at Different Degree Levels")
plt.show()
Interestingly, this shows us that the average total debt among the different degrees are very similar. Let's look at how the salary after debt differentiates between degrees.
plt.figure(figsize=(10,6))
sb.boxplot(x="MEDIAN SALARY AFTER DEBT", y="DEGREE", data= no_doctoral)
plt.xlabel("Median Salary After Debt")
plt.ylabel("Degree Level")
plt.title("Distribution of Median Salary at Different Degree Levels")
plt.show()
Each box changed slightly, but the overall generalization holds true.
A final relationship we can look at is between a school's acceptance rate and computer science graduates' median salary three years after graduation. First, let's make a scatter plot and regression line to visualize this relationship.
fig = plt.subplots(figsize=(10, 6))
sb.scatterplot(x="AVG ACC RATE", y="MEDIAN SALARY", hue="DEGREE", data=computer_science_college_data)
sb.regplot(x="AVG ACC RATE", y="MEDIAN SALARY", data=computer_science_college_data, scatter=False)
plt.title("Median Salary Compared to Acceptance Rate")
plt.xlabel("Acceptance Rate")
plt.ylabel("Median Salary")
plt.show()
We can see a clear inverse relationship, that the lower the acceptance right, the higher the median salary will typically be. This indicates that both the average tuition and acceptance rate may factor into the median salary soon after graduation.
Similar to before, let's try to split this by private and public schools.
school_types = ["PUBLIC", "PRIVATE"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))
plt.suptitle("Median Salary vs Acceptance Rate at Different Types of Schools")
for school_type, ax in zip(school_types, axs.ravel()):
df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
sb.scatterplot(x="AVG ACC RATE", y="MEDIAN SALARY", hue="DEGREE", data=df, ax=ax)
sb.regplot(x="AVG ACC RATE", y="MEDIAN SALARY", data=df, scatter=False, ax=ax)
ax.set_title(school_type)
ax.set_xlabel("Acceptance Rate")
ax.set_ylabel("Median Salary")
plt.show()
Lastly, let's apply debt to this graph.
school_types = ["PUBLIC", "PRIVATE"]
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(20,12))
plt.suptitle("Median Salary After Debt vs Acceptance Rate at Different Types of Schools")
for school_type, ax in zip(school_types, axs.ravel()):
df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
sb.scatterplot(x="AVG ACC RATE", y="MEDIAN SALARY AFTER DEBT", hue="DEGREE", data=df, ax=ax)
sb.regplot(x="AVG ACC RATE", y="MEDIAN SALARY AFTER DEBT", data=df, scatter=False, ax=ax)
ax.set_title(school_type)
ax.set_xlabel("Acceptance Rate")
ax.set_ylabel("Median Salary After Debt")
plt.show()
Let's combine these two graphs to compare their regression lines.
school_types = ["PUBLIC", "PRIVATE"]
fig = plt.subplots(figsize=(10, 6))
sb.scatterplot(x="AVG ACC RATE", y="MEDIAN SALARY AFTER DEBT", hue="SCHOOL TYPE", data=computer_science_college_data)
for school_type in school_types:
df = computer_science_college_data[computer_science_college_data["SCHOOL TYPE"] == school_type]
sb.regplot(x="AVG ACC RATE", y="MEDIAN SALARY AFTER DEBT", data=df, scatter=False)
plt.title("Median Salary Compared to Acceptance Rate")
plt.xlabel("Acceptance Rate")
plt.ylabel("Median Salary")
plt.show()
Wow look at that! It's very interesting to see how the graphs that split by school type differ from the graph that includes all school types! We can see that private schools have a much greater difference in median salary between a high acceptance rate and a low acceptance rate. The range is almost \$100,000, while the range for public schools is only around \$25,000. Another point that's really interesting is that going to a private school only pays off if the acceptance less than 60%. If the acceptance rate is greater than that, graduating from a public school would likely lead to a higher salary.
We can determine from the data that there is a correlation between median salary and average tuition, acceptance rate, and degree level. Most of our data was made up of programs offering a Bachelor's degree, so our findings should be more geared towards computer science degrees at the Bachelor's level. We found that, in general, a higher tuition could contribute to a higher median salary soon after graduation. We also found that the median total debt that students accumulate is also proportional to the school's tuition, and so total debt does not affect the relationship between average tuition and median salary. We also found a strong correlation between median salary and acceptance rate. This makes sense, since schools that are harder to get into tend to have more name recognition and correlate with higher earnings. While we had less data for Associate's Degrees and Master's Degrees, we still found a direct relationship between the degree level and median salary. What was most interesting, however, was the differences we saw when comparing public and private schools. It suggested that going to a private school is not always the best idea financially, and that one should consider all factors when deciding which program to go to.
While looking back at the data, more data specific to each program instead of the school as a whole would impact the data for the better. For example, average tuition was for the entire school, not separated by any degree level. Some of the schools in our data provide multiple degrees for computer science. These degree levels could have different tuition, but our data did not reflect that. This is similar to the total median debt, as it was for the school as a whole and not separated by major or degree level. If we were able to gather data that is specific to computer science students at these schools, our data would reflect the relationships we found more accurately. This really only affects the relationships we determined for Master's Degree programs and possibly Associate's Degree programs. This tutorial was just a small part of relationships that we can find in the vast amount of data that colleges and Universities provide. Hope you enjoyed it!