Online learning has become the default form of education in light of the COVID-19 pandemic. Many higher education institutions are not only converting their curricula to be delivered digitally but also releasing their course materials to the public through several open learning platforms. Unless students are seeking a verified certificate, all of these courses are free. Harvard is one of the institutions that has long been publishing educational content online.
At the time of our analysis (4/28/20), HarvardX offered 109 courses, with 12 session-based courses starting soon. As the EdX course page turned out to be somewhat unintuitive to navigate, we manually scraped data and recorded each course’s Enrollment, Subject, Course Length, Weekly Workload (Effort), Price for Certificate, and Level of Difficulty. Enrollment is a raw number representing the number of user accounts registered for the course. Subject is a categorical variable with values including Archaeology, Art and Culture, Biology and Life Sciences, Business and Management, Chemistry, Computer Science, Data Analysis and Statistics, Education, Environmental Studies, Health and Safety, History, Humanities, Literature, Math, Medicine, Music, Physics, Science, and Social Sciences. Course length is a discrete integer-valued variable in units of weeks. Effort is a range in units of hours per week from low to high, both of which are integer values. Price is a continuous variable measured in US dollars. Finally, Level of Difficulty is a categorical variable with values of Introductory, Intermediate, or Advanced.
We split the weekly workload variable, which was originally a range, into a lower limit (Effort Minimum), upper limit (Effort Maximum), and average (Effort Mean). To minimize the effect of potential outliers in the data, we also removed two courses which had very large enrollments and were the only course offered in their field: Leaders of Learning (categorized as Communication) and The Architectural Imagination (categorized as Archaeology), since those were both offered by Harvard Graduate Schools and were heavily marketed by edX. We also removed two additional courses with very high enrollments: CS50 and Mechanical Ventilation for COVID-19. CS50, one of the first open courses on the platform, has more than double the enrollment of the next most popular course, and the COVID course was a substantial outlier in its length, field, objective, and price.
We performed a few regressions on the data to try and understand some of the enrollment trends. The model we ended up building was only able to predict 22% of the variance in enrollment numbers, but it did reveal some predictors as significant, namely Course Length, Field of Study, and the Price for Certificate.
First, we examined enrollment in relation to course length, which seemed to be one of the more explanatory variables.
From the graph, we see that there is a weak positive correlation between enrollment and length.
Next, we examined enrollment by subject, which also seemed to be one of the more influential variables.
The most popular categories are Business and Management, Computer Science, Biology and Life Sciences, Chemistry, and Data Analysis and Statistics.
In contrast, the least popular categories seem to be music, art and culture, and some of the more theoretical natural sciences such as physics. This may also be explained by the fact that skills in these categories are not only less practically marketable but also more likely to be appreciated through formats that are not massive open online courses.
Next, we looked at the price of the courses’ verified certificates.
The graph shows a weak positive correlation between price and enrollment.
Going in the opposite direction, more valuable courses (e.g. those that teach more marketable skills) may also be priced at higher levels.
We also decided to explore some variables that didn’t show up in linear modelling but might offer some insight into the shape of the data.
We looked at enrollment by course level, and found that there seems to be the most enrollment in intermediate level courses. While there wasn’t enough Intermediate or Advanced data to confirm this relationship was significant, the “hottest” skills and courses were usually classified as intermediate.
Finally, we looked at the maximum effort listed on each course description.
Between zero and nine hours per week, there seems to be an exponential positive correlation between enrollment and maximum effort. This might be because people see the courses that involve more effort as more valuable to their careers. However, beyond ten hours, there is a significant drop in enrollment. While this could possibly just be because of a few outliers, another explanation might be that online learners simply do not have that much time on a weekly basis to dedicate to their courses.
There is still much to be explored about what makes people enroll in online courses — from factors such as enrollment trends over time that we have not accounted for in this analysis to breakdowns of how many of these enrollees actually follow through with the courses and how many of those get verified certificates. However, this preliminary analysis does show that popular online courses don’t typically focus on traditionally academic subjects such math, science, or history.