Open Data on Open Learning

Business and management courses top the list.

By Karina Halevy, Robert McKenzie & Neha Gupta

Online learning has become the default form of education in light of the COVID-19 pandemic. Many higher education institutions are not only converting their curricula to be delivered digitally but also releasing their course materials to the public through several open learning platforms. Unless students are seeking a verified certificate, all of these courses are free. Harvard is one of the institutions that has long been publishing educational content online.

In light of the pandemic, we thought it would be pertinent to analyze some of the trends in Harvard’s online course enrollments.

In this project, we aimed to analyze some of the driving factors behind enrollment in online courses offered by Harvard through the EdX platform.


At the time of our analysis (4/28/20), HarvardX offered 109 courses, with 12 session-based courses starting soon. As the EdX course page turned out to be somewhat unintuitive to navigate, we manually scraped data and recorded each course’s Enrollment, Subject, Course Length, Weekly Workload (Effort), Price for Certificate, and Level of Difficulty. Enrollment is a raw number representing the number of user accounts registered for the course. Subject is a categorical variable with values including Archaeology, Art and Culture, Biology and Life Sciences, Business and Management, Chemistry, Computer Science, Data Analysis and Statistics, Education, Environmental Studies, Health and Safety, History, Humanities, Literature, Math, Medicine, Music, Physics, Science, and Social Sciences. Course length is a discrete integer-valued variable in units of weeks. Effort is a range in units of hours per week from low to high, both of which are integer values. Price is a continuous variable measured in US dollars. Finally, Level of Difficulty is a categorical variable with values of Introductory, Intermediate, or Advanced.

We split the weekly workload variable, which was originally a range, into a lower limit (Effort Minimum), upper limit (Effort Maximum), and average (Effort Mean). To minimize the effect of potential outliers in the data, we also removed two courses which had very large enrollments and were the only course offered in their field: Leaders of Learning (categorized as Communication) and The Architectural Imagination (categorized as Archaeology), since those were both offered by Harvard Graduate Schools and were heavily marketed by edX. We also removed two additional courses with very high enrollments: CS50 and Mechanical Ventilation for COVID-19. CS50, one of the first open courses on the platform, has more than double the enrollment of the next most popular course, and the COVID course was a substantial outlier in its length, field, objective, and price.

We performed a few regressions on the data to try and understand some of the enrollment trends. The model we ended up building was only able to predict 22% of the variance in enrollment numbers, but it did reveal some predictors as significant, namely Course Length, Field of Study, and the Price for Certificate.

Results and Discussion

A Few Explanatory Variables

First, we examined enrollment in relation to course length, which seemed to be one of the more explanatory variables.

From the graph, we see that there is a weak positive correlation between enrollment and length.

This may be because people see longer courses as more valuable and are more likely to at least start the courses by enrolling. It may also be the case that longer courses offer more solid training in marketable and employable skills, thus driving enrollment numbers up.

Next, we examined enrollment by subject, which also seemed to be one of the more influential variables.

The most popular categories are Business and Management, Computer Science, Biology and Life Sciences, Chemistry, and Data Analysis and Statistics.

This may be explained by the fact that skills in these categories are in high demand (or are at least marketed to job-seekers as high-demand skills). It has been floated that data science is the hottest job of the 21st century, so that may be driving enrollment in data-related courses and courses in closely tied fields.

In contrast, the least popular categories seem to be music, art and culture, and some of the more theoretical natural sciences such as physics. This may also be explained by the fact that skills in these categories are not only less practically marketable but also more likely to be appreciated through formats that are not massive open online courses.

Next, we looked at the price of the courses’ verified certificates.

The graph shows a weak positive correlation between price and enrollment.

This might be because people might perceive more expensive courses as more valuable and hence enroll in greater numbers.

Going in the opposite direction, more valuable courses (e.g. those that teach more marketable skills) may also be priced at higher levels.

We also decided to explore some variables that didn’t show up in linear modelling but might offer some insight into the shape of the data.

We looked at enrollment by course level, and found that there seems to be the most enrollment in intermediate level courses. While there wasn’t enough Intermediate or Advanced data to confirm this relationship was significant, the “hottest” skills and courses were usually classified as intermediate.

Finally, we looked at the maximum effort listed on each course description.

Between zero and nine hours per week, there seems to be an exponential positive correlation between enrollment and maximum effort. This might be because people see the courses that involve more effort as more valuable to their careers. However, beyond ten hours, there is a significant drop in enrollment. While this could possibly just be because of a few outliers, another explanation might be that online learners simply do not have that much time on a weekly basis to dedicate to their courses.


There is still much to be explored about what makes people enroll in online courses — from factors such as enrollment trends over time that we have not accounted for in this analysis to breakdowns of how many of these enrollees actually follow through with the courses and how many of those get verified certificates. However, this preliminary analysis does show that popular online courses don’t typically focus on traditionally academic subjects such math, science, or history.

Popular courses help thousands of learners get job-ready skills. With unemployment rates skyrocketing during the pandemic, these professionally oriented online courses can be even more important in helping learners get hired.

Harvard Open Data Project
© 2016-2020, Built with Sanity & Gatsby

Harvard Wiki

The code for this website is open source.
Subscribe to our monthly newsletter

Interested in open data? Join the team.