“Proved impossible to answer the question”: a day at the SciencesPo #Datathon

“Proved impossible to answer the question”: a day at the SciencesPo #Datathon#DatathonScPoDigitalBlockedUnblockFollowFollowingJan 24India Kerle, Francesco Lanzone, Ricardo Zapata Lopera, Stephanie Tran, Gloriana Lang and Maximilian GAHNTZ faced a five-day datathon as part of the New Technologies and Public Policy course at Sciences Po.

We were tasked with solving a public policy issue using open data and our newly acquired R skills from days 1 and 2.

We chose to delve into the question, ‘do French students cost too much?’ We set out in a linear fashion: by scoping relevant datasets, cleaning and shaping the data, producing visualisations and developing a coherent solution.

The team started out strong, ready for the often hairy landscape of open data:Stephanie, our resident already-R expert and engineerWe hit troubled waters soon after due to the difficulty in finding appropriate data and our collective, incipient R skills (bar Stephanie):Day 1’s end: A rather demoralised crew of MPP studentsWe carried on nonetheless, exploring tertiary education spending data according to French regions and across OECD countries.

We wanted to see if France spent considerably more per tertiary level student compared to other OECD countries and which French regions spent most per student.

The data we ultimately used for our analysis came from the OECD and French Ministry of Education, Research and Innovation.

The Ministry of Education, Research and Innovation provided us with two important datasets: one on tertiary spending at the French regional level and a second rich dataset with geo attributes on main institutions of French higher education.

Meanwhile, the OECD dataset provided us with tertiary level education spending across all OECD nations.

We managed to apply our recently acquired R skills and create the following figures.

For example, according to OECD data, as can be seen in graph#1, France expenditure in education as a percentage of GDP is in line with the average of the OECD.

Graph №1:Source: computed by Sciences Po students.


com/transteph/DoFrenchStudentsCostTooMuchMeanwhile, we generated Graph #2, Maps #1 and #2 with data from the French Ministry of Education, Research and Innovation.

Pays de la Loire spends considerably more per tertiary student, approximately 1.

77 times more than the regional average or 4.

16 times more than the region spending the least.

Surprisingly, Ile-de-France, the region where Paris is located and otherwise a hotspot for higher institutions, spends less than regional average.

We know that there are a lot of factors that influence these numbers, such as type of institutions per region and composition of students, unfortunately the data needed to make these distinctions was not easily available.

Graph №2:Source: computed by Sciences Po students.


com/transteph/DoFrenchStudentsCostTooMuchSimilarly to graph #2, map #1 showcases the data as an alternative visual.

It is easier to see which French regions spend more or less on tertiary-level students.

Map №1:Source: computed by Sciences Po students.


com/transteph/DoFrenchStudentsCostTooMuchMap #2, shows the number of tertiary-level students per region.

The Ile-de-France region has far and away more students than any other.

Map №2:Source: computed by Sciences Po students.


com/transteph/DoFrenchStudentsCostTooMuchWhat happenedWe were aiming for a comprehensive dataset of the allocation of funds among french universities but we did not find any such database.

What we found instead were different reports from various sources, with information about the regions, or collectivitées territoriales.

The collectivitées territoriales spend 20% of what is spent in tertiary education, another 54.

6% comes from the French State, with the rest being covered by students and private sector contributors, we did not find detailed data on the last ones.

By exploring the available datasets containing the number of enrolled students and the different sources of funding, we realised that they were often incomplete and with misleading categorizations.

After navigating and analysing them, we realised they did not contain what we needed to answer our question.

We cross-referenced from multiple sources and multiple databases, some of them containing conflicting information.

We returned to the data and attempting to retrieve new insights from what we had.

We had to make strong assumptions, including an estimation of the missing components of the dataset, such as the number of students where it was not reported or incomplete.

In some regions the percentage of universities with no data about enrolled students was close to 40% (link).

In others, the data was more complete and we were able to estimate the costs per student at the regional level.

However, this process was hampered by the lack of a cohesive dataset including both regional, national and other sources of funding across universities.

For us it was not possible to reach the number indicated in the official OECD data.

At this point, we could not resolve the numerous issues we faced in this short amount of time.

We therefore thought of presenting the limitations of this research as an integral part of its findings.

Namely, open data on tertiary education can be improved in structure, completeness, clearness and cohesiveness.

The data was compartimented in different databases, and needed extensive work to be properly integrated.

For example we encountered situations like the following, where data was only available through a terrible user interface.

Source: https://data.



fr/FR/T445/P844/tableau_de_bord_financier_-_finance#TDBPolicy ProposalsThe problems we experienced prompted two sets of policy proposals, the first one regarding the specific data on education spending, and the second around the general open data standards and guidelines.

Tidy Education spending dataAlthough education spending data is available, it does not fulfil open data standards.

Data users should be able to find one source with the consolidated data and not three (or even more), as it happens right now.

There are four critical pieces of information that should be together in order to make a complete and relevant database for the question we were dealing with:financial numbers (research, overhead and total spending),geographical granularity (city, department and regional levels, including geographic coordinates),student statistics (total, per gender numbers),Institutional information.

Regarding the published datasets, we would recommend the Ministry of Education:working on improving the ”Principaux établissements d’enseignement supérieur” dataset, with additional financial variables (columns), ideally per year information and;opening the raw data that feeds the ‘Tableau de Bord Financier’.

It is a rich set of data that should be made available in an open format.

Although it only allows browse the public universities’ data, opening the database will allow data users and researchers to have a closer look at specific spending information, making it a valuable transparency tool.

Open Data Standards and GuidelinesAlthough accomplishing the previous recommendation will improve education data (including expenditures), structural problems remains with the quality of the datasets published.

This happens because the process of publishing datasets in data.


fr it is not consistently following open data quality standards and the different institutions apply their own criteria.

As it happens with the ‘Tableau de Bord Financier’, although some data is public, it is not open.

Additionally, the open data platform requires advanced skills to harness its possibilities, segmenting the public that can use it.

Thus, our proposal is the establishment of a governance system for French open data portals, that creates the incentives for the general quality improvement of the data published.

It includes three components:Data curation by a high level institutiona.

A high level institution like Etalab could be in charge of curating datasets, guaranteeing that low quality datasets are not published until they are improved.

This will alleviate the searching process.


Feedback system — user rating system implemented in two phases:a.

User rating for administrator feedback: if a dataset is badly qualified, administrators will be obliged to check it up and fix the problems and/or take it off.


Public user ratings: after a user-community has matured, public ratings could be made visible to every user to provide information on the quality of the data they will find.


Reputation based score — to benchmark ministries and public agencies, and taking as reference the “Excellence Certificate” of the Colombian Ministry of Communications and Technology, we suggest an open data score based on:a.



RelevanceConcluding thoughtsOur few days with R and open data proved rather tricky.

The exercise taught us that the open data landscape has a long way to go in terms of accuracy and usability.

However, equipped with our R skills, we feel better prepared to tackle forthcoming public policy issues as we enter our second semester.

.. More details

Leave a Reply