The Bus Station That Didn’t Exist, and Other Data Epiphanies

“Data is multidisciplinary” is my mantra—it’s 2025, and I’ve now worked 20 years in every possible flavour of data—data visualization, open data advocacy, data pipelines in healthcare, data-driven national-scale services, AI innovation, and more. Whatever the application or project, my take on data literacy is the fundamental ability to challenge your own assumptions about the data you have or don’t, the appropriateness in using it, the ethics of your application, and ask yourself: is there a different way, perhaps? Here is a gallery of some of my most treasured eureka moments working with data.

You have a clear purpose but the data isn’t quite right for it

I regularly walk through the Turnpike Lane Bus Station, there’s a pretty big sign pointing to it. It’s a major node for North London public transport and yet, a few years back, I found out that it did not exist… in the data, at least. I used to run the official data set of bus stops for the UK Government—a rather obscure dataset that made its way into powering a few popular journey planners like Google Maps and City Mapper.

This was 2020 during COVID, and one of my colleagues wanted a list of all bus stations in the country in order to send posters which advertised social distancing. While the dataset contained over 500,000 points, it did not contain this bus station. The problem data definitions: the dataset listed bus stops, which were not the same things as bus stations. While the words “bus station” have a common sense meaning in our minds as a collection of bus stops, that meaning was not translated into the dataset. The individual bus stops making the bus station are all in the datasets, except there was no way to group them together other than trying to infer they’re part of the same bus station because of their proximity.

I found other interesting issues in the dataset. Some were easy to spot, like bus stations in the middle of the North Sea. Other stations were a few meters away from their real location, which would not have a huge impact unless we were trying to use the dataset to get self-driving buses to park automatically. So, why weren’t these groupings captured in the first place? The process that created and populated the data never asked itself “are we capturing everything that we need about this bus stop?”. As a result, the dataset wasn’t quite fit for the purpose we were looking to deliver. The translated definitions of common sense concepts into data is a major element of making sure that a dataset is usable and stays current, and having a process that allows that question to emerge is an ingredient of good data management. At the time, to my surprise, we didn’t have either.

Disappointingly, data that may appear suitable to your purpose is not always; and if you are in the fortunate position of being the owner of a dataset, always ask: are there any use cases that would be out of scope for this dataset, and is it worth expanding?

Sometimes the data is really incomplete or missing

W.E.B. Du Bois is widely remembered for his infographics about the conditions of African Americans at the end of the nineteenth century. What I always hail him for was having shown that a lack of data should not stop a good data project and that sometimes the hard work is putting data together. When he realised the US Census lacked data about African Americans, he assembled his own survey and team,collecting data that resulted in his now famous infographics. Incomplete or missing data is something that I’ve regularly had to cope with and decide whether to pursue the initial project or pivot to something different. Once again, during the pandemic, we were trying to see if there was a way to check the density of people on pavements, and entered a tunnel trying to find the accurate measurements of pavements for the whole of the UK—an impossible task. This is when I realised using a proxy would have given informative enough results, as did The Economist in the chart below, created by collecting, over time, Google Places “busy times” for major points of interest in major cities. Simple, effective, but not based on anywhere close to “complete” data.

Sometimes missing data should make us reflect. In one of my projects while working for public healthcare in the UK, a team of dermatologists came asking if my team could develop an AI algorithm to grade a type of skin condition. The intent was very positive: in their clinical research, they realised human medics were biased, resulting in less accurate grading for people who are not white, and were looking for AI to help correct that bias. We found that the collection of images we could find about this condition were themselves biased, so any AI model trained on them would have not addressed the issue. The image below captures what dermatologists call the Fitzpatrick scale—the official measure of skin darkness.

We realised we had as many images as we wanted about Fitzpatrick scale I, and increasingly less as we went towards scale VI. Developing a model would have been unethical, and we and the team of dermatologists agreed to pivot and start collecting an unbiased dataset. This experience taught me to reframe success: we successfully detected bias and used the tools at our disposal to try and correct it. Unbiased data means more ethical and accurate models. During this project I learned a lot also about the concept of intersectionality, thanks to data journalist and humanizer Donata Columbro, who suggested in her book “Dentro l’algoritmo” (2022) that in order to understand and correct bias in data, you can apply intersectional thinking as the framework that analyses the embedded injustice in the administration of power.

Just use the official data and you’ll be fine…

…or so they said. One possible critique of du Bois’ work is that it wasn’t based—for good reasons, as we’ve seen—on the official data. We also use the words authoritative for data that comes from official administrations, suggesting it has a superior authority; yet that’s not always the case. A personal example of this is the chart below, from the Financial Times, representing how UK parliamentarians increasingly mention the words “NHS” (National Health Services) in their speeches. It’s personally interesting for me because if you look at the source, it references “Parli-N-Grams,” a website I built that allows that analysis. I always make the joke that I called my mum to say “mum, I’ve become a source.” What’s going on here and what’s the lesson?

Well, while I’m obviously flattered that the Financial Times used my website to make this analysis, it is remarkable that I’m not the official source of the data. In fact, I’m not even using the original source myself, as I used the easily usable extracts released by TheyWorkForYou, a project of civic tech charity mySociety. This is because the original data is really hard to use, so data journalists are discouraged from doing so. This is clearly a minor problem in an article like this, but it makes you think about the potential for news manipulation. In my job in the public sector, I always think about how I can make the data as usable as possible.

There are cases where it’s just simply wrong or misleading beyond authoritative data not being ideal. The most famous case is that of borders and their representation in data and maps. There are plenty of border disputes around the world, including in unexpected places. For example, France and Italy are broadly at peace but take different views on where the border around Mont Blanc falls, to the point that the official map of Switzerland is more accurate than those of the two countries. And Switzerland itself has a three-way border dispute with Germany and Austria which has reached several courts. Representing all these fuzzy borders in data and maps can be difficult and it can lead to litigation. Mikel Maron, once head of the Open Street Map Data Working Group, tried to address the issue of maps around Jerusalem, where there are obviously two famously different views about borders. His message “Jerusalem is an edge case of everything” sadly resonates, but I’d take the slightly less optimistic view that it is not just Jerusalem to have edge cases—our reality is full of edge cases, and representing it in data is difficult and imperfect.

Communicating data should clarify reality

We use a lot of data to back our statements with evidence. But is data easy to understand for people who are not trained? And as data journalists and advocates, how can we make it easy to understand? Often this is, again, a problem of clarifying definitions. I often challenge my audiences to raise their hand if they know how to calculate the average; most people will raise their hand. I ask some people how, and they will tell me the well-known formula of the arithmetic average—sum all items and divide by the number of items. And here’s the catch: in many official statistics, we don’t actually use the arithmetic mean.

In the UK, the Office for National Statistics uses the geometric mean to publish the average house prices. Most people don’t remember the formula nor its meaning; some people might even be unaware of its existence. Other times, the definitions we use are disconnected from people’s reality and, while they make a lot of sense from a statistical point of view, they might not chime with people’s common sense understanding of a problem. For example, the official definition of “employed” in the UK is “a person aged 16 years or over who did one hour or more of paid work per week in the previous week”. Having worked an hour per week might not meet the expected definition of “employed” of someone without statistical training; this can trigger confusion. The lesson I’ve learned through engaging with using these examples is that communicating data requires us to clarify definitions, explain why they were chosen & why they were better than the alternatives, and clarify the meaning of the data we are publishing.

This also means, sometimes, recognising that the data is not neutral or impartial. See the two maps below. Would you imagine that they are based on the same data?

They are both based on the data from the Italian General Election in 2018, which was a particularly contentious election with three coalitions, and all outcomes were possible according to the polls, due to the intricacies of the electoral system.

The map on the left is one I made. It’s a dot-map, so it shows a dot for every several thousand votes—blue for the right-wing coalition, red for the left-wing coalition, yellow for the non-aligned “5 star” movement. The story I told with this map was that Italy was an electorally confused country; it does indeed look pretty confused on the map.

On the right you see the map that the Financial Times released, a much more beautiful map than mine! You see the stark difference—the FT team coloured the map based on the winner in each constituency, getting a country that is pretty much split into three parts: centre-right in the north, 5 star in the South, and the centre-left only featuring in its strongholds around Romagna, Tuscany, and less expectedly in Sudtyrol. This map illustrated a story about Italy being a divided country.

So which is it, was Italy divided or confused? The lesson here is that data-driven doesn’t mean neutral or impartial. There’s an agenda in every data use, sometimes unwillingly. In this case, the “agenda” is the fact that the data is initially collected for one purpose which gives it an initial “spin”—administering an election—and then we make a choice to tell a story that we detect in the data, which gives it another spin.

Enter production: data-driven services

Creating live systems, where data or models concur in delivering a service, has been one of my areas of work. Here, the basic problems are the same: using the right definitions, making sure we remove bias, etc, but there are some lessons specific to the live nature of the data. For example, you might have heard of the concept of model decay—if you create a predictive model based on data and the specific problem is one where data is generated frequently, then the model will slowly lose its predictive power. Correcting this requires strong collaboration between experts of the problem, data scientists, and data engineers. I’ve seen this in an interesting project where we used neural networks in order to predict how long a patient would likely stay in hospital at the moment of presentation. While the model was pretty accurate at the point of creation, having been trained on over one million patient admissions and capturing over 300 data points per admission, it would have quickly lost its predictive power without allowing for re-training in the live pipeline.

This project was also remarkable for another lesson that became apparent: asking “so what” about every data-driven project. The problem here was that, despite the model being very accurate, we had no idea how to use it in practice. What action would doctors and nurses take that was different thanks to the prediction? We couldn’t work it out. And there lies the lesson: any data-driven model needs to be built together with those who will use it. Capturing user needs with adequate user research, bringing subject matter experts together with data scientists, is the only recipe for success.

This brings me back to my initial point: data is multidisciplinary. The last example would have been much more successful if we had, from the outset, worked not just with data experts but also with those running the hospital operations—nurses and doctors—who would have guided our development of the model more effectively.

I hope that sharing these lessons will help. Working in data, whether it is data visualization, data analysis, or creating data-driven systems, is fascinating, but it really requires a lot of knowledge that comes from different areas: technology, maths and statistics, ethics, law. The best data practitioners use all these to challenge their assumptions and become better at their data uses.

Giuseppe Sollazzo

Website

Giuseppe Sollazzo was a judge in the 2025 IIB Awards. He sends the popular dataviz newsletter Quantum of Sollazzo. For almost a decade, Giuseppe has been leading data innovation initiatives for public sector organisations in the UK, and has been speaking extensively about “good” uses of data and AI.