Home » Specialist articles » Visualise your Data Whenever Possible

Specialist article

Data Visualization

Visualise your Data Whenever Possible

What exactly is data visualisation? Every day, a company collects new data on a wide variety of processes. With so much data, however, it can be difficult to grasp the actual message. This is where data visualisation comes into play. Modern data visualisations translate this complex information into a visual context to make connections easier for the human brain to understand and help to draw important insights from the data. They show the “big picture” and thus complement the weaknesses of statistical models and machine-learning methods, which only reduce complex relationships to a few individual statistics. Not using data visualisation or not exploiting its full potential can lead to overlooking potential insights or even generating false knowledge. Data visualisation methods have developed steadily in recent years and many new topics are currently emerging on the fringes of data visualisation, such as data democratisation, data storytelling or the use of artificial intelligence in data visualisation, to name just a few examples.

This article highlights the dangers of relying solely on statistics without data visualisation, as well as some technologies for visualising data. It also describes the latest developments in this area and concludes with an outlook on current trends and future topics.

I have found a significant, linear, strong correlation of r = 0.816 – so why visualise my data?
If you only look at data sets and the associated statistics, you can very quickly come to the wrong conclusions. Anscombe (1973) already visualised this impressively 50 years ago: Figure 1 shows the “Anscombe quartet” of four scatter diagrams, all of which show the same correlation of r = 0.816, as well as the same mean values, variances and standard errors. However, the assumption that a linear correlation has been found only holds true for Figure 1(A).

Abbildung 1. Das Anscombe-Quartett zeigt vier unterschiedliche Scatterplots mit nahezu identischen statistischen Eigenschaften: Mittelwert von x = 9; Mittelwert von y = 7,5; Standardabweichung x = 1; Standardabweichung y = 4,122; r = 0.816; p =.002.

Figure 1(B) shows a perfect square relationship that is easy to recognise with the human eye. However, the statistical characteristics do not differ from the other scatter plots. Figure 1(C) shows that individual outliers can cause an otherwise perfect linear relationship to be overestimated. On the other hand, Figure 1(D) shows that a strong linear correlation is falsely identified due to an outlier, although in reality no correlation is recognisable (potential error of the first kind).

In all of these cases, one would assume – based only on the characteristic values – that there is a strong, significant and linear relationship between our variables in the data. A look at the scatterplots shows that this is only correct in one of four cases, namely in Figure 1 (A).

Abbildung 2. Neun Grafiken aus dem Datasaurus-Dutzend. Obwohl sie unterschiedlich aussehen, hat jeder Datensatz die gleichen zusammenfassenden Statistiken (Mittelwert X=54.26, Mittelwert Y=47.83, Standardabweichung X=16.76, Standardabweichung Y=26.93 und Pearson-Korrelation= -0.06)

Metajka and Fitzmaurice (2017) took the Anscombe quartet as a model and generated even more impressive examples, which they call the “Datasaurus Dozen”. In this example, Metajka and Fitzmaurice have modified an existing dataset while retaining its statistical properties. Visualised, no similarities can be seen, but statistically the twelve data sets share the same values for the standard statistics (see Figure 2).

Even if the datasets of the Datasaurus and the Anscombe Quartet are artificially generated, they still show that data visualisation is not just a nice extra, but prevents interpretation errors. Similar examples are also regularly encountered in real business life: Financial satisfaction and income often correlate only weakly with each other. If you plot the data, however, you will recognise that there is normally a clear correlation for low earners, but this is “overshadowed” by the rest of the distribution.

Another example: Older people are more likely to like a product in the countryside, while younger people are more likely to like it in cities. However, it was incorrectly not recorded whether people live in the country or in the city. In this case, you get the data as shown in Figure 2 in the “X-Shape” (assuming X is the age and Y is the rating of my product). The result would be a weak negative correlation of r = -0.06 – but if you plot the data, you can see that something is wrong. You go in search of a moderating effect and with a little skill you can find out that urbanity is an important factor that should be included in future surveys.

This is where the particular strength of data visualisation comes into play, the “big picture”. Statistical or machine-learning methods can of course also identify outliers and uncover non-linear effects, but they need all the necessary variables to do so. Data visualisation enables us to recognise when something is wrong with our data and to actively search for explanations: Do I need other data? Can feature engineering help me? In other words, can I identify city dwellers based on other variables and take this into account in my model? How are univariate and bivariate outliers distributed and what could be behind them? Do I possibly have a data quality problem that hides potential effects?

Possibilities of modern visualisation methods
Traditional visualisation methods always had the disadvantage that they had to work on paper – especially in research – which is why scatter plots and bar charts have long been the standard. Even classic three-dimensional scatter plots do not make sense on a purely two-dimensional surface, as our brain lacks the depth information needed to easily interpret such a plot. Figure 3(A) shows us a 3D coordinate grid, but it is only by rotating it in Figures 3(C) to 3(E) that it becomes clear where the data points are actually located in space. However, the great advantage of modern data visualisation is that visualisations can be interactive. For example, newer three-dimensional scatterplots allow the user to rotate, zoom or highlight points at will, making it even easier to interpret the point cloud.

Figure 3 (labelling missing)

Today’s interactivity goes even further – several visualisations can exist side by side and be linked together. Figure 4.A shows an example of fictitious departments and revenue per product. The revenue per product can be seen across all departments, but clicking on a department adjusts the visualisation and shows the revenue for this specific department (see Figure 4.B).

Figure 4.A. Interaktiver Plot zeigt links Einnahmen nach Landesgesellschaften und Rechts die Einnahmen je Produkt aggregiert über alle Landesgesellschaften hinweg.

Figure 4.B. Im Kreisdiagramm (links) wurde die Landesgesellschaft “Italy” selektiert, das Balkendiagramm (rechts) wird interaktiv gefiltert und zeigt nur noch die Einnahmen je Produkt für die Landesgesellschaft “Italy”.

The possibilities of modern methods for summarising information go even further. The best-known example is probably the “bubble plot” from the Gapminder Foundation (Gapminder Foundation, 2021). The data points here are nations, the X-axis shows the average income, the Y-axis the average life expectancy, the size of the dots shows the population size and the different colours indicate the regions. The clue: the chart not only shows the current situation, but also has a slider that can be used to interactively turn time forwards and backwards. Individual countries can be selected in order to record the development of individual countries over time for any given period. Figure 5, for example, shows the development of Germany from 1950 to 2019, while the other countries only show the status for 2019.

Abbildung 5. Interaktiver Bubble-Plot der Gapminder-Stiftung (Gapminder-Stiftung, 2021): Einkommen vs. Lebenserwartung über die Zeit

Which tools are suitable for data visualisation?
There are now a myriad of data visualisation tools on the market, all with their own advantages and disadvantages. The best-known data visualisation tool is certainly Microsoft PowerBI, which 29FORWARD also likes to work with, for example to provide a historical perfumery with dashboards on customer behaviour and sales performance.

We have also had very good experiences with SAS Visual Analytics, most recently used for a customer project to visualise relevant key figures relating to the COVID-19 pandemic, including geovisualisation, time series, forecasts and simulations. This made it quicker and easier for the respective regions to recognise developments and hot spots in order to initiate measures at an early stage

Last but not least, open source solutions are enjoying general popularity. Tools such as Python, R and especially R-Shiny are very suitable for building dashboards. Some openly accessible examples from the manufacturer can be found here: R-Shiny, 2021.

A look into the future
It is clear that data visualisation tools have developed considerably in recent years. So what is still to come? One might think that data visualisation is currently on a plateau – hardly any new processes are emerging, but other topics relating to data visualisation are becoming increasingly relevant. Each point would be worth its own article, so we will only touch on them here.

Data democratisation describes the sharing of data with a large group of people, creating transparency on the one hand and enabling a broad group of people to generate insights from the data on the other. Data democratisation is also possible within companies. It definitely makes a difference whether a marketing department only ever receives ready-made results from the analytics team or whether the data basis is visualised interactively. In this way, employees who may have little experience with data but have sufficient specialised knowledge can gain further insights.

Data storytelling describes techniques that go beyond the mere presentation of data. Data storytelling is typically used when insights have already been generated but are still too abstract for the target group. Data storytelling embeds these insights in a context and – as the name suggests – primarily uses narrative methods. In a nutshell: All data tells a story and data storytelling is concerned with how this story can be presented in an interesting and understandable way.

Artificial intelligence (AI) and machine learning (ML) are also influencing the field of data visualisation. On the one hand, it is about how to make complex AI&ML models visually interpretable; on the other hand, AI methods also support semi-automated reporting. One example of this is SAS Visual Analytics for Viya, where an AI suggests which visualisations are particularly suitable based on the data that a user selects.

However, the topic of data visualisation is not only gaining importance in business fields. More and more artists are now also becoming aware of data visualisation – the first data visualisation fashion collection by Geogria Lupi (2021) was recently published under the name Data Fashion. Even if this is not 29FORWARD’s primary field of application, we are nevertheless delighted that data visualisation is becoming more widespread in other areas of life and may even be taught in schools and universities in the future, so that the next generation can make further progress in the field of data visualisation and – perhaps – break into the third dimension. Perhaps by means of virtual or augmented reality; initial attempts are already being made under the term “immersive data visualisation”.

And now?
In conclusion, it can be said that data visualisation should still be a central element of any statistical analysis and can be visualised now more than ever. Perhaps this article has inspired you to take a look at your own data visualisations. You may already be using data visualisation software, but the potential can still be expanded. Alternatively, you can start your first attempts with open source solutions – it doesn’t always have to be the ultimate in software. Do you need support with the implementation? Talk to us, we will be happy to help

Sources:

Anscombe, F. J. (1973). Graphs in statistical analysis. The american statistician, 27 (1), 17-21.

Matejka, J., & Fitzmaurice, G. (2017, May). Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 1290-1294).

AI-assisted business analysis: How LLMs are changing the discovery process

You are going to fail, so fail fast – my journey from Zoology to IT – Jessica Lund

International Conference on Information Systems

Job Fair Oldenburg

AI-assisted business analysis: How LLMs are changing the discovery process

You are going to fail, so fail fast – my journey from Zoology to IT – Jessica Lund

Data Visualization

Abbildung 1. Das Anscombe-Quartett zeigt vier unterschiedliche Scatterplots mit nahezu identischen statistischen Eigenschaften: Mittelwert von x = 9; Mittelwert von y = 7,5; Standardabweichung x = 1; Standardabweichung y = 4,122; r = 0.816; p =.002.

Abbildung 2. Neun Grafiken aus dem Datasaurus-Dutzend. Obwohl sie unterschiedlich aussehen, hat jeder Datensatz die gleichen zusammenfassenden Statistiken (Mittelwert X=54.26, Mittelwert Y=47.83, Standardabweichung X=16.76, Standardabweichung Y=26.93 und Pearson-Korrelation= -0.06)

Figure 3 (labelling missing)

Figure 4.A. Interaktiver Plot zeigt links Einnahmen nach Landesgesellschaften und Rechts die Einnahmen je Produkt aggregiert über alle Landesgesellschaften hinweg.

Figure 4.B. Im Kreisdiagramm (links) wurde die Landesgesellschaft “Italy” selektiert, das Balkendiagramm (rechts) wird interaktiv gefiltert und zeigt nur noch die Einnahmen je Produkt für die Landesgesellschaft “Italy”.

Abbildung 5. Interaktiver Bubble-Plot der Gapminder-Stiftung (Gapminder-Stiftung, 2021): Einkommen vs. Lebenserwartung über die Zeit

Anscombe, F. J. (1973). Graphs in statistical analysis. The american statistician, 27 (1), 17-21.

Matejka, J., & Fitzmaurice, G. (2017, May). Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 1290-1294).

Gapminder Foundation (2021) Interactive Bubble Plot, Retrieved 02/09/2021.

R-Shiny Gallery (2021), Retrieved 02/09/2021

Giorgia Lupi (2021), Blog Article: data-items-a-fashion-landscape-at-the-museum-of-modern-art, Retrieved 02/09/2021.

Do you have any questions, would you like to discuss your project with us or are you looking for technical support? We look forward to talking to you.

Make an appointment now

AI-assisted business analysis: How LLMs are changing the discovery process

You are going to fail, so fail fast – my journey from Zoology to IT – Jessica Lund

International Conference on Information Systems

Job Fair Oldenburg

AI-assisted business analysis: How LLMs are changing the discovery process

You are going to fail, so fail fast – my journey from Zoology to IT – Jessica Lund

Data Visualization

Abbildung 1. Das Anscombe-Quartett zeigt vier unterschiedliche Scatterplots mit nahezu identischen statistischen Eigenschaften: Mittelwert von x = 9; Mittelwert von y = 7,5; Standardabweichung x = 1; Standardabweichung y = 4,122; r = 0.816; p =.002.

Abbildung 2. Neun Grafiken aus dem Datasaurus-Dutzend. Obwohl sie unterschiedlich aussehen, hat jeder Datensatz die gleichen zusammenfassenden Statistiken (Mittelwert X=54.26, Mittelwert Y=47.83, Standardabweichung X=16.76, Standardabweichung Y=26.93 und Pearson-Korrelation= -0.06)

Figure 3 (labelling missing)

Figure 4.A. Interaktiver Plot zeigt links Einnahmen nach Landesgesellschaften und Rechts die Einnahmen je Produkt aggregiert über alle Landesgesellschaften hinweg.

Figure 4.B. Im Kreisdiagramm (links) wurde die Landesgesellschaft “Italy” selektiert, das Balkendiagramm (rechts) wird interaktiv gefiltert und zeigt nur noch die Einnahmen je Produkt für die Landesgesellschaft “Italy”.

Abbildung 5. Interaktiver Bubble-Plot der Gapminder-Stiftung (Gapminder-Stiftung, 2021): Einkommen vs. Lebenserwartung über die Zeit

Anscombe, F. J. (1973). Graphs in statistical analysis. The american statistician, 27 (1), 17-21.

Matejka, J., & Fitzmaurice, G. (2017, May). Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI conference on human factors in computing systems (pp. 1290-1294).

Gapminder Foundation (2021) Interactive Bubble Plot, Retrieved 02/09/2021.

R-Shiny Gallery (2021), Retrieved 02/09/2021

Giorgia Lupi (2021), Blog Article: data-items-a-fashion-landscape-at-the-museum-of-modern-art, Retrieved 02/09/2021.

Do you have any questions, would you like to discuss your project with us or are you looking for technical support? We look forward to talking to you.

Make an appointment now

Locations

Germany

Ismaning (Munich)