Data in images: what tools for data visualization?

Producing aesthetic or interactive graphs

Contents

Don’t we say “a picture is worth a thousand words”? We could also say that Chemistry and Physics, are visual sciences (1). A sketch of a reaction scheme in a laboratory notebook, traction curves in an article or absorption spectra, are data visualization techniques.

Data formatting leads to efficiently communicating a result, but researchers and their students rarely receive training in data visualization.

Here you’ll find a series of tools to help you explore your data and enhance your articles or presentations.

While there is a multitude of specialised software, having some few programming skills ensures a high level of versatility. Python and R are two programming languages that can be used for a wide range of visualisations, from the simplest to the most complex.

Python or R: how to choose

The two programming languages Python and R are widely used in the scientific community, whatever the discipline. Most types of visualisation are possible with both Python and R. The choice of one language or another depends essentially on the community in which you wish to work, and the collaborations you wish to pursue.

Nevertheless, there are few differences between the two languages that are worth considering.

R is first and foremost a data processing language. It was originally designed by and for statisticians, to extend the use of the S. As a result, R includes a number of fairly advanced statistical analysis functions and graphing modules, to which new ones can be added by downloading dedicated packages. Python is a general programming language. To perform statistical analysis and visualisations, you need to import additional libraries such as matplotlib. Nevertheless, Python is more and more used in data science, particularly for machine learning, even though R, as a language designed for statistical data processing, is also a natural language for machine learning.

Python is therefore used by a wider community than R (2), but largely for uses unrelated to statistics and data visualisation, whereas the R community is focused on data science.

Finally, it should be noted that the two languages ​​are not exclusive. On the one hand, they share a number of common features, such as plotly, a library for integrating JavaScript elements to create interactive graphics, or Shiny, to create web applications. It is also possible to use both Python and R when necessary, thanks to the reticulate package, which lets you call Python functions and libraries in R, and rpy2, which lets you call R elements in Python.

Promoting collaboration and reproducibility

Notepad is not necessary for writing in Python or R, but it is particularly useful for data visualisation. This makes it possible to write documents in which the choices made for the graphical representation are explained in comment blocks, and insert the images into a story, according to the principles of literate programming theorised by Donald E. Knuth (3).

This consists of combining blocks of text written in markdown, with blocks of code in the chosen language, executable using a dedicated kernel. When the code blocks are executed, their output (a graphic, for example) is inserted into the document, either in place of the code block or after it. There are many solutions available for writing in notepad format. They can be grouped into two main categories :

  • Those whose interface is displayed directly in a web browser, such as Jupyter Notebook
  • IDEs (integrated development environments) in which the appropriate library is loaded, for example R Mardown in RStudio or Jupyter Lab.

Combining the use of Jupyter notebooks with Quarto allows them to be exported and viewed in a wide variety of formats, producing publishable-quality documents very easily, in particular by using available models corresponding to the standards of major publishers such as PLOS, Elsevier or ACS. It is also possible to obtain a result in HTML, in order to create a version in which readers can interact with the tables and graphics. Solutions of this type enable to use the same document to obtain, with just a few commands, both a version for publication in a newspaper and an interactive version, and a version to display on a website.

Data visualisation with Python

Importing and structuring data for visualisation

The easiest way to import data into Python is to build lists. A list is a series of values of different types. Lists can be built by hand or imported from various file formats, such as csv, JSON or XML. In Python, a list is a set of values enclosed in square brackets, separated by commas. Going beyond the lists, wo very common libraries make the task much easier. They are the NumPy library and the pandas library.

Structuring tables with NumPy

The NumPy library (for Numeric Python) can be used to manipulate matrices. The basic element introduced by NumPy is the array. A one-dimensional array is similar to a list, but unlike a list, it can only contain objects of the same type (numeric, textual, etc.).

The first advantage of the table is that it can be multidimensional. It can be used to combine lists in rows and columns. But above all, the NumPy package can be used to apply a number of mathematical functions to arrays. It can be extended with the SciPy package (Scientific Python), which introduces additional possibilities for mathematical transformations.

Structuring DataFrames with pandas

Another solution for structuring data in rows and columns is to use the pandas package. The use of pandas does not exclude the use of numpy. Pandas allows you to create DataFrames that are compatible with other types of structured two-dimensional data, such as a SQL database or an Excel or csv table (see the list of formats that can be read in DataFrame format). Pandas overcomes some of NumPy’s limitations, as a DataFrame can contain objects of different types.

Interact directly with electronic laboratory notebooks. Most electronic laboratory notebooks have an API that makes it easy to interface the application with software using Python. For example, the elabapi Python library can be used to interact with the eLabFTW v2 notebook API. and the LabguruPython, elabjournal and rspace-client-python libraries with the corresponding electronic laboratory notebooks. You can also find the equivalent in R libraries, such as LabguruR or elabjournal-r.

Python visualisation libraries

There are a lot of visualisation tools available in Python, many of them very specialised, community-specific tools. A very useful tool for browsing existing libraries is the PyViz platform. In particular, it provides a page of examples for specific subjects or sub-domains. Despite this abundance of information, a small number of general libraries can be used to produce a wide variety of graphics.

General libraries
  • Matplotlib : The original aim of the matplotlib library was to replicate the visualisation capabilities offered by MATLAB. This is an essential library for producing graphs with Python. It allows you to set a large number of parameters for a graphical representation : line layout, font, size, colour, style, etc. The central object created by matplotlib is the figure. This figure is at the top of a hierarchy of sub-objects: title, axes, curves, etc. These sub-objects themselves have ramifications: axis titles, graduations, markers, etc. This means that matplotlib can be used to plot functions as well as to create standard graphs (histograms, pie charts, box plots, scatter plots …) in two or three dimensions. But if each element of a figure can be configured, this is sometimes to the cost of the simplicity of the writing. Pour écrire un code plus simple, d’autres packages peuvent être envisagé.
An example of weather data visualisation with matplotlib, created by Jan Kühn. The code used is explained and available on the Python Graph Gallery. This graph is published under the CC BY-NC-SA 4.0 DEED licence.
  • Seaborn : If you don’t need to configure every detail of a graph, you can use a library such as Seaborn rather than matplotlib directly. Seaborn builds graphs directly from data structured using pandas. So, for example, rather than having to enter the axis titles in the code yourself, as you would have to do with matplotlib, seaborn uses the column labels from the DataFrame directly. Whenever possible, seaborn includes a representation of the confidence interval (95% by default, but this can be changed).
  • Bokeh : Bokeh can be used to create interactive visualisations that can be viewed via a web browser, allowing Javascript to be used in Python without having to write Javascript directly. The library works equally well with lists, NumPy arrays and pandas DataFrames.
  • Plotly : Like Bokeh, Plotly is a library for introducing interactivity into graphics productions. It is more versatile than Bokeh, in that it works with Python and R, as well as directly in JavaScript. It is also better adapted to 3D representations.
  • Streamlit and Dash : Streamlit and Dash can be used to create interactive web applications using plotly graphics.
More specialised libraries
  • ASE’s GUI for atomic visualisation : ASE, Atomic Simulation Environment, is a set of Python tools for performing atomic simulations, analysing and visualising them.
  • Astropy for astronomy and astrophysics : Astropy is a set of libraries useful in astronomy and astrophysics
Visualisation of a simulation of the temperature of the TRAPPIST-1e star using the astropy package. The code used is available on wikimedia
  • Biopython for biology and bioinformatics : Biopython is a library for working with sequences, such as nucleotide or amino acid sequences.
  • NetworkX : The NetworkX library can be used to represent complex networks.
  • Nglview for bioinformatics : Nglview can be used to visualise sequence or molecule data, produced using Biopython or RDKit for example.
  • NMRglue for NMR spectroscopy : This library can read Bruker, Pipe, Sparky, Varian and JCAMP files. It requires NumPy and SciPy to be installed, as well as Matplotlib for visualisation.
  • RDKit for computer chemistry : .RDKit allows to visualise molecules in two or three dimensions. The library allows to work with representations of molecules in SMILES format (Simplified Molecular-Input Line-Entry System), InChl (International Chemical Identifier) ou MO. This library can also be used to include a certain amount of metadata about the molecules and chemical reactions concerned in the product file.
  • SunPy : SunPy is a library designed to analyse solar data.
  • TomoPy : TomoPy can be used to reconstruct tomographic images.
  • yt : yt can be used to visualise astrophysical simulations.
Visualisation of the magnetic field created during the yt simulation of the fusion of two white dwarf stars, Suoking Ji and Robert Fischer. The code is available on the yt-project blog.

Data structure in R

As with Python, data can be imported into R in csv, Excel, Google Sheets, JSON or XML formats. Unlike Python, data is stored in R directly in the form of a relational database, in data tables called dataframes.

Focus on the tidyverse

To import, manipulate and visualise data, a standard is spreading in the R community around the tidyverse. Data organised in length rather than width is referred to as “tidy”. From this point of view, in a tidy :

  • each variable corresponds to a column
  • each observation corresponds to a line
  • each value corresponds to a cell

The tidyverse package includes a set of libraries for obtaining tidy data and manipulating it using a common logic.

Importer des données : readr, readxl, xml2, httr, jsonlite, haven, revest

A multitude of tidyverse packages can be used to import data from a wide range of formats. This data is imported in the form of tibbles. Each library can read and write a specific format :

  • readr for tabular files of the csv type, with delimiter or fixed width
  • readxl for excel files
  • xml2 for xml and html files
  • jsonlite for JSON files
  • haven for data from other statistical software such as SAS, SPSS or STATA.

It is also possible to import data via an API with the httr package or via webscrapping with the rvest package.

Visualisation tools

Visualisation tools in R

While there are many data visualisation tools in Python, and they have to be added via complementary libraries, most of the visualisations they enable can be carried out in R with the ggplot2 library, which is part of the tidyverse.

ggplot2

ggplot2 reproduces the Grammar of Graphs theorised by Leland Wilkinson. As the package is part of the tidyverse, it works particularly well with data organised in tibbles, but it also works with dataframes.

The versatility of ggplot2 lies in the fact that graphics are constructed using a succession of layers, added one after the other using the + symbol. So there are no predefined graphics. To create a plot, call the ggplot() function, giving it the arguments :

  • with data, the object containing the data
  • with mapping and the aes() function, an association between variables and graph attributes such as axes, point size and colour. Next, press + to add additional layers. It works in a similar way to the %>% or |> pipe. Only one of these layers is required: geom, which specifies the type of graph (points, curves, bars, etc.). Other layers allow you to specify titles, modify axes, create graphs by subgroup (facets), add labels, etc.
Visualisation of temperature anomalies since 1850 by David Curran. The code is available on GitHub

ggplot2 cannot be used to create interactive graphics. However, it can be combined with plotly, using the ggplotly() function.

Plotly, Bokeh and Shiny for interactive graphics

plotly is a library available in both R and Python. It can be used to create interactive graphs and display them in a web browser. You can also use Bokeh with R, using the rbokeh package. Shiny, directly in RStudio, allows you to edit web applications. To get an idea of what R-Shiny can do, take a look at the gallery here.

What type of graphic choosing?

If you’re a little lost when you hate to choosing the right type of visualisation for your data, there are a number of tools available to help you navigate through the different possibilities.

  • The Data Visualisation Catalogue, created by Severino Ribecca and available in English, French and four other languages, allows you to discover a wide variety of graph types and browse them by function.
  • The Data Viz Project also allows you to compare different types of visualisation by function, by type of input data or even by shape.
  • From Data to Viz allows you to decide on the most suitable graph using various filters and a network diagram. The site is full of resources explaining how to create graphs using R and Python and various libraries (including ggplot2 and plotly).

Open access resources

  1. Wu, Hsin-Kai, et Priti Shah. « Exploring Visuospatial Thinking in Chemistry Learning ». Science Education, vol. 88, nᵒ 3, mai 2004, p. 465, doi:10.1002/sce.10126.
  2. In November 2023, Python was in first place in the TIOBE Index, a monthly ranking of the popularity of languages based on while and R was ranked 19th. Similarly, the RedMonk ranking of programming languages, based on activity on GitHub and Stack Overflow, ranked Python 2nd and R 12th in January 2023.
  3. Knuth, Donald E., Literate Programming, Center for the Study of Language and Information Lecture Notes, no. 27, 1992, xvi+368p. ISBN 0-937073-80-6