MultiQC: A fresh coat of paint
The style and code powering the HTML MultiQC report has remained largely the same since it was first released in 2015. The world has moved on since then, and in 2024 we hope to start modernizing the report output and introducing some new features.
Since its inception, MultiQC has used the excellent HighCharts plotting library for nearly all of the interactive plots in reports. It’s a very elegant tool and has worked very well, however we have wanted to replace it for some time due to a number of limitations. For example, MultiQC needs two sets of code for every plot type: interactive (HighCharts) and static image (Matplotlib). This is difficult to maintain, and some plots have never had matplotlib equivalents written (heatmaps, beeswarm plots). HighCharts has also historically lacked certain plot types and features, leading to MultiQC having to have custom implementations.
Whilst we haven’t made a big attempt to optimize the code (yet), we are already seeing that the new code is much faster. Reports feel much snappier when working with large sample numbers, and CLI run times are also quicker. For example, see the timings below for a moderately large report:
|CLI: Running modules
|Browser: Time to load page
|Browser: Highlighting samples
|22 seconds (2.3x faster)
|3 seconds (3.3x faster)
|2 seconds (6x faster)
multiqc . --interactive -m fastqc --export
The first iteration of the new Plotly graphs has just been released in v1.20 of MultiQC. For now, the old HighCharts plots are still available in a
highcharts template as a fallback. However, in v1.21 (March 2024) we will remove HighCharts entirely.
We’d love the MultiQC community to use the new Plotly code as extensively as possible, so that we can catch any bugs or regressions quickly before we roll out the new plots to all users.
For the most part, we have tried to reimplement the new MultiQC plots to look and feel just as they always have done. One exception is the beeswarm / dot plot. This plot type has never reached the maturity of the others, for example it doesn’t have an equivalent flat-image plot and large sample numbers can crash the browser. Perhaps more importantly, the plot often doesn’t do a good job of representing the data. Point jitter is imperfect and data easily overflows the axes, hiding the true extent of the sample count.
We have taken the new Plotly integration as an opportunity to improve the beeswarm visualisation substantially – we now use a violin plot to accurately show the data distribution. Individual samples are overlaid as points, keeping the previous behavior whereby hovering on a single point highlights the same sample across all rows. When a report contains large numbers of samples we show just the distribution, to reduce the amount of data embedded in the HTML and avoid overwhelming the browser.
The beeswarm / violin plot is mostly used to replace tables once sample numbers exceed a certain threshold (at which point, tables are not practical for summarising all samples). However, although the underlying data is the same, it’s not been possible to switch between table and plot representations. As of MultiQC v1.20, any table can be shown as a violin plot with a simple toggle button in the report, no matter the number of samples.
div in the report and have to compose two PNG images for export!). All in all, changing the plotting library has been a huge effort by Vlad, with over 200k lines of new code written.
As sequencing projects get ever-larger and single-cell studies become the norm, MultiQC reports with hundreds of thousands of samples are becoming a lot more common. We’ve always strived to support both ends of the scale, with a useful report whether you have one sample or a million. For example, when a report gets to a certain size, MultiQC flips to using static-image plots in reports instead of interactive - this is done to avoid crashing the browser with huge report file sizes. However, even if it is technically possible to show a line graph with 300k lines on it, it’s not particularly useful. To this end, we are considering approaches to better render these kinds of data volumes.
The new violin plot showing distributions is a first step in this direction, and we are thinking of applying a similar logic to other plot types. For example, if a line graph has over a certain number of samples we could plot just the median line together with shading for the variation. Other alternative approaches may be better - for example, keeping individual lines but plotting with a very low opacity so that a density effect is achieved (this would require static image plots still, to avoid crashing the browser). Bar plots could follow a similar pattern, disregarding individual samples and instead showing a single bar for each category with whiskers to show distribution.
All in all, we want to work towards a MultiQC report that not only loads, but is genuinely useful for the end user – for any number of samples. We have GitHub issue (MultiQC/MultiQC#2280) to discuss these ideas and would love to hear your feedback! Feel free to also let us know your thoughts in the community forum, on twitter or on Mastodon.