vignettes/quality-control_visualisation.Rmd
quality-control_visualisation.Rmd
First we load the SPIAT library.
Here we present some quality control steps implemented in SPIAT to check for the quality of phenotyping, help detect uneven staining, and other potential technical artefacts.
In this vignette we will use an inForm data file that’s already been
formatted for SPIAT with format_image_to_spe()
, which we
can load with data()
. We will use
define_celltypes()
to define the cells with certain
combinations of markers.
data("simulated_image")
# define cell types
formatted_image <- define_celltypes(
simulated_image,
categories = c("Tumour_marker","Immune_marker1,Immune_marker2",
"Immune_marker1,Immune_marker3",
"Immune_marker1,Immune_marker2,Immune_marker4", "OTHER"),
category_colname = "Phenotype",
names = c("Tumour", "Immune1", "Immune2", "Immune3", "Others"),
new_colname = "Cell.Type")
Phenotyping of cells can be verified comparing marker intensities of
cells labelled positive and negative for a marker. Cells positive for a
marker should have high levels of the marker. An unclear separation of
marker intensities between positive and negative cells would suggest
phenotypes have not been accurately assigned. We can use
marker_intensity_boxplot()
to produce a boxplot for cells
phenotyped as being positive or negative for a marker.
marker_intensity_boxplot(formatted_image, "Immune_marker1")
Note that if phenotypes were obtained from software that uses machine learning to determine positive cells, which generally also take into account properties such as cell shape, nucleus size etc., rather than a strict threshold, some negative cells will have high marker intensities, and vice versa. In general, a limited overlap of whiskers or outlier points is tolerated and expected. However, overlapping boxplots suggests unreliable phenotyping.
Uneven marker staining or high background intensity can be identified
with plot_cell_marker_levels()
. This produces a scatter
plot of the intensity of a marker in each cell. This should be
relatively even across the image and all phenotyped cells. Cells that
were not phenotyped as being positive for the particular marker are
excluded.
plot_cell_marker_levels(formatted_image, "Immune_marker1")
For large images, there is also the option of ‘blurring’ the image,
where the image is split into multiple small areas, and marker
intensities are averaged within each. The image is blurred based on the
num_splits
parameter.
plot_marker_level_heatmap(formatted_image, num_splits = 100, "Tumour_marker")
We may see cells with biologically implausible combination of markers
present in the input data when using
unique(spe_object$Phenotype)
. For example, cells might be
incorrectly typed as positive for two markers known to not co-occur in a
single cell type. Incorrect cell phenotypes may be present due to low
cell segmentation quality, antibody ‘bleeding’ from one cell to another
or inadequate marker thresholding.
If the number of incorrectly phenotyped cells is small (<5%), we advise simply removing these cells (see below). If it is a higher proportion, we recommend checking the cell segmentation and phenotyping methods, as a more systematic problem might be present.
If you identify incorrect phenotypes or have any properties (columns)
that you want to exclude you can use select_celltypes()
.
Set celltypes
the values that you want to keep or exclude
for a column (feature_colname
). Set keep
as
TRUE
to include these cells and FALSE
to
exclude them.
data_subset <- select_celltypes(
formatted_image, keep=TRUE,
celltypes = c("Tumour_marker","Immune_marker1,Immune_marker3",
"Immune_marker1,Immune_marker2",
"Immune_marker1,Immune_marker2,Immune_marker4"),
feature_colname = "Phenotype")
# have a look at what phenotypes are present
unique(data_subset$Phenotype)
## [1] "Immune_marker1,Immune_marker2"
## [2] "Tumour_marker"
## [3] "Immune_marker1,Immune_marker2,Immune_marker4"
## [4] "Immune_marker1,Immune_marker3"
In this vignette we will work with all the original phenotypes
present in formatted_image
.
We can also check for specific misclassified cells using dimensionality reduction. SPIAT offers tSNE and UMAPs based on marker intensities to visualise cells. Cells of distinct types should be forming clearly different clusters.
The generated dimensionality reduction plots are interactive, and users can hover over each cell and obtain the cell ID. Users can then remove specific misclassified cells.
# First predict the phenotypes (this is for generating not 100% accurate phenotypes)
predicted_image2 <- predict_phenotypes(spe_object = simulated_image,
thresholds = NULL,
tumour_marker = "Tumour_marker",
baseline_markers = c("Immune_marker1",
"Immune_marker2",
"Immune_marker3",
"Immune_marker4"),
reference_phenotypes = FALSE)
## [1] "Tumour_marker threshold intensity: 0.445450443784465"
## [1] "Immune_marker1 threshold intensity: 0.116980867970434"
## [1] "Immune_marker2 threshold intensity: 0.124283809517202"
## [1] "Immune_marker3 threshold intensity: 0.0166413130263845"
## [1] "Immune_marker4 threshold intensity: 0.00989731350898589"
# Then define the cell types based on predicted phenotypes
predicted_image2 <- define_celltypes(
predicted_image2,
categories = c("Tumour_marker", "Immune_marker1,Immune_marker2",
"Immune_marker1,Immune_marker3",
"Immune_marker1,Immune_marker2,Immune_marker4"),
category_colname = "Phenotype",
names = c("Tumour", "Immune1", "Immune2", "Immune3"),
new_colname = "Cell.Type")
# Delete cells with unrealistic marker combinations from the dataset
predicted_image2 <-
select_celltypes(predicted_image2, "Undefined", feature_colname = "Cell.Type",
keep = FALSE)
# TSNE plot
g <- dimensionality_reduction_plot(predicted_image2, plot_type = "TSNE",
feature_colname = "Cell.Type")
Note that dimensionality_reduction_plot()
only prints a
static version of the UMAP or tSNE plot. If the user wants to interact
with this plot, they can pass the result to the ggplotly()
function from the plotly
package. Due to the file size
restriction, we only show a screenshot of the interactive tSNE plot.
plotly::ggplotly(g)
The plot shows that there are four clear clusters based on marker intensities. This is consistent with the cell definition from the marker combinations from the “Phenotype” column. (The interactive t-SNE plot would allow users to hover the cursor on the misclassified cells and see their cell IDs.) In this example, Cell_3302, Cell_4917, Cell_2297, Cell_488, Cell_4362, Cell_4801, Cell_2220, Cell_3431, Cell_533, Cell_4925, Cell_4719, Cell_469, Cell_1929, Cell_310, Cell_2536, Cell_321, and Cell_4195 are obviously misclassified according to this plot.
We can use select_celltypes()
to delete the
misclassified cells.
predicted_image2 <-
select_celltypes(predicted_image2, c("Cell_3302", "Cell_4917", "Cell_2297",
"Cell_488", "Cell_4362", "Cell_4801",
"Cell_2220", "Cell_3431", "Cell_533",
"Cell_4925", "Cell_4719", "Cell_469",
"Cell_1929", "Cell_310", "Cell_2536",
"Cell_321", "Cell_4195"),
feature_colname = "Cell.ID", keep = FALSE)
Then plot the TSNE again (not interactive). This time we see there are fewer misclassified cells.
# TSNE plot
g <- dimensionality_reduction_plot(predicted_image2, plot_type = "TSNE", feature_colname = "Cell.Type")
# plotly::ggplotly(g) # uncomment this code to interact with the plot
In addition to the marker level tissue plots for QC, SPIAT has other methods for visualising markers and phenotypes in tissues.
We can see the location of all cell types (or any column in the data)
in the tissue with plot_cell_categories()
. Each dot in the
plot corresponds to a cell and cells are coloured by cell type. Any cell
types present in the data but not in the cell types of interest will be
put in the category “OTHER” and coloured lightgrey.
my_colors <- c("red", "blue", "darkcyan", "darkgreen")
plot_cell_categories(spe_object = formatted_image,
categories_of_interest =
c("Tumour", "Immune1", "Immune2", "Immune3"),
colour_vector = my_colors, feature_colname = "Cell.Type")
plot_cell_categories()
also allows the users to plot the
categories layer by layer when there are too many cells by setting
layered
parameter as TRUE
. Then the cells will
be plotted in the order of categories_of_interest
layer by
layer. Users can also use cex
parameter to change the size
of the points.
We can visualise a selected marker in 3D with
marker_surface_plot()
. The image is blurred based on the
num_splits
parameter.
marker_surface_plot(formatted_image, num_splits=15, marker="Immune_marker1")
Due to the restriction of the file size, we have disabled the interactive plot in this vignette. Here only shows a screen shot. (You can interactively move the plot around to obtain a better view with the same code).
To visualise multiple markers in 3D in a single plot we can use
marker_surface_plot_stack()
. This shows normalised
intensity level of specified markers and enables the identification of
co-occurring and mutually exclusive markers.
marker_surface_plot_stack(formatted_image, num_splits=15, markers=c("Tumour_marker", "Immune_marker1"))
The stacked surface plots of the Tumour_marker and Immune_marker1 cells in this example shows how Tumour_marker and Immune_marker1 are mutually exclusive as the peaks and valleys are opposite. Similar to the previous plot, we have disabled the interactive plot in the vignette. (You can interactively move the plot around to obtain a better view with the same code.)
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SPIAT_1.6.3 SpatialExperiment_1.14.0
## [3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [5] Biobase_2.64.0 GenomicRanges_1.56.1
## [7] GenomeInfoDb_1.40.1 IRanges_2.38.1
## [9] S4Vectors_0.42.1 BiocGenerics_0.50.0
## [11] MatrixGenerics_1.16.0 matrixStats_1.3.0
## [13] BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2
## [4] dplyr_1.1.4 lazyeval_0.2.2 fastmap_1.2.0
## [7] pracma_2.4.4 spatstat.geom_3.3-2 spatstat.explore_3.3-1
## [10] digest_0.6.36 lifecycle_1.0.4 spatstat.data_3.1-2
## [13] magrittr_2.0.3 compiler_4.4.1 rlang_1.1.4
## [16] sass_0.4.9 tools_4.4.1 utf8_1.2.4
## [19] yaml_2.3.10 data.table_1.15.4 knitr_1.48
## [22] labeling_0.4.3 S4Arrays_1.4.1 htmlwidgets_1.6.4
## [25] DelayedArray_0.30.1 abind_1.4-5 Rtsne_0.17
## [28] purrr_1.0.2 withr_3.0.1 desc_1.4.3
## [31] grid_4.4.1 polyclip_1.10-7 fansi_1.0.6
## [34] colorspace_2.1-1 ggplot2_3.5.1 scales_1.3.0
## [37] spatstat.utils_3.1-0 cli_3.6.3 rmarkdown_2.28
## [40] crayon_1.5.3 ragg_1.3.2 generics_0.1.3
## [43] httr_1.4.7 rjson_0.2.21 cachem_1.1.0
## [46] zlibbioc_1.50.0 BiocManager_1.30.23 XVector_0.44.0
## [49] vctrs_0.6.5 Matrix_1.7-0 jsonlite_1.8.8
## [52] bookdown_0.40 tensor_1.5 crosstalk_1.2.1
## [55] systemfonts_1.1.0 magick_2.8.4 plotly_4.10.4
## [58] tidyr_1.3.1 spatstat.univar_3.0-0 jquerylib_0.1.4
## [61] goftest_1.2-3 glue_1.7.0 spatstat.random_3.3-1
## [64] pkgdown_2.1.0 gtable_0.3.5 deldir_2.0-4
## [67] UCSC.utils_1.0.0 munsell_0.5.1 tibble_3.2.1
## [70] pillar_1.9.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.12
## [73] R6_2.5.1 textshaping_0.4.0 evaluate_0.24.0
## [76] lattice_0.22-6 highr_0.11 mmand_1.6.3
## [79] bslib_0.8.0 Rcpp_1.0.13 gridExtra_2.3
## [82] SparseArray_1.4.8 nlme_3.1-164 spatstat.sparse_3.1-0
## [85] xfun_0.47 fs_1.6.4 pkgconfig_2.0.3