Clustering using Python tools

Clustering using Python tools#

Tested with QuPath 0.6.0rc3 on Win11
This section builds off of the following resources: https://www.youtube.com/watch?v=984Jz2QCvng
Instructions: https://github.com/qupath/i2k-qupath-for-python-programmers
Documentation?: https://qupath.github.io/qubalab-docs/
Thanks to the work of Alan O’Callaghan

YouTube video walkthrough: Python + QuBaLab

Create an environment and install qubalab into it with
conda create -n yourEnvironmentName python=3.10
pip install –upgrade qubalab
pip install ipython leidenalg igraph umap-learn
pip install seaborn[stats]

Start QuPath, open your project and an image, and open a connection#

The .gif shows both opening a connection, and seeing that the Python symbol is red, and closing the connection.
Do not close the connection unless you are done!

from qubalab.qupath import qupath_gateway
from qubalab.images.qupath_server import QuPathServer
import matplotlib.pyplot as plt

#This uses the default gateway 25333, make sure to change this to match the one you created in QuPath if you altered this.
gateway = qupath_gateway.create_gateway(auth_token=None, port=25333)
gateway
qupath_server = QuPathServer(gateway) # read pixels from qupath via gateway

Test the connection!#

Check the state of QuPath by taking a snapshot of the interface - not necessary, but something that is neat to know how to do.

plt.imshow(qupath_gateway.create_snapshot())
plt.axis(False)
plt.show()

In QuPath#

I used the following script to create cells, feel free to follow along if you have access to the sample data.

//QuPath script, do not run here!
setImageType('FLUORESCENCE');
createFullImageAnnotation(true)
qupath.ext.instanseg.core.InstanSeg.builder()
    .modelPath("F:/Scripts/fluorescence_nuclei_and_cells")
    .device("cpu")
    .inputChannels([ColorTransforms.createChannelExtractor("Hoechst"), ColorTransforms.createChannelExtractor("CD31"), ColorTransforms.createChannelExtractor("CD11c"), ColorTransforms.createChannelExtractor("CD68"), ColorTransforms.createChannelExtractor("CD163"), ColorTransforms.createChannelExtractor("CD20"), ColorTransforms.createChannelExtractor("CD4"), ColorTransforms.createChannelExtractor("CD8a"), ColorTransforms.createChannelExtractor("CD45RO"), ColorTransforms.createChannelExtractor("PDL1"), ColorTransforms.createChannelExtractor("LAG3"), ColorTransforms.createChannelExtractor("PD1"), ColorTransforms.createChannelExtractor("CD45"), ColorTransforms.createChannelExtractor("S100a")])
    .outputChannels()
    .tileDims(512)
    .interTilePadding(32)
    .nThreads(12)
    .makeMeasurements(true)
    .randomColors(false)
    .build()
    .detectObjects()
clearSelectedObjects(true);

Data cleaning in QuPath#

Possibly run some filtering on the cells, to remove outliers or objects we do not think fit our analysis. These scripts are for use in QuPath, not Python.

//QuPath script, do not run here

toRemove = []
//Cycle through all of the detected objects, and check for any that do not have a nucleus
//There are a variety of ways of doing this, but one is checking for NaNs in measurements for nuclei.
for (cell in getDetectionObjects()){
    if (cell.measurements.get("Nucleus: Solidity").isNaN()){
        toRemove.add(cell)
    }else if (cell.measurements.get("Nucleus: Area µm^2") ==cell.measurements.get("Cell: Area µm^2")){
        //Additionally, remove any cells that have the exact same area of nucleus and cytoplasm.
        //This is mostly a demonstration of another way to remove certain kinds of detected outliers.
        toRemove.add(cell)
    }
}
print "Objects removed: " + toRemove.size()
removeObjects(toRemove, true)

Additionally, the UMAP function run below throws an error if ANY value is NaN. Since we are cycling through all of the “mean” values, let’s go ahead and remove all cells where a measurement with mean in the name has a NaN value.

//QuPath script do not run here

toRemove = []

// Iterate over all objects
for (def pathObject in getAllObjects()) {
    // Check if any measurement contains "Mean" and has a NaN value
    if (pathObject.measurements.any { key, value -> key.contains("Mean") && Double.isNaN(value) }) {
        toRemove.add(pathObject)
    }
}

// Remove the collected objects
removeObjects(toRemove, true)

Python#

Create a dataframe using cell measurements#

Generate a basic dataframe using pandas, and normalize it - something important to keep in mind when performing clustering as different channels in a multiplex image can have vastly different values, yet represent the same amount of protein or have similar impacts on classification.
There are many methods of normalization, this is a fairly simple one.
https://pmc.ncbi.nlm.nih.gov/articles/PMC4779207/
https://www.nature.com/articles/s41598-018-22489-1
Depending on the setting, variation in the samples (genetically identical mice vs genetically diverse humans, variation in human sample collection and fixation protocols) and other complications, different methods of normalization may be called for. Check with your local statistician and be ready to explain where sources of variation!

Note that the below method makes the very simple assumption that ALL images in the project are part of the project.#

Consider: Are the measurements in all channels relevant? Didn’t we basically turn off some channels right at the start?
Further cleaning of your data can both improve results and reduce processing time!

Beyond this point, make sure to check whether you are running the “Single image” or “Full project” code - they do not mix!#

#Full project

import pandas as pd
project = gateway.getProject()
detections = []
detection_to_image = {}

for imageEntry in project.getImageList():
    imageData = imageEntry.readImageData()
    hierarchy = imageData.getHierarchy()
    imageDetections = hierarchy.getDetectionObjects()
    #Extra processing here to make sure we can map detections back to the correct image
    for det in imageDetections:
        uuid = det.getID().toString()
        detections.append(det)
        detection_to_image[uuid] = imageEntry.getID()  # Store mapping of detection to image number

print(len(detections))
names = detections[0].getMeasurementList().getNames()
names = [x for x in names if x.endswith("Mean")]

df = pd.DataFrame(columns=["UUID", "ImageID"] + names)
for det in detections:
    df = df._append(
        {
            "UUID": det.getID().toString(),
            "ImageID": detection_to_image[det.getID().toString()],
            **{name: det.getMeasurements().get(name) for name in names}
        },
        ignore_index=True
    )

# Standardize numeric columns (excluding UUID and ImageID)
numeric_df = df.drop(columns=["UUID", "ImageID"])  
normalized_df = (numeric_df - numeric_df.mean()) / numeric_df.std()

#Check the state of your dataframes!
#print(df.head())
#print(normalized_df.head())

# Single image

import pandas as pd
detections = []
detections = gateway.getDetectionObjects()
names = detections[0].getMeasurementList().getNames()
names = [x for x in names if x.endswith("Mean")]

df = pd.DataFrame(columns = names)
for det in detections:
    df = df._append(
        {name: det.getMeasurements().get(name) for name in names},
        ignore_index=True)


# standardise the columns to avoid scale and shift effects
normalized_df = (df - df.mean()) / df.std()

Running UMAP#

https://umap-learn.readthedocs.io/en/latest/
We could now run a dimensionality reduction algorithm, which might help us to visualise classifiers or other analyses in QuPath.

We can also add these features back into QuPath.

# Full project
import umap

# Run UMAP dimensionality reduction
embedding = umap.UMAP().fit_transform(normalized_df)

# Create a dictionary mapping UUID to UMAP values
umap_results = {df.iloc[i]["UUID"]: (float(embedding[i, 0]), float(embedding[i, 1])) for i in range(len(df))}

for imageEntry in project.getImageList():
    imageData = imageEntry.readImageData()
    hierarchy = imageData.getHierarchy()
    imageDetections = hierarchy.getDetectionObjects()
    for det in imageDetections:
        uuid = det.getID().toString()
        image_id = imageEntry.getID()

        # Ensure we're updating only detections that belong to the current image
        if uuid in umap_results and detection_to_image[uuid] == image_id:
            det.getMeasurementList().put("UMAP1", umap_results[uuid][0])
            det.getMeasurementList().put("UMAP2", umap_results[uuid][1])
    imageEntry.saveImageData(imageData)

# Single image
import umap

# run a dimensionality reduction algorithm on the measurements
embedding = umap.UMAP().fit_transform(normalized_df)

# assign back to measurement list
for i in range(embedding.shape[0]):
    detections[i].getMeasurementList().put("UMAP1", float(embedding[i][0]))
    detections[i].getMeasurementList().put("UMAP2", float(embedding[i][1]))

Plot the UMAP using Python

# Extract UMAP1 and UMAP2 values from the embedding
umap1 = embedding[:, 0]
umap2 = embedding[:, 1]

# Create a scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(umap1, umap2, s=5, alpha=0.5)  # s=5 sets small point size, alpha=0.5 for transparency
plt.xlabel("UMAP1")
plt.ylabel("UMAP2")
plt.title("UMAP Scatter Plot")
plt.grid(True)

# Show the plot
plt.show()

Plot the UMAP using QuPath!
Run the following script in the QuPath script editor to plot the UMAP for the objects in the current image

//QuPath script, do not run here
import qupath.lib.gui.charts.Charts

nameOfActiveImage = getProjectEntry().getImageName()

def builder = Charts.scatterChart()
    .viewer(QuPathGUI.getInstance().getViewer())
    .title('UMAP for '+ nameOfActiveImage)
    .measurements(getDetectionObjects(), "UMAP1", "UMAP2")
    .show()

Plot the UMAP for all detections in the project

//QuPath script, do not run here
import qupath.lib.gui.charts.Charts

project = QP.getProject()
detections = []
for (imageEntry: project.getImageList()) {
    hierarchy = imageEntry.readHierarchy()
    detections += hierarchy.getDetectionObjects()
}

def builder = Charts.scatterChart()
    .viewer(QuPathGUI.getInstance().getViewer())
    .title('UMAP based on mean channel intensities')
    .measurements(detections, "UMAP1", "UMAP2")
    .show()

Preparing to cluster cells#

We can identify the KNN graph of the cells, and turn this into an adjacency matrix.

from sklearn.neighbors import kneighbors_graph
import igraph

# find KNN graph
A = kneighbors_graph(normalized_df, 50)

# convert matrix to adjacency graph
g = igraph.Graph.Adjacency((A > 0))

kneighbors_graph What’s happening?#

We are building a network (graph) where each cell (data point) is connected to its 50 nearest neighbors based on their feature similarity.

Why 50?
This number controls how connected the graph is. A small value (e.g., 5-10) might not connect all data points well, while a large value (e.g., 100+) could make clusters too mixed.
50 is a commonly used middle-ground: it provides enough connections to detect meaningful groups while keeping clusters distinct.

How does “nearest neighbors” work?
It measures similarity between cells based on their feature values (e.g., intensity measurements) and links each cell to its 50 most similar neighbors.

igraph.Graph.Adjacency What’s happening?#

The KNN graph (A) is stored as a matrix where each row represents a cell and each column represents whether it’s connected to another cell.
The (A > 0) part checks whether there is a connection (1 for connected, 0 for not connected).
This converts the matrix into a proper graph structure using igraph, so that we can apply clustering.

Why use > 0?
This ensures that we only keep meaningful connections (i.e., cells that are actually linked in the KNN graph).
The matrix A stores distances between points, but all we care about is whether a connection exists.

Running clustering#

Then, we can cluster the adjacency. We could also have done K-means or something similar.
https://leidenalg.readthedocs.io/en/stable/reference.html

#For project

import leidenalg

# partition the KNN graph based on graph modularity
partition = leidenalg.find_partition(g, leidenalg.ModularityVertexPartition)

# Create dictionary mapping UUIDs to cluster labels
cluster_results = {
    df.iloc[i]["UUID"]: f"Cluster {partition.membership[i]}" 
    for i in range(len(df))
}

# assign the partitions as classes to the original QuPath objects
for imageEntry in project.getImageList():
    imageData = imageEntry.readImageData()
    hierarchy = imageData.getHierarchy()
    imageDetections = hierarchy.getDetectionObjects()

    for det in imageDetections:
        uuid = det.getID().toString()
        image_id = imageEntry.getID()

        # Ensure we're updating only detections that belong to the current image
        if uuid in cluster_results and detection_to_image[uuid] == image_id:
            det.setClassification(cluster_results[uuid])
    imageEntry.saveImageData(imageData)

#For single image
import leidenalg

# partition the KNN graph based on graph modularity
partition = leidenalg.find_partition(g, leidenalg.ModularityVertexPartition)

# assign the partitions as classes to the original QuPath objects
for i in range(embedding.shape[0]):
    detections[i].setClassification(f"Cluster {partition.membership[i]}")

find_partition What’s happening?#

This groups the cells into clusters using the Leiden algorithm, which is designed to find highly connected groups in a graph.
The method used here is ModularityVertexPartition, which finds groups that maximize a measure called modularity.

What is “Modularity”?

Modularity measures how well-defined clusters are.
A high modularity score means nodes in the same cluster are highly connected to each other, but weakly connected to nodes in other clusters. The algorithm keeps adjusting cluster assignments to maximize modularity, ensuring meaningful groups.

Why use ModularityVertexPartition?

This is a standard way of defining clusters in a network.
It ensures that groups are found based on connections, not just distance.
This method is commonly used in biological and medical data where we are dealing with complex, non-linear relationships between features.

What is wrong with clustering from one image?#

How is classification different?#

Create classifiers from clustering results???#

Next we generate some heatmaps to inspect our clusters#

Clusters are great, but do we trust them? What do they mean? Are they just artifacts?

At this point there is no longer a differentiation between single and multiple image data sets, but you should be doing this with multiple images if possible!

import seaborn as sns
import matplotlib.pyplot as plt

# Add cluster labels to the normalized DataFrame
normalized_df["Cluster"] = df["UUID"].map(cluster_results)  # Assign clusters to data

# Group by cluster and compute the mean of each **normalized** feature
cluster_means = normalized_df.groupby("Cluster").mean(numeric_only=True)

# Plot heatmap with normalized values
plt.figure(figsize=(12, 6))
sns.heatmap(cluster_means, cmap="coolwarm", annot=False)  # Set annot=True to show values
plt.title("Heatmap of Clustered Cells vs. Normalized Feature Measurements")
plt.xlabel("Feature Measurements (Normalized)")
plt.ylabel("Cluster")
plt.xticks(rotation=45)
plt.show()

What a mess!#

Let’s filter this, and rename some of the features so they do not clutter up our reading of the heatmap.

import seaborn as sns
import matplotlib.pyplot as plt

# Set threshold for filtering features
threshold =2 # Ignore features with all values between -2 and 2

# Add cluster labels to the normalized DataFrame
normalized_df["Cluster"] = df["UUID"].map(cluster_results)  # Assign clusters to data

# Group by cluster and compute the mean of each normalized feature
cluster_means = normalized_df.groupby("Cluster").mean(numeric_only=True)

# Filter out features that do not exceed the threshold in any cluster
filtered_features = cluster_means.columns[
    (cluster_means.max(axis=0) > threshold) | (cluster_means.min(axis=0) < -threshold)
]
cluster_means_filtered = cluster_means[filtered_features]  # Keep only these features

# Function to clean feature names
def clean_feature_name(name):
    name = name.replace(": Mean", "")  # Remove ": Mean" suffix
    name = name.replace("Nucleus", "N").replace("Cytoplasm", "C").replace("Membrane", "M")  # Abbreviate
    return name

# Apply cleaning function to feature names
cleaned_feature_names = [clean_feature_name(name) for name in cluster_means_filtered.columns]

# Plot heatmap with filtered features
plt.figure(figsize=(12, 6))
ax = sns.heatmap(cluster_means_filtered, cmap="coolwarm", annot=False)  # Set annot=True to show values
ax.set_xticklabels(cleaned_feature_names, rotation=45, ha="right")  # Align right for better readability

plt.title("Heatmap of Clustered Cells vs. Normalized Feature Measurements")
plt.xlabel("Feature Measurements (Filtered)")
plt.ylabel("Cluster")
plt.xticks(rotation=45)
plt.show()

Patterns!#

The repeated patterns here suggest that we did not need all of the nuclear, cytoplasmic, and cell measurements, as the same channels show the same patterns for each cluster, per area of the cell!

What are the clusters without any extreme measurements? What changes with Cluster 9 if we set the threshold to 1.8?