In 2022, the SIAM Conference on Mathematics for Data Science took place in San Diego, CA. Anna Konstorum (Yale U.), Misha Kilmer (Tufts U.), and Shuchin Aeron (Tufts U.) organized a Minisymposium aimed at bringing together applied mathematicians and bioinformatics scientists to explore how factorization methods are used in bioinformatics studies. Slides shared here have been made available by the speakers.

Abstract


The volume and complexity of bioinformatics datasets have increased to include temporal and spatial resolution, as well as multiple distinct feature sets that can encompass >100K dimensions. This makes it imperative to develop frameworks and algorithms that can help to structure and extract clinically- and biologically-relevant patterns from such datasets. To this end, interpretable matrix-, joint-matrix, and tensor decompositions along with efficient large-scale implementations are coming to the fore as frameworks and methods capable to address this challenge. This minisymposium will focus on recent developments and applications of factorization strategies to high dimensional bioinformatics datasets.

Talks


Pattern Discovery in Time-Course Omics Data Using Non-Negative CP tensor decomposition (NCPD)

Speaker: Shoaib Bin Masud (Tufts University)

Slides


Abstract. Datasets associated with bioinformatics studies that include time-course analysis of omics data naturally lend themselves to a 3-mode tensor structure of features-by-subject-by-time. A non-negative CANDECOMP/PARAFAC (CP) decomposition (NCPD) of such datasets can reveal temporal patterns of feature (such as gene) expression that are associated with different subject groups, and may be assessed for association with clinical features. We show the application of an NCPD pipeline to reveal novel structure and biological observations from immune response profiling studies against the Influenza (flu) and Bordetella Pertussis pathogens.

We further consider an extension of NCPD which can capture the underlying geometry of the data using a Wasserstein distance in lieu of the Frobenius distance, which is used in the ALS and OPT objective functions of the original NCPD decomposition. We compare the performance of NCPD-F (NCPD with the Frobenius distance) to NCPD-W (NCPD with the Wasserstein distance) using both real and synthetic datasets and discuss ongoing NCPD-W research to improve the quality and robustness of NCPD application to bioinformatics datasets.


Mechanistic and Data-Driven Dissection of Cell Communication Through Tensor Decompositions

Speaker: Aaron Meyer (UCLA)

Slides


Abstract. Studies of even simple cell responses to their environment are hindered by how responses are multi-dimensional. For example, a simple receptor-ligand pathway can display differing responses based on timescale, cell type, stimulation, type of response measured, and context. Interrogating and manipulating these systems is thus almost always constrained by an incomplete view of the overall pathway.

Like how principal component analysis uses a low-rank approximation for dimensionality reduction of matrix-structured data, tensor generalizations provide solutions for pattern recognition in data with a higher-dimensional structure. Using several recent and unpublished applications, including engineering cell-type selective IL-2 therapies and serology analysis, I will describe some of the unique benefits of tensor-based analysis and the biological discoveries it has revealed. Specifically, tensor approximations enable more effective dimensionality reduction, separation of dimension-specific effects, and a natural, flexible solution to data integration. Finally, I will discuss some of the reasons tensor-based methods remain limited in their application to molecular biology. Resolving these limitations, and applying tensor methods in a more widespread manner, will help provide a complete view of cellular communication.

Uncovering the Spatial Landscape of Molecular Interactions Within the Tumor Microenvironment Using Latent Spaces

Speaker: Atul Deshpande (Johns Hopkins University)

Slides


Abstract. Spatial transcriptomics (ST) technologies enable us to measure gene expression in the tissue samples while retaining their spatial context. Such spatially-resolved data enables in situ resolution of the regulatory pathways in the heterogeneous tumor and its microenvironment (TME). Direct characterization of cellular co-localization using spatial technologies enables quantification of molecular changes caused by direct cell-cell interaction, such as that seen in tumor-immune interactions. Spot-based ST technologies, on the other hand, do not measure gene expression in individual cells but rather in groups of 1-10 cells, obscuring the constituent cell types as well as their states. Matrix factorization methods can be used to deconvolve the ST data to infer constituent cell populations and cell activities in each spot. I will discuss how unsupervised nonnegative matrix factorization (NMF) of the ST data can be used to identify spatially-resolved latent features that mirror pathologists’ annotations on the tissue samples. I will then describe our work to identify molecular changes and pathways associated with cell-cell interactions under the assumption that spatially overlapping latent features associated with different cell types interact in regions where they overlap. We apply these strategies to infer molecular changes from tumor-immune interactions in ST data from metastasis, invasive and precursor lesions, and immunotherapy treatments.