Capricorn: a multi-view diffusion model for Hi-C contact matrix resolution enhancement

Paper: Capricorn: a multi-view diffusion model for Hi-C contact matrix resolution enhancement

Introduction

High-throughput Chromosome Conformation Capture (Hi-C) technology has revolutionized our understanding of the 3D architecture of the genome.

However, a significant challenge in utilizing Hi-C data is its requirement for high-coverage sequencing to achieve the resolution necessary for detecting fine-scale chromatin structures, such as chromatin loops. High-coverage Hi-C datasets demand extensive sequencing efforts, making them costly and time-consuming to produce.

  • According to Rao et al., 2014, a ideal resolution for hic matrix: 80% of loci have at least 1,000 contacts

This limitation has prompted the development of computational methods aimed at enhancing the resolution of low-coverage Hi-C data, thereby enabling more detailed genomic analyses at reduced costs.

Problem Setting

For a set of experimental cell lines $C$, interaction frequency matrices are available for chromosomes $\Xi$. The dataset consists of pairs of low- and high-coverage contact matrices $ D = \left { (X^{(c)}{\xi}, Y^{(c)}{\xi}) \right} $ for each cell line $c \in C$ and chromosome $\xi \in \Xi$, all at high resolution.

The goal is to develop a model $f$ that approximates $Y^{(c)}{\xi} \approx f(X^{(c)}{\xi})$ for all cell lines $c$ and chromosomes $\xi$. The model $f$ should also generalize well to new cell lines and different chromosomes.

Constraints

  • The focus is on enhancing intrachromosomal contacts within a 2 Mb range.
  • Each chromosome $\xi$ has $N_{\xi}$ total base pairs.
  • Fixed resolution $\Delta$, resulting in contact matrix shapes of $N_{\xi}/\Delta \times N_{\xi}/\Delta$.
  • The high-coverage version $Y^{(c)}_{\xi}$ contains $\lambda$-fold more contacts than its low-coverage counterpart.

Matrix Details

  • Each entry $X^{(c)}_{\xi}[i, j]$ represents a $\Delta^2 = 10kb \times 10kb$ region of genomic interactions between the genomic regions $[10i , \text{kb}, 10(i + 1) , \text{kb})$ and $[10j , \text{kb}, 10(j + 1) , \text{kb})$, with $\Delta = 10 , \text{kb}$.
  • The matrices are tiled into $40 \times 40$ non-overlapping submatrices $X^{(c)}_{\xi}[i : i + 40, j : j + 40]$, each covering a $400^2 , \text{kb}$ region, aligning with existing Hi-C resolution enhancement frameworks.

Capricorn Model Overview

Capricorn is a novel machine learning model designed to enhance the resolution of Hi-C contact matrices, focusing on the accurate detection and representation of chromatin loops. Its approach diverges from traditional image resolution enhancement techniques by incorporating biological insights into the model, leveraging a diffusion probability model as its backbone. This section delves into the technical aspects of Capricorn, explaining its methodology, the problem it addresses, and its innovative use of diffusion models for biological data enhancement.

Methodology

Capricorn addresses the challenge of enhancing low-coverage Hi-C data by integrating small-scale chromatin features (such as TADs and loops) as additional inputs, alongside the primary Hi-C contact matrix. This multi-view approach allows Capricorn to generate high-resolution contact matrices that more accurately reflect the underlying biological structures.

  1. Input Data Preparation: The model takes a low-coverage Hi-C contact matrix and enriches it with derived features, including:

    • Distance-corrected matrices: Adjust for bias based on inter-locus distance, normalizing contacts based on expected distances to handle the diagonal dominance in the contact matrices. The expected matrix $E(X)$ is computed by averaging contacts along each diagonal, and the observed matrix is normalized accordingly.
    • TAD scores: Utilize insulation scores (IS) for TAD detection, identifying insulated regions with low scores and within-TAD regions with high scores. The IS scores are smoothed over a 21-bin domain, and labels are assigned based on monotonically decreasing IS centered around some locus.
    • Loop p-value and ratio: Follow the Hi-C Computational Unbiased Peak Search (HiCCUPS) algorithm for loop detection, assessing whether measured contacts are significantly more frequent than expected. This involves combining a 10x10 donut kernel with other kernels centered at a specific locus and computing the loop ratio and p-values based on a distance-based expected matrix.
  2. Diffusion Model Backbone:
    Capricorn leverages conditional diffusion probability models, treating the combined contact matrix and its derived chromatin feature views as a multi-channel image. We use the conditional diffusion probability model Imagen as the resolution enhancement backbone model, updating the model to condition on low-coverage contact matrices rather than text. We choose Imagen rather than other image diffusion models due to its efficient U-Net architecture, which is faster and more memory efficient than other diffusion generators. This approach allows Capricorn to generate high-resolution contact matrices by effectively incorporating both the primary Hi-C matrix and additional biological views (TADs, loops, distance-normalized counts).

  3. Loss Function in Capricorn Model:In the Capricorn model, the inputs are derived directly from the low-coverage experimental matrix $X$, ensuring that there’s no prior knowledge of the high-coverage contact matrix $Y$. This approach maintains Capricorn’s utility in practical inference scenarios. The inputs and outputs are structured as follows:

    • Input: $$\widetilde{X}(X) = [X, X^{(oe)}, X^{(tad)}, X^{(loop-p)}, X^{(loop-r)}] \in \mathbb{R}^{5 \times N/\Delta \times N/\Delta}$$
    • Output: $$\widetilde{Y}(Y) = [Y, Y^{(oe)}, Y^{(tad)}, Y^{(loop-p)}, Y^{(loop-r)}] \in \mathbb{R}^{5 \times N/\Delta \times N/\Delta}$$

    To avoid the dominance of any single experimental view during model training, an initial exploratory version of Capricorn is trained end-to-end. The validation set loss, $L’{val}(\widehat{\widetilde{Y}} | \widetilde{X})$, is then assessed to evaluate the contribution of each view to the overall loss. This analysis leads to the creation of a weight vector $\omega \in \R+^5$, designed to balance each view’s contribution to the loss function. Consequently, during the full model training, each channel is weighted by $\sqrt{\omega}$ to ensure an equitable contribution to the input and output, optimizing the model’s performance and utility.

  4. Training and Inference: During training, Capricorn learns to predict the high-resolution contact matrix from its low-resolution counterpart and the additional chromatin features. The model is trained to minimize the mean squared error between its predictions and the ground truth high-resolution matrices, with adjustments to ensure that all input views contribute equally to the loss.
    training procedure

Performance Measures

Capricorn’s performance is evaluated using both image-based and biologically motivated metrics. The model’s predicted high-coverage contact matrix is compared against the true high-coverage matrix using mean squared error (MSE) and loop F1 score. The loop F1 score is calculated based on the number of correctly predicted loops, with a tolerance for positional discrepancies, to assess the model’s ability to accurately capture biologically relevant chromatin structures.

loop F1 score: $$ F1 = \frac{TP}{TP+\frac{1}{2}(FP+FN) } $$ with a 5 pixel (50kb) tolerance range.
The model’s performance has been validated across various cell lines and chromosomes, showcasing its general applicability and potential for uncovering new insights into genome architecture.

Hi-C data preprocessing

  1. Dataset: GM12878 Epstein-Barrvirus-infected human lymphoblastoid cell line and K562 human chronic myelogenous leukemia lymphoblast cell lines from the Rao et al. (2014) dataset (accession code GSE63525)
  2. Preprocessing: Restricting the contact matrix to read mapping quality ≥ 30 and processed to 10 kilobase (kb) resolution, following previous work. We adopt the contact matrix preprocessing techniques from HiCARN and DeepHiC, including tiling the contact matrices into 40×40 submatrices, only retaining submatrices in the 2 megabase (Mb) region around the diagonal, clamping the high-coverage matrix to [0, 255] and then normalizing to [0, 1], and clamping thelow-coverage matrix to [0, 100] and then normalizing to [0, 1].
  3. Downsample: In order to simulate low-coverage data, we randomly downsampled the GM12878 and K562 cell line Hi-C matrices to 1/16 of the original read count.
  4. We train on the GM12878 data and test the model on K562; in the second, we train on K562 data and test on GM12878. In both experiments, we withhold chromosomes 4, 5, 11, and 14 from the training cell line as our validation set.

Results

Capricorn accurately enhances contact matrices and loop features

Capricorn demonstrated remarkable performance in enhancing the resolution of Hi-C contact matrices and accurately identifying chromatin loops. This success was evident in cross-cell-line experiments, where Capricorn was trained on simulated low-coverage data from one cell line and tested on another. Notably, it surpassed other models in enhancing chromatin loops and accuracy in generating high-coverage data. For instance, in testing across GM12878 and K562 cell lines, Capricorn achieved an average loop F1-score improvement and significantly lower mean squared error (MSE) compared to other methods, indicating its precision in reflecting true genomic interactions.
r1

Small-scale chromatin features are critical to model improvement and are model-agnostic

Further analysis revealed that incorporating small-scale chromatin features (e.g., TADs and loops) as additional inputs significantly and enhancing the small-scale chromatin features
in addition to the Hi-C matrices increased Capricorn’s performance. This approach not only improved the resolution enhancement task but also the model’s ability to identify structurally meaningful contacts from the enhanced matrices.

  1. Primary view: This setting uses the comparison models’ default resolution enhancement pipelines with the low- and high-coverage Hi-C matrices as input and output.
  2. Five-view input only: This setting uses the additional chromatin feature views as input to the model, but is still trained to only predict the high-coverage view.
  3. Full five-view model: This setting uses Capricorn’s complete multi-view setting, including all five chromatin feature views as input and training the model to enhance the small-scale chromatin features in addition to the Hi-C matrices.
    r2

Capricorn generalizes across chromosomes

A rigorous testing regime confirmed Capricorn’s ability to generalize its learning across different chromosomes and cell lines, underscoring its robustness. The model effectively transferred learned patterns to new genomic loci, outperforming comparison approaches in identifying accurate chromatin loops, further evidenced by cross-chromosome and intra-cell-line generalizations.

r3

Discussion and Future Directions

Capricorn represents a significant advancement in the field of Hi-C resolution enhancement by explicitly modeling the biological underpinnings of contact matrices through the incorporation of small-scale chromatin features. This approach has not only demonstrated Capricorn’s superiority in enhancing Hi-C data resolution but also highlighted its applicability to a broader range of resolution enhancement problems beyond its initial scope.

Future Directions

Looking forward, the Capricorn framework offers ample room for expansion to incorporate further biological views tailored to specific downstream applications. For instance, integrating structural information covering larger genomic loci could enhance the identification of TADs and A/B compartments, thereby broadening the model’s applicability and utility in genomic research.

Additionally, exploring the impact of varying model backbones on the multi-view resolution enhancement problem presents an intriguing avenue for future research. This could involve assessing the model’s adaptability to different genomic datasets, including those from other species, thereby extending its utility to a cross-species transfer learning context.

Moreover, applying Capricorn to a wider array of Hi-C cell line data and investigating the effects of training data size could yield insights into optimizing the model’s performance. Furthermore, the potential to adapt Capricorn for use with other types of contact map data, such as micro-C, suggests opportunities for broadening the model’s applicability within genomic research.

Lastly, future studies might explore the performance of Capricorn across various loop calling methods, enhancing the model’s capacity to accurately identify biologically relevant chromatin structures. This would not only solidify Capricorn’s standing as a robust tool for Hi-C resolution enhancement but also contribute to our understanding of the complex architecture of the genome.