TCGA Lung Cancer Tutorial¶

This tutorial demonstrates how to use DLVPM to integrate multiple data types in a lung cancer dataset derived from The Cancer Genome Atlas (TCGA). In this example we link five different modalities: histological image features, RNA‑seq, methylation, miRNA and somatic mutation data. Each view is encoded by a small residual fully connected network and connected via a five‑factor structural path model.

Prerequisites¶

Ensure that deep_lvpm is installed as described on the installation page. The tutorial uses the small sample datasets bundled with the package under deep_lvpm.data and can be run on CPU.

1. Load the multi‑omics dataset¶

We start by importing the necessary dependencies and loading the training data. The data files are packaged as NumPy archives inside the deep_lvpm.data module.

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers, optimizers
from importlib import resources

import deep_lvpm as DLVPM
from deep_lvpm.models.StructuralModel import StructuralModel

# Disable eager mode for improved performance
tf.config.run_functions_eagerly(False)

# Load the training arrays (preserve order!)
with resources.as_file(resources.files("deep_lvpm.data") / "Lung_multiomics_sample_train.npz") as f:
    arrays = np.load(f)
    rnaseq      = arrays["rnaseq"]
    snv         = arrays["snv"]
    methylation = arrays["methylation"]
    mirna       = arrays["mirna"]
    histo20     = arrays["histo20"]

# Assemble the list of data arrays in the same order
X_arr = [histo20, rnaseq, methylation, mirna, snv]

2. Define measurement models¶

Each data type is processed by a simple fully connected residual block. You can experiment with different architectures or hyperparameters.

def residual_block(
    input_dim: int,
    kernel_reg_l1: float = 0.01,
    kernel_reg_l2: float = 0.01,
    dropout_rate: float = 0.5,
    name: str = "residual_block",
) -> tf.keras.Model:
    """
    Builds a simple fully connected residual block.

    Parameters
    ----------
    input_dim : int
        Number of features in the (flat) input vector.
    kernel_reg_l1 : float
        L1 regularisation factor for dense layers.
    kernel_reg_l2 : float
        L2 regularisation factor for dense layers.
    dropout_rate : float
        Dropout rate applied after the residual connection.
    name : str
        Name for the returned Keras model.
    """
    inputs = keras.Input(shape=(input_dim,), name=f"{name}_in")

    x = layers.Dense(
        input_dim,
        activation="linear",
        kernel_initializer=keras.initializers.Identity(),
        kernel_regularizer=regularizers.l1_l2(l1=kernel_reg_l1, l2=kernel_reg_l2),
        name=f"{name}_dense1",
    )(inputs)
    x = layers.BatchNormalization(name=f"{name}_bn")(x)
    x = layers.ReLU(name=f"{name}_relu")(x)
    x = layers.Dense(
        input_dim,
        activation="linear",
        kernel_initializer=keras.initializers.Identity(),
        kernel_regularizer=regularizers.l1_l2(l1=kernel_reg_l1, l2=kernel_reg_l2),
        name=f"{name}_dense2",
    )(x)
    x = layers.Add(name=f"{name}_add")([inputs, x])
    x = layers.Dropout(dropout_rate, name=f"{name}_drop")(x)

    return keras.Model(inputs=inputs, outputs=x, name=name)

# Create an encoder for each modality
model_list = [
    residual_block(histo20.shape[1], name="histo20_enc"),
    residual_block(rnaseq.shape[1],  name="rnaseq_enc"),
    residual_block(methylation.shape[1], name="meth_enc"),
    residual_block(mirna.shape[1],   name="mirna_enc"),
    residual_block(snv.shape[1],     name="snv_enc"),
]

3. Specify the structural path matrix¶

For this example we use a five‑factor model with asymmetric paths. The matrix below defines which latent factors influence each other.

import numpy as np

ndims = 5  # number of latent factors

Path = np.array([
    # F1 F2 F3 F4 F5
    [0, 1, 0, 0, 0],  # F1 ← F2
    [1, 0, 1, 1, 1],  # F2 ← F1,F3,F4,F5
    [0, 1, 0, 0, 0],  # F3 ← F2
    [0, 1, 0, 0, 0],  # F4 ← F2
    [0, 1, 0, 0, 0],  # F5 ← F2
], dtype="float32")

batch_size  = 256
epochs      = 300
total_steps = int(rnaseq.shape[0] / batch_size) * epochs

# Exponential learning rate decay
init_lr, final_lr = 1e-4, 1e-5
lr_schedule = optimizers.schedules.ExponentialDecay(
    initial_learning_rate=init_lr,
    decay_steps=total_steps,
    decay_rate=final_lr / init_lr,
    staircase=False,
)

# Total number of samples (needed by DLVPM for normalisation)
tot_num = rnaseq.shape[0]

4. Build and compile the model¶

We create a StructuralModel instance and provide regularisers for the projection layers. We then compile it with a list of optimisers, one per view.

from tensorflow.keras import regularizers

# Regularisers applied to each projection layer
regularizer_list = [
    regularizers.L1L2(l1=0.01, l2=0.01),
    regularizers.L1L2(l1=0.01, l2=0.01),
    regularizers.L1L2(l1=0.01, l2=0.01),
    regularizers.L1L2(l1=0.01, l2=0.01),
    regularizers.L1L2(l1=0.01, l2=0.01),
]

# Build the structural model
DLVPM_Structural_instance = StructuralModel(
    Path,
    model_list,
    regularizer_list,
    tot_num,
    ndims,
    momentum=0.95,
    epsilon=0.001,
    orthogonalization="Moore-Penrose",
)

# One optimizer per measurement model using the decaying learning rate
opt_list = [
    optimizers.Adam(learning_rate=lr_schedule) for _ in model_list
]

# Compile the model
DLVPM_Structural_instance.compile(optimizer=opt_list)

5. Train and evaluate¶

Training proceeds with the standard Keras fit interface. The evaluate method returns both the mean squared error and the mean correlation between connected data types.

# Train the model on the training data
DLVPM_Structural_instance.fit(
    X_arr,
    batch_size=batch_size,
    epochs=epochs,
    verbose=True,
)

# Evaluate on the training data
mean_corr = DLVPM_Structural_instance.evaluate(X_arr)
print(f"Mean correlation on training data: r={mean_corr[1]:.3f}")

6. Evaluate on the test set¶

We load the separate test dataset and compute the mean correlation of the learned DLVs.

# Load the independent test dataset
with resources.as_file(resources.files("deep_lvpm.data") / "Lung_multiomics_sample_test.npz") as f:
    arrays = np.load(f)
    rnaseq_test      = arrays["rnaseq"]
    snv_test         = arrays["snv"]
    methylation_test = arrays["methylation"]
    mirna_test       = arrays["mirna"]
    histo20_test     = arrays["histo20"]

X_arr_test = [histo20_test, rnaseq_test, methylation_test, mirna_test, snv_test]

mean_corr_test = DLVPM_Structural_instance.evaluate(X_arr_test)
print(f"Mean correlation on test data: r={mean_corr_test[1]:.3f}")

7. Inspect the learned latent variables¶

To extract the latent factors for each view, call predict. This returns a tensor with shape (n_samples, ndims, n_views).

test_DLVs = DLVPM_Structural_instance.predict(X_arr_test)

# Correlation matrix of the first latent factor across views
corr_first = np.corrcoef(test_DLVs[:, 0, :].T)
print("Correlation matrix for latent factor 1:", corr_first)

# Correlation matrix of the second latent factor
corr_second = np.corrcoef(test_DLVs[:, 1, :].T)
print("Correlation matrix for latent factor 2:", corr_second)

8. Save the model¶

Finally, save your trained model to disk in the .keras format:

DLVPM_Structural_instance.save("/path/to/output_folder/DLVPM_Model.keras")

This tutorial illustrates how DLVPM can be applied to real multi‑omics data. You can extend this example by changing the measurement models, experimenting with different regularisation schemes, or altering the structural path matrix to test different hypotheses about cross‑modal relationships.

If this deep dive was useful, please star the repository—community support signals that DLVPM matters and helps us justify the time invested in future improvements.