TCGA Lung Cancer Tutorial
=========================

This tutorial demonstrates how to use DLVPM to integrate multiple data types in a lung cancer dataset derived from The Cancer Genome Atlas (TCGA).  In this example we link five different modalities: histological image features, RNA‑seq, methylation, miRNA and somatic mutation data.  Each view is encoded by a small residual fully connected network and connected via a five‑factor structural path model.

Prerequisites
-------------

Ensure that :mod:`deep_lvpm` is installed as described on the :doc:`installation` page.  The tutorial uses the small sample datasets bundled with the package under ``deep_lvpm.data`` and can be run on CPU.

1. Load the multi‑omics dataset
-------------------------------

We start by importing the necessary dependencies and loading the training data.  The data files are packaged as NumPy archives inside the ``deep_lvpm.data`` module.

.. code-block:: python

   import numpy as np
   import tensorflow as tf
   from tensorflow import keras
   from tensorflow.keras import layers, regularizers, optimizers
   from importlib import resources

   import deep_lvpm as DLVPM
   from deep_lvpm.models.StructuralModel import StructuralModel

   # Disable eager mode for improved performance
   tf.config.run_functions_eagerly(False)

   # Load the training arrays (preserve order!)
   with resources.as_file(resources.files("deep_lvpm.data") / "Lung_multiomics_sample_train.npz") as f:
       arrays = np.load(f)
       rnaseq      = arrays["rnaseq"]
       snv         = arrays["snv"]
       methylation = arrays["methylation"]
       mirna       = arrays["mirna"]
       histo20     = arrays["histo20"]

   # Assemble the list of data arrays in the same order
   X_arr = [histo20, rnaseq, methylation, mirna, snv]

2. Define measurement models
----------------------------

Each data type is processed by a simple fully connected residual block.  You can experiment with different architectures or hyperparameters.

.. code-block:: python

   def residual_block(
       input_dim: int,
       kernel_reg_l1: float = 0.01,
       kernel_reg_l2: float = 0.01,
       dropout_rate: float = 0.5,
       name: str = "residual_block",
   ) -> tf.keras.Model:
       """
       Builds a simple fully connected residual block.

       Parameters
       ----------
       input_dim : int
           Number of features in the (flat) input vector.
       kernel_reg_l1 : float
           L1 regularisation factor for dense layers.
       kernel_reg_l2 : float
           L2 regularisation factor for dense layers.
       dropout_rate : float
           Dropout rate applied after the residual connection.
       name : str
           Name for the returned Keras model.
       """
       inputs = keras.Input(shape=(input_dim,), name=f"{name}_in")

       x = layers.Dense(
           input_dim,
           activation="linear",
           kernel_initializer=keras.initializers.Identity(),
           kernel_regularizer=regularizers.l1_l2(l1=kernel_reg_l1, l2=kernel_reg_l2),
           name=f"{name}_dense1",
       )(inputs)
       x = layers.BatchNormalization(name=f"{name}_bn")(x)
       x = layers.ReLU(name=f"{name}_relu")(x)
       x = layers.Dense(
           input_dim,
           activation="linear",
           kernel_initializer=keras.initializers.Identity(),
           kernel_regularizer=regularizers.l1_l2(l1=kernel_reg_l1, l2=kernel_reg_l2),
           name=f"{name}_dense2",
       )(x)
       x = layers.Add(name=f"{name}_add")([inputs, x])
       x = layers.Dropout(dropout_rate, name=f"{name}_drop")(x)

       return keras.Model(inputs=inputs, outputs=x, name=name)

   # Create an encoder for each modality
   model_list = [
       residual_block(histo20.shape[1], name="histo20_enc"),
       residual_block(rnaseq.shape[1],  name="rnaseq_enc"),
       residual_block(methylation.shape[1], name="meth_enc"),
       residual_block(mirna.shape[1],   name="mirna_enc"),
       residual_block(snv.shape[1],     name="snv_enc"),
   ]

3. Specify the structural path matrix
-------------------------------------

For this example we use a five‑factor model with asymmetric paths.  The matrix below defines which latent factors influence each other.

.. code-block:: python

   import numpy as np

   ndims = 5  # number of latent factors

   Path = np.array([
       # F1 F2 F3 F4 F5
       [0, 1, 0, 0, 0],  # F1 ← F2
       [1, 0, 1, 1, 1],  # F2 ← F1,F3,F4,F5
       [0, 1, 0, 0, 0],  # F3 ← F2
       [0, 1, 0, 0, 0],  # F4 ← F2
       [0, 1, 0, 0, 0],  # F5 ← F2
   ], dtype="float32")

   batch_size  = 256
   epochs      = 300
   total_steps = int(rnaseq.shape[0] / batch_size) * epochs

   # Exponential learning rate decay
   init_lr, final_lr = 1e-4, 1e-5
   lr_schedule = optimizers.schedules.ExponentialDecay(
       initial_learning_rate=init_lr,
       decay_steps=total_steps,
       decay_rate=final_lr / init_lr,
       staircase=False,
   )

   # Total number of samples (needed by DLVPM for normalisation)
   tot_num = rnaseq.shape[0]

4. Build and compile the model
-------------------------------

We create a :class:`StructuralModel` instance and provide regularisers for the projection layers.  We then compile it with a list of optimisers, one per view.

.. code-block:: python

   from tensorflow.keras import regularizers

   # Regularisers applied to each projection layer
   regularizer_list = [
       regularizers.L1L2(l1=0.01, l2=0.01),
       regularizers.L1L2(l1=0.01, l2=0.01),
       regularizers.L1L2(l1=0.01, l2=0.01),
       regularizers.L1L2(l1=0.01, l2=0.01),
       regularizers.L1L2(l1=0.01, l2=0.01),
   ]

   # Build the structural model
   DLVPM_Structural_instance = StructuralModel(
       Path,
       model_list,
       regularizer_list,
       tot_num,
       ndims,
       momentum=0.95,
       epsilon=0.001,
       orthogonalization="Moore-Penrose",
   )

   # One optimizer per measurement model using the decaying learning rate
   opt_list = [
       optimizers.Adam(learning_rate=lr_schedule) for _ in model_list
   ]

   # Compile the model
   DLVPM_Structural_instance.compile(optimizer=opt_list)

5. Train and evaluate
---------------------

Training proceeds with the standard Keras ``fit`` interface.  The ``evaluate`` method returns both the mean squared error and the mean correlation between connected data types.

.. code-block:: python

   # Train the model on the training data
   DLVPM_Structural_instance.fit(
       X_arr,
       batch_size=batch_size,
       epochs=epochs,
       verbose=True,
   )

   # Evaluate on the training data
   mean_corr = DLVPM_Structural_instance.evaluate(X_arr)
   print(f"Mean correlation on training data: r={mean_corr[1]:.3f}")

6. Evaluate on the test set
---------------------------

We load the separate test dataset and compute the mean correlation of the learned DLVs.

.. code-block:: python

   # Load the independent test dataset
   with resources.as_file(resources.files("deep_lvpm.data") / "Lung_multiomics_sample_test.npz") as f:
       arrays = np.load(f)
       rnaseq_test      = arrays["rnaseq"]
       snv_test         = arrays["snv"]
       methylation_test = arrays["methylation"]
       mirna_test       = arrays["mirna"]
       histo20_test     = arrays["histo20"]

   X_arr_test = [histo20_test, rnaseq_test, methylation_test, mirna_test, snv_test]

   mean_corr_test = DLVPM_Structural_instance.evaluate(X_arr_test)
   print(f"Mean correlation on test data: r={mean_corr_test[1]:.3f}")

7. Inspect the learned latent variables
--------------------------------------

To extract the latent factors for each view, call ``predict``.  This returns a tensor with shape ``(n_samples, ndims, n_views)``.

.. code-block:: python

   test_DLVs = DLVPM_Structural_instance.predict(X_arr_test)

   # Correlation matrix of the first latent factor across views
   corr_first = np.corrcoef(test_DLVs[:, 0, :].T)
   print("Correlation matrix for latent factor 1:", corr_first)

   # Correlation matrix of the second latent factor
   corr_second = np.corrcoef(test_DLVs[:, 1, :].T)
   print("Correlation matrix for latent factor 2:", corr_second)

8. Save the model
-----------------

Finally, save your trained model to disk in the ``.keras`` format:

.. code-block:: python

   DLVPM_Structural_instance.save("/path/to/output_folder/DLVPM_Model.keras")

This tutorial illustrates how DLVPM can be applied to real multi‑omics data.  You can extend this example by changing the measurement models, experimenting with different regularisation schemes, or altering the structural path matrix to test different hypotheses about cross‑modal relationships.

If this deep dive was useful, please `star <https://github.com/alexjamesing/Deep_LVPM>`_ the repository—community support signals that DLVPM matters and helps us justify the time invested in future improvements.