Use NMF to discover

Xun Zhu

This simple tutorial will guide you through a typical analysis workflow with NMFEM pipeline. At the end of this tutorial the reader should be able to do unsupervised clustering and important gene calling using non-negative matrix factorization, and important gene module calling using the spin-glass based algorithm FEM. Throughout the tutorial we will be using the mouse embryonic lung epithelial cell dataset (Treutlein et al. 2014).

Unsupervised Clustering and Important gene calling

Let’s begin by loading the NMFEM package:

library(NMFEM)

and loading the dataset attached in the package:

data(fpkm)

the unsupervised clustering and important genes calling is nicely packed in a single function. If you are running this on a multi-core computer, feel free to set a higher n_threads_ to take advantage of parallelism. Depending on your hardware this might take while so grab a cup of coffee while waiting for the results. :)

nmf_results <- nmf_subpopulation(fpkm, n_threads_ = 30)

The nmf_results contains various dataframes and ggplot objects for different aspect of the analysis. Let’s take a look at a few important ones. We begin by looking at the clustering result:

predict(nmf_results$nmf_result)
## SRX380620 SRX380631 SRX380635 SRX380623 SRX380629 SRX380636 SRX380639 
##         2         2         2         2         2         2         2 
## SRX380641 SRX380630 SRX380633 SRX380624 SRX380628 SRX380637 SRX380640 
##         2         2         2         2         1         2         1 
## SRX380643 SRX380632 SRX380626 SRX380619 SRX380634 SRX380638 SRX380642 
##         2         1         2         1         1         2         1 
## SRX380622 SRX380644 SRX380625 SRX380627 SRX380621 SRX380618 SRX380565 
##         1         1         2         2         1         1         1 
## SRX380551 SRX380557 SRX380558 SRX380560 SRX380570 SRX380536 SRX380552 
##         1         1         1         1         1         1         1 
## SRX380534 SRX380538 SRX380568 SRX380566 SRX380567 SRX380556 SRX380559 
##         1         1         2         1         1         1         1 
## SRX380527 SRX380535 SRX380555 SRX380530 SRX380532 SRX380562 SRX380542 
##         1         1         1         1         1         1         1 
## SRX380550 SRX380554 SRX380528 SRX380547 SRX380529 SRX380544 SRX380541 
##         1         1         1         1         1         1         1 
## SRX380563 SRX380548 SRX380549 SRX380561 SRX380543 SRX380546 SRX380540 
##         1         1         1         1         1         1         1 
## SRX380537 SRX380553 SRX380539 SRX380571 SRX380564 SRX380531 SRX380545 
##         1         1         1         1         1         1         1 
## SRX380533 
##         1 
## attr(,"what")
## [1] columns
## Levels: 1 2

Then we can ask “what are the genes that provided the evidence for the separation of these two clusters?”

nmf_results$gene_info
## # A tibble: 14,865 × 8
##    gene_name      basis_1      basis_2 basis_1_weighted basis_2_weighted
##        <chr>        <dbl>        <dbl>            <dbl>            <dbl>
## 1        Mgp 1.090587e+00 3.190557e+01     1.607912e-01     6.831103e+00
## 2       Mal2 5.313835e+01 8.484844e+00     7.834476e+00     1.816637e+00
## 3        Bgn 2.220446e-16 2.673308e+01     3.273725e-17     5.723652e+00
## 4     Col1a2 3.252298e-01 2.662793e+01     4.795040e-02     5.701139e+00
## 5   Hist1h4d 3.588441e+01 2.220446e-16     5.290634e+00     4.754058e-17
## 6      Mfap4 2.220446e-16 2.224263e+01     3.273725e-17     4.762231e+00
## 7       Idh1 3.035478e+01 2.220446e-16     4.475371e+00     4.754058e-17
## 8       Nrep 4.179132e+01 4.961695e+01     6.161522e+00     1.062318e+01
## 9       Igf1 1.758126e+00 2.184246e+01     2.592101e-01     4.676552e+00
## 10      Fhl1 7.258388e+00 2.559203e+01     1.070144e+00     5.479349e+00
## # ... with 14,855 more rows, and 3 more variables: d_score <dbl>,
## #   which_max <fctr>, rank <int>

One way to understand the connection between the called clusters and the called important genes is that these two clusters express two distinct expression patterns. That is, some genes are expressed highly only in one cluster while some other genes are expressed highly only in the other cluster.

Let’s see what these two expression patterns are:

nmf_results$coef_line_plot

We can also see the D-score distribution of these two patterns. This plot also gives you an idea the relative abundance of genes for each pattern.

nmf_results$d_score_frequency_plot

There are also other interesting plots that allow various insights into the clusters, important genes, and expression patterns. We encourage the reader to explore by looking into its struture (it’s just a list):

str(nmf_results, max.level=1)
## List of 12
##  $ nmf_result            :Formal class 'NMFfitX1' [package "NMF"] with 16 slots
##  $ gene_info             :Classes 'tbl_df', 'tbl' and 'data.frame':  14865 obs. of  8 variables:
##  $ d_score_frequency_plot:List of 9
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
##  $ ordered_sample_ids    : chr [1:71] "SRX380553" "SRX380621" "SRX380549" "SRX380545" ...
##  $ coef_line_dat         :Classes 'tbl_df', 'tbl' and 'data.frame':  142 obs. of  3 variables:
##  $ coef_line_plot        :List of 9
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
##  $ path_dat              :Classes 'tbl_df', 'tbl' and 'data.frame':  71 obs. of  19 variables:
##  $ pca_path_plot         :List of 9
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
##  $ mds_path_plot         :List of 9
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
##  $ top_genes_dat         :Classes 'tbl_df', 'tbl' and 'data.frame':  7100 obs. of  6 variables:
##  $ top_genes_free_y_plot :List of 9
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"
##  $ top_genes_fixed       :List of 9
##   ..- attr(*, "class")= chr [1:2] "gg" "ggplot"

By the way, all the ggplot objects in the list have dataframe counter-parts for easy access of the underlying data. For example, coef_line_dat has the data for coef_line_plot:

nmf_results$coef_line_dat
## # A tibble: 142 × 3
##     coef    sample        value
##    <int>    <fctr>        <dbl>
## 1      1 SRX380553 1.641486e-01
## 2      2 SRX380553 1.689362e-09
## 3      1 SRX380621 1.531187e-01
## 4      2 SRX380621 1.233965e-05
## 5      1 SRX380549 1.426344e-01
## 6      2 SRX380549 2.216304e-08
## 7      1 SRX380545 1.424453e-01
## 8      2 SRX380545 1.076828e-08
## 9      1 SRX380561 1.361543e-01
## 10     2 SRX380561 4.021645e-09
## # ... with 132 more rows

Important Module generation

Again, everything you need is nicely packed in the function spinglass_procedure:

module_results <- spinglass_procedure(fpkm, phe, leading_genes, mppi, 'mouse', n_threads_ = 30)

We can view the module graphs by:

module_results$graph_plot

A summary containing important information about these modules can be found using:

module_results$final_tb
##     seed size connectivity p_values
## 1 Col1a1   87     5.931034    0.001
## 2 Ndufs8   21     8.000000    0.006
## 3  Gsk3b   39     4.051282    0.008
## 4   Gnb1   25     4.400000    0.034
## 5   Wtap   12     4.000000    0.034
##                                  first_term first_qvalue
## 1                      extracellular matrix 7.961773e-54
## 2 mitochondrial respiratory chain complex I 2.663462e-12
## 3                                      <NA>           NA
## 4          heterotrimeric G-protein complex 1.500972e-20
## 5                             nuclear speck 1.854504e-03
##                          second_term second_qvalue
## 1 proteinaceous extracellular matrix  3.337247e-53
## 2         NADH dehydrogenase complex  2.663462e-12
## 3                               <NA>            NA
## 4    plasma membrane protein complex  4.172258e-16
## 5                               <NA>            NA

References

Treutlein, Barbara, Doug G Brownfield, Angela R Wu, Norma F Neff, Gary L Mantalas, F Hernan Espinoza, Tushar J Desai, Mark A Krasnow, and Stephen R Quake. 2014. “Reconstructing Lineage Hierarchies of the Distal Lung Epithelium Using Single-Cell RNA-Seq.” Nature 509 (7500). Nature Publishing Group: 371–75.