Contents

1 Overview

In many situations, marker genes for cell types are either known a priori as expert knowledge, or can be curated through databases such as the Cellmark database. Alternatively, if purified expression data exists (either in bulk or single-cell form), it is possible to quickly derive marker genes using the findMarkers function in the scran R package.

Below we detail a case study in deriving marker genes through a differential expression approach.

2 Data

2.1 Overview

We take bulk RNA-seq data from Holik et al. Nucleic Acids Research 2017 to derive marker genes for 3 different cell lines. This is packaged with cellassign as holik_data:

data(holik_data)

which contains a matrix of counts, where each row is a gene (index by entrez ID) and each column is a sample:

head(holik_data$counts[,1:2])
#>    RSCE_10_BC2CTUACXX_AGTTCC_L006_R1.bam
#> 1                                     13
#> 2                                      0
#> 9                                    346
#> 10                                     0
#> 12                                    10
#> 13                                     1
#>    RSCE_12_BC2CTUACXX_TGACCA_L005_R1.bam
#> 1                                     12
#> 2                                      0
#> 9                                    286
#> 10                                     1
#> 12                                    17
#> 13                                     0

as well as a vector with the cell line of origin for each sample:

head(holik_data$cell_line)
#> [1] "HCC827" "HCC827" "H2228"  "H2228"  "H838"   "H838"

2.2 Preparation

We first provide a map from entrez IDs to gene symbols:

entrez_map <- select(org.Hs.eg.db, 
                     as.character(rownames(holik_data$counts)), 
                     c("SYMBOL"), "ENTREZID")
#> 'select()' returned 1:1 mapping between keys and columns
gene_annotations <- entrez_map %>%
  dplyr::rename(GeneID=ENTREZID,
                Symbol=SYMBOL)

Then construct the DGEList object for input to limma voom, filtering out lowly expressed genes:

dge <- DGEList(counts = holik_data$counts, 
               group = holik_data$cell_line, 
               genes = gene_annotations, 
               remove.zeros = TRUE)
#> Removing 3620 rows with all zero counts
genes_to_keep <- rowSums(cpm(dge$counts) > 0.5) >= 2
dge_filt <- dge[genes_to_keep,]

and finally calculate the normalization factors:

dge_filt <- calcNormFactors(dge_filt, method="TMM")

3 Differential expression

We next perform differential expression using Limma Voom on a subset of 3 samples: HCC827, H2228, H1975:

dge_subset <- dge_filt[,dge_filt$samples$group %in% c("HCC827", "H2228", "H1975")]
design <- model.matrix(~ 0+dge_subset$samples$group)
colnames(design) <- levels(dge_subset$samples$group)
v <- voom(dge_subset, design)
fit <- lmFit(v, design)

Next, fit constrasts to find differentially expressed genes between cell types:

contrast.matrix <- makeContrasts(H2228 - H1975, 
                                 HCC827 - H1975, 
                                 HCC827 - H2228, 
                                 levels = design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2)

Finally, compute gene summary statistics and filter to only significantly differentially expressed geens (FDR < 0.05):

tt <- topTable(fit2, n=Inf)
tt_sig <- tt %>%
  dplyr::filter(adj.P.Val < 0.05)

head(tt_sig)
#>   GeneID    Symbol H2228...H1975 HCC827...H1975 HCC827...H2228  AveExpr
#> 1   1956      EGFR    0.68777340       4.303436       3.615662 9.208711
#> 2  23480    SEC61G   -0.40115418       4.414580       4.815734 7.784072
#> 3  81552     VOPP1   -0.09640406       4.347392       4.443796 6.998559
#> 4   1362       CPD    2.12265278      -2.103571      -4.226224 8.704180
#> 5 729086 LOC729086   -0.04338772       4.384976       4.428363 6.555193
#> 6  55915    LANCL2   -0.57454448       4.021963       4.596508 6.568625
#>          F      P.Value    adj.P.Val
#> 1 2026.616 3.447052e-12 2.982899e-08
#> 2 2003.024 3.623981e-12 2.982899e-08
#> 3 1766.548 6.199752e-12 3.232724e-08
#> 4 1671.376 7.854996e-12 3.232724e-08
#> 5 1506.362 1.224645e-11 3.628466e-08
#> 6 1479.497 1.322488e-11 3.628466e-08

4 Marker gene derivation

To derive marker genes, we first create a log fold change matrix using H1975 as the baseline expression:

lfc_table <- tt_sig[,c("H2228...H1975", "HCC827...H1975")]
lfc_table <- lfc_table %>%
  dplyr::mutate(H1975=0,
                H2228=H2228...H1975,
                HCC827=HCC827...H1975) %>%
  dplyr::select(H1975, H2228, HCC827)
rownames(lfc_table) <- tt_sig$GeneID

Then, for each gene, we subtract the minimum log fold change, as we care about overexpression of genes relative to some minimum expression level, as this defines a marker gene:

lfc_table <- as.matrix(lfc_table)
lfc_table <- lfc_table - rowMins(lfc_table)
lfc_table <- as.data.frame(lfc_table)

We now define a helper function for turning log fold changes into a binary matrix. This takes a matrix and a threshold, and any values less than or equal to the threshold are set to 0, and all others to 1:

binarize <- function(x, threshold) {
  x[x <= threshold] <- -Inf
  x[x > -Inf] <- 1
  x[x == -Inf] <- 0
  return(x)
}

Next, we implement a basic procedure for binarizing this matrix. Essentially, we look for the largest ‘gap’ in expression for each gene, and the cell types with expression above this gap are designated has having that gene as a marker:

# Find the biggest difference
maxdiffs <- apply(lfc_table, 1, function(x) max(diff(sort(x))))

#
thres_vals <- apply(lfc_table, 1, function(x) sort(x)[which.max(diff(sort(x)))])
expr_mat_thres <- plyr::rbind.fill(lapply(1:nrow(lfc_table), function(i) {
  binarize(lfc_table[i,], thres_vals[i])
}))
rownames(expr_mat_thres) <- rownames(lfc_table)
marker_gene_mat <- expr_mat_thres[(maxdiffs >= quantile(maxdiffs, c(.99))) 
                                  & (thres_vals <= log(2)),] %>%
  as.matrix

Finally, we add back in gene symbols rather than entrez ids:

suppressMessages({
  symbols <- plyr::mapvalues(
    rownames(marker_gene_mat),
    from = gene_annotations$GeneID,
    to = gene_annotations$Symbol
  )
})

is_na <- is.na(symbols)

marker_gene_mat <- marker_gene_mat[!is_na,]
rownames(marker_gene_mat) <- symbols[!is_na]

And there we have a marker gene matrix for our cell types:

head(marker_gene_mat)
#>          H1975 H2228 HCC827
#> NRK          0     0      1
#> MTAP         1     0      1
#> CP           0     1      0
#> HIST1H3F     1     0      1
#> HIST1H4E     1     0      1
#> ADIRF        1     1      0
pheatmap(t(marker_gene_mat))

Note that the expression data used for input to CellAssign should use only these as input.

5 Technical

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] cellassign_0.99.11   pheatmap_1.0.12      matrixStats_0.54.0  
#>  [4] edgeR_3.26.5         org.Hs.eg.db_3.8.2   AnnotationDbi_1.46.0
#>  [7] IRanges_2.18.1       S4Vectors_0.22.0     Biobase_2.44.0      
#> [10] BiocGenerics_0.30.0  limma_3.40.2         magrittr_1.5        
#> [13] BiocStyle_2.12.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] pkgload_1.0.2               bit64_0.9-7                
#>  [3] jsonlite_1.6                assertthat_0.2.1           
#>  [5] BiocManager_1.30.4          blob_1.1.1                 
#>  [7] GenomeInfoDbData_1.2.1      yaml_2.2.0                 
#>  [9] remotes_2.1.0               sessioninfo_1.1.1          
#> [11] pillar_1.4.2                RSQLite_2.1.1              
#> [13] backports_1.1.4             lattice_0.20-38            
#> [15] glue_1.3.1                  reticulate_1.12            
#> [17] digest_0.6.20               GenomicRanges_1.36.0       
#> [19] RColorBrewer_1.1-2          XVector_0.24.0             
#> [21] colorspace_1.4-1            plyr_1.8.4                 
#> [23] htmltools_0.3.6             Matrix_1.2-17              
#> [25] pkgconfig_2.0.2             devtools_2.0.2             
#> [27] bookdown_0.11               zlibbioc_1.30.0            
#> [29] purrr_0.3.2                 scales_1.0.0               
#> [31] processx_3.4.0              whisker_0.3-2              
#> [33] BiocParallel_1.18.0         tibble_2.1.3               
#> [35] usethis_1.5.1               withr_2.1.2                
#> [37] SummarizedExperiment_1.14.0 cli_1.1.0                  
#> [39] crayon_1.3.4                memoise_1.1.0              
#> [41] evaluate_0.14               ps_1.3.0                   
#> [43] fs_1.3.1                    pkgbuild_1.0.3             
#> [45] tools_3.6.0                 prettyunits_1.0.2          
#> [47] stringr_1.4.0               munsell_0.5.0              
#> [49] locfit_1.5-9.1              DelayedArray_0.10.0        
#> [51] callr_3.3.0                 compiler_3.6.0             
#> [53] GenomeInfoDb_1.20.0         rlang_0.4.0                
#> [55] grid_3.6.0                  RCurl_1.95-4.12            
#> [57] rstudioapi_0.10             bitops_1.0-6               
#> [59] base64enc_0.1-3             rmarkdown_1.13             
#> [61] testthat_2.1.1              gtable_0.3.0               
#> [63] DBI_1.0.0                   R6_2.4.0                   
#> [65] tfruns_1.4                  dplyr_0.8.3                
#> [67] knitr_1.23                  tensorflow_1.13.1          
#> [69] bit_1.1-14                  rprojroot_1.3-2            
#> [71] desc_1.2.0                  stringi_1.4.3              
#> [73] Rcpp_1.0.1                  tidyselect_0.2.5           
#> [75] xfun_0.8