Overview

In many situations, marker genes for cell types are either known a priori as expert knowledge, or can be curated through databases such as the Cellmark database. Alternatively, if purified expression data exists (either in bulk or single-cell form), it is possible to quickly derive marker genes using the findMarkers function in the scran R package.

Below we detail a case study in deriving marker genes through a differential expression approach.

Data

Overview

We take bulk RNA-seq data from Holik et al. Nucleic Acids Research 2017 to derive marker genes for 3 different cell lines. This is packaged with cellassign as holik_data:

data(holik_data)

which contains a matrix of counts, where each row is a gene (index by entrez ID) and each column is a sample:

as well as a vector with the cell line of origin for each sample:

Preparation

We first provide a map from entrez IDs to gene symbols:

Then construct the DGEList object for input to limma voom, filtering out lowly expressed genes:

and finally calculate the normalization factors:

Marker gene derivation

To derive marker genes, we first create a log fold change matrix using H1975 as the baseline expression:

lfc_table <- tt_sig[,c("H2228...H1975", "HCC827...H1975")]
lfc_table <- lfc_table %>%
  dplyr::mutate(H1975=0,
                H2228=H2228...H1975,
                HCC827=HCC827...H1975) %>%
  dplyr::select(H1975, H2228, HCC827)
rownames(lfc_table) <- tt_sig$GeneID

Then, for each gene, we subtract the minimum log fold change, as we care about overexpression of genes relative to some minimum expression level, as this defines a marker gene:

We now define a helper function for turning log fold changes into a binary matrix. This takes a matrix and a threshold, and any values less than or equal to the threshold are set to 0, and all others to 1:

Next, we implement a basic procedure for binarizing this matrix. Essentially, we look for the largest ‘gap’ in expression for each gene, and the cell types with expression above this gap are designated has having that gene as a marker:

Finally, we add back in gene symbols rather than entrez ids:

And there we have a marker gene matrix for our cell types:

pheatmap(t(marker_gene_mat))

Note that the expression data used for input to CellAssign should use only these as input.

Technical

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS  10.15.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] cellassign_0.99.16   pheatmap_1.0.12      matrixStats_0.55.0  
#>  [4] edgeR_3.26.8         org.Hs.eg.db_3.8.2   AnnotationDbi_1.46.1
#>  [7] IRanges_2.18.3       S4Vectors_0.22.1     Biobase_2.44.0      
#> [10] BiocGenerics_0.30.0  limma_3.40.6         magrittr_1.5        
#> [13] BiocStyle_2.12.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.3                  locfit_1.5-9.1             
#>  [3] lattice_0.20-38             assertthat_0.2.1           
#>  [5] zeallot_0.1.0               rprojroot_1.3-2            
#>  [7] digest_0.6.23               plyr_1.8.5                 
#>  [9] R6_2.4.1                    GenomeInfoDb_1.20.0        
#> [11] backports_1.1.5             RSQLite_2.2.0              
#> [13] evaluate_0.14               pillar_1.4.3               
#> [15] zlibbioc_1.30.0             tfruns_1.4                 
#> [17] rlang_0.4.2                 rstudioapi_0.10            
#> [19] whisker_0.4                 blob_1.2.0                 
#> [21] Matrix_1.2-18               reticulate_1.13            
#> [23] rmarkdown_2.0               pkgdown_1.4.1              
#> [25] desc_1.2.0                  BiocParallel_1.18.1        
#> [27] stringr_1.4.0               RCurl_1.95-4.13            
#> [29] bit_1.1-15.1                munsell_0.5.0              
#> [31] DelayedArray_0.10.0         compiler_3.6.0             
#> [33] xfun_0.12                   pkgconfig_2.0.3            
#> [35] base64enc_0.1-3             tensorflow_2.0.0           
#> [37] htmltools_0.4.0             tidyselect_0.2.5           
#> [39] SummarizedExperiment_1.14.1 tibble_2.1.3               
#> [41] GenomeInfoDbData_1.2.1      bookdown_0.17              
#> [43] dplyr_0.8.3                 crayon_1.3.4               
#> [45] MASS_7.3-51.5               bitops_1.0-6               
#> [47] grid_3.6.0                  jsonlite_1.6               
#> [49] gtable_0.3.0                lifecycle_0.1.0            
#> [51] DBI_1.1.0                   scales_1.1.0               
#> [53] stringi_1.4.5               XVector_0.24.0             
#> [55] fs_1.3.1                    vctrs_0.2.1                
#> [57] RColorBrewer_1.1-2          tools_3.6.0                
#> [59] bit64_0.9-7                 glue_1.3.1                 
#> [61] purrr_0.3.3                 yaml_2.2.0                 
#> [63] colorspace_1.4-1            BiocManager_1.30.10        
#> [65] GenomicRanges_1.36.1        memoise_1.1.0              
#> [67] knitr_1.27