Abstract

scTEI add any phylogenetically based transcriptome evolutionary index (TEI) to single-cell data objects

Introduction

The goal of scTEI is to provide easy functionality to add phylogenetically based transcriptome evolutionary index (TEI) to single-cell data objects.

For a comprehensive overview about the topic of gene age assignments, transcriptome age index (TAI) and its derivates (TDI, TPI, Adjusted SD, PhastCons, …) see e.g.:

  • myTAI - Introduction
  • TAI: (Domazet-Lošo, Brajković, and Tautz 2007)
  • TAI: (Domazet-Lošo and Tautz 2010)
  • TDI: (Quint et al. 2012)
  • TDI: (Drost et al. 2017)
  • TPI: (Gossmann et al. 2016)
  • Adjusted SD, PhastCons: (Liu et al. 2020)

Since all of these values deal with the topic of weigthing transcriptome data with an evolutionary index, I combine all of them under one term transcriptome evolutionary index, short TEI.

The following sections introduce main single-cell gene expression data analysis techniques implemented in scTEI, which basically update basic myTAI functions to deal with large single-cell data objects:

  • Adding transcriptome evolutionary index (TEI) to single-cell data objects

  • Seurat

  • monocle3

  • Visualizing TEI values

Adding transcriptome evolutionary index (TEI) to single-cell data objects

A variety of data objects are used to deal with sparse count data (dgCMatrix) from single-cell RNA sequencing (scRNA-seq).

Here, we focus on the common used R packages Seurat and monocle3, which both use a sparseMatrix object to store scRNA count data.

Note: Apparently, one important note is that count data or subsequent normalized counts or scaled data needs to be positive.

One should consider transforming scaled data to positive scale prior applying TEI calculations.

In this section we will introduce how to add TEI values to either a seurat or monocle3 cell data set.

Note that prior calculating TEI one needs to retrieve phylogenetic or taxonomic information for your focal species.

This might be a phylostratigraphic map (as introduced by (Domazet-Lošo, Brajković, and Tautz 2007)) or an ortho map (I call them, see e.g. (Julca et al. 2021) or (Cazet et al. 2022)), which can be obtained by assigning to each orthogroup (OG) or hierachical orthogroup (HOG) along a given species tree the ancestral node.

Please have a look at the introduction of the great myTAI package for possible sources, how to get such phylostratigraphic maps (myTAI - Introduction).

To create an ortho map, one simply needs to get OGs or HOGs with e.g. OrthoFinder ((Emms and Kelly 2019)) or Proteinortho ((Lechner et al. 2011)) or any other ortholog prediction tool (see (Linard et al. 2021)) using a set of species that cover the species range of your interest. Parse each OG or HOG for the oldest clade as compared to a species tree and your focal species of interest.

Or e.g. use pre-calculated OGs from e.g. https://omabrowser.org/oma/home ((Schneider, Dessimoz, and Gonnet 2007)) or https://bioinformatics.psb.ugent.be/plaza/ for plants ((Proost et al. 2009)).

Installation

Install - Seurat and monocle3

Please visit the following documentation to install Seurat and monocole3

Seurat

monocle3

Install org.Hs.eg.db and org.Mm.eg.db to be able to convert ensembl IDs into gene name alias

org.Hs.eg.db org.Mm.eg.db

BiocManager::install(
    c(
    "org.Hs.eg.db",
    "org.Mm.eg.db")
)

Install additional plotting packages

ggplot2 viridis cowplot ComplexHeatmap

install.packages("ggplot2")
install.packages("viridis")
install.packages("cowplot")
BiocManager::install("ComplexHeatmap")

Seurat - example

For Seurat, we use scRNA data obtained from SeuratData for the model organism Caenorhabditis elegans (Packer et al. 2019) and phylogenetic data obtained from (Sun, Rödelsperger, and Sommer 2021) (Sun2021_Supplemental_Table_S6).

Example: 6k C. elegans embryos from Packer and Zhu et al (2019)

Here, for demonstration purposes, we will be using the 6k Caenorhabditis elegans embryos object that is available via the SeuratData package.

# load Packer and Zhu et al (2019) data set
SeuratData::InstallData("celegans.embryo.SeuratData")
library(celegans.embryo.SeuratData)
data("celegans.embryo")
celegans <- UpdateSeuratObject(celegans.embryo)
#celegans <- SeuratData::LoadData("celegans.embryo")
dim(celegans)
## [1] 20222  6188
head(rownames(celegans))
## [1] "WBGene00010957" "WBGene00010958" "WBGene00010959" "WBGene00010960"
## [5] "WBGene00010961" "WBGene00000829"
# preprocess scRNA
all.genes <- rownames(celegans)
celegans <- Seurat::NormalizeData(
    celegans,
    normalization.method = "LogNormalize",
    scale.factor = 10000) |>
    Seurat::FindVariableFeatures(selection.method = "vst",
    nfeatures = 2000) |>
    Seurat::ScaleData(features = all.genes) |>
    Seurat::RunPCA(dims=50) |>
    Seurat::RunUMAP(dims = 1:10)
Seurat::Idents(celegans) <- "embryo.time.bin"
p1 <- DimPlot(celegans)
print(p1)

# overwrite NA in cell.type + add embryo.time.bin x cell.type
celegans@meta.data$cell.type[is.na(celegans@meta.data$cell.type)] <- "notClassified"
celegans@meta.data["embryo.time.bin.cell.type"] <- paste0(
    unlist(celegans@meta.data["embryo.time.bin"]),
    "-",
    unlist(celegans@meta.data["cell.type"]))

# load Caenorhabditis elegans gene age estimation
celegans_ps <- readr::read_tsv(file = system.file("extdata",
    "Sun2021_Orthomap.tsv", package = "scTEI"))
table(celegans_ps$Phylostratum)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13 
## 1290 5434 4666  603 1039  808  274  884  590  511  384 1113 1277 1167
# define Phylostratum
ps_vec <- setNames(as.numeric(celegans_ps$Phylostratum),
    celegans_ps$GeneID)
# add TEI values
celegans@meta.data["TEI"] <- TEI(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="counts"),
    Phylostratum = ps_vec
)
## Use multiple threads to calculate TEI on sparseMatrix

celegans@meta.data["TEI"] <- TEI(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="counts"),
    Phylostratum = ps_vec,
    split = 1000,
    threads = 2)
# make FeaturePlot
p2 <- FeaturePlot(
    object = celegans,
    features = "TEI",
    min.cutoff='q05',
    max.cutoff='q95',
    cols = viridis(3))
print(p2)

# make RidgePlot
p3 <- RidgePlot(object = celegans,
    features = "TEI",
    group.by = "embryo.time.bin")
print(p3)
## Picking joint bandwidth of 0.0512

# make RidgePlot by cell type
Seurat::Idents(celegans) <- "cell.type"
p4 <- RidgePlot(object = celegans,
    features = "TEI",
    group.by = "cell.type") +
    Seurat::NoLegend()
print(p4)
## Picking joint bandwidth of 0.054

# subset to specific cell type - ADF + notClassified
ADF <- subset(celegans,
    cells = c(grep("ADF",
        celegans@meta.data$cell.type),
    grep("notClassified",
         celegans@meta.data$cell.type)))
p5 <- DimPlot(object = ADF)
Seurat::Idents(ADF) <- "embryo.time.bin"
p6 <- DimPlot(object = ADF)
p7 <- RidgePlot(ADF, "TEI")
Seurat::Idents(ADF) <- "embryo.time.bin.cell.type"
p8 <- RidgePlot(object = ADF, features = "TEI") +
    Seurat::NoLegend()

# make grid plot
print(plot_grid(p5, p6))

print(plot_grid(p7, p8))
## Picking joint bandwidth of 0.0608
## Picking joint bandwidth of 0.0704

# use pMatrix as data to cluster

# Seurat v4

#celegans.TEI <- Seurat::CreateSeuratObject(counts = celegans@assays$RNA@counts,
#    meta.data = celegans@meta.data, row.names = rownames(celegans@assays$RNA@counts))
#celegans.TEI@assays$RNA@data <- pMatrixTEI(
#    ExpressionSet = celegans.TEI@assays$RNA@counts,
#    Phylostratum = ps_vec
#)

# Seurat v5

celegans.TEI <- Seurat::CreateSeuratObject(
    counts = GetAssayData(celegans, assay="RNA", layer="counts"),
    meta.data = celegans@meta.data)
celegans.TEI <- SetAssayData(
    object = celegans.TEI,
    layer = "data",
    new.data = pMatrixTEI(
        ExpressionSet = GetAssayData(celegans.TEI, assay="RNA", layer="counts"),
        Phylostratum = ps_vec
    )
)
all.genes <- rownames(GetAssayData(celegans.TEI, assay="RNA", layer="data"))
celegans.TEI <- Seurat::FindVariableFeatures(
    celegans.TEI,
    selection.method = "vst",
    nfeatures = 2000) %>%
    Seurat::ScaleData(do.scale = FALSE, do.center = FALSE,
    features = all.genes) %>%
    Seurat::RunPCA(dims=50) %>%
    Seurat::RunUMAP(dims = 1:20)
## Finding variable features for layer counts
## Warning in PrepDR5(object = object, features = features, layer = layer, : The
## following features were not available: WBGene00018440, WBGene00219374,
## WBGene00008328, WBGene00008881, WBGene00020009, WBGene00010424, WBGene00015276,
## WBGene00219372.
## PC_ 1 
## Positive:  WBGene00006399, WBGene00194983, WBGene00012214, WBGene00017584, WBGene00022648, WBGene00020261, WBGene00021804, WBGene00269431, WBGene00012254, WBGene00045208 
##     WBGene00015774, WBGene00012301, WBGene00219421, WBGene00006051, WBGene00219280, WBGene00016724, WBGene00044562, WBGene00018234, WBGene00044502, WBGene00045272 
##     WBGene00022395, WBGene00018850, WBGene00194680, WBGene00206383, WBGene00022339, WBGene00195086, WBGene00043054, WBGene00077773, WBGene00020775, WBGene00206507 
## Negative:  WBGene00002085, WBGene00009212, WBGene00018352, WBGene00015610, WBGene00022112, WBGene00044066, WBGene00018606, WBGene00016071, WBGene00012899, WBGene00013553 
##     WBGene00002253, WBGene00014149, WBGene00020537, WBGene00016711, WBGene00020205, WBGene00016457, WBGene00012420, WBGene00021299, WBGene00017713, WBGene00011156 
##     WBGene00019039, WBGene00000217, WBGene00077531, WBGene00009861, WBGene00014156, WBGene00000468, WBGene00010100, WBGene00013098, WBGene00009180, WBGene00019918 
## PC_ 2 
## Positive:  WBGene00002085, WBGene00002086, WBGene00015081, WBGene00019516, WBGene00001534, WBGene00006047, WBGene00006399, WBGene00194983, WBGene00012214, WBGene00022648 
##     WBGene00020261, WBGene00017584, WBGene00021804, WBGene00269431, WBGene00012254, WBGene00045208, WBGene00015774, WBGene00012301, WBGene00219280, WBGene00006051 
##     WBGene00219421, WBGene00044562, WBGene00016724, WBGene00018234, WBGene00044502, WBGene00022395, WBGene00045272, WBGene00018850, WBGene00206383, WBGene00194680 
## Negative:  WBGene00018352, WBGene00018606, WBGene00009212, WBGene00015610, WBGene00016071, WBGene00016711, WBGene00021915, WBGene00017611, WBGene00014149, WBGene00007641 
##     WBGene00017326, WBGene00019039, WBGene00016463, WBGene00016297, WBGene00007883, WBGene00012420, WBGene00011763, WBGene00020772, WBGene00044066, WBGene00016457 
##     WBGene00045386, WBGene00019482, WBGene00013553, WBGene00021272, WBGene00022112, WBGene00020537, WBGene00009983, WBGene00020205, WBGene00002253, WBGene00012899 
## PC_ 3 
## Positive:  WBGene00018352, WBGene00009212, WBGene00018606, WBGene00016071, WBGene00015610, WBGene00019039, WBGene00014149, WBGene00044066, WBGene00020205, WBGene00020537 
##     WBGene00022112, WBGene00013553, WBGene00017611, WBGene00009861, WBGene00077531, WBGene00002253, WBGene00021299, WBGene00019035, WBGene00017713, WBGene00012899 
##     WBGene00000217, WBGene00014156, WBGene00007641, WBGene00045183, WBGene00016578, WBGene00020362, WBGene00022557, WBGene00007934, WBGene00019591, WBGene00019703 
## Negative:  WBGene00045386, WBGene00016463, WBGene00003890, WBGene00007883, WBGene00194921, WBGene00009983, WBGene00045336, WBGene00011500, WBGene00016711, WBGene00017068 
##     WBGene00003642, WBGene00021915, WBGene00010864, WBGene00011763, WBGene00010848, WBGene00008484, WBGene00021272, WBGene00004346, WBGene00003934, WBGene00011869 
##     WBGene00016254, WBGene00044020, WBGene00022441, WBGene00003851, WBGene00019177, WBGene00005246, WBGene00019482, WBGene00021395, WBGene00002085, WBGene00015901 
## PC_ 4 
## Positive:  WBGene00018352, WBGene00002085, WBGene00017611, WBGene00021915, WBGene00007641, WBGene00016297, WBGene00011763, WBGene00020772, WBGene00016463, WBGene00019482 
##     WBGene00007883, WBGene00015062, WBGene00044068, WBGene00016254, WBGene00007490, WBGene00007675, WBGene00011500, WBGene00044419, WBGene00021272, WBGene00016532 
##     WBGene00009983, WBGene00019177, WBGene00022419, WBGene00010632, WBGene00017184, WBGene00044232, WBGene00003642, WBGene00020196, WBGene00016005, WBGene00004346 
## Negative:  WBGene00009212, WBGene00044066, WBGene00015610, WBGene00022112, WBGene00018606, WBGene00002253, WBGene00016071, WBGene00012899, WBGene00013553, WBGene00016457 
##     WBGene00020537, WBGene00021299, WBGene00014149, WBGene00011156, WBGene00019039, WBGene00020205, WBGene00017713, WBGene00014156, WBGene00007934, WBGene00010100 
##     WBGene00019918, WBGene00016578, WBGene00077531, WBGene00009861, WBGene00045183, WBGene00011475, WBGene00020362, WBGene00022557, WBGene00008510, WBGene00016520 
## PC_ 5 
## Positive:  WBGene00045386, WBGene00018352, WBGene00003890, WBGene00016071, WBGene00015610, WBGene00010848, WBGene00018606, WBGene00019039, WBGene00020205, WBGene00020537 
##     WBGene00005246, WBGene00194921, WBGene00019416, WBGene00013553, WBGene00009861, WBGene00022112, WBGene00010693, WBGene00017713, WBGene00019035, WBGene00077531 
##     WBGene00021299, WBGene00006778, WBGene00045183, WBGene00021773, WBGene00045247, WBGene00022617, WBGene00001242, WBGene00020785, WBGene00009862, WBGene00022063 
## Negative:  WBGene00016463, WBGene00045336, WBGene00016711, WBGene00007883, WBGene00021915, WBGene00009983, WBGene00011763, WBGene00016254, WBGene00017068, WBGene00044020 
##     WBGene00011500, WBGene00003934, WBGene00003642, WBGene00016297, WBGene00004346, WBGene00002105, WBGene00011869, WBGene00020582, WBGene00019177, WBGene00010864 
##     WBGene00022441, WBGene00019482, WBGene00003745, WBGene00018164, WBGene00017076, WBGene00164973, WBGene00017236, WBGene00003851, WBGene00044232, WBGene00010632
## 13:17:23 UMAP embedding parameters a = 0.9922 b = 1.112
## 13:17:23 Read 6188 rows and found 20 numeric columns
## 13:17:23 Using Annoy for neighbor search, n_neighbors = 30
## 13:17:23 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 13:17:23 Writing NN index file to temp file /var/folders/wy/d_ysgzdd3_bdgz02jfdy12z00000gp/T//RtmpqwpWAk/filee94538642fb1
## 13:17:23 Searching Annoy index using 1 thread, search_k = 3000
## 13:17:24 Annoy recall = 100%
## 13:17:24 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 13:17:25 Initializing from normalized Laplacian + noise (using RSpectra)
## 13:17:25 Commencing optimization for 500 epochs, with 252246 positive edges
## 13:17:25 Using rng type: pcg
## 13:17:31 Optimization finished
## Use multiple threads to calculate pMatrix

# Seurat v4

#celegans.TEI@assays$RNA@data <- pMatrixTEI(
#    ExpressionSet = celegans.TEI@assays$RNA@counts,
#    Phylostratum = ps_vec,
#    split = 1000,
#    threads = 2)

# Seurat v5

celegans.TEI <- SetAssayData(
    object = celegans.TEI,
    layer = "data",
    new.data = pMatrixTEI(
        ExpressionSet = GetAssayData(celegans.TEI, assay="RNA", layer="counts"),
        Phylostratum = ps_vec,
        split = 1000,
        threads = 2
    )
)
Seurat::Idents(celegans.TEI) <- "embryo.time.bin"
p9 <- DimPlot(celegans.TEI)
print(p9)

# make FeaturePlot
p10 <- FeaturePlot(
    object = celegans.TEI,
    features = "TEI",
    min.cutoff='q05',
    max.cutoff='q95',
    cols = viridis(3))
print(p10)

Seurat::Idents(celegans) <- "cell.type"
p11 <- DimPlot(celegans)
Seurat::Idents(celegans.TEI) <- "cell.type"
p12 <- DimPlot(celegans.TEI)
# make grid plot
print(plot_grid(p2, p11))

print(plot_grid(p10, p12))

# get TEI per strata
pS <- pStrataTEI(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="counts"),
    Phylostratum = ps_vec
)
## Use multiple threads to calculate pStrata

# Seurat v4

#pS <- pStrataTEI(
#    ExpressionSet = celegans@assays$RNA@counts,
#    Phylostratum = 
#        setNames(as.numeric(celegans_ps$Phylostratum),
#        celegans_ps$GeneID),
#    split = 1000,
#    threads = 2)

# Seurat v5

pS <- pStrataTEI(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="counts"),
    Phylostratum = 
        setNames(as.numeric(celegans_ps$Phylostratum),
        celegans_ps$GeneID),
    split = 1000,
    threads = 2)
# get permutations

# Seurat v4

#bM <- bootTEI(
#    ExpressionSet = celegans@assays$RNA@counts,
#    Phylostratum = ps_vec,
#    permutations = 100
#)

# Seurat v5

bM <- bootTEI(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="counts"),
    Phylostratum = ps_vec,
    permutations = 100
)
## Use multiple threads to get permutations

# Seurat v4

#bM <- bootTEI(
#    ExpressionSet = celegans@assays$RNA@counts,
#    Phylostratum = ps_vec,
#    permutations = 100,
#    split = 1000,
#    threads = 2
#)

# Seurat v5

bM <- bootTEI(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="counts"),
    Phylostratum = ps_vec,
    permutations = 100,
    split = 1000,
    threads = 2
)
# get mean expression matrix

# Seurat v4

#meanMatrix <- REMatrix(
#    ExpressionSet = celegans@assays$RNA@data,
#    Phylostratum = ps_vec
#)

# Seurat v5

meanMatrix <- REMatrix(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="data"),
    Phylostratum = ps_vec
)
# get mean expression matrix with groups

cell_groups <- setNames(
    lapply(names(table(celegans@meta.data$cell.type)),
        function(x){which(celegans@meta.data$cell.type==x)}),
    names(table(celegans@meta.data$cell.type))
)

# Seurat v4

#meanMatrix_by_cell.type <- REMatrix(
#    ExpressionSet = celegans@assays$RNA@scale.data,
#    Phylostratum = ps_vec,
#    groups = cell_groups
#)

# Seurat v5

meanMatrix_by_cell.type <- REMatrix(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="scale.data"),
    Phylostratum = ps_vec,
    groups = cell_groups
)

ComplexHeatmap::Heatmap(meanMatrix_by_cell.type,
    cluster_rows = FALSE, cluster_columns = FALSE,
    col = viridis::viridis(3))

# computing relative expression profile over cell types

cell_groups <- setNames(
    lapply(names(table(celegans@meta.data$cell.type)),
        function(x){which(celegans@meta.data$cell.type==x)}),
    names(table(celegans@meta.data$cell.type))
)

# Seurat v4

#reMatrix_by_cell.type <- REMatrix(
#    ExpressionSet = celegans@assays$RNA@scale.data,
#    Phylostratum = ps_vec,
#    groups = cell_groups,
#    by = "row"
#)

# Seurat v5

reMatrix_by_cell.type <- REMatrix(
    ExpressionSet = GetAssayData(celegans, assay="RNA", layer="scale.data"),
    Phylostratum = ps_vec,
    groups = cell_groups,
    by = "row"
)

ComplexHeatmap::Heatmap(reMatrix_by_cell.type,
    cluster_rows = FALSE)

Monocle3 - example

For Monocle3, we again use scRNA data obtained from (Packer et al. 2019) for the model organism Caenorhabditis elegans (Packer et al. 2019) and phylogenetic data obtained from (Sun, Rödelsperger, and Sommer 2021) (Sun2021_Supplemental_Table_S6).

Example: 6k C. elegans embryos from Packer and Zhu et al (2019)

# load Packer and Zhu et al (2019) data set

#expression_matrix <- readRDS(
#    url(
#    paste0("http://staff.washington.edu/hpliner/data/",
#    "packer_embryo_expression.rds")
#    )
#)

#cell_metadata <- readRDS(
#    url(
#    paste0("http://staff.washington.edu/hpliner/data/",
#    "packer_embryo_colData.rds")
#    )
#)

#gene_annotation <- readRDS(
#    url(
#    paste0("http://staff.washington.edu/hpliner/data/",
#    "packer_embryo_rowData.rds")
#    )
#)

packer_embryo_expression_path <- system.file("extdata",
    "packer_embryo_expression.rds",
    package = "scTEI")
expression_matrix <- readRDS(packer_embryo_expression_path)

packer_embryo_colData_path <- system.file("extdata",
    "packer_embryo_colData.rds",
    package = "scTEI")
cell_metadata <- readRDS(packer_embryo_colData_path)

packer_embryo_rowData_path <- system.file("extdata",
    "packer_embryo_rowData.rds",
    package = "scTEI")
gene_annotation <- readRDS(packer_embryo_rowData_path)

cds <- new_cell_data_set(
    expression_data = expression_matrix,
    cell_metadata = cell_metadata,
    gene_metadata = gene_annotation
)
# preprocess scRNA
cds <- preprocess_cds(cds, num_dim = 50)
cds <- align_cds(cds, alignment_group = "batch",
    residual_model_formula_str = "~ bg.300.loading +
        bg.400.loading + bg.500.1.loading + bg.500.2.loading +
        bg.r17.loading + bg.b01.loading + bg.b02.loading")
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds)
cds <- learn_graph(cds)
##   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
p1 <- plot_cells(cds,
    label_groups_by_cluster=FALSE,
    color_cells_by = "embryo.time.bin",
    group_label_size = 5,
    label_cell_groups = FALSE,
    label_leaves = TRUE,
    label_branch_points = TRUE,
    graph_label_size=1.5)
print(p1)

# order cells - select youngest time point according to embryo.time
# - center of the plot
cds <- order_cells(cds,
    root_cells = colnames(cds)[which(
    colData(cds)$embryo.time == min(colData(cds)$embryo.time))])
colData(cds)["pseudotime"] <- pseudotime(cds)

# plot by pseudotime
p2 <- plot_cells(cds,
    label_groups_by_cluster=FALSE,
    color_cells_by = "pseudotime",
    group_label_size = 5,
    label_cell_groups = FALSE,
    label_leaves = TRUE,
    label_branch_points = TRUE,
    graph_label_size=1.5)
print(p2)

# load Caenorhabditis elegans gene age estimation
celegans_ps <- readr::read_tsv(file = system.file("extdata",
    "Sun2021_Orthomap.tsv", package = "scTEI"))
## Rows: 20040 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): GeneID
## dbl (1): Phylostratum
## 
##  Use `spec()` to retrieve the full column specification for this data.
##  Specify the column types or set `show_col_types = FALSE` to quiet this message.
table(celegans_ps$Phylostratum)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13 
## 1290 5434 4666  603 1039  808  274  884  590  511  384 1113 1277 1167
# define Phylostratum
ps_vec <- setNames(as.numeric(celegans_ps$Phylostratum),
    celegans_ps$GeneID)
# add TEI values
colData(cds)["TEI"] <- TEI(
    ExpressionSet = counts(cds),
    Phylostratum = ps_vec
)
## Use multiple threads to calculate TEI on sparseMatrix

colData(cds)["TEI"] <- TEI(ExpressionSet = counts(cds),
    Phylostratum = 
        setNames(as.numeric(celegans_ps$Phylostratum),
        celegans_ps$GeneID),
    split = 1000,
    threads = 2)
# make FeaturePlot
p3 <- plot_cells(cds,
    label_groups_by_cluster=FALSE,
    color_cells_by = "TEI",
    group_label_size = 5,
    label_cell_groups = FALSE,
    label_leaves = TRUE,
    label_branch_points = TRUE,
    graph_label_size=1.5,
    cell_size = 1)
print(p3)

# make Boxplot
p4 <- ggplot2::ggplot(data.frame(colData(cds)),
    aes(x=embryo.time.bin, y=TEI, fill=embryo.time.bin)) +
    geom_violin() +
    geom_boxplot(width=0.1)
print(p4)

# make scatter plot - TEI vs pseudotime
p5 <- ggplot(data.frame(colData(cds)),
    aes(x=pseudotime, y=TEI, col=pseudotime)) +
    geom_point()
print(p5)

p6 <- ggplot(data.frame(colData(cds)),
    aes(x=pseudotime, y=TEI, col=embryo.time.bin)) +
    geom_point()
print(p6)

# make grid plot
print(plot_grid(p1, p2, p3, p4))

References

Cantalapiedra, Carlos P, Ana Hernández-Plaza, Ivica Letunic, Peer Bork, and Jaime Huerta-Cepas. 2021. “eggNOG-Mapper V2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale.” Molecular Biology and Evolution 38 (12): 5825–29.
Cao, Junyue, Malte Spielmann, Xiaojie Qiu, Xingfan Huang, Daniel M Ibrahim, Andrew J Hill, Fan Zhang, et al. 2019. “The Single-Cell Transcriptional Landscape of Mammalian Organogenesis.” Nature 566 (7745): 496–502.
Cazet, Jack, Stefan Siebert, Hannah Morris Little, Philip Bertemes, Abby S Primack, Peter Ladurner, Matthias Achrainer, et al. 2022. “New Hydra Genomes Reveal Conserved Principles of Hydrozoan Transcriptional Regulation.” bioRxiv.
Domazet-Lošo, Tomislav, Josip Brajković, and Diethard Tautz. 2007. “A Phylostratigraphy Approach to Uncover the Genomic History of Major Adaptations in Metazoan Lineages.” Trends in Genetics 23 (11): 533–39.
Domazet-Lošo, Tomislav, and Diethard Tautz. 2010. “A Phylogenetically Based Transcriptome Age Index Mirrors Ontogenetic Divergence Patterns.” Nature 468 (7325): 815–18.
Drost, Hajk-Georg, Philipp Janitza, Ivo Grosse, and Marcel Quint. 2017. “Cross-Kingdom Comparison of the Developmental Hourglass.” Current Opinion in Genetics & Development 45: 69–75.
Emms, David M, and Steven Kelly. 2019. “OrthoFinder: Phylogenetic Orthology Inference for Comparative Genomics.” Genome Biology 20 (1): 1–14.
Gossmann, Toni I, Dounia Saleh, Marc W Schmid, Michael A Spence, and Karl J Schmid. 2016. “Transcriptomes of Plant Gametophytes Have a Higher Proportion of Rapidly Evolving and Young Genes Than Sporophytes.” Molecular Biology and Evolution 33 (7): 1669–78.
Huerta-Cepas, Jaime, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, et al. 2019. “eggNOG 5.0: A Hierarchical, Functionally and Phylogenetically Annotated Orthology Resource Based on 5090 Organisms and 2502 Viruses.” Nucleic Acids Research 47 (D1): D309–14.
Julca, Irene, Camilla Ferrari, Marı́a Flores-Tornero, Sebastian Proost, Ann-Cathrin Lindner, Dieter Hackenberg, Lenka Steinbachová, et al. 2021. “Comparative Transcriptomic Analysis Reveals Conserved Programmes Underpinning Organogenesis and Reproduction in Land Plants.” Nature Plants 7 (8): 1143–59.
Lechner, Marcus, Sven Findeiß, Lydia Steiner, Manja Marz, Peter F Stadler, and Sonja J Prohaska. 2011. “Proteinortho: Detection of (Co-) Orthologs in Large-Scale Analysis.” BMC Bioinformatics 12 (1): 1–9.
Linard, Benjamin, Ingo Ebersberger, Shawn E McGlynn, Natasha Glover, Tomohiro Mochizuki, Mateus Patricio, Odile Lecompte, et al. 2021. “Ten Years of Collaborative Progress in the Quest for Orthologs.” Molecular Biology and Evolution 38 (8): 3033–45.
Liu, Jialin, Michael Frochaux, Vincent Gardeux, Bart Deplancke, and Marc Robinson-Rechavi. 2020. “Inter-Embryo Gene Expression Variability Recapitulates the Hourglass Pattern of Evo-Devo.” BMC Biology 18 (1): 1–12.
Packer, Jonathan S, Qin Zhu, Chau Huynh, Priya Sivaramakrishnan, Elicia Preston, Hannah Dueck, Derek Stefanik, et al. 2019. “A Lineage-Resolved Molecular Atlas of c. Elegans Embryogenesis at Single-Cell Resolution.” Science 365 (6459): eaax1971.
Proost, Sebastian, Michiel Van Bel, Lieven Sterck, Kenny Billiau, Thomas Van Parys, Yves Van de Peer, and Klaas Vandepoele. 2009. “PLAZA: A Comparative Genomics Resource to Study Gene and Genome Evolution in Plants.” The Plant Cell 21 (12): 3718–31.
Quint, Marcel, Hajk-Georg Drost, Alexander Gabel, Kristian Karsten Ullrich, Markus Bönn, and Ivo Grosse. 2012. “A Transcriptomic Hourglass in Plant Embryogenesis.” Nature 490 (7418): 98–101.
Satija, Rahul, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. 2015. “Spatial Reconstruction of Single-Cell Gene Expression Data.” Nature Biotechnology 33 (5): 495–502.
Schneider, Adrian, Christophe Dessimoz, and Gaston H Gonnet. 2007. “OMA Browser—Exploring Orthologous Relations Across 352 Complete Genomes.” Bioinformatics 23 (16): 2180–82.
Sun, Shuai, Christian Rödelsperger, and Ralf J Sommer. 2021. “Single Worm Transcriptomics Identifies a Developmental Core Network of Oscillating Genes with Deep Conservation Across Nematodes.” Genome Research 31 (9): 1590–1601.

Session Info

## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] monocle3_1.3.7                   SingleCellExperiment_1.28.1     
##  [3] SummarizedExperiment_1.36.0      GenomicRanges_1.58.0            
##  [5] GenomeInfoDb_1.42.3              IRanges_2.40.1                  
##  [7] S4Vectors_0.44.0                 MatrixGenerics_1.18.1           
##  [9] matrixStats_1.5.0                Biobase_2.66.0                  
## [11] BiocGenerics_0.52.0              celegans.embryo.SeuratData_0.1.0
## [13] SeuratData_0.2.2.9001            Seurat_5.2.1                    
## [15] SeuratObject_5.0.2               sp_2.2-0                        
## [17] ComplexHeatmap_2.22.0            ggplot2_3.5.1                   
## [19] cowplot_1.1.3                    viridis_0.6.5                   
## [21] viridisLite_0.4.2                readr_2.1.5                     
## [23] plyr_1.8.9                       dplyr_1.1.4                     
## [25] scTEI_0.1.1                      BiocStyle_2.34.0                
## 
## loaded via a namespace (and not attached):
##   [1] fs_1.6.5                  spatstat.sparse_3.1-0    
##   [3] httr_1.4.7                RColorBrewer_1.1-3       
##   [5] doParallel_1.0.17         tools_4.4.2              
##   [7] sctransform_0.4.1         ResidualMatrix_1.16.0    
##   [9] R6_2.6.1                  lazyeval_0.2.2           
##  [11] uwot_0.2.3                GetoptLong_1.0.5         
##  [13] withr_3.0.2               gridExtra_2.3            
##  [15] myTAI_2.0.0               progressr_0.15.1         
##  [17] taxizedb_0.3.1            cli_3.6.4                
##  [19] textshaping_1.0.0         Cairo_1.6-2              
##  [21] spatstat.explore_3.3-4    fastDummies_1.7.5        
##  [23] labeling_0.4.3            sass_0.4.9               
##  [25] spatstat.data_3.1-4       proxy_0.4-27             
##  [27] ggridges_0.5.6            pbapply_1.7-2            
##  [29] pkgdown_2.1.1             systemfonts_1.2.1        
##  [31] parallelly_1.42.0         limma_3.62.2             
##  [33] rstudioapi_0.17.1         RSQLite_2.3.9            
##  [35] generics_0.1.3            shape_1.4.6.1            
##  [37] ica_1.0-3                 spatstat.random_3.3-2    
##  [39] vroom_1.6.5               Matrix_1.7-2             
##  [41] ggbeeswarm_0.7.2          abind_1.4-8              
##  [43] lifecycle_1.0.4           yaml_2.3.10              
##  [45] SparseArray_1.6.2         Rtsne_0.17               
##  [47] blob_1.2.4                promises_1.3.2           
##  [49] crayon_1.5.3              miniUI_0.1.1.1           
##  [51] lattice_0.22-6            beachmat_2.22.0          
##  [53] magick_2.8.5              pillar_1.10.1            
##  [55] knitr_1.49                rjson_0.2.23             
##  [57] boot_1.3-31               future.apply_1.11.3      
##  [59] codetools_0.2-20          glue_1.8.0               
##  [61] leidenbase_0.1.32         spatstat.univar_3.1-1    
##  [63] data.table_1.17.0         vctrs_0.6.5              
##  [65] png_0.1-8                 spam_2.11-1              
##  [67] Rdpack_2.6.2              gtable_0.3.6             
##  [69] assertthat_0.2.1          cachem_1.1.0             
##  [71] xfun_0.51                 rbibutils_2.3            
##  [73] S4Arrays_1.6.0            mime_0.12                
##  [75] reformulas_0.4.0          survival_3.8-3           
##  [77] iterators_1.0.14          statmod_1.5.0            
##  [79] fitdistrplus_1.2-2        ROCR_1.0-11              
##  [81] nlme_3.1-167              bit64_4.6.0-1            
##  [83] RcppAnnoy_0.0.22          bslib_0.9.0              
##  [85] irlba_2.3.5.1             vipor_0.4.7              
##  [87] KernSmooth_2.23-26        colorspace_2.1-1         
##  [89] DBI_1.2.3                 ggrastr_1.0.2            
##  [91] tidyselect_1.2.1          bit_4.5.0.1              
##  [93] compiler_4.4.2            curl_6.2.1               
##  [95] BiocNeighbors_2.0.1       desc_1.4.3               
##  [97] DelayedArray_0.32.0       plotly_4.10.4            
##  [99] bookdown_0.42             scales_1.3.0             
## [101] lmtest_0.9-40             rappdirs_0.3.3           
## [103] stringr_1.5.1             digest_0.6.37            
## [105] goftest_1.2-3             spatstat.utils_3.1-2     
## [107] minqa_1.2.8               rmarkdown_2.29           
## [109] XVector_0.46.0            htmltools_0.5.8.1        
## [111] pkgconfig_2.0.3           lme4_1.1-36              
## [113] sparseMatrixStats_1.18.0  dbplyr_2.5.0             
## [115] fastmap_1.2.0             rlang_1.1.5              
## [117] GlobalOptions_0.1.2       htmlwidgets_1.6.4        
## [119] UCSC.utils_1.2.0          shiny_1.10.0             
## [121] DelayedMatrixStats_1.28.1 farver_2.1.2             
## [123] jquerylib_0.1.4           zoo_1.8-13               
## [125] jsonlite_1.9.0            BiocParallel_1.40.0      
## [127] BiocSingular_1.22.0       magrittr_2.0.3           
## [129] scuttle_1.16.0            GenomeInfoDbData_1.2.13  
## [131] dotCall64_1.2             patchwork_1.3.0          
## [133] munsell_0.5.1             Rcpp_1.0.14              
## [135] reticulate_1.41.0         stringi_1.8.4            
## [137] zlibbioc_1.52.0           MASS_7.3-64              
## [139] parallel_4.4.2            listenv_0.9.1            
## [141] ggrepel_0.9.6             deldir_2.0-4             
## [143] splines_4.4.2             tensor_1.5               
## [145] hms_1.1.3                 circlize_0.4.16          
## [147] igraph_2.1.4              spatstat.geom_3.3-5      
## [149] RcppHNSW_0.6.0            ScaledMatrix_1.14.0      
## [151] reshape2_1.4.4            evaluate_1.0.3           
## [153] hoardr_0.5.5              BiocManager_1.30.25      
## [155] batchelor_1.22.0          nloptr_2.1.1             
## [157] tzdb_0.4.0                foreach_1.5.2            
## [159] httpuv_1.6.15             RANN_2.6.2               
## [161] tidyr_1.3.1               purrr_1.0.4              
## [163] polyclip_1.10-7           future_1.34.0            
## [165] clue_0.3-66               scattermore_1.2          
## [167] rsvd_1.0.5                xtable_1.8-4             
## [169] RSpectra_0.16-2           later_1.4.1              
## [171] ragg_1.3.3                tibble_3.2.1             
## [173] memoise_2.0.1             beeswarm_0.4.0           
## [175] cluster_2.1.8             globals_0.16.3