Preprocessing

The ZEBRA pipeline consists of the following five main steps:
  1. First, we curate and standardize crucial metadata manually from the collected data.
  2. Secondly, we perform doublet detection and removal for each data set separately using Scrublet [1].
  3. Afterwards, we merge the data sets, i.e., we map similar genes onto each other. This is possible since almost all datasets were generated using a similar reference genome.
  4. The merged expression matrix is then further preprocessed by removing less abundant genes and cells with a small number of detected genes. We filtered the dataset using scanpy [2] and the following thresholds [3]:
    • Minimum number of genes per cell: 200
    • Minimum number of cells per gene: 3
    • Maximum number of genes per cell: 7500
    • Maximum percentage of mitochondrial reads: 5%
  5. After the preprocessing we use the scVI framework to integrate the merged data set. Based on the integrated data we cluster the cells using the Leiden algorithm and reassign a harmonized cell type annotation to each cell. The annotation is improved using a marker gene based approach derived from the literature [4], as well as a majority voting described in our paper. We use a pseudo-bulk approach to compute the DEGs across cell types and conditions using the edgeR [5] package.

With the generated dataset we provide a valuable reference for further evaluations, such as the classification of new datasets and the detection of previously unobserved marker genes. The data set can be used to benchmark new and existing tools.

Reference Organism Tissue #Raw cells Condition Cell/nucleus DOI Download links
Reference Organism Tissue #Raw cells Condition Cell/nucleus DOI Download links
Loading...

Abbreviations and terminology

Abbreviation Explanation Species
Abbreviation Explanation Species
Loading...

Changelog

  • 2023-11-21
    1. Addressed performance issues.
    2. Fix in metadata in Ayhan et al. study.

References

[1] Scrublet: Wolock, Samuel L., Romain Lopez, and Allon M. Klein. “Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data.” Cell Systems 8, no. 4 (April 24, 2019): 281-291.e9. https://doi.org/10.1016/j.cels.2018.11.005.

[2] Preprocessing: Luecken, Malte D, and Fabian J Theis. “Current Best Practices in Single-Cell RNA-Seq Analysis: A Tutorial.” Molecular Systems Biology 15, no. 6 (June 2019): e8746. https://doi.org/10.15252/msb.20188746.

[3] scanpy: Wolf, F. Alexander, Philipp Angerer, and Fabian J. Theis. “SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis.” Genome Biology 19, no. 1 (February 6, 2018): 15. https://doi.org/10.1186/s13059-017-1382-0.

[4] Markers: Yang, Andrew C., Fabian Kern, Patricia M. Losada, Maayan R. Agam, Christina A. Maat, Georges P. Schmartz, Tobias Fehlmann, et al. “Dysregulation of Brain and Choroid Plexus Cell Types in Severe COVID-19.” Nature 595, no. 7868 (July 22, 2021): 565–71. https://doi.org/10.1038/s41586-021-03710-0.

[5] edgeR: Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 Jan 1;26(1):139-40. doi: 10.1093/bioinformatics/btp616. Epub 2009 Nov 11. PMID: 19910308; PMCID: PMC2796818. https://doi.org/10.1093%2Fbioinformatics%2Fbtp616.