Manage scRNA-seq datasets#

This illustrates how to manage scRNA-seq datasets in absence of a custom schema.

import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = 3  # show hints
lb.settings.auto_save_parents = (
    False  # don't recurse through ontology hierarchies to speed up CI
)

ln.track()

Preparation: registries#

Let’s assume that this is not the first time we work with experimental entities, and hence, our registries are already pre-populated:

ln.view(schema="bionty", orms=["CellType", "ExperimentalFactor", "Tissue"])

Mouse lymph node cells: Detmar22#

We’re working with mouse data:

lb.settings.species = "mouse"

Let’s look at a scRNA-seq count matrix in form of an AnnData object:

When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:

file = ln.File.from_anndata(
    adata, description="Detmar22", var_ref=lb.Gene.ensembl_gene_id
)

file.save()

The file now has two linked feature sets:

file.features

Some of the metadata can be typed:

species = lb.Species.from_bionty(name="mouse")
strains = lb.ExperimentalFactor.from_values(adata.obs["strain"], "name")
dev_stages = lb.ExperimentalFactor.from_values(adata.obs["developmental_stage"], "name")
cell_types = lb.CellType.from_values(adata.obs["cell_type"], "name")
tissues = lb.Tissue.from_values(adata.obs["tissue"], "name")

file.features.add_labels(species, feature="species")
file.features.add_labels(strains + dev_stages + tissues + cell_types)

Metadata that doesn’t have corresponding ORMs:

labels = ln.Label.from_values(adata.obs["sex"])
labels += ln.Label.from_values(adata.obs["age"])
labels += ln.Label.from_values(adata.obs["genotype"])
labels += ln.Label.from_values(adata.obs["immunophenotype"])

file.features.add_labels(labels)

The file is now queryable by everything we linked:

file.describe()

Human immune cells: Conde22#

lb.settings.species = "human"

conde22 = ln.dev.datasets.anndata_human_immune_cells()

file = ln.File.from_anndata(
    conde22, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)
file.save()

The file has the following linked features:

file.features

Let’s now link observational metadata.

cell_types = lb.CellType.from_values(conde22.obs.cell_type, field="name")
efs = lb.ExperimentalFactor.from_values(conde22.obs.assay, field="name")
tissues = lb.Tissue.from_values(conde22.obs.tissue, field="name")

file.features.add_labels([lb.settings.species], feature="species")
file.features.add_labels(cell_types + efs + tissues)

As neither the core schema nor lnschema_bionty have a Donor table, we’re using Label to track donor ids:

donors = ln.Label.from_values(conde22.obs["donor_id"])
file.features.add_labels(donors)

file.describe()

A less well curated dataset#

Let’s now consider a dataset with less-well curated features:

pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols:

pbcm68k.var.index

Because gene symbols don’t uniquely characterize an Ensembl ID, we’re linking more feature records to this file than columns in the AnnData.

Tip

Use Ensembl Gene IDs rather than gene Symbols to index genes.

file_pbcm68k = ln.File.from_anndata(
    pbcm68k, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)
file_pbcm68k.save()

Link cell types:

cell_types = lb.CellType.from_values(pbcm68k.obs["cell_type"], "name")
file_pbcm68k.features.add_labels(cell_types)

file_pbcm68k.describe()

🎉 Now let’s continue with data integration: Integrate scRNA datasets based on shared features/metadata