Manage scRNA-seq datasets#

This illustrates how to manage scRNA-seq datasets in absence of a custom schema.

Hide code cell content
!lamin init --storage ./test-scrna --schema bionty
import lamindb as ln
import lnschema_bionty as lb

ln.settings.verbosity = 3  # show hints
lb.settings.auto_save_parents = (
    False  # don't recurse through ontology hierarchies to speed up CI
)
ln.track()

Preparation: registries#

Let’s assume that this is not the first time we work with experimental entities, and hence, our registries are already pre-populated:

Hide code cell content
# assume prepared registries

# strain
lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0004472").save()
record = lb.ExperimentalFactor.filter(ontology_id="EFO:0004472").one()
record.add_synonym("C57BL/6N")

# developmental stage
lb.ExperimentalFactor.from_bionty(ontology_id="EFO:0001272").save()

# tissue
lb.Tissue.from_bionty(ontology_id="UBERON:0001542").save()

# cell types
ln.save(lb.CellType.from_values(["CL:0000115", "CL:0000738"], "ontology_id"))
ln.view(schema="bionty", orms=["CellType", "ExperimentalFactor", "Tissue"])

Mouse lymph node cells: Detmar22#

We’re working with mouse data:

lb.settings.species = "mouse"

Let’s look at a scRNA-seq count matrix in form of an AnnData object:

Hide code cell source
adata = ln.dev.datasets.anndata_mouse_sc_lymph_node()
# The column names are a bit lengthy, let's abbreviate them:
adata.obs.columns = (
    adata.obs.columns.str.replace("Sample Characteristic", "")
    .str.replace("Factor Value ", "Factor Value:", regex=True)
    .str.replace("Factor Value\[", "Factor Value:", regex=True)
    .str.replace(" Ontology Term\[", "ontology_id:", regex=True)
    .str.strip("[]")
    .str.replace("organism part", "tissue")
    .str.replace("organism", "species")
    .str.replace("developmental stage", "developmental_stage")
    .str.replace("cell type", "cell_type")
)

adata

When we create a File object from an AnnData, we’ll automatically link its feature sets and get information about unmapped categories:

file = ln.File.from_anndata(
    adata, description="Detmar22", var_ref=lb.Gene.ensembl_gene_id
)
file.save()

The file now has two linked feature sets:

file.features

Some of the metadata can be typed:

species = lb.Species.from_bionty(name="mouse")
strains = lb.ExperimentalFactor.from_values(adata.obs["strain"], "name")
dev_stages = lb.ExperimentalFactor.from_values(adata.obs["developmental_stage"], "name")
cell_types = lb.CellType.from_values(adata.obs["cell_type"], "name")
tissues = lb.Tissue.from_values(adata.obs["tissue"], "name")
file.features.add_labels(species, feature="species")
file.features.add_labels(strains + dev_stages + tissues + cell_types)

Metadata that doesn’t have corresponding ORMs:

labels = ln.Label.from_values(adata.obs["sex"])
labels += ln.Label.from_values(adata.obs["age"])
labels += ln.Label.from_values(adata.obs["genotype"])
labels += ln.Label.from_values(adata.obs["immunophenotype"])
file.features.add_labels(labels)

The file is now queryable by everything we linked:

file.describe()

Human immune cells: Conde22#

lb.settings.species = "human"
conde22 = ln.dev.datasets.anndata_human_immune_cells()
file = ln.File.from_anndata(
    conde22, description="Conde22", var_ref=lb.Gene.ensembl_gene_id
)
file.save()

The file has the following linked features:

file.features

Let’s now link observational metadata.

cell_types = lb.CellType.from_values(conde22.obs.cell_type, field="name")
efs = lb.ExperimentalFactor.from_values(conde22.obs.assay, field="name")
tissues = lb.Tissue.from_values(conde22.obs.tissue, field="name")
file.features.add_labels([lb.settings.species], feature="species")
file.features.add_labels(cell_types + efs + tissues)

As neither the core schema nor lnschema_bionty have a Donor table, we’re using Label to track donor ids:

donors = ln.Label.from_values(conde22.obs["donor_id"])
file.features.add_labels(donors)
file.describe()

A less well curated dataset#

Let’s now consider a dataset with less-well curated features:

pbcm68k = ln.dev.datasets.anndata_pbmc68k_reduced()

We see that this dataset is indexed by gene symbols:

pbcm68k.var.index

Because gene symbols don’t uniquely characterize an Ensembl ID, we’re linking more feature records to this file than columns in the AnnData.

Tip

Use Ensembl Gene IDs rather than gene Symbols to index genes.

file_pbcm68k = ln.File.from_anndata(
    pbcm68k, description="10x reference pbmc68k", var_ref=lb.Gene.symbol
)
file_pbcm68k.save()

Link cell types:

Hide code cell content
# inspect shows none of the terms are mappable
lb.CellType.inspect(pbcm68k.obs["cell_type"], "name")

# here we search the cell type names from the public ontology and grab the top match
# then add the cell type names from the pbcm68k as synonyms
celltype_bt = lb.CellType.bionty()
ontology_ids = []
for ct in pbcm68k.obs["cell_type"].unique():
    ontology_id = celltype_bt.search(ct).iloc[0].ontology_id
    record = lb.CellType.from_bionty(ontology_id=ontology_id)
    record.save()
    record.add_synonym(ct)
cell_types = lb.CellType.from_values(pbcm68k.obs["cell_type"], "name")
file_pbcm68k.features.add_labels(cell_types)
file_pbcm68k.describe()

🎉 Now let’s continue with data integration: Integrate scRNA datasets based on shared features/metadata