Skip to content

Data Modeling

LinkML-based data models that power interoperability across the BER data lakehouse — from sample context and sequencing metadata to credit and provenance.

Overview

Each BER resource has historically managed its metadata independently, with its own models and tools. The schemas below are the LinkML-based data models that BRIDGE partners are aligning to enable cross-lakehouse discovery, shared ETL tooling, and AI-assisted mapping between datasets. Every schema links back to its canonical site and repository for documentation, source, and issues.

BERtron Schema

The BERtron schema defines the shared metadata contract used by the BERtron federated search service to index data across partner lakehouses. It specifies how each lakehouse describes its holdings so that a single query can return results spanning EMSL, ESS-DIVE, JGI, KBase, and NMDC.

KBase Schemas

KBase contributes two complementary schemas to the BER data ecosystem. The Common Data Model (CDM) schema provides the core types and relationships KBase uses to represent biological entities, their annotations, and the analyses performed on them — giving downstream tools a stable, well-documented vocabulary for computational biology data. The CRediT metadata schema adopts the community-standard Contributor Roles Taxonomy to describe the specific contributions each person made to a dataset or analysis, supporting attribution, reproducibility, and credit across collaborative BER projects.

NMDC Schema

The NMDC Schema is critical substrate used to facilitate interoperability and collaboration, as it provides a common language for data exchange across systems and disciplines. In the context of the NMDC, this schema supports the integration of microbiome data from medicine, agriculture, bioenergy, and environmental science into a cohesive platform.

This schema is organized into two modules:

  • A core set of elements for representing data values represented in LinkML format consisting of a set of classes, slots, types, and enumerations that are used to define the structure of the NMDC schema.
  • A subset of the MIxS schema developed by the Genomic Standards Consortium, that is used to describe the environmental context of samples.

BASIN-3D

BASIN-3D (Broker for Assimilation, Synthesis and Integration of eNvironmental Diverse, Distributed Datasets) is a data brokering framework that integrates environmental data from heterogeneous remote sources into a common model. Rather than replicating data, BASIN-3D synthesizes it on demand, giving researchers a unified view over sources like ESS-DIVE, USGS, and EPA.

Bridge Schemas

Bridge Schemas are LinkML schemas that map and translate between related data models, enabling interoperability across BER partner resources. They provide explicit crosswalks that connect overlapping concepts in different schemas so that data described in one model can be understood and used in the context of another.

LAMBDA

LAMBDA is a LinkML schema developed within the BER community to capture dataset-level metadata in a shape that supports harmonization across partner lakehouses. It focuses on the common descriptors needed to connect experiments, samples, and analyses regardless of the originating resource.