Skip to content

About

The BER Data Registry

The BER Data Registry is a metadata catalog for scientific data sources across lakehouses operated by Lawrence Berkeley National Laboratory (LBNL) in support of the Department of Energy's Biological and Environmental Research (BER) program. It provides a standardized way to describe, discover, and track data sources spanning multiple platforms, including the KBASE Spark lakehouse at NERSC and Dremio-based data lakehouse environments.

Project Goals

  • Standardize data cataloging across BER-funded lakehouses and data platforms using a common metadata schema
  • Enable data discovery by providing consistent descriptions of data sources, their owners, access levels, and update schedules
  • Track data lifecycle including versioning, deprecation chains, and provenance
  • Align with community standards by building on DCAT v3, DCAT-US, and schema.org vocabularies

The Data Model

Design Principles

  • Standards-aligned -- Classes map to DCAT, Dublin Core, vCard, and PROV vocabularies via explicit class_uri and slot_uri annotations
  • FAIR-compliant -- Supports findable, accessible, interoperable, and reusable metadata for scientific datasets
  • Extensible -- Built with LinkML so the schema can evolve with new fields and enums as requirements grow
  • Multi-platform -- Covers both Spark and Dremio lakehouses with source-type and engine-level detail

Technology Stack

  • LinkML -- Schema definition language for linked data modeling
  • MkDocs Material -- Documentation site
  • just -- Command runner for build automation
  • uv -- Python package management
  • GitHub Actions -- CI/CD for testing and deployment

Core Schema Elements

Element DCAT Mapping Purpose
Catalog dcat:Catalog Top-level container for lakehouses
Lakehouse dcat:DataService A hosting platform (Spark, Dremio) containing data sources
DataSource dcat:Dataset A cataloged data source within a lakehouse
ContactPoint vcard:Kind Contact information for a data source

Background

This project was informed by the HPDF Data Catalog & Lakehouse Demo report (Cohoon & Paine, LBNL-2001745, December 2025), which recommended a unified metadata catalog for BER data assets across LBNL facilities. The registry schema captures the core metadata fields identified in that report while maintaining compatibility with federal data cataloging standards (DCAT-US).

Contributing

Contributions are welcome. Please see CONTRIBUTING.md for guidelines.

Development

Prerequisites: Python 3.9+, uv, just

# Clone and set up
git clone https://github.com/sierra-moxon/ber-data-registry.git
cd ber-data-registry
just install

# Run tests
just test

# Build and preview docs locally
just testdoc

Project Structure

src/ber_data_registry/schema/   # LinkML schema (source of truth)
tests/data/valid/               # Example YAML files (used as test fixtures)
tests/data/invalid/             # Invalid examples for negative testing
docs/                           # Generated documentation site

License

This project is licensed under the MIT License.

Acknowledgments

Built using the linkml-project-copier template.

Contact

For questions or issues, please open an issue on the GitHub repository.