Johns Hopkins UniversityEst. 1876

America’s First Research University

A goal of the Data Science and AI Institute (DSAI) is to accelerate Johns Hopkins’ capacity to produce and steward trusted datasets—datasets whose provenance, quality, documentation, and governance support reliable and reproducible AI. Many high-value datasets are still created through slow, bespoke workflows for annotation, curation, validation, and documentation that do not scale to the breadth and rigor required for modern AI systems. To address this gap, DSAI seeks to catalyze innovation in the tools and infrastructure that make trusted datasets faster and more reliable to build, validate, maintain, and use responsibly. 

Examples include human-in-the-loop annotation systems, automated de-identification and privacy-protection pipelines, tools for measuring inter-rater agreement, resolving ambiguity, strategies for assessing and improving data quality, and frameworks for tracking data lineage and provenance. Other priorities include reusable toolkits for dataset creation, active-learning approaches that increase labeling efficiency, synthetic data generation methods that preserve privacy while retaining utility, scalable hosting and versioning systems, and platforms that support controlled data access. Projects that develop practical policies, standards, or guidelines that enable consistent dataset creation, stewardship, and use are also encouraged. The goal is to position Johns Hopkins as a leading source of high-quality, trustworthy data by investing in the methods and infrastructure that make trusted dataset production efficient, scalable, and broadly usable. 

Rather than focusing solely on the creation of individual datasets, this call prioritizes the development of approaches that accelerate the creation, curation, validation, and lifecycle management of trusted datasets at scale. Specifically, proposals should result in artifacts, not datasets. Artifacts can include software toolkits, new policies or procedures, or other similar artifacts that will enable the creation of datasets. Competitive proposals will demonstrate how new tools, pipelines, or frameworks can replace bespoke processes with repeatable and reusable systems while improving transparency, reproducibility, and accountability.  

Scope and Priority Areas 

Projects should address one or more aspects of the data lifecycle relevant to trustworthy AI. Priority areas include: 

  1. Tools for dataset assurance, documentation, and validation 
    Software or frameworks that improve dataset quality and transparency across the lifecycle, including annotation support, inter-rater agreement measurement, automated quality checks, bias and coverage auditing, provenance tracking, and documentation generation. Contributions in this area should demonstrably increase the efficiency, consistency, and reliability of trusted dataset creation and validation. 
  2. Dataset-production workflows 
    Tools that enable the curation of datasets that are designed for reuse and accompanied by strong documentation, governance, and evaluation protocols. Particularly encouraged are projects that pair dataset creation with reusable pipelines, annotation frameworks, or validation workflows that can be applied across multiple domains. 
  3. Privacy-preserving and responsible data transformation and access 
    Methods or pipelines that enable the responsible use and sharing of sensitive data, including automated de-identification, synthetic data generation that preserves analytic utility, controlled-access mechanisms, and tools that support compliant and well-documented data release. 
  4. Infrastructure prototypes, hosting frameworks, and standards 
    Pilot systems that support scalable dataset stewardship, including data hosting platforms, versioning and release management tools, lineage tracking systems, and frameworks that facilitate integration with institutional or national data commons efforts. Projects that establish reusable standards, policies, or templates for dataset creation and stewardship are also encouraged. 

Proposals outside these areas are welcome but should clearly articulate their relevance to trustworthy AI data practices. Proposals that are primarily intended to produce a specific dataset are outside the scope of this call. 

Eligibility 

  • The Principal Investigator (PI) must be a DSAI faculty member. 
  • Co-investigators may be from any Johns Hopkins division and may include non-faculty members with relevant expertise. 
  • MS/PhD students, undergraduates, and teaching faculty are not eligible to serve as a project’s PI and proposals submitted solely by such individuals will be considered non-responsive. 
  • A Principal Investigator may submit no more than one proposal as lead PI. 
  • Awardees will be asked to serve as reviewers in the subsequent funding cycle. 

Deadlines 

  • Submission deadline: April 13, 2026
  • Awards announced: June 1, 2026
  • Award start date: July 1, 2026
  • Funds must be spent by: December 31, 2026
  • Final project report due: January 15, 2027

Learn more and submit a proposal (JHU affiliate access only)

The deadline for proposal submissions is Monday, April 13.  

Contact 

For any questions about the Creation of Trusted Datasets call for proposals, please email dsai-academics@jhu.edu.