Introduction

ODIN Catalog is an open-source data catalog built on W3C and OMG standards. It bridges the gap between raw technical metadata and business understanding — giving data teams a semantic layer, end-to-end lineage, and AI-powered discovery out of the box.

What makes ODIN different

Most data catalogs stop at documentation. ODIN goes further:

Semantic vocabulary mappings — every data element can be bound to a concept in FIBO, schema.org, or your own ontology using SKOS match types.
Live lineage graph — OpenLineage events and SQL DDL are parsed into an Apache AGE property graph queryable by Cypher.
Data product governance — the DPROD standard gives every dataset a business owner, lifecycle stage, and access policy.
AI-powered Q&A — a Spring AI RAG pipeline runs over your metadata corpus using Ollama (local) or OpenAI.
AI metadata enrichment — per-element classification, description, and vocabulary concept recommendations, all owner-reviewed before acceptance. PII / direct-identifier detection maps elements to W3C DPV-PD concept IRIs, automatically elevating terms-of-use access levels.
ODRL terms of use — access-level policy (OPEN → HIGHLY_RESTRICTED) derived automatically from element classifications and vocabulary mappings; displayed to consumers, governed by data owners.
Accountable data ownership — role-based dataset ownership with a proposal-and-approval transfer workflow and a full audit history. The governance dashboard surfaces pending tasks and an activity feed for every user.
Zero lock-in — all metadata is exportable as DCAT 3.0 JSON-LD.

Standards at the core

Standard	Body	Role in ODIN
DCAT 3.0	W3C	Catalog, Dataset, Distribution, DataService resources
DPROD	OMG	DataProduct, Port, lifecycle, access policy
CSV-W	W3C	Physical schema (table, column, datatype) harvested from source systems
OpenLineage	Linux Foundation	Job/Run/Dataset lineage events ingested via REST
FIBO	EDM Council	Pre-loaded financial ontology vocabulary (FND, FBC, SEC, MD)
SKOS	W3C	Mapping properties: exactMatch, closeMatch, relatedMatch
ODRL	W3C	Terms-of-use policies — permissions, prohibitions, obligations derived from element classifications and vocabulary concepts
DPV / DPV-PD	W3C	W3C Data Privacy Vocabulary — PII and direct-identifier classification on logical model elements; DPV-PD concept IRIs drive AI PII detection

ℹ

ODIN is currently in private alpha. APIs and database schemas may change between releases. Not recommended for production workloads yet.

Next → Quick Start

Quick Start

Get a full ODIN stack running locally in under five minutes using Docker Compose.

1. Clone and configure

bash

git clone https://github.com/ODIN-Data-Intelligence/odin.git
cd odin
cp .env.example .env          # review and edit credentials

2. Start the stack

bash

make up
# or: docker compose up -d

# Watch services come healthy:
docker compose ps

Services start in dependency order. Allow ~60 seconds for Kafka, PostgreSQL, and OpenSearch to initialise before the Spring Boot services become healthy.

3. Create your first dataset

bash

curl -s -X POST http://localhost:8001/api/v1/datasets \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -H "X-Tenant-Id: 00000000-0000-0000-0000-000000000001" \
  -d '{
    "title": "Trade Blotter",
    "description": "Intraday trade records from the front-office OMS.",
    "keywords": ["trading", "blotter", "positions"],
    "accrualPeriodicity": "daily"
  }' | jq .

4. Open the frontend

App	URL	Purpose
Producer (management)	`http://localhost:3000`	Publish, govern, harvest
Consumer (discovery)	`http://localhost:3001`	Search, explore, ask AI

✓

The dev API key X-API-Key: dev-* (any value starting with dev-) grants full catalog:admin scope and bypasses Keycloak. Use it for local smoke testing only.

5. Load sample data

bash

make seed        # loads financial services sample data
make reindex     # pushes all datasets into OpenSearch

The seed script creates 12 financial datasets, 5 data products, logical models with FIBO vocabulary mappings, and OpenLineage pipeline events for a BCBS 239 risk aggregation scenario.

← PreviousIntroduction Next →Prerequisites

Prerequisites

Runtime requirements

Requirement	Minimum version	Notes
Docker	25.0	Docker Desktop or Docker Engine on Linux
Docker Compose	v2.24	Bundled with Docker Desktop; `docker compose` (v2 plugin)
RAM	12 GB available	OpenSearch and Kafka are the largest consumers
Disk	8 GB free	Container images + volumes

Development requirements

Requirement	Version	Notes
Java	21 (LTS)	Required to build services; virtual threads (Project Loom)
Gradle	8.x	Wrapper included; run `./gradlew`
Node.js	20 LTS	Required for frontend builds
pnpm	9.x	`npm install -g pnpm`

Optional — AI features

bash

# Install Ollama for local LLM inference
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull nomic-embed-text   # embedding model (768 dimensions)
ollama pull llama3             # chat model

# Then start the AI profile:
docker compose --profile ai up -d

Without Ollama, the ai-service will not start. All other services function normally. You can also configure an OpenAI key in .env instead.

← PreviousQuick Start Next →Configuration

Configuration

All runtime configuration is driven by environment variables. Copy .env.example to .env and edit before running make up.

Core variables

Variable	Default	Description
`POSTGRES_PASSWORD`	`odin`	Shared password for all Postgres instances (change in production)
`KEYCLOAK_ADMIN`	`admin`	Keycloak admin username
`KEYCLOAK_ADMIN_PASSWORD`	`admin`	Keycloak admin password
`MINIO_ROOT_USER`	`minio`	MinIO root access key
`MINIO_ROOT_PASSWORD`	`minio123`	MinIO root secret key
`JWT_SECRET`	—	HS256 secret for dev API key validation (32+ chars)

AI variables

Variable	Default	Description
`OLLAMA_BASE_URL`	`http://ollama:11434`	Ollama inference endpoint
`OPENAI_API_KEY`	(empty)	If set, OpenAI is used for embeddings and chat instead of Ollama
`AI_CHAT_MODEL`	`llama3`	Ollama model name for chat completions
`AI_EMBED_MODEL`	`nomic-embed-text`	Embedding model; must produce 768-dimension vectors

⚠

The default .env.example values are intentionally weak. Change all passwords and secrets before exposing any port to a network.

← PreviousPrerequisites Next →Architecture Overview

Architecture Overview

ODIN follows Domain-Driven Design with a database-per-service pattern. Seven Spring Boot 3.3 microservices communicate via Kafka events. Traefik routes external HTTP traffic.

flowchart TD CLI["Browser / CLI"] TRF["Traefik :80/443"] CF["consumer-frontend\nnginx :3001"] PF["producer-frontend\nnginx :3000"] CLI --> TRF TRF -->|"catalog.local/"| CF TRF -->|"manage.catalog.local/"| PF TRF -->|"api.catalog.local/"| APIS subgraph APIS["API Services"] INV["inventory-service :8001\nPostgreSQL :5433"] HVT["harvest-service :8002\nPostgreSQL :5434 · MinIO :9000"] LIN["lineage-service :8003\nPostgreSQL+AGE :5435"] SRC["search-service :8004\nOpenSearch :9200"] AIS["ai-service :8005\nPostgreSQL+pgvector :5437"] IDS["identity-service :8006\nPostgreSQL :5436 · Keycloak :8180"] POL["policy-service :8007\nPostgreSQL :5438"] end APIS <-->|events| KFK["Apache Kafka :9092\nKRaft · no ZooKeeper"]

Design principles

API-first — every capability is a versioned REST endpoint before any UI is built on top.
Database-per-service — no service shares a database with another. Cross-service reads go through REST or Kafka events.
Event-driven — state changes publish Kafka events on log-compacted topics. Downstream services maintain their own read models.
Standards-based exports — the catalog exports DCAT 3.0 JSON-LD via Apache Jena; the lineage service accepts OpenLineage JSON.

← PreviousConfiguration Next →Services

Services

Service	Port	Database	Responsibility
inventory-service	8001	PostgreSQL 16	DCAT/DPROD/CSV-W metadata, logical models, vocabulary mappings, Kafka event publisher
harvest-service	8002	PostgreSQL 16 + MinIO	Spring Batch crawlers for Snowflake, AWS Glue, Teradata, DCAT HTTP; Quartz scheduler
lineage-service	8003	PostgreSQL + Apache AGE	OpenLineage REST ingestion, DDL parsing via Calcite, Cypher graph queries
search-service	8004	OpenSearch 2.x	Full-text + semantic indexing, FIBO facets, autocomplete suggestions
ai-service	8005	PostgreSQL + pgvector	Spring AI RAG pipeline, embeddings, SSE chat streaming, Ollama / OpenAI
identity-service	8006	PostgreSQL 16	Keycloak OAuth2/OIDC, role-based access (Administrator, Data Owner, Steward, Governance), user provisioning with Keycloak sync, API keys, tenant management
policy-service	8007	PostgreSQL 16	ODRL policy registry and ODRE enforcement engine (PDP). Evaluates A-Level and B1-Level policies at request time; syncs policies from dataset change events via Kafka; persists evaluation log.

← PreviousArchitecture Overview Next →Event Topology

Event Topology

All inter-service communication uses Kafka with an envelope schema that carries tenant, event type, and schema version on every message.

Topics

Topic	Producer	Consumers	Compacted
`inventory.datasets.changes`	inventory-service	search-service, ai-service, policy-service	Yes
`inventory.data-products.changes`	inventory-service	search-service, ai-service	Yes
`harvest.entities.discovered`	harvest-service	inventory-service	No
`harvest.ddl.discovered`	harvest-service	lineage-service	No
`lineage.graph.updated`	lineage-service	search-service	No
`policy.records.changes`	policy-service	— (reserved for future PEPs)	No
`policy.evaluations.completed`	policy-service	API gateway / consumer (planned)	No

Event envelope

json

{
  "eventId": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "eventType": "DatasetCreated",
  "schemaVersion": "1.0",
  "producerService": "inventory-service",
  "tenantId": "00000000-0000-0000-0000-000000000001",
  "timestamp": "2026-05-18T10:23:00Z",
  "payload": { ... }
}

← PreviousServices Next →Security Model

Security Model

Authentication methods

Method	Header	Use case
Bearer JWT (OIDC)	`Authorization: Bearer <token>`	User sessions via Keycloak
API Key	`X-API-Key: <key>`	Service-to-service, CI pipelines, curl
Dev key	`X-API-Key: dev-*`	Local development only — bypasses auth entirely

Tenant isolation

Every resource row carries a tenant_id UUID. When using Keycloak JWT tokens, the tenant is extracted from the tenant_id claim automatically. When using API keys, the key's associated tenant is used. Rows from other tenants are never returned.

✕

Never use X-API-Key: dev-* in production. It grants unrestricted admin access to all tenants.

See Roles & Login for the four defined roles, producer UI login flow, and how to add users in Keycloak.

← PreviousEvent Topology Next →Metamodel Overview

Metamodel Overview

ODIN's metamodel has three tiers: conceptual (business), logical (semantic), and physical (technical).

flowchart TB subgraph CONC["Conceptual · DPROD"] DP["DataProduct"] --> IP["InputPort"] DP --> OP["OutputPort"] --> SVC["DataService"] end subgraph LOGIC["Logical · DCAT"] D["Dataset"] --> VP["VocabularyProfile"] --> VOC["Vocabulary\nFIBO · schema.org"] D --> LM["LogicalModel"] --> LDE["LogicalDataElement"] --> VM["VocabularyMapping · SKOS"] end subgraph PHYS["Physical · DCAT · CSV-W"] DIST["Distribution"] --> CT["CSVWTable"] --> CS["CSVWSchema"] --> CC["CSVWColumn"] end subgraph LINE["Lineage · OpenLineage · AGE"] JOB["OpenLineage Job"] -->|"READS_FROM · WRITES_TO"| LD["Dataset"] end SVC --> DIST CC -->|"logicalDataElementId FK"| LDE LD -. "catalogResourceId" .-> D

Layer responsibilities

Layer	Key entities	Purpose
Conceptual	DataProduct, Port, DataService	Business ownership, governance, lifecycle (Ideation → Consume)
Logical	LogicalModel, LogicalDataElement, VocabularyMapping	Business meaning, semantic annotations, vocabulary alignment
Physical	Distribution, CSVWTable, CSVWColumn	Technical structure as harvested from source systems

← PreviousSecurity Model Next →DCAT Datasets

DCAT Datasets & Distributions

ODIN models datasets and distributions using the DCAT 3.0 vocabulary. The full catalog can be exported as DCAT JSON-LD.

Dataset fields

Field	DCAT property	Type	Notes
`title`	`dct:title`	string	Human-readable name
`description`	`dct:description`	string	Free-text description
`keywords`	`dcat:keyword`	string[]	Used for search facets
`themes`	`dcat:theme`	string[]	Domain classification IRIs
`accrualPeriodicity`	`dct:accrualPeriodicity`	string	e.g. `daily`, `hourly`
`license`	`dct:license`	URI	License IRI
`conformsTo`	`dct:conformsTo`	URI[]	Standards this dataset conforms to

DCAT export

bash

# Export full catalog as DCAT 3.0 JSON-LD
curl http://localhost:8001/api/v1/catalogs/{id}/export \
  -H "Accept: application/ld+json" \
  -H "X-API-Key: dev-local"

← PreviousMetamodel Overview Next →DPROD Data Products

DPROD Data Products

Data products are modelled using the OMG DPROD standard. A data product represents a business-owned, governed unit of data with a defined lifecycle.

Lifecycle stages

Stage	Description
Ideation	Concept identified; no data yet
Design	Schema and SLA being defined
Build	Pipeline under development
Deploy	Running in production, not yet published
Consume	Publicly available for consumers

Create a data product

bash

curl -X POST http://localhost:8001/api/v1/data-products \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -d '{
    "title": "Trade Risk Data Product",
    "description": "Aggregated risk metrics for regulatory reporting.",
    "lifecycleStatus": "Consume",
    "keywords": ["risk", "trading", "BCBS239"],
    "informationSensitivity": "Internal"
  }'

← PreviousDCAT Datasets Next →CSV-W Physical Schema

CSV-W Physical Schema

The physical layer is modelled using CSV on the Web (CSV-W). Each distribution that is harvested from a source system produces a CSVWTable with a CSVWSchema containing one CSVWColumn per field.

Column fields

Field	Type	Description
`name`	string	Column name as it appears in the source system
`titles`	string[]	Alternate names / aliases
`datatype`	string	Source system type: `DECIMAL(18,4)`, `VARCHAR(50)`, etc.
`required`	boolean	Whether the column is NOT NULL
`description`	string	Column comment from the source DDL
`propertyUrl`	URI	Linked Data property IRI if available

Physical columns are created automatically during harvest. Each CSVWColumn carries an optional logicalDataElementId FK that, when set, creates the logical–physical binding. A single LogicalDataElement may be bound by multiple physical columns across different distributions or schema versions.

← PreviousDPROD Data Products Next →Logical Models

Logical Models

A LogicalModel belongs to a Dataset and provides the business-oriented view of its structure. It contains LogicalDataElements — each representing a named business concept with an optional binding to a physical column and zero or more vocabulary mappings.

LogicalDataElement fields

Field	Type	Description
`name`	string	Technical element name (from harvest or manual entry)
`label`	string	Human-readable business name shown in the Model tab: Trade Amount, Settlement Currency
`description`	string	Plain-English business description, curated by a steward or accepted from an AI recommendation
`logicalType`	string	Semantic type: `MonetaryAmount`, `Identifier`, `Date`, `Party`
`classification`	string	Accepted data sensitivity level: `PUBLIC`, `INTERNAL`, `CONFIDENTIAL`, `HIGH_CONFIDENTIAL`
`recommendedClassification`	string	AI-suggested classification pending data owner review; cleared on accept/reject
`classificationReasoning`	string	One-sentence rationale produced alongside the AI classification recommendation
`physicalColumnIds`	UUID[]	IDs of bound `csvw_columns` rows; the FK lives on the column side (`csvw_columns.logicalDataElementId`)
`isIdentifier`	boolean	True if this element forms part of the logical primary key
`isNullable`	boolean	Whether the business concept permits absence of a value

Bind a physical column

bash

curl -X POST \
  http://localhost:8001/api/v1/logical-data-elements/{elementId}/bind \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -d '{ "physicalColumnId": "b3f1a2e4-..." }'

AI description recommendation

For each element in a model the AI service can generate a plain-English business description grounded in the element's name, label, logical type, and vocabulary mappings. The recommendation is stored in a recommendedDescription field and surfaced as an inline suggestion in the Model tab — a data steward reviews and accepts or dismisses it.

bash

# Request a description recommendation for one element
curl -s -X POST \
  http://localhost:8001/api/v1/logical-data-elements/{elementId}/recommend-description \
  -H "X-API-Key: dev-local"
# → { "elementId": "...", "recommendedDescription": "The gross notional value of the trade,
#      expressed in settlement currency before any netting adjustment." }

# Accept the recommendation (writes to description field)
curl -s -X POST \
  http://localhost:8001/api/v1/logical-data-elements/{elementId}/accept-description \
  -H "X-API-Key: dev-local"

# Bulk — request descriptions for all elements in a model
curl -s -X POST \
  http://localhost:8001/api/v1/logical-models/{modelId}/recommend-descriptions \
  -H "X-API-Key: dev-local"

ℹ

Description recommendations are generated by the ai-service and stored on the element pending review. They are never applied automatically — a steward or data owner must accept each one. Accepting a recommendation writes the text to the description field and clears recommendedDescription.

Auto-scaffold from harvest

When a harvest run discovers columns for a dataset that has no published LogicalModel, ODIN automatically generates a draft LogicalModel with one LogicalDataElement per CSVWColumn. Each harvested column has its logicalDataElementId set to the newly created element, and its logicalType is inferred from the source datatype. You can then enrich the draft with business names and vocabulary mappings.

← PreviousCSV-W Physical Schema Next →Vocabulary & FIBO

Vocabulary & FIBO

ODIN ships with seven system vocabularies pre-loaded. You can register additional RDF vocabularies at any time.

Pre-loaded vocabularies

Vocabulary	Prefix	Type	Base IRI
schema.org	`schema`	general	`https://schema.org/`
FIBO FND	`fibo-fnd`	financial	`https://spec.edmcouncil.org/fibo/ontology/FND/`
FIBO FBC	`fibo-fbc`	financial	`https://spec.edmcouncil.org/fibo/ontology/FBC/`
FIBO SEC	`fibo-sec`	financial	`https://spec.edmcouncil.org/fibo/ontology/SEC/`
FIBO MD	`fibo-md`	financial	`https://spec.edmcouncil.org/fibo/ontology/MD/`
SKOS	`skos`	general	`http://www.w3.org/2004/02/skos/core#`
DPV	`dpv`	privacy	`https://w3id.org/dpv#`
DPV-PD	`dpv`	privacy	`https://w3id.org/dpv/pd#`

SKOS match types

Match type	When to use
`exactMatch`	The element represents precisely the same concept
`closeMatch`	Very similar but not identical (e.g. trade date ↔ schema:startDate)
`relatedMatch`	Related but distinct concepts
`broadMatch`	The vocabulary concept is broader / more general
`narrowMatch`	The vocabulary concept is narrower / more specific

Add a vocabulary mapping

bash

curl -X POST \
  http://localhost:8001/api/v1/logical-data-elements/{elementId}/vocab-mappings \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -d '{
    "vocabularyId": "...",
    "conceptIri": "https://spec.edmcouncil.org/fibo/ontology/FND/Accounting/CurrencyAmount/MonetaryAmount",
    "conceptLabel": "MonetaryAmount",
    "matchType": "exactMatch"
  }'

← PreviousLogical Models Next →Vocabulary & AI

Vocabulary & AI

Semantic vocabularies are the missing layer between your data and AI. This page explains why standard concept IRIs — schema.org and FIBO — transform how language models reason over catalog metadata, and how ODIN makes that connection operational.

Ambiguity is the root cause of AI failure

RAG pipelines retrieve chunks of text. Without semantic grounding, a question about "settlement amount" returns every table that mentions the word "amount." A SKOS exactMatch binding to fibo-fnd-acc-cur:MonetaryAmount makes retrieval precise — the model finds the right element, not the most popular one.

Vocabulary mappings give the search layer a signal that survives synonym drift, column renaming, and schema evolution. The concept IRI is stable even when the column name is not.

Standard IRIs are native to foundation models

schema.org and FIBO IRIs appear extensively in the training corpora of every major LLM. Annotating a data element with https://schema.org/price or fibo-md-temx-ex:MarketPrice puts it in semantic proximity to everything the model already knows about that concept — zero prompt engineering required.

The model does not need to be told what a LegalEntityIdentifier is. It learned from the FIBO specification itself. Your vocabulary mapping surfaces that latent knowledge at query time.

Agents need contracts, not descriptions

As AI moves from answering questions to taking actions — writing pipelines, generating reports, triggering workflows — it needs to know exactly what data it is handling. A vocabulary mapping is a contract:

This column contains a LegalEntityIdentifier, not "some kind of ID."
Agents that operate on contracts are auditable. Agents that operate on descriptions are not.

ODIN exposes these contracts through the logical model API. AI agents can read the full vocabulary profile for any dataset before writing a pipeline that touches it.

Your metadata becomes a knowledge graph

ODIN's vocabulary mappings, logical models, and lineage edges form a traversable knowledge graph stored in Apache AGE. AI agents don't just search it — they walk it.

Cypher (Apache AGE)

-- Walk from a regulatory report upstream through lineage,
-- then sideways through vocabulary to equivalent columns in other datasets
MATCH (report:Dataset {name: 'RISK_AGGREGATION'})
      <-[:DERIVED_FROM*1..4]-(src:Dataset)
RETURN src.namespace, src.name, src.fibo_concepts

From a regulatory report, upstream through lineage to source systems, sideways through vocabulary to equivalent concepts in other datasets. That kind of reasoning is only possible when meaning is explicit.

FIBO: regulatory-grade semantics, pre-loaded

The Financial Industry Business Ontology is the only semantic vocabulary built specifically for financial data with regulatory intent. When an AI model encounters a FIBO-annotated dataset, it has access to the same ontological structure that regulators, risk managers, and auditors use.

FIBO concept IRI (abbreviated)	Meaning
`fibo-fnd-acc-cur:MonetaryAmount`	Monetary value with attached currency
`fibo-fnd-acc-cur:Currency`	ISO 4217 currency code
`fibo-fbc-fi-fi:FinancialInstrument`	General financial instrument
`fibo-fbc-fct-rga:LegalEntityIdentifier`	LEI — unique legal entity identifier
`fibo-md-temx-ex:MarketPrice`	Exchange-quoted market price
`fibo-sec-eq-eq:Share`	Equity share / stock

Cross-system equivalence without ETL

Different source systems use different column names for the same concept: trade_ccy, SETTL_CURR, SettlementCurrency. All three mapped to fibo-fnd-acc-cur:Currency with exactMatch become interchangeable to any AI agent — without moving a byte of data.

bash

# Find all datasets with a column mapped to fibo Currency concept
curl "http://localhost:8004/api/v1/search?fibo_concept=fibo-fnd-acc-cur%3ACurrency" \
  -H "X-API-Key: dev-local"

Semantic equivalence is free once vocabulary mappings exist. The search index stores fibo_concepts as a keyword facet — no text matching, no NLP, just exact IRI lookup across every dataset in the catalog.

IRI → label translation

Raw concept IRIs are never shown in the UI. The inventory-service resolves them to human-readable labels via GET /api/v1/vocabularies/translate?iri=<iri> (single) and POST /api/v1/vocabularies/translate (a JSON array of IRIs → {“translations”:{“iri”:”label”,...}}). The frontend batches lookups and caches results in localStorage, so a mapping like fibo-fnd-acc-cur:Currency renders as “Currency” everywhere.

DPV & DPV-PD: privacy vocabulary for AI PII detection

ODIN pre-loads the W3C Data Privacy Vocabulary (DPV) and its personal-data extension (DPV-PD) alongside FIBO. DPV gives the AI service a precise, machine-readable vocabulary for classifying data elements as personal information or direct identifiers — the same vocabulary referenced in GDPR guidance and privacy impact assessments.

When you request PII recommendations for a logical data element, the AI service maps each element to a DPV-PD concept IRI (e.g. https://w3id.org/dpv/pd#Name, https://w3id.org/dpv/pd#EmailAddress) and sets isPersonalInformation / isDirectIdentifier flags. Accepted recommendations are stored as vocabulary mappings under the dpv prefix and propagate into the ODRL terms-of-use derivation — a column mapped to any DPV-PD direct-identifier concept automatically elevates the dataset's access level toward HIGHLY_RESTRICTED.

DPV-PD concept IRI	Meaning	isDirectIdentifier
`dpv-pd:Name`	Person's full or partial name	Yes
`dpv-pd:EmailAddress`	Email address	Yes
`dpv-pd:PhoneNumber`	Telephone number	Yes
`dpv-pd:NationalIdentificationNumber`	Government-issued ID (SSN, passport)	Yes
`dpv-pd:Location`	Geographic location data	No
`dpv-pd:BehavioralData`	Usage, click-stream, or activity data	No

✓

Use POST /api/v1/logical-models/{modelId}/recommend-pii to run bulk DPV-PD classification across all elements in a logical model. Individual elements use POST /api/v1/logical-data-elements/{elementId}/recommend-pii.

← PreviousVocabulary & FIBO Next →Semantic Types

OpenLineage Integration

ODIN's lineage-service exposes an OpenLineage-compatible HTTP endpoint. Any tool that emits OpenLineage events (Spark, dbt, Airflow, Flink) can send lineage directly to ODIN.

Send a lineage event

bash

curl -X POST http://localhost:8003/api/v1/lineage \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -d '{
    "eventType": "COMPLETE",
    "eventTime": "2026-05-18T10:00:00Z",
    "producer": "https://github.com/OpenLineage/OpenLineage/tree/main/integration/spark",
    "schemaURL": "https://openlineage.io/spec/1-0-5/OpenLineage.json#/definitions/RunEvent",
    "run": { "runId": "3fa85f64-5717-4562-b3fc-2c963f66afa6" },
    "job": { "namespace": "TRADING_DB", "name": "risk_aggregation_job" },
    "inputs":  [{ "namespace": "TRADING_DB.BLOTTER", "name": "TRADE_BLOTTER" }],
    "outputs": [{ "namespace": "REGULATORY_DB.BCBS239", "name": "RISK_AGGREGATION" }]
  }'

Query lineage graph

bash

# Upstream lineage, 4 hops
curl "http://localhost:8003/api/v1/datasets/REGULATORY_DB.BCBS239/RISK_AGGREGATION/lineage?direction=upstream&depth=4" \
  -H "X-API-Key: dev-local"

# Downstream impact analysis
curl "http://localhost:8003/api/v1/datasets/TRADING_DB.BLOTTER/TRADE_BLOTTER/impact" \
  -H "X-API-Key: dev-local"

Lineage is stored in an Apache AGE property graph on PostgreSQL. Cypher queries traverse DERIVED_FROM, READ_BY, and WRITES_TO edges. Column-level lineage uses COLUMN_LINEAGE edges.

← PreviousSemantic Types Next →Terms of Use (ODRL)

Semantic Types

A semantic type is the business domain type a dataset contains — e.g. Customer, DebitCardAccount, LoanOrCredit. Types are derived automatically from the vocabulary mappings on a dataset's published logical models, so they need no curation beyond mapping elements to FIBO / schema.org / SKOS concepts.

How a type is derived

Only exactMatch and closeMatch mappings count — stronger signals than broad, related, or narrow matches.
The type is the terminal fragment of the concept IRI: everything after the last / or #. For example …/ClientsAndAccounts/Customer → Customer, and https://schema.org/LoanOrCredit → LoanOrCredit.
Types are de-duplicated across all elements of the dataset's published logical model(s).

Semantic context

The full picture is returned by GET /api/v1/datasets/{id}/semantic-context:

bash

curl -s http://localhost:8001/api/v1/datasets/$ID/semantic-context \
  -H "X-API-Key: dev-local"
# → { "semanticTypes": ["Customer", "DebitCardAccount"],
#     "vocabConceptLabels": [...], "fiboConcepts": [...],
#     "logicalElementNames": [...], "logicalTypes": [...],
#     "acceptedTags": [ { "id": "...", "type": "Customer", "vocabularyIri": "..." } ] }

Where types surface

Search — indexed as the OpenSearch semanticTypes keyword field and exposed as a facet (filter with ?semanticType=Customer).
Consumer UI — rendered as colour-coded badges on the dataset detail view (blue for FIBO, green for schema.org).
AI chat — types are embedded into the dataset's vector chunks, so questions like "which datasets contain Customer data?" surface the right datasets.
Manual tags — stewards can accept AI-recommended or hand-entered types via POST /api/v1/datasets/{id}/semantic-tags; see the AI Service for recommendations.

← PreviousVocabulary & AI Next →OpenLineage

Governance & Audit

Every dataset has an optional data owner and an immutable audit history. Ownership only changes hands through a proposal workflow, so a transfer is never unilateral.

Assigning an owner

An unowned dataset can be claimed with PUT /api/v1/datasets/{id}/owner (body {"userId":"..."}). If the dataset already has an owner the call is rejected with 409 — use the transfer workflow instead.

Ownership transfer workflow

Propose — POST /api/v1/datasets/{id}/ownership-proposals with {"proposedOwnerId":"..."} creates a PENDING proposal.
Approve / reject — only the current owner (or a catalog admin) may call …/ownership-proposals/{proposalId}/approve or …/reject. Approval atomically updates the dataset's ownerId.
Pending queue — GET …/ownership-proposals/pending returns the open proposal, or 204 if none.

A proposal carries status (PENDING / APPROVED / REJECTED), proposedOwnerId, proposedById, and timestamps.

Audit history

GET /api/v1/datasets/{id}/history returns a reverse-chronological, paginated log. Each entry records the eventType, who made the change (changedById / changedByEmail), and JSON snapshots payloadBefore / payloadAfter.

Tracked events: CREATED, UPDATED, DELETED, OWNER_ASSIGNED, OWNER_TRANSFER_PROPOSED, OWNER_TRANSFER_APPROVED, OWNER_TRANSFER_REJECTED.

AI action gating

Accepting or rejecting an AI-generated classification recommendation is gated to the dataset's current data owner. The producer UI enforces this in the Model tab — the Accept / Reject buttons are shown only when the logged-in user is the dataset owner or an administrator. The same restriction applies to description recommendations. This ensures that AI suggestions never modify metadata without an accountable human approving the change.

Action	Who can perform it
Accept / reject AI classification	Data Owner, Administrator
Accept / reject AI description	Data Owner, Administrator
Accept / reject AI vocabulary concept recommendations	Data Owner, Administrator
Accept / reset ODRL terms of use policy	Data Owner only
Accept physical schema AI mappings	Data Owner only
Approve / reject ownership transfer	Current Data Owner, Administrator
Direct owner assignment (unowned dataset)	Administrator
Propose ownership transfer	Data Governance Officer, Data Steward, current Data Owner

Governance dashboard

The producer UI dashboard (http://localhost:3000/{tenant}) provides a personal governance view for every logged-in user:

Stat cards — count of datasets and data products the user owns.
Outstanding Tasks — pending ownership transfer proposals directed at the user (nomination or transfer awaiting their acceptance).
My Proposals — ownership proposals the user has submitted or been nominated for, with current status (Pending / Approved / Rejected) and resolution notes.
My Changes — a chronological feed of dataset events the user performed: Created, Updated, Assigned Owner, Proposed Transfer, Approved/Rejected Transfer.

The dashboard data is served by GET /api/v1/dashboard/summary (stat counts) and GET /api/v1/dashboard/activity (proposals + changes feed) on the inventory-service.

← PreviousTerms of Use (ODRL) Next →Database ERDs

Terms of Use (ODRL)

ODIN derives a machine-readable ODRL terms-of-use policy for every dataset automatically — no manual policy authoring required. The policy is inferred from two signals already present in the catalog: element classifications and vocabulary concept mappings.

How derivation works

Effective classification — the most restrictive accepted classification across all published logical model elements determines the access level. Order: HIGH_CONFIDENTIAL > CONFIDENTIAL > INTERNAL > PUBLIC.
Regulatory signals — FIBO vocabulary IRI prefixes (fibo-fbc, fibo-sec, fibo-md) and dataset keywords (mifid, emir, gdpr, basel, finrep) identify applicable regulatory frameworks and add corresponding obligations (e.g. "Comply with market data vendor licence terms" when fibo-md concepts are mapped).

Access levels

Effective classification	Access level	Default stance
`PUBLIC`	OPEN	Use and redistribute freely with attribution
`INTERNAL`	INTERNAL_ONLY	Internal use only; no external distribution
`CONFIDENTIAL`	RESTRICTED	Approved analytics only; notify data owner before AI/ML use
`HIGH_CONFIDENTIAL`	HIGHLY_RESTRICTED	Explicit written approval required; full audit trail

API

Method	Path	Description
`GET`	`/api/v1/datasets/{id}/terms-of-use`	Derive (or return explicit) ODRL policy. Returns `effectiveClassification`, `accessLevel`, `permissions`, `prohibitions`, `obligations`, `applicableRegulations`, `odrlPolicy`, `policySource`, `derivationDetails`.
`POST`	`/api/v1/datasets/{id}/terms-of-use/accept`	Lock the derived policy as `hasPolicy` on the dataset (data owner only).
`DELETE`	`/api/v1/datasets/{id}/terms-of-use/policy`	Clear the explicit policy; revert to dynamic derivation.

Accept pre-condition

The Accept Policy action requires that every element in the published logical model has both an accepted classification and at least one vocabulary concept mapping. While unmet, the producer UI shows per-element readiness hints — e.g. "3 elements still need classification" — rather than disabling the button silently.

Policy source

The policySource field in the response indicates provenance:

derived — computed live from current classifications and vocabulary mappings.
explicit — a data owner has accepted and locked the derived policy via POST .../accept. The stored ODRL JSON is returned as-is.
fallback — no element classifications found; terms fall back to the dataset's declared license / accessRights / rightsStatement fields.

Consumer UI

Every dataset in the consumer discovery drawer has a Terms tab showing the access level badge (colour-coded), Permitted Uses, Restrictions, Obligations, Applicable Regulations (as pills), and a collapsible ODRL JSON block for technical consumers.

Producer UI

The producer Governance tab shows a Terms of Use Policy panel with derivation details (classified element count, vocabulary concept count, matched signals). The data owner can Accept Policy to lock the derived terms, or Reset to Derived to clear a locked policy.

← PreviousOpenLineage Next →Governance & Audit

Database ERDs

Six databases, one per service. All primary keys are UUID. Foreign keys are intra-database only; cross-service references are soft UUID columns without a database-level FK constraint.

Use the zoom controls or scroll wheel to navigate large diagrams. Click and drag to pan.

erDiagram resources { uuid id PK varchar resource_type text iri UK uuid tenant_id uuid domain_id text title text description timestamptz issued timestamptz modified text_arr language text_arr keywords text_arr themes text license text rights_statement text access_rights text_arr conforms_to uuid creator_id uuid publisher_id jsonb contact_points text source_uri jsonb extra boolean is_deleted timestamptz created_at timestamptz updated_at } catalogs { uuid resource_id PK text homepage uuid_arr has_part } datasets { uuid resource_id PK uuid catalog_id text accrual_periodicity timestamptz temporal_start timestamptz temporal_end float spatial_resolution_m text temporal_resolution text version text version_notes uuid is_version_of FK } distributions { uuid resource_id PK uuid dataset_id FK text access_url text download_url text media_type text format bigint byte_size varchar checksum_algorithm text checksum_value text compress_format text package_format text availability uuid csvw_table_id FK } data_services { uuid resource_id PK text endpoint_url text endpoint_description uuid_arr serves_dataset text protocol text security_schema_type } data_products { uuid resource_id PK varchar lifecycle_status uuid owner_id text purpose varchar information_sensitivity jsonb has_policy } data_product_ports { uuid id PK uuid data_product_id FK varchar port_type uuid data_service_id FK uuid dataset_id FK uuid distribution_id FK timestamptz created_at } catalog_records { uuid resource_id PK uuid catalog_id uuid primary_topic_id timestamptz listing_date timestamptz modification_date text harvest_source } csvw_tables { uuid id PK uuid distribution_id FK text url text title text description jsonb dialect boolean suppress_output text table_direction timestamptz created_at timestamptz updated_at } csvw_table_schemas { uuid id PK uuid table_id FK text_arr primary_key text about_url text property_url text value_url } csvw_columns { uuid id PK uuid schema_id FK int ordinal text name text_arr titles text datatype boolean required boolean virtual boolean suppress_output text lang text default_value text property_url text value_url text about_url text description uuid logical_data_element_id FK } vocabularies { uuid id PK text name text prefix UK text base_iri UK varchar vocabulary_type text description text version text homepage boolean is_system timestamptz created_at } dataset_vocabulary_profiles { uuid id PK uuid dataset_id FK uuid vocabulary_id FK boolean is_primary text_arr domain_tags timestamptz created_at } logical_models { uuid id PK uuid dataset_id FK text name text description text version varchar status timestamptz created_at timestamptz updated_at } logical_data_elements { uuid id PK uuid logical_model_id FK text name text label text description text logical_type boolean is_required boolean is_identifier boolean is_nullable int ordinal timestamptz created_at timestamptz updated_at } logical_element_vocab_mappings { uuid id PK uuid logical_element_id FK uuid vocabulary_id FK text concept_iri text concept_label text concept_definition varchar match_type timestamptz created_at } cross_model_mappings { uuid id PK varchar source_type uuid source_id varchar target_type uuid target_id text mapping_type timestamptz created_at } resources ||--o| catalogs : "extends" resources ||--o| datasets : "extends" resources ||--o| distributions : "extends" resources ||--o| data_services : "extends" resources ||--o| data_products : "extends" resources ||--o| catalog_records : "extends" datasets ||--o{ distributions : "has" datasets o|--o| datasets : "isVersionOf" datasets ||--o{ dataset_vocabulary_profiles : "profiles" datasets ||--o{ logical_models : "models" distributions o|--o| csvw_tables : "describes" csvw_tables ||--o{ csvw_table_schemas : "schema" csvw_table_schemas ||--|{ csvw_columns : "columns" data_products ||--o{ data_product_ports : "ports" data_product_ports o{--o| data_services : "via" data_product_ports o{--o| datasets : "via" data_product_ports o{--o| distributions : "via" vocabularies ||--o{ dataset_vocabulary_profiles : "used in" vocabularies ||--o{ logical_element_vocab_mappings : "mapped by" logical_models ||--|{ logical_data_elements : "elements" logical_data_elements ||--o{ csvw_columns : "bound by" logical_data_elements ||--o{ logical_element_vocab_mappings : "mappings"

resources is a polymorphic base table — every typed row shares its PK with a resources row. csvw_columns.logical_data_element_id is nullable; set when a physical column is bound to a logical element.

erDiagram harvest_sources { uuid id PK uuid tenant_id text name varchar source_type text base_url text region text database_name text_arr schema_filter text credential_ref jsonb extra_config timestamptz created_at timestamptz updated_at } harvest_credentials { uuid id PK uuid source_id FK varchar credential_type text encrypted_payload timestamptz created_at } harvest_jobs { uuid id PK uuid source_id FK text name text schedule_cron boolean full_refresh boolean enabled timestamptz created_at timestamptz updated_at } harvest_runs { uuid id PK uuid job_id FK uuid source_id FK varchar status varchar triggered_by timestamptz started_at timestamptz completed_at int entities_discovered int entities_created int entities_updated int entities_failed text snapshot_path text error_message boolean full_refresh timestamptz created_at } harvest_run_items { uuid id PK uuid run_id FK varchar entity_type text source_key uuid canonical_id varchar action jsonb raw_payload jsonb normalized_payload text error_detail timestamptz created_at } harvest_sources ||--o{ harvest_credentials : "credentials" harvest_sources ||--o{ harvest_jobs : "jobs" harvest_jobs ||--o{ harvest_runs : "runs" harvest_sources ||--o{ harvest_runs : "runs" harvest_runs ||--o{ harvest_run_items : "items"

harvest_credentials.encrypted_payload stores AES-256-GCM ciphertext; plaintext never reaches the database. harvest_sources.source_type: dcat_http | aws_glue | snowflake | teradata.

erDiagram lineage_jobs { uuid id PK text namespace text name jsonb facets bigint age_vertex_id } lineage_datasets { uuid id PK text namespace text name jsonb facets jsonb schema_facet uuid catalog_resource_id bigint age_vertex_id } lineage_runs { uuid id PK text run_id UK uuid job_id FK jsonb facets timestamptz nominal_start_time timestamptz nominal_end_time } lineage_run_events { uuid id PK uuid run_id FK varchar event_type timestamptz event_time text producer text schema_url jsonb inputs jsonb outputs jsonb raw_event timestamptz created_at } column_lineage { uuid id PK uuid run_event_id FK uuid output_dataset_id FK text output_column uuid input_dataset_id FK text input_column text transformation_type } lineage_jobs ||--o{ lineage_runs : "runs" lineage_runs ||--o{ lineage_run_events : "events" lineage_run_events ||--o{ column_lineage : "columns" column_lineage o{--o| lineage_datasets : "output dataset" column_lineage o{--o| lineage_datasets : "input dataset"

Apache AGE graph lineage_graph: vertices Job, Dataset, Column; edges READ_BY, WRITES_TO, DERIVED_FROM, COLUMN_LINEAGE. age_vertex_id links each relational row to its AGE vertex for Cypher traversal.

erDiagram conversations { uuid id PK uuid tenant_id uuid user_id text title timestamptz created_at } conversation_messages { uuid id PK uuid conversation_id FK varchar role text content int token_count text model_used timestamptz created_at } embedding_documents { uuid id PK uuid tenant_id varchar entity_type uuid entity_id int chunk_index text content vector768 embedding text model_name jsonb metadata timestamptz created_at } conversations ||--|{ conversation_messages : "messages"

embedding_documents.embedding is a VECTOR(768) column indexed with IVFFlat (cosine distance, 100 lists). The composite unique key (entity_id, chunk_index, model_name) ensures idempotent re-embedding.

erDiagram organizations { uuid id PK text name UK text display_name text description varchar plan boolean active timestamptz created_at } domains { uuid id PK uuid tenant_id text name text description uuid parent_domain_id FK uuid owner_id timestamptz created_at timestamptz updated_at } catalog_users { uuid id PK uuid tenant_id text email UK text first_name text last_name text keycloak_user_id UK boolean active text_arr roles text_arr permissions timestamptz created_at timestamptz updated_at } api_keys { uuid id PK uuid tenant_id uuid owner_id text key_hash UK text description boolean active timestamptz expires_at text_arr scopes timestamptz created_at timestamptz last_used_at } organizations ||--o{ domains : "tenancy" organizations ||--o{ catalog_users : "members" organizations ||--o{ api_keys : "keys" domains o|--o{ domains : "parent" catalog_users ||--o{ api_keys : "owns"

Keycloak is the authoritative identity provider. catalog_users.keycloak_user_id links to the Keycloak realm user. domains is self-referential for hierarchical domain trees. api_keys.key_hash stores SHA-256 of the bearer token.

erDiagram policy_records { uuid id PK uuid dataset_id uuid tenant_id varchar policy_level jsonb policy_json timestamptz created_at timestamptz updated_at } evaluation_log { uuid id PK uuid dataset_id uuid tenant_id varchar action boolean granted jsonb request_context timestamptz created_at } policy_pieces { uuid id PK uuid tenant_id varchar piece_type varchar dimension_key text label jsonb policy_json varchar policy_level timestamptz created_at timestamptz updated_at } dataset_policy_links { uuid id PK uuid dataset_id uuid tenant_id uuid piece_id FK timestamptz applied_at } policy_pieces ||--o{ dataset_policy_links : "applied to datasets"

policy_records holds the assembled, effective ODRL policy per dataset (unique on (dataset_id, tenant_id)). policy_pieces are reusable, typed fragments (CLASSIFICATION / REGULATION / CONTRACTUAL) that dataset_policy_links composes onto datasets (piece_id FK cascades on delete). Every POST /evaluate call appends a row to evaluation_log.

← PreviousGovernance & Audit Next →Inventory Service

Inventory Service

The inventory-service is the primary metadata store. It owns all DCAT, DPROD, CSV-W, logical model, and vocabulary resources. All other services treat it as the source of truth.

Key responsibilities

Persist and version DCAT Datasets, Distributions, DataServices, Catalogs
Persist DPROD DataProducts, Ports, and lifecycle transitions
Store CSV-W tables and columns (populated by harvest events)
Manage LogicalModels and LogicalDataElements with physical column bindings
Maintain the vocabulary registry and per-dataset vocabulary profiles
Export the full catalog as DCAT 3.0 JSON-LD via Apache Jena
Publish catalog.*.changes Kafka events on all mutations

Database

PostgreSQL 16 on port 5433 (Docker). Migrations managed by Flyway. Key tables: resources, datasets, distributions, data_products, csvw_columns, logical_models, logical_data_elements, vocabularies.

← PreviousDatabase ERDs Next →Harvest Service

Harvest Service

The harvest-service crawls external data sources, normalises their metadata, and publishes it to Kafka for the inventory-service to ingest. Jobs are scheduled with Quartz and executed as Spring Batch jobs.

Supported connectors

Connector	Source type	What it harvests
`dcat_http`	Any DCAT HTTP endpoint	Datasets, distributions via Apache Jena (JSON-LD, Turtle, RDF/XML)
`aws_glue`	AWS Glue Data Catalog	Databases, tables, columns, partitions via AWS SDK v2
`snowflake`	Snowflake	`SHOW TABLES`, `DESCRIBE TABLE`, `GET DDL`
`teradata`	Teradata	`DBC.TablesV`, `DBC.ColumnsV`, `DBC.ShowSQL`

Configure a source

bash

curl -X POST http://localhost:8002/api/v1/sources \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -d '{
    "name": "Snowflake Production",
    "sourceType": "snowflake",
    "baseUrl": "orgname-accountname.snowflakecomputing.com",
    "databaseName": "TRADING_DB",
    "schemaFilter": ["BLOTTER", "RISK"],
    "credentialRef": "vault://snowflake/prod"
  }'

← PreviousInventory Service Next →Lineage Service

Lineage Service

The lineage-service ingests OpenLineage events and DDL, persists them to PostgreSQL, and builds a property graph in Apache AGE for multi-hop Cypher traversal.

DDL lineage

Submit raw DDL to extract lineage without running a pipeline:

bash

curl -X POST http://localhost:8003/api/v1/ddl/submit \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -d '{
    "dialect": "SNOWFLAKE",
    "ddl": "CREATE VIEW RISK_DB.MARKET_RISK.DAILY_POSITIONS AS SELECT t.*, p.close_price FROM TRADING_DB.BLOTTER.TRADE_BLOTTER t JOIN kafka://prices-realtime p ON t.instrument_id = p.instrument_id"
  }'

Apache Calcite parses the DDL across Snowflake, Teradata, and Hive dialects. A DERIVED_FROM edge is created in the AGE graph between each source table and the view.

← PreviousHarvest Service Next →Search Service

Search Service

The search-service maintains an OpenSearch index that is enriched with logical model data, vocabulary concept labels, and FIBO IRIs. It consumes Kafka events to stay in sync with the catalog.

Search query

bash

# Full-text search with filters
curl "http://localhost:8004/api/v1/search?q=trade&type=dataset&domain=Finance&hasLineage=true" \
  -H "X-API-Key: dev-local"

# FIBO concept facet search
curl "http://localhost:8004/api/v1/search?fibo_concept=MonetaryAmount" \
  -H "X-API-Key: dev-local"

# Semantic type facet search
curl "http://localhost:8004/api/v1/search?semanticType=Customer" \
  -H "X-API-Key: dev-local"

# Autocomplete suggestions
curl "http://localhost:8004/api/v1/search/suggest?q=trad" \
  -H "X-API-Key: dev-local"

Datasets are indexed with a semanticTypes keyword field derived from their vocabulary mappings, exposed as both a filter and an aggregation facet. See Semantic Types.

Reindex

bash

curl -X POST http://localhost:8004/api/v1/admin/reindex \
  -H "X-API-Key: dev-local"
# → {"datasetsIndexed":142,"dataProductsIndexed":31,"distributionsIndexed":287}

← PreviousLineage Service Next →AI Service

AI Service

The ai-service provides a RAG (Retrieval-Augmented Generation) pipeline over your metadata corpus using Spring AI. It can run fully on-premises with Ollama or use the OpenAI API.

Start a conversation

bash

# Create a conversation
CONV=$(curl -s -X POST http://localhost:8005/api/v1/conversations \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{"title": "My session"}' | jq -r .id)

# Ask a question (streaming SSE response)
curl -N -X POST http://localhost:8005/api/v1/conversations/$CONV/messages \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -H "Accept: text/event-stream" \
  -d '{"content": "Which datasets contain monetary amounts mapped to FIBO?"}'

# Multi-dataset SQL query — schema for both datasets loaded; join hints derived from shared vocab IRIs;
# platform conflict detected if datasets are on different systems (Snowflake vs Delta Lake, etc.)
curl -N -X POST http://localhost:8005/api/v1/conversations/$CONV/messages \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -H "Accept: text/event-stream" \
  -d '{
    "content": "Write a Snowflake SQL join between trade positions and counterparty master",
    "focusDatasetIds": ["<trade-positions-id>", "<counterparty-master-id>"]
  }'

Embedding pipeline

The ai-service listens on inventory.datasets.changes. On each event it builds four pgvector chunks per dataset and upserts them:

Chunk 0 — title + description
Chunk 1 — keywords + themes
Chunk 2 — semantic types + vocabulary concept labels + logical element names
Chunk 3 — physical column names and SQL datatypes (used to ground multi-dataset queries)

Element classification & description

The ai-service provides two metadata enrichment operations for logical data elements:

Classification — infers a data sensitivity level (PUBLIC, INTERNAL, CONFIDENTIAL, or HIGH_CONFIDENTIAL) from the element's name, logical type, and FIBO / schema.org vocabulary mappings. Returns a one-sentence reasoning.
Description — generates a plain-English business description grounded in the element's context. Descriptions are surfaced in the Model tab's Description column as inline suggestions pending owner review.

Both operations are triggered via the inventory-service proxy (which owns persistence). Results are stored as recommendedClassification / recommendedDescription on the element and cleared when accepted or rejected. Only the dataset's data owner may accept recommendations.

bash

curl -s -X POST http://localhost:8005/api/v1/classify/elements \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{
    "elements": [
      { "elementId": "e1", "name": "ssn", "label": "Social Security Number",
        "logicalType": "string", "vocabConceptLabels": ["Person Identifier"] }
    ]
  }'
# → { "results": [ { "elementId": "e1",
#       "classification": "HIGH_CONFIDENTIAL", "reasoning": "..." } ] }

Element description endpoint

bash

curl -s -X POST http://localhost:8005/api/v1/describe/elements \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{
    "elements": [
      { "elementId": "e1", "name": "trade_amt", "label": "Trade Amount",
        "logicalType": "MonetaryAmount",
        "vocabConceptLabels": ["MonetaryAmount", "Currency"] }
    ]
  }'
# → { "results": [ { "elementId": "e1",
#       "description": "The gross notional value of the trade in settlement currency,
#                       recorded pre-netting." } ] }

Semantic recommendations

Analyses a dataset's metadata (title, description, keywords, element names, logical types, and current vocabulary mappings) and recommends additional business domain types and vocabulary concepts to improve semantic coverage. The producer UI calls this through the inventory-service proxy POST /api/v1/datasets/{id}/recommend-semantic-context.

bash

curl -s -X POST http://localhost:8005/api/v1/recommend-semantic-context \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{
    "datasetId": "...", "title": "Retail Customer Accounts",
    "keywords": ["customer", "account"],
    "elementNames": ["customer_id", "balance"],
    "currentVocabLabels": ["Customer"]
  }'
# → { "types": [ { "type": "DebitCardAccount",
#       "rationale": "...", "vocabularyHint": "FIBO" } ], "rationale": "..." }

Agentic review (proposer / reviewer)

Beyond one-shot recommendations, the ai-service runs a two-agent proposer/reviewer loop over a logical model to produce higher-quality, self-critiqued enrichment. A proposer drafts per-element descriptions, classifications, vocabulary concept mappings, and PII / direct-identifier flags; a reviewer then audits that draft against the dataset's full DCAT context and returns a verdict (APPROVE / REJECT) with per-issue comments. The proposer revises on each REJECT. The loop is capped at 10 iterations, and a long-term review memory carries lessons from past reviews into new runs to speed convergence. On convergence (or the cap) the final recommendation is persisted to the model's elements for a data owner to accept or reject.

Progress streams over Server-Sent Events — each data: line is a JSON AgenticEvent: phase markers (CONTEXT, MEMORY, PROPOSING, REVIEWING, LOCKED), the proposer's PROPOSAL, the reviewer's REVIEW (verdict + comments + summary), and a terminal DONE / MAX_REACHED / ERROR.

bash

# Run the agentic review over one logical model and stream progress (SSE)
curl -N -X POST http://localhost:8005/api/v1/agentic-review \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-local" \
  -H "Accept: text/event-stream" \
  -d '{"datasetId": "<dataset-id>", "modelId": "<logical-model-id>"}'

ℹ

Swagger UI cannot render SSE — use curl -N or the producer UI to observe the stream.

✓

The ai-service is optional. Start it with docker compose --profile ai up -d ai-service ollama. All other services run without it.

← PreviousSearch Service Next →Identity Service

Identity Service

Port: 8006 | Database: PostgreSQL 16 (port 5436) + Keycloak (port 8180)

The identity-service manages organisations, users, roles, and access policies. It integrates with Keycloak 24 for OIDC token issuance and validation. All other backend services validate JWTs issued by Keycloak.

Responsibilities

User provisioning (invite, list, enable/disable) backed by the Keycloak Admin REST API — changes made in the producer Admin › Users UI are written directly to Keycloak and synced to the local catalog_users table
Bidirectional Keycloak sync — on startup the service imports any existing Keycloak users into the local catalog database so that user references (e.g. ownerId) resolve correctly
Organisation and domain management
Long-lived API key issuance (stored as SHA-256 hashes)
ABAC policy evaluation
JWT issuer — all services trust http://keycloak:8180/realms/datacatalog

Keycloak realm

The datacatalog realm is auto-imported from infra/keycloak/datacatalog-realm.json on first startup. Subsequent changes must be made via the Keycloak Admin Console or REST API — the file is only read against a fresh database.

Item	Value
Admin console	`http://localhost:8180`
Admin credentials	`admin` / `admin`
Realm	`datacatalog`
Frontend client	`catalog-frontend` (public, PKCE)

ℹ

Keycloak 24 uses KEYCLOAK_ADMIN and KEYCLOAK_ADMIN_PASSWORD environment variables. The old KC_BOOTSTRAP_ADMIN_* variables are not supported.

For roles, default users, and how to invite new users see Roles & Login →

← PreviousAI Service Next →Policy Service

Policy Service

Port: 8007 | Database: PostgreSQL 16 (port 5438)

The policy-service is the platform's Policy Decision Point (PDP). It holds the ODRL policy registry for all datasets and evaluates them on demand using an internal implementation of the ODRE enforcement algorithm (Cimmino et al., Computers & Security, 2025) — a concrete enforcement layer on top of ODRL that produces machine-readable UsageDecision tuples at request time.

Responsibilities

Maintain a policy_records registry keyed by (dataset_id, tenant_id)
Auto-sync policies from inventory.datasets.changes Kafka events when a data owner accepts terms
Evaluate ODRL policies at request time via POST /api/v1/policies/{datasetId}/evaluate
Persist an evaluation_log for every evaluation call
Publish PolicyEvaluationResultPayload to policy.evaluations.completed

Policy levels

Level	Description
A-Level	Pure ODRL JSON-LD. Static constraint evaluation — dateTime, numeric, string comparisons. All policies generated by `TermsOfUseService` are A-Level.
B1-Level	Variable injection: `[=varName]` placeholders in the stored policy are resolved from the `M` map passed at evaluation time. Use to inject caller role, caller ID, or any runtime value.

Policy composition (component pieces)

A dataset's effective policy is assembled from reusable, typed fragments called policy pieces. Each piece has a type of CLASSIFICATION (data sensitivity), REGULATION (e.g. FCRA-aware rules), or CONTRACTUAL (terms-of-use obligations), keyed by a dimension value. dataset_policy_links records which pieces apply to which dataset, and the registry composes them into the effective policy_records document. Inspect the breakdown via GET /api/v1/policies/{datasetId}/components.

UsageDecision semantics

Tuple	Meaning
`(action, "true")`	Permission granted
`(action, action)`	Delegated — caller must handle (e.g. show attribution notice)
Absent	All constraints failed — access denied

Quick examples

bash

# Register a policy (upsert)
curl -X PUT http://localhost:8007/api/v1/policies/{datasetId} \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{"policyJson":"{...odrl...}","policyLevel":"A"}'

# Evaluate access (A-Level — no M variables needed)
curl -X POST http://localhost:8007/api/v1/policies/{datasetId}/evaluate \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{"M":{},"F":{}}'
# → {"granted":true,"policyLevel":"A","decisions":[{"action":"read","result":"true","delegated":false}]}

# Evaluate access (B1-Level — inject callerRole at runtime)
curl -X POST http://localhost:8007/api/v1/policies/{datasetId}/evaluate \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{"M":{"callerRole":"DATA_OWNER"},"F":{}}'

ℹ

Policies are also auto-registered when a data owner accepts terms in the producer UI — the hasPolicy field on the DatasetChangedPayload Kafka event triggers an upsert without any manual API call.

See Policy API → for the full endpoint reference.

← PreviousIdentity Service Next →Roles & Login

Roles & Login

The producer UI (http://localhost:3000) requires login before any content is shown. The consumer UI (http://localhost:3001) is read-only and does not require authentication.

Login flow

User visits the producer UI → redirected to the Keycloak login page automatically
User logs in with email and password
Keycloak issues an OIDC access token (JWT) via Authorization Code + PKCE
The producer app stores the token in memory and sends it as Authorization: Bearer <token> on every API call
Backend services validate the JWT against the Keycloak JWKS endpoint
The token is refreshed automatically every 30 seconds before it expires
Click Sign out in the sidebar to terminate the Keycloak session

Roles

Display name	Keycloak name	Description	Admin nav
Administrator	`administrator`	Full platform access — users, harvest sources, all data assets, settings	All items
Data Governance	`data-governance`	Governs data quality and compliance across domains	Domains only
Data Owner	`data-owner`	Owns and manages data products and datasets in their domain	Hidden
Data Steward	`data-steward`	Curates metadata, semantic annotations, and logical models	Hidden

Role → backend permissions

Role	Permissions claim values	Effect
Administrator	`catalog:read`, `catalog:write`, `catalog:admin`	Full API access
Data Governance	`catalog:read`, `catalog:write`	Read + mutate; no admin-only endpoints
Data Owner	`catalog:read`, `catalog:write`	Read + mutate; no admin-only endpoints
Data Steward	`catalog:read`, `catalog:write`	Read + mutate; no admin-only endpoints

Default users

Email	Password	Role
`admin@datacatalog.local`	`admin`	Administrator
`governance@datacatalog.local`	`password`	Data Governance
`owner@datacatalog.local`	`password`	Data Owner
`steward@datacatalog.local`	`password`	Data Steward

⚠

Change all passwords before exposing the stack to a network.

← PreviousIdentity Service Next →Inviting Users

Inviting Users

Users are managed through the Keycloak Admin Console. The identity-service user API (POST /api/v1/users/invite) is the programmatic path for the same workflow.

Via Keycloak Admin Console

Open the admin console and log in as admin / admin — Docker Compose: http://localhost:8180; MicroK8s: http://keycloak.catalog.local/admin (add the host to /etc/hosts as printed by deploy.sh)
Select the datacatalog realm from the top-left dropdown
Go to Users → Add user
Set Email, First name, Last name; enable Email verified; click Create
On the Credentials tab → set a temporary password and click Save password
On the Role mapping tab → click Assign role → select one of the four catalog roles
On the Attributes tab → add the following key/value pairs:

Key	Value(s)	Notes
`tenant_id`	`00000000-0000-0000-0000-000000000001`	Single value. Must match the tenant UUID in the database.
`permissions`	See table below	Add one value per row in Keycloak (multivalued attribute).

Permissions attribute by role

Role	Permission values (one per row)
Administrator	`catalog:read`, `catalog:write`, `catalog:admin`
Data Governance	`catalog:read`, `catalog:write`
Data Owner	`catalog:read`, `catalog:write`
Data Steward	`catalog:read`, `catalog:write`

ℹ

The tenant_id and permissions attributes are mapped into the JWT by Keycloak protocol mappers. Without them the backend services will reject the token with 403.

ℹ

The realm sets unmanagedAttributePolicy: ADMIN_EDIT, so the tenant_id and permissions attributes always save correctly from the Admin Console or the Admin REST API. This policy ships with the realm import, so it applies automatically to any fresh deployment.

Force logout / session revocation

To expire all active sessions for a user (e.g. after a role change takes effect):

Keycloak Admin Console → Users → select the user
Sessions tab → Sign out all sessions

The user will be redirected to the login page on their next API call.

← PreviousRoles & Login Next →Keycloak Setup

Keycloak Setup

Realm

Property	Value
Realm name	`datacatalog`
Admin console	`http://localhost:8180`
Admin credentials	`KEYCLOAK_ADMIN` / `KEYCLOAK_ADMIN_PASSWORD` (default: `admin` / `admin`)
Import file	`infra/keycloak/datacatalog-realm.json` (loaded on first startup only)
Access token lifetime	1 hour

Clients

Client ID	Type	Used by
`catalog-frontend`	Public (PKCE)	Producer and consumer browser apps — Authorization Code flow with PKCE
`identity-service`	Confidential (service account)	Backend M2M calls to Keycloak Admin API
`catalog-api`	Bearer-only (resource server)	Validates JWT tokens on behalf of backend services

Protocol mappers (catalog-scopes)

Mapper name	JWT claim	Type	Purpose
`tenant_id`	`tenant_id`	String	Tenant scoping for all backend database queries
`permissions`	`permissions`	JSON array (multivalued)	Spring Security authority grants (`SCOPE_catalog:read` etc.)
`realm_roles`	`realm_access.roles`	String array	Frontend role-based UI visibility (nav gating, feature flags)

Applying realm changes

The realm JSON is only imported on a fresh database. For an already-running stack, use the Keycloak Admin Console or the Admin REST API to apply changes:

bash

# Get an admin token
TOKEN=$(curl -s -X POST http://localhost:8180/realms/master/protocol/openid-connect/token \
  -d "client_id=admin-cli&username=admin&password=admin&grant_type=password" \
  | jq -r .access_token)

# List current roles
curl -s -H "Authorization: Bearer $TOKEN" \
  http://localhost:8180/admin/realms/datacatalog/roles | jq '.[].name'

# Create a new role
curl -s -X POST -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name":"my-role","description":"Custom role"}' \
  http://localhost:8180/admin/realms/datacatalog/roles

✓

To re-import a fresh realm, delete the Keycloak Postgres volume and restart:

docker compose down keycloak postgres-identity && docker volume rm data-catalog_identity-pgdata && docker compose up -d keycloak postgres-identity

. This erases all existing users.

← PreviousInviting Users Next →Authentication

Authentication

Request headers

Header	Required	Description
`Authorization: Bearer <jwt>`	One of these two	Keycloak OIDC access token. Tenant ID is extracted from the `tenant_id` JWT claim automatically.
`X-API-Key: <key>`	One of these two	Long-lived API key issued by identity-service. Tenant is resolved from the key record.

Dev API key (local development only)

Any key starting with dev- bypasses Keycloak entirely and grants catalog:admin scope with tenant 00000000-0000-0000-0000-000000000001. Use it for curl smoke tests and CI pipelines that don't need a real login.

bash

curl -H "X-API-Key: dev-local" http://localhost:8001/api/v1/datasets

✕

Never use X-API-Key: dev-* in production. It grants unrestricted admin access to all tenants with no authentication.

Producer UI authentication

The producer UI uses the Keycloak catalog-frontend public client (Authorization Code + PKCE). Visiting http://localhost:3000 redirects unauthenticated users to the Keycloak login page automatically. See Roles & Login for details.

← PreviousKeycloak Setup Next →Inventory API

Inventory API :8001

Catalogs

GET/api/v1/catalogs

List all catalogs for the tenant.

POST/api/v1/catalogs

Create a catalog.

GET/api/v1/catalogs/{id}

Get catalog by ID.

PUT/api/v1/catalogs/{id}

Update a catalog.

DELETE/api/v1/catalogs/{id}

Delete a catalog.

GET/api/v1/catalogs/{id}/export

Export catalog as DCAT 3.0 JSON-LD. Set Accept: application/ld+json.

Datasets

GET/api/v1/datasets

List datasets. Params: page, size, catalogId, domain.

POST/api/v1/datasets

Create a new dataset.

GET/api/v1/datasets/{id}

Get dataset by ID.

PUT/api/v1/datasets/{id}

Replace all fields of a dataset.

DELETE/api/v1/datasets/{id}

Soft-delete a dataset and its distributions.

Dataset semantic types

GET/api/v1/datasets/{id}/semantic-context

Semantic types and vocabulary context aggregated from the dataset's published logical models.

POST/api/v1/datasets/{id}/recommend-semantic-context

Request AI semantic type recommendations for the dataset (proxies the ai-service).

POST/api/v1/datasets/{id}/semantic-tags

Accept a semantic type tag. Body: {"type":"...","vocabularyIri":"..."}.

DELETE/api/v1/datasets/{id}/semantic-tags/{tagId}

Remove a previously accepted semantic tag.

Dataset ownership & audit

PUT/api/v1/datasets/{id}/owner

Assign a data owner to an unowned dataset. Body: {"userId":"..."}.

GET/api/v1/datasets/{id}/history

Reverse-chronological, paginated audit log. Params: page, size.

POST/api/v1/datasets/{id}/ownership-proposals

Propose ownership transfer. Body: {"proposedOwnerId":"..."}.

GET/api/v1/datasets/{id}/ownership-proposals/pending

The current PENDING proposal, or 204 if none.

POST/api/v1/datasets/{id}/ownership-proposals/{proposalId}/approve

Approve a pending proposal (current owner or admin).

POST/api/v1/datasets/{id}/ownership-proposals/{proposalId}/reject

Reject a pending proposal.

Terms of use

GET/api/v1/datasets/{id}/terms-of-use

Get the derived or explicitly accepted ODRL terms of use for the dataset.

POST/api/v1/datasets/{id}/terms-of-use/accept

Data owner accepts and locks the current derived terms as the explicit policy.

DELETE/api/v1/datasets/{id}/terms-of-use/policy

Reset: removes the explicit policy so the dataset reverts to dynamically derived terms.

Distributions

GET/api/v1/distributions

List all distributions for the tenant. Params: page, size.

GET/api/v1/datasets/{datasetId}/distributions

List distributions for a specific dataset.

POST/api/v1/datasets/{datasetId}/distributions

Create a distribution for a dataset.

GET/api/v1/distributions/{id}

Get distribution by ID.

PUT/api/v1/distributions/{id}

Update a distribution.

DELETE/api/v1/distributions/{id}

Delete a distribution.

Physical Schema (CSV-W)

GET/api/v1/datasets/{datasetId}/physical-schema

Get physical schema columns for a dataset.

POST/api/v1/datasets/{datasetId}/physical-schema

Set or replace the physical schema for a dataset.

GET/api/v1/distributions/{distributionId}/physical-schema

Get physical schema for a distribution.

POST/api/v1/distributions/{distributionId}/physical-schema

Set or replace the physical schema for a distribution.

GET/api/v1/distributions/{distributionId}/suggest-element-mappings

AI-suggested bindings from physical columns to logical elements. Param: modelId.

Data Products

GET/api/v1/data-products

List data products. Params: domain, lifecycleStatus, page, size.

POST/api/v1/data-products

Create a data product.

GET/api/v1/data-products/{id}

Get data product by ID.

PUT/api/v1/data-products/{id}

Update a data product.

PATCH/api/v1/data-products/{id}/lifecycle

Transition lifecycle status. Body: {"status":"Deploy"}.

DELETE/api/v1/data-products/{id}

Delete a data product.

GET/api/v1/data-products/{id}/datasets

List datasets linked to this data product.

POST/api/v1/data-products/{id}/datasets

Link a dataset. Body: {"datasetId":"..."}.

DELETE/api/v1/data-products/{id}/datasets/{datasetId}

Unlink a dataset from this data product.

Logical Models

GET/api/v1/datasets/{datasetId}/logical-models

List logical models for a dataset.

POST/api/v1/datasets/{datasetId}/logical-models

Create a new logical model (initially in draft status).

GET/api/v1/logical-models/{id}

Get a logical model with its elements.

PATCH/api/v1/logical-models/{id}/status

Transition status. Body: {"status":"published"}. Published models are immutable.

DELETE/api/v1/logical-models/{id}

Delete a logical model and all its elements.

Logical Elements

GET/api/v1/logical-models/{modelId}/elements

List elements in a logical model.

POST/api/v1/logical-models/{modelId}/elements

Add a logical data element to a model.

PUT/api/v1/logical-data-elements/{elementId}

Update a logical data element.

DELETE/api/v1/logical-data-elements/{elementId}

Delete a logical data element.

POST/api/v1/logical-data-elements/{elementId}/bind

Bind element to a physical column. Body: {"physicalColumnId":"..."}.

DELETE/api/v1/logical-data-elements/{elementId}/bind

Unbind the physical column from an element.

GET/api/v1/logical-data-elements/{elementId}/vocab-mappings

List SKOS vocabulary mappings for an element.

POST/api/v1/logical-data-elements/{elementId}/vocab-mappings

Add a vocabulary mapping. Body: {"vocabularyId":"...","conceptIri":"...","matchType":"exactMatch"}.

DELETE/api/v1/vocab-mappings/{mappingId}

Delete a vocabulary mapping.

AI enrichment — single element

POST/api/v1/logical-data-elements/{elementId}/recommend-classification

Request an AI classification recommendation (PUBLIC…HIGH_CONFIDENTIAL).

POST/api/v1/logical-data-elements/{elementId}/accept-classification

Accept the pending AI classification.

POST/api/v1/logical-data-elements/{elementId}/reject-classification

Reject the pending AI classification.

POST/api/v1/logical-data-elements/{elementId}/recommend-description

Request an AI description recommendation.

POST/api/v1/logical-data-elements/{elementId}/accept-description

Accept the pending AI description.

POST/api/v1/logical-data-elements/{elementId}/reject-description

Reject the pending AI description.

POST/api/v1/logical-data-elements/{elementId}/recommend-vocab-concepts

Request AI vocabulary concept recommendations (up to 5 SKOS concepts from available vocabularies).

POST/api/v1/logical-data-elements/{elementId}/accept-vocab-concepts

Accept vocabulary concept recommendations. Body: {"iris":["iri1"]} for a subset, or {} to accept all.

POST/api/v1/logical-data-elements/{elementId}/reject-vocab-concepts

Reject all pending vocabulary concept recommendations.

POST/api/v1/logical-data-elements/{elementId}/recommend-pii

Request AI PII indicator recommendations (isPersonalInformation, isDirectIdentifier).

POST/api/v1/logical-data-elements/{elementId}/accept-pii

Accept the pending PII indicator recommendation.

POST/api/v1/logical-data-elements/{elementId}/reject-pii

Reject the pending PII indicator recommendation.

AI enrichment — bulk (model-wide async)

POST/api/v1/logical-models/{modelId}/recommend-classifications

Start async bulk classification job for all elements. Returns a job ID.

GET/api/v1/logical-models/recommend-classifications/jobs/{jobId}

Poll job status: PENDING, RUNNING, COMPLETED, FAILED.

POST/api/v1/logical-models/{modelId}/recommend-descriptions

Start async bulk description recommendation job.

GET/api/v1/logical-models/recommend-descriptions/jobs/{jobId}

Poll bulk description job status.

POST/api/v1/logical-models/{modelId}/recommend-vocab-concepts

Start async bulk vocabulary concept recommendation job.

GET/api/v1/logical-models/recommend-vocab-concepts/jobs/{jobId}

Poll bulk vocab concept job status.

POST/api/v1/logical-models/{modelId}/recommend-pii

Start async bulk PII indicator recommendation job.

GET/api/v1/logical-models/recommend-pii/jobs/{jobId}

Poll bulk PII job status.

Vocabularies

GET/api/v1/vocabularies

List all registered vocabularies (system-loaded and custom).

POST/api/v1/vocabularies

Register a custom vocabulary.

GET/api/v1/vocabularies/{id}

Get a vocabulary by ID.

PUT/api/v1/vocabularies/{id}

Update a custom vocabulary.

DELETE/api/v1/vocabularies/{id}

Delete a custom vocabulary.

GET/api/v1/vocabularies/{id}/concepts/search

Search concepts. Params: q, limit (default 20).

GET/api/v1/vocabularies/translate

Translate one IRI to a human-readable label. Param: iri=<concept-iri>.

POST/api/v1/vocabularies/translate

Batch-translate up to 200 IRIs. Body: JSON array of IRIs. Returns {"translations":{"<iri>":"<label>",...}}.

Vocabulary Profiles

GET/api/v1/datasets/{datasetId}/vocabulary-profiles

List vocabulary profiles assigned to a dataset.

POST/api/v1/datasets/{datasetId}/vocabulary-profiles

Assign a vocabulary. Body: {"vocabularyId":"...","isPrimary":true,"domainTags":[]}.

DELETE/api/v1/datasets/{datasetId}/vocabulary-profiles/{vocabId}

Remove a vocabulary profile from a dataset.

Dashboard

GET/api/v1/dashboard/summary

Counts of datasets, data products, and pending governance tasks for the current user.

GET/api/v1/dashboard/activity

Recent dataset activity (creates, updates, AI accepts) for the current user.

Terms Policies (ODRL)

GET/api/v1/terms-policies

List all policy sets for the tenant.

POST/api/v1/terms-policies

Create a DRAFT policy set. Body: {"name":"...","description":"..."}.

GET/api/v1/terms-policies/{id}

Get a policy set with all rules and obligations.

PUT/api/v1/terms-policies/{id}

Update a policy set's name and description.

DELETE/api/v1/terms-policies/{id}

Delete a DRAFT policy set.

POST/api/v1/terms-policies/{id}/activate

Activate a policy set so it is used for terms derivation.

POST/api/v1/terms-policies/{id}/clone

Clone a policy set to a new DRAFT. Body: {"name":"..."}.

PUT/api/v1/terms-policies/{id}/classification-rules/{classification}

Upsert terms rule for a classification level (PUBLIC | INTERNAL | CONFIDENTIAL | HIGH_CONFIDENTIAL).

DELETE/api/v1/terms-policies/{id}/classification-rules/{classification}

Delete terms rule for a classification level.

GET/api/v1/terms-policies/{id}/regulation-rules

List regulation detection rules for a policy set.

POST/api/v1/terms-policies/{id}/regulation-rules

Add a regulation detection rule (e.g. GDPR, MiFID II keyword signal).

PUT/api/v1/terms-policies/{id}/regulation-rules/{ruleId}

Update a regulation detection rule.

DELETE/api/v1/terms-policies/{id}/regulation-rules/{ruleId}

Delete a regulation detection rule.

GET/api/v1/terms-policies/{id}/regulation-obligations

List regulation obligations (e.g. audit obligation triggered by Basel III signal).

POST/api/v1/terms-policies/{id}/regulation-obligations

Add a regulation obligation.

DELETE/api/v1/terms-policies/{id}/regulation-obligations/{oblId}

Delete a regulation obligation.

← PreviousAuthentication Next →Harvest API

Harvest API :8002

Sources

GET/api/v1/sources

List all harvest sources. Params: type, page, size.

POST/api/v1/sources

Register a new data source.

GET/api/v1/sources/{id}

Get a source by ID.

PUT/api/v1/sources/{id}

Update a source's configuration.

DELETE/api/v1/sources/{id}

Delete a source.

POST/api/v1/sources/{id}/test

Test connectivity to the source. Returns {"success":bool,"message":"..."}.

Jobs

GET/api/v1/jobs

List harvest jobs. Params: sourceId, page, size.

POST/api/v1/jobs

Create a scheduled or on-demand harvest job.

GET/api/v1/jobs/{id}

Get a job and its latest run summary.

PUT/api/v1/jobs/{id}

Update a job's schedule or configuration.

DELETE/api/v1/jobs/{id}

Delete a harvest job.

POST/api/v1/jobs/{id}/trigger

Trigger an immediate harvest run. Returns the run ID.

← PreviousInventory API Next →Lineage API

Lineage API :8003

OpenLineage ingest

POST/api/v1/lineage

Ingest an OpenLineage RunEvent. Accepted states: START, RUNNING, COMPLETE, FAIL, ABORT.

Dataset lineage

GET/api/v1/datasets/lookup

Resolve a dataset by OpenLineage namespace+name to its inventory UUID. Params: namespace, name.

GET/api/v1/datasets/{id}/lineage

Graph traversal by inventory UUID. Params: direction=upstream|downstream, depth=1..10.

GET/api/v1/datasets/{id}/impact

Downstream impact analysis — all datasets that depend on this one.

GET/api/v1/catalog-datasets/{catalogId}/lineage-identity

List lineage-known datasets within a catalog with their OpenLineage namespace/name identity.

PUT/api/v1/datasets/{namespace}/{name}/catalog-link

Associate an OpenLineage dataset identity with an inventory catalog. Body: {"catalogId":"..."}.

DDL parser

POST/api/v1/ddl/submit

Parse DDL with Apache Calcite and emit column-level lineage events. Body: {"dialect":"SNOWFLAKE","ddl":"CREATE VIEW ..."}.

← PreviousHarvest API Next →Search API

Search API :8004

GET/api/v1/search

Full-text search. Params: q, type, domain, lifecycleStatus, format, hasLineage, fibo_concept, page, size.

GET/api/v1/search/suggest

Autocomplete. Param: q. Returns up to 10 suggestions.

POST/api/v1/admin/reindex

Trigger a full reindex from inventory-service. Admin only. Returns {"datasetsIndexed":N,"dataProductsIndexed":N,"distributionsIndexed":N}.

← PreviousLineage API Next →AI API

AI API :8005

Conversations

GET/api/v1/conversations

List conversations for the current user. Each item: {"id","title","tenantId","createdAt"}.

POST/api/v1/conversations

Create a new conversation. Body: {"title":"..."}. Returns {"id","title","tenantId","createdAt"}.

GET/api/v1/conversations/{id}

Get a conversation by ID. Returns {"id","title","tenantId","createdAt"}.

POST/api/v1/conversations/{id}/messages

Send a message. Set Accept: text/event-stream for SSE streaming. Response tokens arrive as data: <token> events.

Request body fields:

content (required) — the user's message text
focusDatasetId (optional) — single dataset UUID; pre-loads schema as context
focusDatasetIds (optional) — array of dataset UUIDs for multi-table queries; the AI loads physical schema for all listed datasets, derives join hints from shared vocabulary concept IRIs, and enforces platform isolation (Snowflake, Delta Lake, etc.) — no mixed-platform SQL is generated

AI recommendations (element-level)

POST/api/v1/classify/elements

Classify data elements by sensitivity (PUBLIC…HIGH_CONFIDENTIAL). Body: {"elements":[{"elementId":"...","name":"...","logicalType":"...","vocabConceptLabels":[]}]}.

POST/api/v1/describe/elements

Generate natural-language descriptions for data elements. Body: {"elements":[{"elementId":"...","name":"...","logicalType":"..."}]}.

POST/api/v1/recommend-vocab-concepts

Suggest SKOS vocabulary concepts for elements. Body: {"elements":[...],"vocabularyIds":["..."]}.

POST/api/v1/recommend-pii

Detect PII indicators (isPersonalInformation, isDirectIdentifier) for elements. Body: {"elements":[...]}.

POST/api/v1/recommend-semantic-context

Recommend semantic types for a dataset. Body includes title, keywords, elementNames, currentVocabLabels.

Agentic review

POST/api/v1/agentic-review

Run the proposer/reviewer agent loop over a logical model and stream progress as Server-Sent Events (Accept: text/event-stream). Body: {"datasetId":"...","modelId":"..."}. Each data: line is a JSON AgenticEvent (phases CONTEXT/MEMORY/PROPOSING/REVIEWING/PROPOSAL/REVIEW/LOCKED/DONE/MAX_REACHED/ERROR); the reviewer verdict is APPROVE or REJECT. Loop capped at 10 iterations; the converged result is persisted to the model's elements for the data owner to accept or reject.

← PreviousSearch API Next →Policy API

Policy API :8007

GET/api/v1/policies/{datasetId}

Retrieve the stored ODRL policy record for a dataset (scoped to the caller's tenant).

PUT/api/v1/policies/{datasetId}

Upsert a policy. Body: {"policyJson": "...", "policyLevel": "A"}. policyLevel is A or B1.

DELETE/api/v1/policies/{datasetId}

Remove the policy record for a dataset.

POST/api/v1/policies/{datasetId}/evaluate

Evaluate the policy using ODRE Algorithm 1. Body: {"M": {}, "F": {}}. M is a string→any variable map injected into B1-Level [=varName] placeholders. Returns {"granted": bool, "policyLevel": "A"|"B1", "decisions": [{"action":"read","result":"true","delegated":false}]}.

GET/api/v1/policies/{datasetId}/components

Component breakdown of the assembled policy, keyed by piece type (CLASSIFICATION, REGULATION, CONTRACTUAL) and dimension value, alongside the assembled ODRL document. The effective policy is composed from reusable policy_pieces linked to the dataset via dataset_policy_links.

GET/api/v1/policies/{datasetId}/evaluations

Paginated evaluation log. Params: page, size.

B1-Level example

bash

# Store a B1-Level policy (role check via variable injection)
curl -X PUT http://localhost:8007/api/v1/policies/{datasetId} \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{
    "policyLevel": "B1",
    "policyJson": "{\"@context\":\"http://www.w3.org/ns/odrl.jsonld\",\"@type\":\"Set\",\"uid\":\"urn:b1\",\"permission\":[{\"target\":\"dataset:x\",\"action\":\"read\",\"constraint\":[{\"leftOperand\":{\"@value\":\"[=callerRole]\",\"@type\":\"xsd:string\"},\"operator\":\"eq\",\"rightOperand\":{\"@value\":\"DATA_OWNER\",\"@type\":\"xsd:string\"}}]}]}"
  }'

# Evaluate — DATA_STEWARD → denied
curl -X POST http://localhost:8007/api/v1/policies/{datasetId}/evaluate \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{"M":{"callerRole":"DATA_STEWARD"},"F":{}}'
# → {"granted":false,"decisions":[]}

# Evaluate — DATA_OWNER → granted
curl -X POST http://localhost:8007/api/v1/policies/{datasetId}/evaluate \
  -H "X-API-Key: dev-local" -H "Content-Type: application/json" \
  -d '{"M":{"callerRole":"DATA_OWNER"},"F":{}}'
# → {"granted":true,"decisions":[{"action":"read","result":"true","delegated":false}]}

← PreviousAI API Next →Identity API

Identity API :8006

Users

GET/api/v1/users

List all users in the tenant. Params: page, size.

GET/api/v1/users/{id}

Get a user by internal UUID.

GET/api/v1/users/by-keycloak/{keycloakId}

Resolve a Keycloak subject ID to the internal user record.

POST/api/v1/users/invite

Invite a new user by email. Creates a Keycloak account and a pending identity record. Body: {"email":"...","role":"DATA_STEWARD"}.

PUT/api/v1/users/{id}

Update a user's display name or role.

POST/api/v1/users/{id}/activate

Activate a pending user account.

DELETE/api/v1/users/{id}

Deactivate and remove a user from the tenant.

Bookmarks

GET/api/v1/bookmarks

List all bookmarks for the current user.

POST/api/v1/bookmarks

Create a bookmark. Body: {"datasetId":"...","collectionId":"...","note":"..."}.

DELETE/api/v1/bookmarks/{id}

Delete a bookmark.

GET/api/v1/bookmarks/dataset/{datasetId}

Check whether the current user has bookmarked a specific dataset. Returns the bookmark record or 204.

Bookmark Collections

GET/api/v1/bookmark-collections

List bookmark collections for the current user.

POST/api/v1/bookmark-collections

Create a collection. Body: {"name":"...","description":"..."}.

PUT/api/v1/bookmark-collections/{id}

Rename or update a collection's description.

DELETE/api/v1/bookmark-collections/{id}

Delete a collection (bookmarks are unlinked, not deleted).

← PreviousPolicy API Next →Docker Compose

Docker Compose

The repository ships a docker-compose.yml that starts the full stack and a docker-compose.override.yml for development hot-reload overrides.

Makefile targets

Target	Description
`make up`	Start all services in detached mode
`make down`	Stop and remove containers (preserves volumes)
`make destroy`	Stop containers and delete all volumes
`make migrate`	Run Flyway migrations manually
`make seed`	Load financial services sample data
`make reindex`	Full OpenSearch reindex
`make build`	Build all Docker images from source
`make test`	Run all service tests in Docker
`make logs svc=inventory-service`	Tail logs for a specific service

Profiles

Profile	Additional services
(default)	All 7 services + Kafka + PostgreSQL × 6 + OpenSearch + MinIO + Keycloak
`ai`	Adds ai-service and Ollama

bash

# Start including AI features
docker compose --profile ai up -d

← PreviousIdentity API Next →Environment Variables

Environment Variables

Full reference

Variable	Service	Default	Description
`POSTGRES_PASSWORD`	All	`odin`	PostgreSQL password (shared)
`KEYCLOAK_ADMIN`	identity	`admin`	Keycloak admin username
`KEYCLOAK_ADMIN_PASSWORD`	identity	`admin`	Keycloak admin password
`JWT_SECRET`	All	—	HS256 signing secret for dev API keys
`MINIO_ROOT_USER`	harvest	`minio`	MinIO access key
`MINIO_ROOT_PASSWORD`	harvest	`minio123`	MinIO secret key
`OPENSEARCH_PASSWORD`	search	`admin`	OpenSearch admin password
`OLLAMA_BASE_URL`	ai	`http://ollama:11434`	Ollama base URL
`OPENAI_API_KEY`	ai	(empty)	OpenAI key; takes precedence over Ollama if set
`AI_CHAT_MODEL`	ai	`llama3`	Chat model name
`AI_EMBED_MODEL`	ai	`nomic-embed-text`	Embedding model (must be 768-dim)
`SNOWFLAKE_ACCOUNT`	harvest	—	Snowflake account identifier
`AWS_ACCESS_KEY_ID`	harvest	—	AWS credentials for Glue connector
`AWS_SECRET_ACCESS_KEY`	harvest	—	AWS credentials for Glue connector
`AWS_REGION`	harvest	`us-east-1`	AWS region for Glue
`JAVA_TOOL_OPTIONS`	All Java services	(empty)	JVM options injected at startup. Set to `-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005` to enable remote debugging on all services. See Local Development for port mappings.

← PreviousDocker Compose Next →Kubernetes (Raw Manifests)

Kubernetes (Raw Manifests)

Raw Kubernetes manifests live under infra/kubernetes/. Fourteen numbered YAML files cover everything from namespace creation to ingress rules. deploy.sh applies them in dependency order via kubectl and envsubst.

Prerequisites

Tool	Purpose
`kubectl`	Cluster access (configured kubeconfig)
`envsubst`	Image registry injection — `apt install gettext-base`
`curl`, `jq`	Health checks and seed script
NGINX Ingress	`microk8s enable ingress` or install via upstream manifest

1. Build and push images

bash

export IMAGE_REGISTRY=localhost:32000/   # MicroK8s built-in registry

for svc in inventory-service harvest-service lineage-service \
           search-service ai-service identity-service; do
  docker build -t ${IMAGE_REGISTRY}odin/${svc}:latest services/${svc}/
  docker push ${IMAGE_REGISTRY}odin/${svc}:latest
done

2. Deploy

bash

IMAGE_REGISTRY=localhost:32000/ ./infra/kubernetes/deploy.sh

The script applies all manifests in order, then waits for StatefulSets, Jobs, and Deployments to reach ready state before exiting.

Flag	Description
`--dry-run`	Render manifests via `envsubst` without applying — review before committing
`--delete`	Tear down all resources; PVCs are preserved to protect data

3. Update /etc/hosts

bash

NODE_IP=$(kubectl get nodes \
  -o jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}')
echo "$NODE_IP  catalog.local manage.catalog.local api.catalog.local" \
  | sudo tee -a /etc/hosts

4. Load sample data

bash

./infra/kubernetes/seed.sh --namespace odin-catalog

The seed script establishes kubectl port-forward tunnels to each service, waits for readiness, then delegates to infra/seed/seed.sh — the same Meridian Capital financial dataset scenario used with Docker Compose.

Flag	Default	Description
`--namespace`	`odin-catalog`	Kubernetes namespace
`--api-key`	`dev-local`	`X-API-Key` header value
`--context`	(current)	`kubectl` context to use
`--timeout`	`120`	Seconds to wait per service health check

ℹ

Port collision: Stop any local Docker Compose stack before running seed.sh. It uses ports 8001–8004 for port-forwarding.

5. Access

App	URL
Consumer (discovery)	`http://catalog.local`
Producer (management)	`http://manage.catalog.local`
API health	`http://api.catalog.local/inventory/actuator/health`

Manifest inventory

File	Contents
`00-namespace.yaml`	Namespace with `pod-security.kubernetes.io/enforce: privileged`
`01-serviceaccount.yaml`	Default service account
`02-secrets.yaml`	PostgreSQL passwords, MinIO credentials, Keycloak admin password
`03-configmaps.yaml`	Common Spring config, Kafka topic script, OpenSearch index mapping, Keycloak realm
`10-postgres.yaml`	Five StatefulSets: inventory, harvest, lineage (Apache AGE), identity, ai (pgvector)
`11-kafka.yaml`	KRaft-mode Kafka StatefulSet (no ZooKeeper)
`12-opensearch.yaml`	OpenSearch with privileged sysctl init container (`vm.max_map_count=262144`)
`13-minio.yaml`	MinIO Deployment for harvest snapshots
`14-redis.yaml`	Redis for Quartz scheduler clustering
`15-keycloak.yaml`	Keycloak 24 with realm auto-import
`20-backend-services.yaml`	Six Spring Boot Deployments + ClusterIP Services
`21-frontends.yaml`	Producer and Consumer frontend Deployments + Services
`22-ingress.yaml`	NGINX Ingress for all three virtual hosts
`30-jobs.yaml`	One-shot Jobs: Kafka topic creation and OpenSearch index initialisation

← PreviousEnvironment Variables Next →Kubernetes (Helm)

Kubernetes (Helm / MicroK8s)

A Helm umbrella chart is provided under infra/helm/ for clusters managed with Helm 3. A MicroK8s-specific deploy.sh under infra/microk8s/ wraps the Helm install with sensible single-node defaults.

MicroK8s quick deploy

bash

# Full resources
./infra/microk8s/deploy.sh

# Reduced resources for machines with < 16 GB RAM
./infra/microk8s/deploy.sh --reduced-resources

Standard Helm install / upgrade

bash

# Install
helm install odin infra/helm \
  --namespace odin-catalog --create-namespace \
  -f infra/microk8s/values.yaml

# Upgrade after image rebuild
helm upgrade odin infra/helm \
  --namespace odin-catalog \
  -f infra/microk8s/values.yaml

ℹ

See docs/microk8s-deployment.md for the full MicroK8s setup guide — snap installation, addon enablement (dns, storage, ingress, registry), and image registry configuration.

← PreviousKubernetes (Raw Manifests) Next →Local Development

Local Development

Build all services

bash

./gradlew build               # compile + test all services
./gradlew :services:inventory-service:bootRun  # run one service

Build the frontends

bash

cd frontend/shared && pnpm install && pnpm build
cd ../producer  && pnpm install && pnpm dev   # http://localhost:3000
cd ../consumer  && pnpm install && pnpm dev   # http://localhost:3001

Rebuild a single Docker image

bash

# Always build then up — `restart` reuses the old image
docker compose build inventory-service
docker compose up -d inventory-service

Remote debugging

All Java services support JDWP remote debugging via the standard JAVA_TOOL_OPTIONS env var — no rebuild required. Enable it in .env:

.env

JAVA_TOOL_OPTIONS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005

Then restart the services you want to debug (docker compose up -d). Each service listens on port 5005 inside its container, exposed on a unique host port:

Service	HTTP port	Debug host port
`inventory-service`	8001	5001
`harvest-service`	8002	5002
`lineage-service`	8003	5003
`search-service`	8004	5004
`ai-service`	8005	5005
`identity-service`	8006	5006

In IntelliJ IDEA: Run → Edit Configurations → + → Remote JVM Debug, set Host to localhost and Port to the service's debug host port above. In VS Code, add a launch.json entry with type java, request: attach, and the corresponding port.

✓

The JVM prints Listening for transport dt_socket at address: 5005 on startup when debugging is active. Leave JAVA_TOOL_OPTIONS= (empty) in .env to disable it with zero overhead.

← PreviousKubernetes (Helm) Next →Contribution Guide

Contribution Guide

Before you start

Open an issue describing the bug or feature before submitting a PR.
One logical change per PR — keep diffs reviewable.
All new API endpoints need an integration test that runs against a real database (no mocks).

Code conventions

Java: hexagonal architecture — domain classes have no Spring annotations; all infrastructure concerns live in the infrastructure/ package.
TypeScript: no any; shared API types live in frontend/shared/src/types/.
SQL: all schema changes via numbered Flyway migrations (V{n}__description.sql).

License

ODIN Catalog is released under the Apache 2.0 License. By contributing you agree your changes will be licensed under the same terms.

← PreviousLocal Development