Skip to content
semanticdesk.top

Digital sovereignty for a fragmented life

Personal knowledge management has always been a vision for me since university times where the amount of digital information grew and the tools to handle and interconnect them often felt unsatisfying.

As a violin and music teacher, orchestral musician, software developer, and member of several community groups, I generate data on many devices and clouds across all of those roles.

For over a decade I have slowly been building the infrastructure that lets me actually keep track of it. semanticdesk.top is that work, made public: a personal semantic data fabric, open source and self-hosted, so the answer to “where is my data?” is mine to give.

Context

Fragmented tools, one life

  • Is this photo folder already backed up, or only on this SD card?
  • Which orchestra score is the canonical version, and who has access to it?
  • What was I working on last Tuesday at the music school?
  • Show me the most relevant ressources for construction of the summer kitchen in the community garden

These questions go unanswered today because each tool only sees the data it owns.

Architecture

A personal semantic fabric

Click any box for more information

Layer 0 — Data sources

  • Files

    NAS, PC, SD, servers

    Exists

    Filesystems across workstation, laptop, SD cards, NAS, and long-lived servers. The fast scan pass records path, content hash, and MIME; deep plugins add richer meaning later.

    Prior art

    • Strigi (Nepomuk era)
    • Baloo (post-semantic KDE)
  • PIM

    Mail, cal, tasks

    Planned

    Thunderbird mail, calendar, and tasks across many mailboxes and calendar sources, translated into a unified PIM view in the graph. Adapter work connects Gloda and sources to the RDF layer.

    Prior art

    • Akonadi (KDE PIM unification, non-RDF)
    • NIE/NFO contact and calendar terms
  • Browser

    Bookmarks, history, downloads

    Planned

    Browser plugin path for bookmarks, history, and downloads so web-origin artefacts can be correlated with file and PIM data in the same index.

    Prior art

    • Solid (WebID, user-controlled data)
    • W3C Web Annotation
  • Messengers

    Bridge → unified

    Planned

    Chat channels bridged (for example via Matrix) into a single normalized representation, so conversation artefacts participate in the same fabric as files and PIM.

    Prior art

    • XMPP/IRC bridging patterns
  • Context

    eBPF, GPS, time

    Experimental

    Raw file-access events, coarse location, and timestamps: the stream that makes AccessEvent and context reconstruction possible, without pretending every event belongs in the triple store verbatim.

    Prior art

    • eBPF observability
    • Plasma Activities

Layer 1 — Ingestion pipeline

  • Stage 1: fast scan

    Path, hash, MIME

    Exists

    First stage indexing optimised for speed: paths, content hashes, and MIME types so the system always knows what exists and where, before any expensive analyzer work.

    Prior art

    • Strigi two-stage split
    • This project’s 2014-era pipeline
  • Stage 2: deep plugins

    EXIF, NLP, git, …

    Exists

    Second stage: pluggable extractors for EXIF, perceptual fingerprinting, NLP, PDF content identity, git detection, and other depth-first signals, composed without hard-wiring a single stack.

    Prior art

    • Strigi’s plugin model
  • eBPF adapter

    Events → AccessEvent

    Experimental

    Monitors file activity and process calls in selected directories using eBPF, producing the raw stream from which AccessEvent triples are derived. The prototype exists; throttling, aggregation, and RDF emission are the current development focus.

    Prior art

    • Linux tracepoints
    • eBPF BCC/CO-RE community
  • PIM / cloud adapter

    Gloda, NC API → RDF

    Experimental

    Maps Thunderbird Gloda, contacts, and calendar, plus where applicable Nextcloud-style APIs, into the vocabulary layer so PIM and files share a joined-up query surface.

    Prior art

    • NIF/NCO/NCAL usage in desktop stacks
    • NEPOMUK-era mappings

Layer 2 — Semantic core

  • Triple store

    SPARQL 1.1/1.2 · federated

    Exists

    The semantic spine of the system. Any SPARQL 1.1–compliant endpoint. In production: read-optimised and federated patterns are composed as a deployment choice; the file indexer outputs Turtle or SPARQL Update. The specific triple store product is not an architectural lock-in.

    Prior art

    • Nepomuk / Soprano + legacy triple stores (2007–2013)
    • GNOME Tracker / TinySPARQL (2008–present)
    • Solid / Comunica (federation on the web)
  • Full-text index

    Unified facade

    Exists

    Pluggable full-text over file bodies, email, and long-form metadata, driven from ontology flags so the index stays in sync with what matters semantically, without hand-maintained per-field config.

    Repositories

    Prior art

    • Xapian in Baloo
    • ES/Solr as commodity FTS
  • Vector index

    IRI-aligned

    Experimental

    Vector similarity keyed to the same entity IRIs as the graph so hybrid recall can combine structure from SPARQL with neighbourhood similarity, without a second ID scheme.

    Prior art

    • GraphRAG patterns
    • Vector stores as pluggable retrievers
  • GraphRAG

    SPARQL + vectors → model

    Experimental

    Hybrid retrieval: a bounded subgraph from SPARQL with vector similarity to ground answers in the graph you own, for assistants that are allowed to be wrong only in prose, not in identity.

    Prior art

    • Retrieval augmented generation (community patterns)
  • MCP (SPARQL tools)

    For LLM clients

    Planned

    Exposes a constrained SPARQL-shaped tool surface to MCP-compatible clients so local models can query your fabric with the same contract as your UI, not a bespoke API per app.

    Prior art

    • Model context protocol (MCP)
    • Comunica as prior art in Linked Data clients

Layer 3 — Integration and presentation

  • KDE Plasma

    KRunner, input

    Planned

    Desktop integration: KRunner to query the index from the system search bar, and input-method adjacent hooks where voice or special input can feed the same graph-backed actions.

    Prior art

    • KRunner runners
    • Plasma Activities (context)
  • PKM plugins

    Obsidian, Logseq, …

    Planned

    Note tools query the same semantic index so personal knowledge and indexed artefacts stay aligned instead of forking into yet another silo.

    Prior art

    • Local-first PKM
    • RDF/JSON-LD round trips in the wild
  • Graviola UI

    SemanticTable, forms

    Exists

    JSON Schema–driven forms and tables over SPARQL: the CRUD and exploration surface that makes the store legible, including fit testing between schema and data at deploy time, not only compile time.

    Repositories

    Prior art

    • CRUD on RDF in research prototypes
    • LinkML for schema-first data
  • Voice assistant

    Vox, MCP, local LLM

    Experimental

    Speech stack and MCP client so voice-driven queries can use the same tools as the desktop UI, against a local or self-hosted model boundary you control.

    Repositories

    Prior art

    • Whisper-family STT
    • MCP for tool use
  • Mobile / remote

    Android + API

    Experimental

    Android-side indexing where the platform only allows polling, plus a small REST API so other devices can query the fabric without mounting every disk on every machine.

    Prior art

    • Nextcloud client patterns
    • REST facades on SPARQL in enterprise search

Lineage

Semantic history

This problem has been attempted before. Here is what happened and what was learned — including the TU Dresden / NubiSave / NubiVis / nubixtract thread that still runs in production code.

  1. 1999

    Semantic Web vision W3C / Tim Berners-Lee

    RDF, OWL, SPARQL proposed as the foundation for machine-readable linked data

    Outcome: survived

    The foundation everything else is built on. Still the correct abstraction.

  2. 2004

    NEPOMUK project EU FP6 / multiple universities

    Networked Environment for Personalized, Ontology-based Management of Unified Knowledge — a full semantic desktop layer for enterprise and personal use

    Outcome: transformed

    Produced the NIE/NFO/NCO/NMM ontology suite still used by TinySPARQL today. The vision was right; the implementation weight was too high for volunteer maintenance.

  3. 2007

    Nepomuk-KDE KDE

    Full integration of Nepomuk semantic layer into KDE 4 — RDF-backed file metadata, semantic tags, relationship graphs, Virtuoso triple store

    Outcome: retired

    Retired in KDE SC 4.13 (2014). Resource consumption was too high for typical hardware of the era. Replaced by Baloo.

  4. 2007

    Strigi indexer KDE / Jos van den Oever

    The file crawler and metadata extractor for Nepomuk-KDE. Plugin architecture for extracting metadata from files of different types.

    Outcome: retired

    Directly ancestral to the two-stage plugin indexer architecture in this project. The right shape; replaced by Baloo's simpler approach.

  5. 2008

    GNOME Tracker GNOME

    SPARQL-backed file indexer for GNOME. Used the Nepomuk ontologies (NIE/NFO). D-Bus exposed SPARQL endpoint.

    Outcome: survived

    Still active as TinySPARQL (renamed 2024). The most direct living ancestor of the semantic desktop vision.

  6. 2012

    Akonadi KDE

    PIM data storage framework for KDE — unified storage for email, contacts, calendar, notes via pluggable resource agents

    Outcome: survived

    Still the KDE PIM backend. Not RDF internally, but architecturally the closest thing to a unified PIM datastore on Linux.

  7. 2013 · Spillner et al., FGCS 29 (2013) 1062–1072

    NubiSave — optimal cloud storage controller TU Dresden / Spillner, Müller, Schill

    RAID-like dispersion across cloud providers with a cloud storage ontology (WSML). First semantic modelling of storage provider properties in a personal storage context.

    Outcome: published

    Published in Future Generation Computer Systems. The cloud storage ontology is a direct conceptual ancestor of the Realm concept. Sebastian Tilsch worked in this lab.

    View PDF

  8. 2014 · Spillner, Tilsch, Schill, MobiQuitous 2014

    NubiVis — personal cloud file explorer TU Dresden / Spillner, Tilsch, Schill

    Web-based file manager integrating NubiSave (cloud distribution) and Strigi (semantic metadata) to answer 'Where is my data?' Map, timeline, tree, and distribution views.

    Outcome: published

    Published at MobiQuitous 2014. Co-authored by the project owner. The inadequacy of Strigi as a metadata backend motivated building nubixtract as a replacement.

    View PDF

  9. 2014

    Baloo KDE

    Replaced Nepomuk-KDE as KDE's file indexer. SQLite + Xapian full-text, deliberately NOT semantic — learned from Nepomuk's complexity.

    Outcome: survived

    The pragmatic retreat from semantics. Fast, reliable, but cannot answer 'which files did I work on during the Orchesterprobe last month'.

  10. 2014

    nubixtract — first commit this project

    RDF-native replacement for Strigi. Plugin architecture, pluggable triple-store connector, SPARQL query interface, WGS84 geo, PDF via Grobid, image classification, Android cross-compilation. First commit 4 December 2014 (TU Dresden).

    Outcome: active

    It started out of curiosity and for educational purpose on how to build a robust highly extensible file indexer and metadata extractor as an alternative to strigi, which often crashed or blocked resources when running on my system

  11. 2016

    Solid project MIT / Tim Berners-Lee

    Personal Online Datastore — user-controlled RDF pods, linked data, decentralised identity (WebID)

    Outcome: active

    The web-oriented answer to personal data sovereignty. Strong on federation and access control; weaker on local/offline and desktop integration.

  12. 2022

    graviola-crud-framework this project

    JSON Schema → SPARQL/RDF CRUD framework with auto-generated forms and tables. Built for a university library semantic data project.

    Outcome: active

    The UI layer that makes the semantic index navigable. SemanticTable, GenericForm, multiple store backends.

  13. 2024

    Samsung Personal Data Engine Samsung / Oxford Semantic Technologies

    On-device RDFox-powered personal knowledge graph on Galaxy S25. Commercial validation of the personal RDF concept.

    Outcome: active

    The first major consumer deployment of personal RDF. Closed source. Confirms the thesis; does not address the open, cross-device, self-hosted case.

  14. 2024

    graviola-semantic-file-analyzer this project

    Second-generation file indexer with RDF output, eBPF context tracking, location correlation, plugin pipeline for deep metadata extraction

    Outcome: experimental

    The current active development frontier. Exists, runs, needs stabilisation and documentation.

  15. 2025

    semanticdesk.top (this project, named) this project

    The full vision named and made public: personal semantic data fabric for digital serenity across all devices, domains, and data sources

    Outcome: active

    You are here.

Vocabulary

Core concepts

Plain language first — the formal definitions are a click away, like NFO/NIE terms you already know, but in this project’s own namespace over time.

Artifact

A piece of content with a stable identity regardless of where it lives.

Example: The PDF of “Träumerei” by Schumann is one Artifact. It might exist on three Nextclouds, an SD card, and a NAS — but it is one piece of content, with one identity.

Formal definition

A named individual in the semanticdesk.top ontology with one or more content hashes (SHA-256, perceptual hash, content-only hash), a MIME type, and semantic type annotations.

Manifestation

A specific occurrence of an Artifact at a particular location on a particular device.

Example: The copy of Träumerei.pdf on the Musikschule Nextcloud in the folder “/Scores/Piano/Schumann/” is one Manifestation. The copy on the personal NAS is another.

Formal definition

Links an Artifact to a filesystem path, a device IRI, a mount point at index time, and a Realm (sharing/access context).

Realm

A sharing and access domain — who can reach which files through which system.

Example: The school Nextcloud is a Realm. It has team folders; only members of the “music teachers” team folder can see certain scores. A colleague is either in that Realm or not. But also your home folder on your laptop is a Realm, your home folder on your NAS is a Realm.

Formal definition

A named individual representing a Nextcloud instance, a team folder, a local device, or any other access boundary. Manifestations in a Realm inherit its access rules.

AccessEvent

A recorded moment when a process touched a file, on a specific device, at a specific time.

Example: MuseScore opened Träumerei.pdf at 19:47 on a Tuesday, on the workstation, while the calendar showed “Orchesterprobe”. That is an AccessEvent.

Formal definition

Captured via eBPF kernel tracing. Triples: process IRI, file Manifestation IRI, timestamp, device IRI. Correlated with location and calendar data to derive Context.

Context / Domain

A named period of activity inferred from co-occurring signals.

Example: “Tuesday evening + location: Proberaum + calendar: Orchesterprobe + processes: MuseScore, Thunderbird” coheres into a Context. Files touched in that Context are likely orchestra-related.

Formal definition

Derived by statistical analysis of AccessEvent clusters correlated with calendar events, GPS location clusters, Plasma Activities, and time patterns. Assigned to a Domain (one of the owner’s life roles: violin teacher, school teacher, programmer, community gardener).

Reality

What already exists

“Exists” is production on the author’s own machines; “experimental” is honest about rough edges; “planned” is designed but not shipped.

  • nubixtract (file indexer, C++)

    Since 2014

    C++ file indexer with RDF output, plugin architecture, multi-backend triple store support. Started 2014 as Strigi replacement.

  • Semantic file analyzer

    RDF output, eBPF context, location correlation

    Experimental
  • Graviola Framework

    JSON Schema → SPARQL/RDF, SemanticTable, GenericForm, multiple store backends. Concept book: gravio-la.github.io/graviola-concept-documentation

    Exists
  • nix-vox (Vox fork)

    Rust STT/TTS/VAD pipeline, NixOS flake with CUDA support

    Experimental
  • wnix/packages

    Personal NixOS package collection integrating the above

    Experimental
  • Core ontology (semanticdesk.top)

    Published LinkML schema for Artifact, Manifestation, Realm, AccessEvent, Context, Hash. Vocabulary IRI: http://semanticdesk.top/ontology#

    Exists
  • eBPF → RDF adapter

    Monitors file activity and process calls in chosen directories via eBPF. Prototype exists; AccessEvent triple emission is the current development focus.

    Experimental
  • NixOS SPARQL service module

    Declarative NixOS module for the personal SPARQL endpoint (triple store choice is configurable in deployment)

    Experimental
    (no public repo yet)
  • restricted MCP server for personal graph

    Exposes parts of the personal graph as MCP tools for LLM assistants after user consent and with limited scope

    Planned
    (no public repo yet)
  • KRunner plugin

    Query the semantic index from KDE’s universal search bar

    Planned
    (no public repo yet)
  • Thunderbird RDF adapter

    Translates Gloda + contacts + calendar into NMO/NCO/NCAL triples

    Experimental
    (no public repo yet)
  • Android indexer

    Polls new files on Android (polling due to platform restrictions)

    Experimental

Sustain

Support the work

This is built by one person in the gaps between teaching, rehearsals, and client work. The code is real and has been running in production for over a decade — making it properly visible, documented, and installable by others is what sponsorship buys back.

If the vision resonates with you, even a small recurring amount makes a concrete difference.