Append-only concept embedding log

Abstract

We're proposing something exciting: an append-only concept embedding log that captures how our understanding of concepts evolves over time. Think of it as a time machine for semantic meaning. Instead of just storing the latest vector representation of a concept, we're keeping its entire history - every twist and turn in how its meaning has evolved.

This isn't your typical mutable database. We're using TimescaleDB with pgvector and pgvectorscale (specifically the StreamingDiskANN index) to create a system that can:

Track how concepts evolve semantically over time
Analyze semantic drift with precision
Maintain complete historical fidelity
Enable sophisticated latent space reasoning

Motivation

Here's the big picture: we're building a system that learns continuously, climbing the DIKW pyramid by turning raw data into actionable knowledge. Our existing observation_log is great at preventing catastrophic forgetting at the data layer, but we need more. We need to understand how concepts themselves evolve in the latent space.

Traditional approaches have a blind spot: they only keep the latest vector representation of a concept. It's like having a photo album with only the most recent picture of someone - you miss their entire life story. We're fixing this by creating a system that captures every semantic snapshot, just like our observation_log captures every observation.

Technical principles

Our design rests on four key pillars:

Immutability: Once we record an embedding, it's set in stone. No updates, no deletions - just like a historical record.
Temporal fidelity: Each embedding is a precise snapshot of how we understood a concept at that moment.
Traceability: Every embedding links back to the specific observations that shaped it.
Separation of concerns: We're using specialized structures optimized for vector operations, distinct from our main observation log.

Proposed solution

We're creating a new TimescaleDB hypertable called concept_embedding_log. This isn't just another table - it's a temporal semantic derivative of our observation_log, designed specifically for tracking concept evolution.

Architecture overview

Here's how everything fits together:

graph TD
    A[Data Source] --> B(LLM Processing Component);
    C(observation_log Hypertable) -- Read/Append --> B;
    C -- Updates --> D(Continuous Aggregates);
    D -- Read Patterns --> B;
    B -- Generate Embedding --> E{Concept Embedding Generation Logic};
    E -- INSERT --> F(concept_embedding_log Hypertable);
    C -- observation_id --> F;

    style C fill:#f9f,stroke:#333,stroke-width:2px;,color:black
    style F fill:#ccf,stroke:#333,stroke-width:2px;,color:black
    style B fill:#ff9,stroke:#333,stroke-width:2px;,color:black
    style D fill:#9cf,stroke:#333,stroke-width:2px;,color:black

Diagram 1: How data flows from source through LLM processing to our embedding log

Schema definition

Let's look at the schema. We're using pgvector's VECTOR type for efficient storage of our high-dimensional embeddings:

-- SQL Definition for concept_embedding_log
CREATE TABLE concept_embedding_log (
  embedding_id BIGINT GENERATED ALWAYS AS IDENTITY, -- Unique identifier for this embedding event
  timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(), -- Precise time of embedding generation/logging
  concept_name TEXT NOT NULL, -- The unique name identifying the concept
  concept_type TEXT NOT NULL, -- Categorization, e.g., 'entity', 'coined_term'
  embedding VECTOR(<embedding_dimension>) NOT NULL, -- The semantic vector representation (specify dimension)
  source_observation_id BIGINT NOT NULL REFERENCES observation_log(id), -- Foreign key linking to the trigger event
  confidence REAL, -- Optional: LLM's confidence in this semantic representation
  generation_reason TEXT -- Optional: Metadata, e.g., 'initial_discovery', 'refinement'
);

-- Convert the table into a hypertable partitioned by time
SELECT create_hypertable('concept_embedding_log', 'timestamp', chunk_time_interval => INTERVAL '1 week');

-- Create an index for efficiently retrieving the embedding history for a specific concept
CREATE INDEX idx_concept_embedding_log_name_time ON concept_embedding_log (concept_name, timestamp DESC);

-- Create an ANN index using pgvectorscale's StreamingDiskANN for cosine distance
CREATE INDEX idx_concept_embedding_log_embedding_cos_diskann
ON concept_embedding_log
USING diskann (embedding vector_cosine_ops);

Here's how it relates to our observation_log:

classDiagram
    class observation_log {
        +BIGINT id PK
        +TIMESTAMPTZ timestamp
        +JSONB payload
        +TEXT operation
        +REAL confidence
        +TIMESTAMPTZ processed_timestamp
    }
    class concept_embedding_log {
        +BIGINT embedding_id PK
        +TIMESTAMPTZ timestamp
        +TEXT concept_name
        +TEXT concept_type
        +VECTOR embedding
        +BIGINT source_observation_id FK
        +REAL confidence
        +TEXT generation_reason
    }
    observation_log "1" -- "0..*" concept_embedding_log : contains source for

Diagram 2: How our concept embedding log relates to the observation log

Data flow and operational logic

Here's how it works in practice:

Our LLM continuously monitors the observation_log and its aggregates
When it spots something significant - like a new concept or a shift in meaning - it generates an embedding
This embedding captures the concept's meaning based on everything we know up to that point
We insert a new row into concept_embedding_log, never updating existing ones
For similarity searches, we use the cosine distance operator (<=>)

Here's the sequence in detail:

sequenceDiagram
    participant DS as Data Source
    participant LLM as LLM Processor
    participant OLog as observation_log
    participant Aggs as Continuous Aggregates
    participant CELog as concept_embedding_log

    DS ->> LLM: Raw Data Input
    LLM ->> OLog: INSERT Observation (Data, Info)
    OLog -->> Aggs: Trigger Aggregate Update
    Aggs -->> LLM: Provide Updated Patterns (Info)
    LLM ->> OLog: Query Observations/Aggregates
    alt Sufficient Semantic Event Detected
        LLM ->> LLM: Synthesize Concept / Detect Refinement (Knowledge)
        LLM ->> LLM: Generate Embedding Vector
        LLM ->> CELog: INSERT Embedding Record (Timestamp, Name, Type, Vector, SourceObsID)
        Note right of LLM: New row logged in concept_embedding_log
    end

Diagram 3: The sequence of events when generating and logging a new concept embedding

Reasoning capabilities

This is where it gets interesting. Our append-only design lets us do things that were impossible before:

Track semantic evolution: We can see how a concept's meaning has changed over time
Analyze semantic drift: By calculating vector distances between consecutive embeddings
Perform time-contextual searches: Find concepts similar to "Vibe Coding" as it was understood during its early days
Monitor concept emergence: Track when new concepts first appear
Observe semantic stabilization: See when a concept's meaning becomes more stable

Here's an example of querying the semantic history of 'Vibe Coding':

+---------------------+---------------+--------------------------+------------------------+
| timestamp           | concept_name  | embedding                | source_observation_id  |
+---------------------+---------------+--------------------------+------------------------+
| 2025-01-15 10:00:00 | Vibe Coding   | [0.1, 0.5, ..., 0.2]     | 123                    | <-- Initial Discovery
| 2025-02-20 14:30:00 | Vibe Coding   | [0.12, 0.51, ..., 0.25]  | 456                    | <-- Refinement after new context
| 2025-04-10 09:15:00 | Vibe Coding   | [0.11, 0.49, ..., 0.28]  | 789                    | <-- Slight drift
| ...                 | ...           | ...                      | ...                    |
+---------------------+---------------+--------------------------+------------------------+

Figure 1: How we track the evolution of a concept's semantic meaning over time

Considerations and tradeoffs

Every design choice comes with tradeoffs. Here are the key ones to consider:

Storage growth: We're keeping every version of every embedding. This means:
- More storage needed
- Need for effective compression strategies
- Possible need for tiered storage for older embeddings
Query patterns: Getting the "current" state requires explicitly selecting the latest timestamp. This is different from mutable stores but gives us more flexibility.
Search performance: While StreamingDiskANN is efficient, searching the entire history without time bounds might be slower. We'll need to optimize our queries.
LLM decision making: The system's effectiveness depends on the LLM's ability to detect significant semantic shifts. We'll need to tune this carefully.

Alternatives considered

We looked at the traditional approach: a standard relational table or key-value store with UPSERT operations. It would be simpler to implement and use less storage, but it would lose something crucial - the history of how concepts evolve.

Given our goal of building a system that truly learns and understands, we chose the append-only approach. It aligns with our core principles and gives us capabilities that simpler solutions can't match.

Properties

Location

Stats