The Memo build and content quality pipeline

The Memo platform uses a sophisticated pipeline to transform raw Markdown content into the final, published website. This process involves automated builds, data processing, and a suite of tools dedicated to ensuring content quality and consistency. This document outlines these key aspects.

Automated build workflow

At its heart, the build system is triggered by pushes to the dev branch (as defined in .github/workflows/dispatch.yml) and orchestrates a sequence of scripts and commands to get content ready for the web application, onchain integrations, and other platform features.

Key scripts and their functions

generate-summary.ts: This script is run to generate summaries, likely as part of the SPR content compression and AI processing steps.
duckdb-export (Makefile command): This command initiates the core data processing. It uses the Elixir compiler (lib/obsidian-compiler/export_duckdb.ex) to compile Obsidian Markdown, handle image processing (like compression and WebP conversion), detect and replace image paths, and save all processed data to DuckDB.

Core data processing

The core data processing pipeline handles the transformation of raw Markdown content into structured data for the platform.

File selection and initial handling

The system identifies which files to process, either based on recent Git history for incremental updates or a specified pattern. Files within the vault directory are filtered using patterns defined in .export-ignore, and submodule changes are also handled.

AI-powered content transformation

This step involves compressing content using SPR (Summary-Preserve-Refactor) and generating embeddings (both OpenAI and custom). Processing is optimized by only regenerating embeddings when content actually changes, leveraging Git history for tracking file relationships through renames (previous_paths field).

Database operations

The vault table in DuckDB is set up, updating the schema as needed and removing obsolete columns. Each selected file is read, its frontmatter extracted and normalized, and if new or changed, AI processing is triggered. All extracted and processed data is stored in the DuckDB database. Finally, the data is exported to a Parquet file (vault.parquet), which is crucial for efficient data access by the web application and other parts of the system.

Content quality assurance

Beyond the core data processing, a comprehensive set of tools ensures content consistency, quality, and adherence to our standards. These tools cover linting (validating Markdown and metadata) and formatting (standardizing text styling).

Relevant source files include: scripts/formatter/format-sentence-case.js, scripts/formatter/note-lint.js, and various rule files within scripts/formatter/rules/.

Linting

Linting architecture and process

The Memo system features a modular note linting system designed to verify Markdown files against a configurable set of rules. The process generally involves recursively scanning target directories, dynamically loading rule modules (e.g., from scripts/formatter/rules/), applying each loaded rule, and reporting violations. It exits with an error code if issues are detected, making it suitable for CI integration.

Key linting rules

Frontmatter validation (frontmatter.js): Validates the YAML frontmatter section, checking for proper delimiters, formatting, balanced quotes/brackets, absence of duplicate keys, and no forbidden characters.
Heading structure rules (no-heading1.js): Enforces proper document structure by flagging any level 1 headings within the content. The document title should come from the frontmatter.
Link validation (relative-link-exists.js): Verifies that all relative links point to existing resources within the repository, handling standard and image links, ignoring code blocks, and processing URI-encoded paths and anchors.

Extending with custom rules

The modular design allows for easy extension. To add a new rule, create a JavaScript file in scripts/formatter/rules/ and implement a check(file, content) function that returns an array of violation messages.

/**
 * Rule: Description of what the rule checks.
 * Returns an array of violation messages if any.
 */
function check(file, content) {
  const violations = [];
  // Rule implementation: analyze content, push to violations if needed
  return violations;
}

export { check };

Formatting

AI-powered sentence case formatting

An AI tool (format-sentence-case.js) standardizes text formatting, particularly for titles and headings. It converts text to sentence case while intelligently preserving proper nouns and acronyms by extracting relevant text elements (frontmatter titles, headings, bold text, link text), using an OpenAI model to convert, and replacing the original text.

Workflow integration

The linting and formatting tools are designed for flexibility, usable both manually and in automated workflows. They can be run from the command line on files or directories:

Running the note linter:

# Check all rules in the vault directory
node scripts/formatter/note-lint.js vault/
# Check specific rules
node scripts/formatter/note-lint.js vault frontmatter,no-heading1

Running the sentence case formatter:

node scripts/formatter/format-sentence-case.js [optional-path-to-file-or-directory]

These tools are integral to the Memo content workflow: content is created/edited, linted, formatted, and then proceeds to the core processing pipeline before DuckDB export. These steps can be run manually, integrated into Git pre-commit hooks, or incorporated into CI/CD pipelines.

Next: Automate static site deployment

Properties

Location

Stats

The Memo build and content quality pipeline

Automated build workflow

Key scripts and their functions

Core data processing

File selection and initial handling

AI-powered content transformation

Database operations

Content quality assurance

Linting

Linting architecture and process

Key linting rules

Extending with custom rules

Formatting

AI-powered sentence case formatting

Workflow integration

Subscribe to Dwarves Memo

Properties

Location

Stats

Command Palette

The Memo build and content quality pipeline

Automated build workflow

Key scripts and their functions

Core data processing

File selection and initial handling

AI-powered content transformation

Database operations

Content quality assurance

Linting

Linting architecture and process

Key linting rules

Extending with custom rules

Formatting

AI-powered sentence case formatting

Workflow integration

Subscribe to Dwarves Memo