Type ESC to close search bar

Project reports system: a case study

At Dwarves, we’ve developed a Monthly Project Reports system that transforms communication data into actionable intelligence. This lean system orchestrates multiple data streams into comprehensive project insights while maintaining enterprise-grade security and cost efficiency.

The need for orchestrated intelligence

Our engineering teams exchange thousands of Discord messages daily across projects, capturing critical technical discussions, architectural decisions, and implementation details. However, while Discord excels at real-time communication, valuable insights often remain buried in chat histories, making it difficult to:

  1. Track project progress against client requirements.
  2. Align ongoing discussions with formal documentation.
  3. Extract actionable insights from technical conversations.

This challenge led us to develop the Project Reports system - an intelligent orchestration layer that transforms scattered communication data into structured project intelligence. Our system processes multiple data streams, extracting key insights and patterns to generate comprehensive project visibility.

The foundation: Data architecture

Our architecture follows a simple yet powerful approach to data management, emphasizing efficiency and practicality over complexity. We’ve built our system on three core principles:

  1. Lean storage: S3 serves as our primary data lake and warehouse, using Parquet and CSV files to optimize for both cost and performance
  2. Efficient processing: DuckDB and Polars provide high-performance querying without the overhead of traditional data warehouses
  3. Secure access: Modal orchestrates our serverless functions, ensuring secure and efficient data processing

Data flow overview

graph TB
    subgraph Data Sources ["Data Sources (Raw)"]
        D1[Discord Messages]
        D2[Git Activity]
        D3[JIRA Tickets]
        D4[Google Docs]
        D5[Notion Pages]

    subgraph Data Engineering
        L1[Landing Zone - S3]
        G1[Gold Zone - S3]
        DQ[Data Quality Checks]

        D1 & D2 & D3 & D4 & D5 & D6 --> L1
        L1 --> DQ
        DQ --> G1

    subgraph Platform Engineering
        API[REST API]
        SEC[Security Layer]
        ORCH[Modal Orchestration]

        G1 --> API
        API --> SEC
        SEC --> MON
        MON --> ORCH

    subgraph AI Engineering
        LLM[LLM Processing]

        ORCH --> LLM
        LLM --> AGG
        AGG --> SUM

    subgraph Operations Usage
        R1[Monthly Reports]
        R2[Progress Tracking]
        R3[Resource Planning]

        SUM --> R1 & R2 & R3

    classDef data fill:#d4ebf2,stroke:#1b70a6,color:#000
    classDef platform fill:#fdf1d5,stroke:#d4a017,color:#000
    classDef ai fill:#e8f5e8,stroke:#2d862d,color:#000
    classDef ops fill:#ffe6e6,stroke:#cc0000,color:#000

    class D1,D2,D3,D4,D5,D6 data
    class L1,G1,DQ platform
    class API,SEC,MON,ORCH platform
    class LLM,AGG,SUM,VEC ai
    class R1,R2,R3 ops

The system begins with raw data collection from various sources, primarily Discord at present, with planned expansion to Git, JIRA, Google Docs, and Notion. This data moves through our S3-based landing and gold zones, where it undergoes quality checks and transformations before feeding into our platform and AI engineering layers.

Detailed processing pipeline

graph LR
    subgraph Data Collection
        DC1[Discord Collector]
        DC2[Git Collector]
        DC3[JIRA Collector]
        SCHEDULE[Weekly Schedule]

        SCHEDULE --> DC1 & DC2 & DC3

    subgraph Processing Pipeline
        B1[Message Buffer]
        B2[Git Buffer]
        B3[Ticket Buffer]

        P1[PII Scrubber]
        P2[Data Validator]
        P3[Schema Enforcer]

        DC1 --> B1
        DC2 --> B2
        DC3 --> B3

        B1 & B2 & B3 --> P1
        P1 --> P2
        P2 --> P3

    subgraph Storage Layer
        S1[S3 - Parquet Files]
        S2[S3 - CSV Files]

        P3 --> S1
        P3 --> S2

    subgraph Query Layer
        Q1[DuckDB Engine]
        Q2[Polars Engine]
        Q3[Report Generator]

        S1 --> Q1
        S2 --> Q2
        Q1 & Q2 --> Q3

    style DC1 fill:#d4ebf2,stroke:#1b70a6,color:#000
    style DC2 fill:#d4ebf2,stroke:#1b70a6,color:#000
    style DC3 fill:#d4ebf2,stroke:#1b70a6,color:#000

    style P1 fill:#fdf1d5,stroke:#d4a017,color:#000
    style P2 fill:#fdf1d5,stroke:#d4a017,color:#000
    style P3 fill:#fdf1d5,stroke:#d4a017,color:#000

    style Q1 fill:#e8f5e8,stroke:#2d862d,color:#000
    style Q2 fill:#e8f5e8,stroke:#2d862d,color:#000
    style Q3 fill:#e8f5e8,stroke:#2d862d,color:#000

Our processing pipeline emphasizes efficiency and security:

  1. Collection layer: Weekly scheduled collectors gather data from various sources
  2. Processing pipeline: Data undergoes PII scrubbing, validation, and schema enforcement
  3. Storage layer: Processed data is stored in S3 using Parquet and CSV formats
  4. Query layer: DuckDB and Polars engines provide fast, efficient data analysis

Dify - Operational intelligence through low-code workflows

We use Dify to transform our raw data streams into intelligent insights through low-code workflows. This process bridges the gap between our data collection pipeline and the operational insights needed by our team.

graph LR
    subgraph "Input Collection"
        START[Start] --> |channel_id/dates| PE1[Parameter Extractor 1]
        START --> |git_token| PE2[Parameter Extractor 2]
        START --> |condition check| IE{IF/ELSE}

    subgraph "Data Extraction"
        PE1 --> |Map| LE[Links Extraction]
        PE2 --> |Map| GE[Git Extraction]
        IE --> |dialogue_count ≤ 1| DM[Discord Messages]

    subgraph "Parallel Processing"
        LE --> |Iterate| IT[Link Iterator]
        IT --> |Map| FSP[Fetch Single Page]
        GE --> |Map| GT[Git Traverser]
        DM --> |Map| VA[Variable Aggregator]

    subgraph "Reduction & Output"
        FSP --> |Reduce| RED[Template Transform]
        GT --> |Reduce| RED
        VA --> |Reduce| RED
        RED --> LLM[Monthly Reporter LLM]
        LLM --> ANS[Answer]

    style START fill:#f9f,stroke:#333,color:#000
    style IT fill:#bbf,stroke:#333,color:#000
    style RED fill:#bfb,stroke:#333,color:#000
    style ANS fill:#fbf,stroke:#333,color:#000

Our Dify implementation provides a few key advantages:

Operational impact

The Project Reports system serves as the foundation for our Operations team’s project oversight. It provides:

Technical implementation

Secure data collection

The cornerstone of our system is a robust collection pipeline built on Modal. Our collection process runs weekly, automatically processing Discord messages through a sophisticated filtering system that preserves critical technical discussions while ensuring security and privacy.

    schedule=modal.Cron("0 1 * * 1"),  # Weekly Monday collection
def weekly_discord_collection():
    category_id = get_category_id.local()
    channels = get_category_channels.remote(category_id)
    channel_args = [(channel, year, month) for channel in channels]
    saved_files = process_channel_monthly_data.starmap(channel_args)

Through Modal’s serverless architecture, we’ve implemented separate landing zones for different project data, ensuring granular access control and comprehensive audit trails. Each message undergoes content filtering and PII scrubbing before being transformed into optimized Parquet format, providing both storage efficiency and query performance.

Query interface

The system provides a flexible API for accessing processed data:

    volumes={MOUNT_PATH: modal.CloudBucketMount("dwarvesf-discord", secret=secrets)},
def query_messages(item: QueryRequest, token: str = Depends(verify_token)) -> Dict:
    parquet_files = get_relevant_files.remote(

Measured impact

The implementation of Project Reports has fundamentally transformed our project management approach. Our operations team now have greater visibility into project progress, with tracking and early issue identification becoming the norm rather than the exception. The automated documentation of key decisions has significantly reduced meeting overhead, while the correlation between discussions and deliverables ensures nothing falls through the cracks.

Future development

We’re expanding the system’s capabilities in several key areas:

We also don’t plan to be vendor-locked using entirely Modal. The foundations we’ve layed out to create our landing zones and data lake makes it very easy to swap in-and-out query and API architectures.


At Dwarves, our Project Reports system demonstrates the power of thoughtful data engineering in transforming raw communication into strategic project intelligence. By combining secure data collection, efficient processing, and AI-powered analysis, we’ve created a system that doesn’t just track progress – it actively contributes to project success.

The system continues to coordinate our project data streams with precision and purpose, ensuring that every piece of information contributes to a clear picture of project health. Through this systematic approach, we’re setting new standards for data-driven project management in software development, one report at a time.