Jörg Lohrer fbd6630f6d chore: initial project setup

- Git repository mit .gitignore und .editorconfig
- NPM-Projekt mit package.json und Dependencies
- Projekt-Struktur (src/, docs/, examples/, test/)
- Umfassende README.md mit Features und Roadmap
- Architektur-Dokumentation mit Mermaid-Diagrammen
- Design-Entscheidungen dokumentiert
- .env.example für Forgejo API-Konfiguration
- MIT Lizenz und Contributing Guidelines

Status: Phase 1 - Core Parser (Setup abgeschlossen)

2025-10-01 15:28:30 +02:00

11 KiB

Raw Blame History

🏛️ Architektur-Dokumentation

Überblick

MDParser ist ein modularer, erweiterbarer Parser für Markdown-Dateien mit YAML Front Matter, optimiert für die Verarbeitung von AMB-konformen Bildungsressourcen.

📐 Architektur-Diagramm

flowchart TB
    subgraph "Datenquellen"
        File["📄 Lokale Datei"]
        URL["🌐 HTTP/HTTPS URL"]
        API["🔌 Forgejo/Gitea API"]
    end

    subgraph "Core Parser"
        Fetch["Fetch Module<br/>Daten abrufen"]
        Unified["unified Pipeline<br/>remark-parse<br/>remark-frontmatter<br/>remark-gfm"]
        YAMLParser["YAML Parser<br/>yaml library"]
    end

    subgraph "Extraction Layer"
        FrontMatter["Front Matter<br/>Extractor"]
        AMBExtract["AMB Metadata<br/>Extractor<br/>(schema.org)"]
        ContentExtract["Content<br/>Extractor<br/>(AST)"]
    end

    subgraph "Output Formats"
        JSON["📦 JSON<br/>Structured Data"]
        AST["🌲 MDAST<br/>Abstract Syntax Tree"]
        HTML["📝 HTML<br/>(optional)"]
    end

    subgraph "Transformers (Phase 2)"
        WP["WordPress<br/>REST API v2"]
        Nostr["Nostr<br/>NIP-23"]
    end

    File --> Fetch
    URL --> Fetch
    API --> Fetch

    Fetch --> Unified
    Unified --> YAMLParser
    Unified --> FrontMatter

    FrontMatter --> AMBExtract
    FrontMatter --> ContentExtract

    AMBExtract --> JSON
    ContentExtract --> AST
    ContentExtract --> HTML

    JSON --> WP
    JSON --> Nostr
    AST --> WP
    AST --> Nostr

    style Unified fill:#e1f5ff,stroke:#01579b
    style AMBExtract fill:#f3e5f5,stroke:#4a148c
    style JSON fill:#e8f5e9,stroke:#1b5e20

🎯 Design-Prinzipien

1. Modularität

Jede Komponente hat eine klare Verantwortung
Lose Kopplung zwischen Modulen
Einfach erweiterbar durch Plugin-System

2. Isomorphie

Code funktioniert in Node.js und Browser
Keine Node.js-spezifischen APIs im Core
Native fetch für HTTP-Requests

3. Standards-Konformität

AMB-Metadatenstandard (schema.org)
MDAST (Markdown Abstract Syntax Tree)
CommonMark + GFM (GitHub Flavored Markdown)

4. Fehlertoleranz

Graceful Degradation bei fehlenden Metadaten
Validierung mit aussagekräftigen Fehlermeldungen
Optionale Felder werden sauber behandelt

📦 Modul-Struktur

Core Module

1. Parser (`src/parser.js`)

export async function parseMarkdownFile(filePath, options) {
  // Haupteinstiegspunkt für Markdown-Parsing
  // Orchestriert unified Pipeline
  return {
    yaml: {},      // Rohes YAML Front Matter
    metadata: {},  // Extrahierte AMB-Metadaten
    ast: {},       // Markdown AST
    content: "",   // Reiner Content
    html: ""       // Optional: HTML-Output
  }
}

Technologie: unified + remark Ökosystem

Plugins:

remark-parse - Markdown → AST
remark-frontmatter - YAML Front Matter Support
remark-gfm - GitHub Flavored Markdown
remark-stringify - AST → Markdown (optional)
remark-html - AST → HTML (optional)

2. Forgejo Client (`src/forgejo-client.js`)

export class ForgejoClient {
  constructor(config) { /* ... */ }
  
  async getFileContent(path) { /* ... */ }
  async listDirectory(path) { /* ... */ }
  async listPosts(postsDir) { /* ... */ }
  async getRepository() { /* ... */ }
}

API-Endpoints:

/repos/{owner}/{repo}/contents/{path} - Dateiinhalt
/repos/{owner}/{repo}/git/trees/{sha} - Verzeichnis-Listing
Content wird Base64-dekodiert

3. YAML Extractor (`src/extractors/yaml-extractor.js`)

export function extractYAML(markdownContent) {
  // Extrahiert YAML Front Matter
  // Parst mit yaml library
  return yamlObject
}

Technologie: yaml library (v2.x)

Features:

Komplexe YAML-Strukturen
Arrays, nested Objects
Multi-line Strings
Datum-Parsing

4. AMB Metadata Extractor (`src/extractors/amb-extractor.js`)

export function extractAMBMetadata(yamlObject) {
  // Transformiert YAML → Schema.org
  // Validiert AMB-Konformität
  return ambMetadata
}

Mapping:

{
  "@context": "https://schema.org/",
  "type": "LearningResource",
  "name": yaml.commonMetadata.name,
  "description": yaml.commonMetadata.description,
  "creator": mapCreators(yaml.commonMetadata.creator),
  "license": yaml.commonMetadata.license,
  "inLanguage": yaml.commonMetadata.inLanguage,
  "datePublished": yaml.commonMetadata.datePublished,
  "about": yaml.commonMetadata.about,
  "image": yaml.commonMetadata.image,
  "id": yaml.commonMetadata.id,
  "learningResourceType": yaml.commonMetadata.learningResourceType,
  "educationalLevel": yaml.commonMetadata.educationalLevel
}

Transformation Layer (Phase 2)

5. WordPress Transformer (`src/transformers/wordpress.js`)

export function transformToWordPress(parsedData) {
  return {
    title: "",
    content: "",
    excerpt: "",
    featured_media: 0,
    tags: [],
    categories: [],
    meta: {},
    author: 0
  }
}

WordPress REST API v2 Format

6. Nostr Transformer (`src/transformers/nostr.js`)

export function transformToNostr(parsedData) {
  return {
    kind: 30023,  // NIP-23 Long-form
    tags: [
      ["d", ""],          // unique identifier
      ["title", ""],
      ["summary", ""],
      ["published_at", ""],
      ["image", ""],
      ["t", ""],          // hashtags
      ["e", ""],          // event refs
      ["a", ""],          // article refs
      ["p", ""]           // pubkey refs
    ],
    content: ""  // Markdown content
  }
}

🔄 Datenfluss

1. Parsing-Pipeline

sequenceDiagram
    participant Client
    participant Parser
    participant Unified
    participant YAML
    participant AMB
    
    Client->>Parser: parseMarkdownFile(path)
    Parser->>Unified: process(markdown)
    Unified->>YAML: extract front matter
    YAML-->>Parser: yamlObject
    Parser->>AMB: extractAMBMetadata(yaml)
    AMB-->>Parser: ambMetadata
    Unified-->>Parser: ast
    Parser-->>Client: { yaml, metadata, ast, content }

2. Forgejo API Integration

sequenceDiagram
    participant Client
    participant ForgejoClient
    participant API as Forgejo API
    participant Parser
    
    Client->>ForgejoClient: getFileContent(path)
    ForgejoClient->>API: GET /repos/.../contents/...
    API-->>ForgejoClient: { content: base64, ... }
    ForgejoClient->>ForgejoClient: decode base64
    ForgejoClient-->>Client: markdown string
    Client->>Parser: parseMarkdownFile(markdown)
    Parser-->>Client: parsed data

3. Transformation (Phase 2)

flowchart LR
    Parse["Parsed Data<br/>{yaml, metadata, ast}"]
    WPT["WordPress<br/>Transformer"]
    NostrT["Nostr<br/>Transformer"]
    WPAPI["WordPress<br/>REST API"]
    NostrRelay["Nostr<br/>Relay"]
    
    Parse --> WPT
    Parse --> NostrT
    
    WPT --> WPAPI
    NostrT --> NostrRelay
    
    style Parse fill:#e8f5e9
    style WPT fill:#fff3e0
    style NostrT fill:#f3e5f5

🛠️ Technologie-Entscheidungen

Warum unified/remark?

Alternative	Pro	Contra	Entscheidung
marked	✅ Sehr populär ✅ Einfach	❌ HTML-fokussiert ❌ Kein AST	❌ Abgelehnt
markdown-it	✅ Erweiterbar ✅ Performance	❌ Komplexe API ❌ HTML-fokussiert	❌ Abgelehnt
unified/remark	✅ AST-basiert ✅ Isomorph ✅ Plugin-System ✅ Standard	⚠️ Lernkurve	✅ GEWÄHLT
gray-matter + marked	✅ Einfach	❌ Weniger strukturiert	⚠️ Fallback

Warum `yaml` library?

Alternative	Pro	Contra	Entscheidung
js-yaml	✅ Populär	❌ Größere Bundle-Size	❌ Abgelehnt
yaml	✅ Modern ✅ Spec-compliant ✅ Klein	-	✅ GEWÄHLT
JSON.parse	✅ Native	❌ Kein YAML-Support	❌ Nicht geeignet

Warum native `fetch`?

✅ Standard in Node.js 18+
✅ Identische API im Browser
✅ Keine Dependencies
✅ Async/await Support

📊 Performance-Überlegungen

Caching-Strategie

// Optional: Cache für häufig abgerufene Dateien
const cache = new Map()

async function parseWithCache(path, options) {
  const cacheKey = `${path}-${JSON.stringify(options)}`
  
  if (cache.has(cacheKey)) {
    return cache.get(cacheKey)
  }
  
  const result = await parseMarkdownFile(path, options)
  cache.set(cacheKey, result)
  
  return result
}

Rate Limiting für APIs

// Forgejo API: Max. 10 Requests/Sekunde
const rateLimiter = new RateLimiter({
  tokensPerInterval: 10,
  interval: 1000
})

🔒 Sicherheit

Input-Validierung

YAML-Bombing-Schutz (max. depth/size)
Path-Traversal-Schutz bei Dateizugriffen
Content-Type-Validierung bei API-Requests

Sanitization

XSS-Schutz bei HTML-Output (optional mit DOMPurify)
SQL-Injection-Schutz bei DB-Integration (Phase 2)

🧪 Testing-Strategie

Unit Tests

test/
├── parser.test.js
├── yaml-extractor.test.js
├── amb-extractor.test.js
├── forgejo-client.test.js
└── transformers/
    ├── wordpress.test.js
    └── nostr.test.js

Integration Tests

End-to-End mit echtem Forgejo-Repository
Mocking der API-Responses

Test-Fixtures

test/fixtures/
├── valid-amb.md
├── missing-metadata.md
├── complex-yaml.md
└── github-flavored.md

🚀 Deployment-Szenarien

1. Node.js CLI

npm install -g mdparser
mdparser parse ./content/post.md

2. Node.js Library

import { parseMarkdownFile } from 'mdparser'
const result = await parseMarkdownFile('./post.md')

3. Browser (ESM)

<script type="module">
  import { parseMarkdownFile } from './mdparser.js'
  // ...
</script>

4. Serverless Function

// Vercel/Netlify Function
export default async function handler(req, res) {
  const result = await parseMarkdownFile(req.body.url)
  res.json(result)
}

📈 Roadmap & Erweiterungen

Phase 1: Core Parser ✅ (aktuell)

Projekt-Setup
Parser-Implementierung
Forgejo-Client
AMB-Extraktor
Tests & Dokumentation

Phase 2: Transformers 🚧

WordPress-Integration
Nostr-Integration
Batch-Processing

Phase 3: Advanced Features 🔮

Browser-Build
CLI-Tool
Webhook-Support
Real-time Sync
GraphQL-API

🤝 Contribution Guidelines

Siehe CONTRIBUTING.md für Details zu:

Code-Style (ESLint + Prettier)
Commit-Conventions
Pull-Request-Prozess
Testing-Requirements

11 KiB Raw Blame History