MOP - Managed Observability Platform

A reference implementation for a modern observability stack using OpenTelemetry Backend Initiative (OBI), Grafana, and cloud-native components.

๐ŸŽฏ Project Overview

MOP provides a production-ready observability platform featuring:

  • OpenTelemetry Backend Initiative (OBI): Zero-code, eBPF-based instrumentation with <1% CPU overhead
  • Grafana Stack: Unified visualization and alerting
  • Grafana Alloy: Advanced telemetry pipeline with sampling and routing
  • Tempo: Distributed tracing backend with cost-efficient object storage
  • Mimir: Long-term metrics storage (Prometheus-compatible, no Prometheus)
  • Loki: Log aggregation with trace correlation
  • Tanka: Infrastructure as code with Jsonnet + Helm

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Application   โ”‚
โ”‚   (Any Lang)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚
    โ•”โ•โ•โ•โ•โ–ผโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘  OBI (eBPF Instrumentation)    โ•‘
    โ•‘  - HTTP/gRPC/SQL/Redis/Kafka   โ•‘
    โ•‘  - <1% CPU overhead            โ•‘
    โ•šโ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
         โ”‚ OTLP
    โ•”โ•โ•โ•โ•โ–ผโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘  Grafana Alloy                 โ•‘
    โ•‘  - Sampling & Routing          โ•‘
    โ•‘  - Cost Optimization           โ•‘
    โ•šโ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
         โ”‚               โ”‚
    โ•”โ•โ•โ•โ•โ–ผโ•โ•โ•โ•โ•โ•—    โ•”โ•โ•โ•โ•โ–ผโ•โ•โ•โ•โ•โ•—
    โ•‘  Tempo   โ•‘    โ•‘  Mimir   โ•‘
    โ•‘ (Traces) โ•‘    โ•‘ (Metrics)โ•‘
    โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•    โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
         โ”‚               โ”‚
    โ•”โ•โ•โ•โ•โ–ผโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ–ผโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘        Loki (Logs)             โ•‘
    โ•šโ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
         โ”‚
    โ•”โ•โ•โ•โ•โ–ผโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
    โ•‘  Grafana (Visualization)       โ•‘
    โ•‘  - Stateless, Auth Disabled    โ•‘
    โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿš€ Quick Start

# Install dependencies
just install
 
# Initialize Tanka
just init
 
# Deploy to dev environment
just deploy dev
 
# View logs
just logs alloy
 
# Access Grafana
just grafana-port-forward
open http://localhost:3000

๐Ÿ“ Repository Structure

mop/
โ”œโ”€โ”€ docs/                      # Documentation
โ”‚   โ”œโ”€โ”€ architecture/          # Architecture Decision Records (ADRs)
โ”‚   โ”œโ”€โ”€ workstreams/           # Parallel workstream issues
โ”‚   โ”œโ”€โ”€ agents/                # Agent coordination configs
โ”‚   โ””โ”€โ”€ research/              # Research findings
โ”œโ”€โ”€ environments/              # Tanka environments
โ”‚   โ”œโ”€โ”€ dev/                   # Development environment
โ”‚   โ”œโ”€โ”€ staging/               # Staging environment
โ”‚   โ””โ”€โ”€ production/            # Production environment
โ”œโ”€โ”€ lib/                       # Jsonnet libraries
โ”‚   โ”œโ”€โ”€ config.libsonnet       # Centralized configuration
โ”‚   โ”œโ”€โ”€ alloy.libsonnet        # Alloy configuration
โ”‚   โ”œโ”€โ”€ obi.libsonnet          # OBI DaemonSet configuration
โ”‚   โ”œโ”€โ”€ tempo.libsonnet        # Tempo distributed tracing
โ”‚   โ”œโ”€โ”€ mimir.libsonnet        # Mimir metrics storage
โ”‚   โ”œโ”€โ”€ loki.libsonnet         # Loki log aggregation
โ”‚   โ””โ”€โ”€ grafana.libsonnet      # Grafana dashboards
โ”œโ”€โ”€ charts/                    # Vendored Helm charts
โ”œโ”€โ”€ vendor/                    # Jsonnet dependencies
โ”œโ”€โ”€ scripts/                   # Automation scripts
โ”‚   โ””โ”€โ”€ nu/                    # Nushell scripts
โ”œโ”€โ”€ tests/                     # Integration tests
โ”œโ”€โ”€ Tiltfile                   # Local development with Tilt
โ”œโ”€โ”€ justfile                   # Common commands
โ””โ”€โ”€ tanka.yaml                 # Tanka configuration

๐Ÿ› ๏ธ Technology Stack

ComponentPurposeWhy No Prometheus?
OBIeBPF instrumentationZero-code, universal coverage
Grafana AlloyTelemetry pipelineAdvanced sampling & routing
TempoDistributed tracingCost-efficient, object storage
MimirMetrics storagePrometheus-compatible API, better for scale
LokiLog aggregationTrace-log correlation
GrafanaVisualizationUnified observability UX
TankaInfrastructure as CodeJsonnet + Helm flexibility

Why Mimir instead of Prometheus?

  • Horizontally scalable (Prometheus is single-instance)
  • Object storage backend (cheaper than local disks)
  • Multi-tenancy built-in
  • Better retention policies
  • Still exposes Prometheus-compatible API for querying

๐Ÿงช OBI Experiments

See docs/architecture/obi-experiments.md for detailed experiment proposals:

  1. Adaptive Tail-Based Sampling: Dynamic sampling based on SLO breaches (90% cost reduction)
  2. Network Service Discovery: Auto-generate dependency graphs from traffic
  3. Database Query Profiling: Identify slow SQL without instrumentation
  4. Multi-Region Cost Optimization: Regional traces, global metrics (79% cost reduction)
  5. Canary Automated Rollback: OBI metrics drive Argo Rollouts quality gates

๐Ÿ“‹ Parallel Workstreams

This project is organized into parallel workstreams that can be worked on concurrently:

๐Ÿค– Agent Coordination

See docs/agents/coordination.md for agent roles and collaboration patterns.

๐Ÿ”ง Development

Prerequisites

  • Kubernetes cluster (kind, minikube, or cloud)
  • Tanka (brew install tanka)
  • jsonnet-bundler (brew install jsonnet-bundler)
  • Tilt (brew install tilt)
  • just (brew install just)
  • nushell (brew install nushell)

Local Development Workflow

# 1. Start local Kubernetes cluster
just cluster-up
 
# 2. Start Tilt (hot reload)
tilt up
 
# 3. Make changes to Jsonnet files
# Tilt automatically reloads
 
# 4. Run tests
just test
 
# 5. Apply to dev environment
just deploy dev

๐Ÿ“– Documentation

๐ŸŽ“ Learning Resources

๐Ÿ“Š Monitoring & Alerting

Default dashboards are provisioned automatically:

  • OBI Overview: eBPF instrumentation health
  • Alloy Pipeline: Sampling rates, throughput, errors
  • Tempo: Trace ingestion, query latency
  • Mimir: Metrics cardinality, ingestion rate
  • Loki: Log volume, query performance
  • SLO Dashboard: Service-level objectives tracking

๐Ÿ” Security

  • Grafana: Stateless deployment, auth disabled (for internal use)
  • OBI: Read-only eBPF probes, no data modification
  • Secrets: Managed via Kubernetes Secrets (not in git)
  • Network policies: Least-privilege access

๐Ÿค Contributing

  1. Create a workstream issue in docs/workstreams/
  2. Use agent coordination patterns from docs/agents/
  3. Follow Tanka best practices
  4. Ensure tests pass
  5. Update documentation

๐Ÿ“ License

MIT License - see LICENSE file

๐Ÿ™‹ Support

  • Issues: File in GitHub Issues with workstream label
  • Docs: See docs/ directory
  • Examples: See docs/research/ for detailed guides

Status: ๐Ÿ—๏ธ Initial Setup Phase

Next Steps: See Workstream 1: Infrastructure Foundation