MOP - Managed Observability Platform
A reference implementation for a modern observability stack using OpenTelemetry Backend Initiative (OBI), Grafana, and cloud-native components.
๐ฏ Project Overview
MOP provides a production-ready observability platform featuring:
- OpenTelemetry Backend Initiative (OBI): Zero-code, eBPF-based instrumentation with <1% CPU overhead
- Grafana Stack: Unified visualization and alerting
- Grafana Alloy: Advanced telemetry pipeline with sampling and routing
- Tempo: Distributed tracing backend with cost-efficient object storage
- Mimir: Long-term metrics storage (Prometheus-compatible, no Prometheus)
- Loki: Log aggregation with trace correlation
- Tanka: Infrastructure as code with Jsonnet + Helm
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโ
โ Application โ
โ (Any Lang) โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OBI (eBPF Instrumentation) โ
โ - HTTP/gRPC/SQL/Redis/Kafka โ
โ - <1% CPU overhead โ
โโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OTLP
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Grafana Alloy โ
โ - Sampling & Routing โ
โ - Cost Optimization โ
โโโโโโคโโโโโโโโโโโโโโโโคโโโโโโโโโโโโ
โ โ
โโโโโโผโโโโโโ โโโโโโผโโโโโโ
โ Tempo โ โ Mimir โ
โ (Traces) โ โ (Metrics)โ
โโโโโโโโโโโโ โโโโโโโโโโโโ
โ โ
โโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโ
โ Loki (Logs) โ
โโโโโโคโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Grafana (Visualization) โ
โ - Stateless, Auth Disabled โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Quick Start
# Install dependencies
just install
# Initialize Tanka
just init
# Deploy to dev environment
just deploy dev
# View logs
just logs alloy
# Access Grafana
just grafana-port-forward
open http://localhost:3000๐ Repository Structure
mop/
โโโ docs/ # Documentation
โ โโโ architecture/ # Architecture Decision Records (ADRs)
โ โโโ workstreams/ # Parallel workstream issues
โ โโโ agents/ # Agent coordination configs
โ โโโ research/ # Research findings
โโโ environments/ # Tanka environments
โ โโโ dev/ # Development environment
โ โโโ staging/ # Staging environment
โ โโโ production/ # Production environment
โโโ lib/ # Jsonnet libraries
โ โโโ config.libsonnet # Centralized configuration
โ โโโ alloy.libsonnet # Alloy configuration
โ โโโ obi.libsonnet # OBI DaemonSet configuration
โ โโโ tempo.libsonnet # Tempo distributed tracing
โ โโโ mimir.libsonnet # Mimir metrics storage
โ โโโ loki.libsonnet # Loki log aggregation
โ โโโ grafana.libsonnet # Grafana dashboards
โโโ charts/ # Vendored Helm charts
โโโ vendor/ # Jsonnet dependencies
โโโ scripts/ # Automation scripts
โ โโโ nu/ # Nushell scripts
โโโ tests/ # Integration tests
โโโ Tiltfile # Local development with Tilt
โโโ justfile # Common commands
โโโ tanka.yaml # Tanka configuration
๐ ๏ธ Technology Stack
| Component | Purpose | Why No Prometheus? |
|---|---|---|
| OBI | eBPF instrumentation | Zero-code, universal coverage |
| Grafana Alloy | Telemetry pipeline | Advanced sampling & routing |
| Tempo | Distributed tracing | Cost-efficient, object storage |
| Mimir | Metrics storage | Prometheus-compatible API, better for scale |
| Loki | Log aggregation | Trace-log correlation |
| Grafana | Visualization | Unified observability UX |
| Tanka | Infrastructure as Code | Jsonnet + Helm flexibility |
Why Mimir instead of Prometheus?
- Horizontally scalable (Prometheus is single-instance)
- Object storage backend (cheaper than local disks)
- Multi-tenancy built-in
- Better retention policies
- Still exposes Prometheus-compatible API for querying
๐งช OBI Experiments
See docs/architecture/obi-experiments.md for detailed experiment proposals:
- Adaptive Tail-Based Sampling: Dynamic sampling based on SLO breaches (90% cost reduction)
- Network Service Discovery: Auto-generate dependency graphs from traffic
- Database Query Profiling: Identify slow SQL without instrumentation
- Multi-Region Cost Optimization: Regional traces, global metrics (79% cost reduction)
- Canary Automated Rollback: OBI metrics drive Argo Rollouts quality gates
๐ Parallel Workstreams
This project is organized into parallel workstreams that can be worked on concurrently:
- Workstream 1: Infrastructure Foundation
- Workstream 2: OBI Integration
- Workstream 3: Grafana Stack
- Workstream 4: Tanka Configuration
- Workstream 5: Development Tools
- Workstream 6: OBI Experiments
๐ค Agent Coordination
See docs/agents/coordination.md for agent roles and collaboration patterns.
๐ง Development
Prerequisites
- Kubernetes cluster (kind, minikube, or cloud)
- Tanka (
brew install tanka) - jsonnet-bundler (
brew install jsonnet-bundler) - Tilt (
brew install tilt) - just (
brew install just) - nushell (
brew install nushell)
Local Development Workflow
# 1. Start local Kubernetes cluster
just cluster-up
# 2. Start Tilt (hot reload)
tilt up
# 3. Make changes to Jsonnet files
# Tilt automatically reloads
# 4. Run tests
just test
# 5. Apply to dev environment
just deploy dev๐ Documentation
- Architecture Overview
- Alloy Operator Decision
- OBI Integration Patterns
- Tanka Best Practices
- Cost Optimization Guide
๐ Learning Resources
๐ Monitoring & Alerting
Default dashboards are provisioned automatically:
- OBI Overview: eBPF instrumentation health
- Alloy Pipeline: Sampling rates, throughput, errors
- Tempo: Trace ingestion, query latency
- Mimir: Metrics cardinality, ingestion rate
- Loki: Log volume, query performance
- SLO Dashboard: Service-level objectives tracking
๐ Security
- Grafana: Stateless deployment, auth disabled (for internal use)
- OBI: Read-only eBPF probes, no data modification
- Secrets: Managed via Kubernetes Secrets (not in git)
- Network policies: Least-privilege access
๐ค Contributing
- Create a workstream issue in
docs/workstreams/ - Use agent coordination patterns from
docs/agents/ - Follow Tanka best practices
- Ensure tests pass
- Update documentation
๐ License
MIT License - see LICENSE file
๐ Support
- Issues: File in GitHub Issues with workstream label
- Docs: See
docs/directory - Examples: See
docs/research/for detailed guides
Status: ๐๏ธ Initial Setup Phase
Next Steps: See Workstream 1: Infrastructure Foundation