Skip to main content

Technical Design Documentation

Overviewโ€‹

This directory contains comprehensive technical design documentation for the 24/7 AI Teams platform - a monorepo-based system for building, deploying, and managing automated AI-powered workflows.

What is 24/7 AI Teams?โ€‹

The platform enables users to create sophisticated automated workflows that combine AI capabilities, external service integrations, and human-in-the-loop decision points. Think of it as a visual programming environment where AI agents can collaborate with tools, memory systems, and human oversight to accomplish complex tasks.

System Architectureโ€‹

The platform follows a microservices architecture with four core backend services communicating via HTTP/REST:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ API Gateway โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Workflow Agent โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Workflow Engine โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚Workflow Schedulerโ”‚
โ”‚ (FastAPI) โ”‚ โ”‚ (LangGraph/AI) โ”‚ โ”‚ (Execution) โ”‚ โ”‚ (Triggers) โ”‚
โ”‚ Port: 8000 โ”‚ โ”‚ Port: 8001 โ”‚ โ”‚ Port: 8002 โ”‚ โ”‚ Port: 8003 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚ โ”‚ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Supabase โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
(Auth, State, Vector Store, Row Level Security)

Core Servicesโ€‹

  1. API Gateway (Port 8000)

    • Three-layer API architecture: Public, App (OAuth), MCP (API Key)
    • Client-facing HTTP/REST endpoints with proper authentication
    • Real-time SSE (Server-Sent Events) for workflow execution updates
    • Row-Level Security (RLS) integration with Supabase
  2. Workflow Agent (Port 8001)

    • LangGraph-based AI workflow generation
    • Conversational interface for workflow creation
    • Gap analysis and capability negotiation
    • Template-based workflow modification
    • Automatic debugging and refinement
  3. Workflow Engine (Port 8002)

    • Node-based workflow execution engine
    • 8 core node types with flexible subtypes
    • Human-in-the-Loop (HIL) support with pause/resume
    • Real-time execution tracking and logging
    • Comprehensive error handling and retry mechanisms
  4. Workflow Scheduler (Port 8003)

    • Trigger management (Cron, Manual, Webhook, GitHub, Slack, Email)
    • Deployment lifecycle management
    • Distributed locking for concurrent execution prevention
    • Real-time trigger monitoring

Frontend Applicationsโ€‹

  • Agent Team Web (Next.js): Main web interface for workflow creation and management
  • Internal Tools: Docusaurus-based documentation site

Key Architectural Patternsโ€‹

Node-Based Workflow Systemโ€‹

The workflow engine uses a sophisticated node system with 8 core node types:

  1. TRIGGER: Workflow initiation (Manual, Cron, Webhook, GitHub, Slack, Email)
  2. AI_AGENT: Provider-based AI nodes (Gemini, OpenAI, Claude) with custom prompts
  3. ACTION: System operations (HTTP requests, code execution, data transformation)
  4. EXTERNAL_ACTION: External service integrations (Slack, GitHub, Notion, etc.)
  5. FLOW: Control flow (If, Loop, Filter, Merge, Wait)
  6. HUMAN_IN_THE_LOOP: Human interaction points with AI-powered response classification
  7. TOOL: MCP (Model Context Protocol) tool integrations
  8. MEMORY: Conversation and knowledge storage

Node Structureโ€‹

Each node contains:

  • Configurations: Node-specific parameters defining behavior
  • Input/Output Params: Runtime data flow parameters
  • Attached Nodes: (AI_AGENT only) Tool and Memory nodes executed in the same context
  • Position: Canvas coordinates for UI visualization

AI Integration Revolutionโ€‹

The system moved from hardcoded AI roles to provider-based architecture:

Old Approach โŒ:

AI_ROUTER_AGENT
AI_TASK_ANALYZER
AI_DATA_INTEGRATOR

New Approach โœ…:

GEMINI_NODE      # Google Gemini with custom system prompt
OPENAI_NODE # OpenAI GPT with custom system prompt
CLAUDE_NODE # Anthropic Claude with custom system prompt

Functionality is now defined entirely through system prompts, enabling unlimited AI capabilities without code changes.

Authentication & Securityโ€‹

Three-Layer API Architecture:

  1. Public API (/api/v1/public/*): No auth, rate-limited, health checks
  2. App API (/api/v1/app/*): Supabase OAuth + JWT + Row Level Security
  3. MCP API (/api/v1/mcp/*): API Key authentication for LLM clients

Security Features:

  • Row-Level Security (RLS) for multi-tenant data isolation
  • JWT token validation with Supabase
  • API key scopes for fine-grained permissions
  • Redis-based rate limiting

Data Managementโ€‹

  • Primary Database: Supabase PostgreSQL with RLS
  • Vector Store: pgvector for RAG and semantic search
  • Cache Layer: Redis for sessions, rate limiting, temporary state
  • File Storage: Supabase Storage for artifacts

Technical Documents by Categoryโ€‹

Core Service Architectureโ€‹

API Gatewayโ€‹

  • API Gateway Architecture - Three-layer API design (Public/App/MCP), authentication middleware, rate limiting, SSE streaming, RLS integration

Workflow Agentโ€‹

Workflow Engineโ€‹

Workflow Schedulerโ€‹

Data & Specificationsโ€‹

Workflow Specificationsโ€‹

Database Designโ€‹

Feature Systemsโ€‹

Human-in-the-Loop (HIL)โ€‹

  • HIL Node System - Complete HIL architecture, AI response classification, multi-channel support
  • HIL Data Formats - Request/response schemas for HIL interactions

Integrationsโ€‹

Supporting Systemsโ€‹

Migration & Developmentโ€‹

Legacy/Referenceโ€‹

Development Workflowโ€‹

Setting Up Development Environmentโ€‹

# Backend services (Python with uv)
cd apps/backend
uv sync

# Individual services
cd api-gateway && uv sync
cd workflow_agent && uv sync
cd workflow_engine && pip install -e .
cd workflow_scheduler && uv sync

# Frontend
cd apps/frontend/agent_team_web
npm install
npm run dev

Running Servicesโ€‹

# All services with Docker Compose (recommended)
cd apps/backend
docker-compose up --build

# Individual services for development
cd api-gateway && uv run uvicorn app.main:app --reload --port 8000
cd workflow_agent && python main.py
cd workflow_engine && python -m workflow_engine.main
cd workflow_scheduler && python main.py

Testingโ€‹

# Backend testing
cd apps/backend
pytest # All tests
pytest api-gateway/tests/ # Service-specific
uv run pytest --cov=app # With coverage

# Frontend testing
cd apps/frontend/agent_team_web
npm test

Key Concepts & Terminologyโ€‹

Workflow Execution Modelโ€‹

Execution States:

  • NEW: Initial state
  • RUNNING: Active execution
  • PAUSED: Halted (Human-in-the-Loop)
  • SUCCESS: Completed successfully
  • ERROR: Failed execution
  • WAITING_FOR_HUMAN: Awaiting HIL response

Node-Level States:

  • pending: Waiting to execute
  • running: Currently executing
  • waiting_input: Awaiting user input (HIL)
  • completed: Successfully finished
  • failed: Execution error

Attached Nodes Pattern (AI_AGENT)โ€‹

AI_AGENT nodes can attach TOOL and MEMORY nodes for enhanced capabilities:

  1. Memory Context Loading (pre-execution): Load conversation history
  2. Tool Discovery (pre-execution): Register MCP tools with AI provider
  3. AI Response Generation: Execute with enhanced context and tools
  4. Conversation Storage (post-execution): Persist interaction to memory

Human-in-the-Loop (HIL)โ€‹

HIL nodes enable workflows to pause and await human decisions:

  1. Pause & Wait: Workflow pauses, state persisted to database
  2. Multi-channel Interaction: Slack, Email, Webhook, In-App notifications
  3. AI Response Classification: Gemini-powered 8-factor analysis determines response relevance
  4. Timeout Management: Configurable timeouts (60s-24h) with customizable actions
  5. Resume Execution: Workflow resumes with human response as node output

Development Philosophyโ€‹

"Fail Fast with Clear Feedback"โ€‹

CRITICAL: Never use mock responses or silent failures in production code. Always provide real errors with actionable feedback when functionality is not implemented or misconfigured.

DO โœ…: Structured error responses with clear solutions

return NodeExecutionResult(
status=ExecutionStatus.ERROR,
error_message="Slack OAuth token not configured",
error_details={
"reason": "missing_oauth_token",
"solution": "Connect Slack account in integrations settings",
"oauth_flow_url": "/integrations/connect/slack"
}
)

DON'T โŒ: Mock responses that hide issues

# NEVER do this - creates false positives
if not api_key:
return f"[Mock Response] Success: {message}"

Deployment Architectureโ€‹

AWS ECS Deploymentโ€‹

  • Platform: AWS ECS Fargate with service discovery
  • Networking: VPC with private subnets, NAT gateways
  • Load Balancing: Application Load Balancer with health checks
  • Security: Security Groups, IAM roles, encrypted secrets (AWS SSM)

Health Check Configurationโ€‹

# API Gateway
curl http://localhost:8000/api/v1/public/health # 120s start period

# Workflow Agent
curl http://localhost:8001/health # 120s start period

# Workflow Engine
curl http://localhost:8002/health # 90s start period

# Workflow Scheduler
curl http://localhost:8003/health # 60s start period

Critical Deployment Requirementsโ€‹

  • Platform: Build with --platform linux/amd64 for ECS
  • Dependencies: All dependencies in requirements.txt/pyproject.toml
  • Import Structure: Preserve Python package hierarchy in Docker images
  • Environment: All secrets via AWS SSM Parameters

Common Development Tasksโ€‹

Adding OAuth Integrationsโ€‹

When adding new OAuth providers (Slack, GitHub, Notion, etc.):

  1. GitHub Secrets: Add {PROVIDER}_CLIENT_ID, {PROVIDER}_CLIENT_SECRET, etc.
  2. Terraform: Add variables to infra/variables.tf and SSM parameters to infra/secrets.tf
  3. GitHub Actions: Add environment variables to .github/workflows/deploy.yml
  4. Service Config: Update all relevant ECS task definitions in infra/ecs.tf
  5. Testing: Test complete OAuth flow end-to-end after deployment

Checklist available in: CLAUDE.md OAuth Integration section

Writing Documentation (Docusaurus/MDX)โ€‹

When writing technical documentation in MDX format, escape comparison operators:

โœ… Correct:
- score \>= 0.7
- score \<= 0.3

โŒ Incorrect (causes build failures):
- score >= 0.7
- score <= 0.3

Build locally before committing:

cd apps/internal-tools/docusaurus-doc
npm run build

Troubleshootingโ€‹

Service Health Checksโ€‹

# Check all services
curl http://localhost:8000/api/v1/public/health # API Gateway
curl http://localhost:8001/health # Workflow Agent
curl http://localhost:8002/health # Workflow Engine
curl http://localhost:8003/health # Workflow Scheduler

# Check Redis
redis-cli ping

# Check database
psql $SUPABASE_DATABASE_URL -c "SELECT version();"

Common Issuesโ€‹

  1. Import Errors: Ensure Docker preserves package structure with proper COPY commands
  2. Port Conflicts: Services must use designated ports (8000-8003)
  3. Database Connection: Check Supabase connection strings and SSL requirements
  4. Authentication: Verify JWT tokens and RLS policies in Supabase

Migration Historyโ€‹

Major Architectural Changesโ€‹

  • gRPC โ†’ FastAPI Migration: All services now use HTTP/REST for consistency
  • Three-Layer API Architecture: Public/App/MCP authentication layers
  • Node Specification System: Centralized validation with automatic type conversion
  • Provider-Based AI Agents: From hardcoded roles to flexible prompt-driven nodes
  • Workflow Scheduler Addition: Dedicated trigger management service

Documentation Navigationโ€‹

Start Hereโ€‹

Service-Specific Deep Divesโ€‹

  • Each service has detailed CLAUDE.md files with service-specific patterns
  • Check apps/backend/{service}/CLAUDE.md for implementation details

Database & Data Modelsโ€‹

Integration Guidesโ€‹

Contributingโ€‹

When adding new features or making architectural changes:

  1. Update Technical Design Docs: Document new systems, data structures, and APIs
  2. Follow Existing Patterns: Maintain consistency with current architecture
  3. Add Tests: Comprehensive unit and integration tests
  4. Update CLAUDE.md: Service-specific development guidance
  5. Fail Fast: Never use mock responses - always return clear, actionable errors

Documentation Version: 2.0 Last Updated: 2025-01-28 Maintained By: Engineering Team Related: See individual service CLAUDE.md files for service-specific guidance