Technical Design Documentation
Overviewโ
This directory contains comprehensive technical design documentation for the 24/7 AI Teams platform - a monorepo-based system for building, deploying, and managing automated AI-powered workflows.
What is 24/7 AI Teams?โ
The platform enables users to create sophisticated automated workflows that combine AI capabilities, external service integrations, and human-in-the-loop decision points. Think of it as a visual programming environment where AI agents can collaborate with tools, memory systems, and human oversight to accomplish complex tasks.
System Architectureโ
The platform follows a microservices architecture with four core backend services communicating via HTTP/REST:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ API Gateway โโโโโโถโ Workflow Agent โโโโโโถโ Workflow Engine โโโโโโถโWorkflow Schedulerโ
โ (FastAPI) โ โ (LangGraph/AI) โ โ (Execution) โ โ (Triggers) โ
โ Port: 8000 โ โ Port: 8001 โ โ Port: 8002 โ โ Port: 8003 โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโ Supabase โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
(Auth, State, Vector Store, Row Level Security)
Core Servicesโ
-
API Gateway (Port 8000)
- Three-layer API architecture: Public, App (OAuth), MCP (API Key)
- Client-facing HTTP/REST endpoints with proper authentication
- Real-time SSE (Server-Sent Events) for workflow execution updates
- Row-Level Security (RLS) integration with Supabase
-
Workflow Agent (Port 8001)
- LangGraph-based AI workflow generation
- Conversational interface for workflow creation
- Gap analysis and capability negotiation
- Template-based workflow modification
- Automatic debugging and refinement
-
Workflow Engine (Port 8002)
- Node-based workflow execution engine
- 8 core node types with flexible subtypes
- Human-in-the-Loop (HIL) support with pause/resume
- Real-time execution tracking and logging
- Comprehensive error handling and retry mechanisms
-
Workflow Scheduler (Port 8003)
- Trigger management (Cron, Manual, Webhook, GitHub, Slack, Email)
- Deployment lifecycle management
- Distributed locking for concurrent execution prevention
- Real-time trigger monitoring
Frontend Applicationsโ
- Agent Team Web (Next.js): Main web interface for workflow creation and management
- Internal Tools: Docusaurus-based documentation site
Key Architectural Patternsโ
Node-Based Workflow Systemโ
The workflow engine uses a sophisticated node system with 8 core node types:
- TRIGGER: Workflow initiation (Manual, Cron, Webhook, GitHub, Slack, Email)
- AI_AGENT: Provider-based AI nodes (Gemini, OpenAI, Claude) with custom prompts
- ACTION: System operations (HTTP requests, code execution, data transformation)
- EXTERNAL_ACTION: External service integrations (Slack, GitHub, Notion, etc.)
- FLOW: Control flow (If, Loop, Filter, Merge, Wait)
- HUMAN_IN_THE_LOOP: Human interaction points with AI-powered response classification
- TOOL: MCP (Model Context Protocol) tool integrations
- MEMORY: Conversation and knowledge storage
Node Structureโ
Each node contains:
- Configurations: Node-specific parameters defining behavior
- Input/Output Params: Runtime data flow parameters
- Attached Nodes: (AI_AGENT only) Tool and Memory nodes executed in the same context
- Position: Canvas coordinates for UI visualization
AI Integration Revolutionโ
The system moved from hardcoded AI roles to provider-based architecture:
Old Approach โ:
AI_ROUTER_AGENT
AI_TASK_ANALYZER
AI_DATA_INTEGRATOR
New Approach โ :
GEMINI_NODE # Google Gemini with custom system prompt
OPENAI_NODE # OpenAI GPT with custom system prompt
CLAUDE_NODE # Anthropic Claude with custom system prompt
Functionality is now defined entirely through system prompts, enabling unlimited AI capabilities without code changes.
Authentication & Securityโ
Three-Layer API Architecture:
- Public API (
/api/v1/public/*): No auth, rate-limited, health checks - App API (
/api/v1/app/*): Supabase OAuth + JWT + Row Level Security - MCP API (
/api/v1/mcp/*): API Key authentication for LLM clients
Security Features:
- Row-Level Security (RLS) for multi-tenant data isolation
- JWT token validation with Supabase
- API key scopes for fine-grained permissions
- Redis-based rate limiting
Data Managementโ
- Primary Database: Supabase PostgreSQL with RLS
- Vector Store: pgvector for RAG and semantic search
- Cache Layer: Redis for sessions, rate limiting, temporary state
- File Storage: Supabase Storage for artifacts
Technical Documents by Categoryโ
Core Service Architectureโ
API Gatewayโ
- API Gateway Architecture - Three-layer API design (Public/App/MCP), authentication middleware, rate limiting, SSE streaming, RLS integration
Workflow Agentโ
- Workflow Agent Architecture - LangGraph state machine, conversational workflow generation, gap analysis, negotiation, debugging
- Workflow Agent API - API specification and integration guide
Workflow Engineโ
- Workflow Engine Architecture - Node execution engine, pause/resume system, provider-based AI agents, HIL integration, execution tracking
- Integration Tests - Comprehensive test strategy and scenarios
Workflow Schedulerโ
- Workflow Scheduler Architecture - Trigger management, deployment lifecycle, distributed locking, GitHub/Slack integration
Data & Specificationsโ
Workflow Specificationsโ
- Workflow Data Structure - Complete workflow data models, execution states, node definitions, API interfaces
- Node Specification System - Centralized node specs, parameter validation, input/output ports, data formats
- Node Structure - Detailed node anatomy and configuration patterns
- Node Communication Protocol - Standardized inter-node data exchange format
Database Designโ
- Database Design - Complete schema, tables, relationships, RLS policies
- Unified Log Table - Centralized logging architecture
- Execution Log API - API for querying execution logs
Feature Systemsโ
Human-in-the-Loop (HIL)โ
- HIL Node System - Complete HIL architecture, AI response classification, multi-channel support
- HIL Data Formats - Request/response schemas for HIL interactions
Integrationsโ
- Slack App Integration - Slack OAuth flow, event handling, messaging
- GitHub App Integration - GitHub App setup, webhook processing, code access
- Manual Trigger System - User-initiated workflow execution
Supporting Systemsโ
- Data Mapping System - Node-to-node data transformation
- Distributed Tracing - OpenTelemetry integration for observability
- Monitoring Guide - System health monitoring and alerting
Migration & Developmentโ
- gRPC to FastAPI Migration - Service communication architecture evolution
- Frontend Integration Examples - UI integration patterns
- MCP Node Knowledge Server - Model Context Protocol integration
Legacy/Referenceโ
- MVP Workflow Data Structure - Original workflow specification (superseded by new_workflow_spec.md)
Development Workflowโ
Setting Up Development Environmentโ
# Backend services (Python with uv)
cd apps/backend
uv sync
# Individual services
cd api-gateway && uv sync
cd workflow_agent && uv sync
cd workflow_engine && pip install -e .
cd workflow_scheduler && uv sync
# Frontend
cd apps/frontend/agent_team_web
npm install
npm run dev
Running Servicesโ
# All services with Docker Compose (recommended)
cd apps/backend
docker-compose up --build
# Individual services for development
cd api-gateway && uv run uvicorn app.main:app --reload --port 8000
cd workflow_agent && python main.py
cd workflow_engine && python -m workflow_engine.main
cd workflow_scheduler && python main.py
Testingโ
# Backend testing
cd apps/backend
pytest # All tests
pytest api-gateway/tests/ # Service-specific
uv run pytest --cov=app # With coverage
# Frontend testing
cd apps/frontend/agent_team_web
npm test
Key Concepts & Terminologyโ
Workflow Execution Modelโ
Execution States:
NEW: Initial stateRUNNING: Active executionPAUSED: Halted (Human-in-the-Loop)SUCCESS: Completed successfullyERROR: Failed executionWAITING_FOR_HUMAN: Awaiting HIL response
Node-Level States:
pending: Waiting to executerunning: Currently executingwaiting_input: Awaiting user input (HIL)completed: Successfully finishedfailed: Execution error
Attached Nodes Pattern (AI_AGENT)โ
AI_AGENT nodes can attach TOOL and MEMORY nodes for enhanced capabilities:
- Memory Context Loading (pre-execution): Load conversation history
- Tool Discovery (pre-execution): Register MCP tools with AI provider
- AI Response Generation: Execute with enhanced context and tools
- Conversation Storage (post-execution): Persist interaction to memory
Human-in-the-Loop (HIL)โ
HIL nodes enable workflows to pause and await human decisions:
- Pause & Wait: Workflow pauses, state persisted to database
- Multi-channel Interaction: Slack, Email, Webhook, In-App notifications
- AI Response Classification: Gemini-powered 8-factor analysis determines response relevance
- Timeout Management: Configurable timeouts (60s-24h) with customizable actions
- Resume Execution: Workflow resumes with human response as node output
Development Philosophyโ
"Fail Fast with Clear Feedback"โ
CRITICAL: Never use mock responses or silent failures in production code. Always provide real errors with actionable feedback when functionality is not implemented or misconfigured.
DO โ : Structured error responses with clear solutions
return NodeExecutionResult(
status=ExecutionStatus.ERROR,
error_message="Slack OAuth token not configured",
error_details={
"reason": "missing_oauth_token",
"solution": "Connect Slack account in integrations settings",
"oauth_flow_url": "/integrations/connect/slack"
}
)
DON'T โ: Mock responses that hide issues
# NEVER do this - creates false positives
if not api_key:
return f"[Mock Response] Success: {message}"
Deployment Architectureโ
AWS ECS Deploymentโ
- Platform: AWS ECS Fargate with service discovery
- Networking: VPC with private subnets, NAT gateways
- Load Balancing: Application Load Balancer with health checks
- Security: Security Groups, IAM roles, encrypted secrets (AWS SSM)
Health Check Configurationโ
# API Gateway
curl http://localhost:8000/api/v1/public/health # 120s start period
# Workflow Agent
curl http://localhost:8001/health # 120s start period
# Workflow Engine
curl http://localhost:8002/health # 90s start period
# Workflow Scheduler
curl http://localhost:8003/health # 60s start period
Critical Deployment Requirementsโ
- Platform: Build with
--platform linux/amd64for ECS - Dependencies: All dependencies in requirements.txt/pyproject.toml
- Import Structure: Preserve Python package hierarchy in Docker images
- Environment: All secrets via AWS SSM Parameters
Common Development Tasksโ
Adding OAuth Integrationsโ
When adding new OAuth providers (Slack, GitHub, Notion, etc.):
- GitHub Secrets: Add
{PROVIDER}_CLIENT_ID,{PROVIDER}_CLIENT_SECRET, etc. - Terraform: Add variables to
infra/variables.tfand SSM parameters toinfra/secrets.tf - GitHub Actions: Add environment variables to
.github/workflows/deploy.yml - Service Config: Update all relevant ECS task definitions in
infra/ecs.tf - Testing: Test complete OAuth flow end-to-end after deployment
Checklist available in: CLAUDE.md OAuth Integration section
Writing Documentation (Docusaurus/MDX)โ
When writing technical documentation in MDX format, escape comparison operators:
โ
Correct:
- score \>= 0.7
- score \<= 0.3
โ Incorrect (causes build failures):
- score >= 0.7
- score <= 0.3
Build locally before committing:
cd apps/internal-tools/docusaurus-doc
npm run build
Troubleshootingโ
Service Health Checksโ
# Check all services
curl http://localhost:8000/api/v1/public/health # API Gateway
curl http://localhost:8001/health # Workflow Agent
curl http://localhost:8002/health # Workflow Engine
curl http://localhost:8003/health # Workflow Scheduler
# Check Redis
redis-cli ping
# Check database
psql $SUPABASE_DATABASE_URL -c "SELECT version();"
Common Issuesโ
- Import Errors: Ensure Docker preserves package structure with proper COPY commands
- Port Conflicts: Services must use designated ports (8000-8003)
- Database Connection: Check Supabase connection strings and SSL requirements
- Authentication: Verify JWT tokens and RLS policies in Supabase
Migration Historyโ
Major Architectural Changesโ
- gRPC โ FastAPI Migration: All services now use HTTP/REST for consistency
- Three-Layer API Architecture: Public/App/MCP authentication layers
- Node Specification System: Centralized validation with automatic type conversion
- Provider-Based AI Agents: From hardcoded roles to flexible prompt-driven nodes
- Workflow Scheduler Addition: Dedicated trigger management service
Documentation Navigationโ
Start Hereโ
- New to the project? Read this overview, then Workflow Data Structure
- Setting up API integration? See API Gateway Architecture
- Building workflows? Check Node Specification System
- Adding OAuth providers? Follow OAuth Integration Checklist
Service-Specific Deep Divesโ
- Each service has detailed
CLAUDE.mdfiles with service-specific patterns - Check
apps/backend/{service}/CLAUDE.mdfor implementation details
Database & Data Modelsโ
- Start with Database Design for schema overview
- See Workflow Data Structure for complete data models
- Reference Node Specification System for node validation rules
Integration Guidesโ
- Slack Integration - Slack OAuth and event handling
- GitHub Integration - GitHub App setup and webhooks
- HIL System - Human-in-the-Loop architecture
Contributingโ
When adding new features or making architectural changes:
- Update Technical Design Docs: Document new systems, data structures, and APIs
- Follow Existing Patterns: Maintain consistency with current architecture
- Add Tests: Comprehensive unit and integration tests
- Update CLAUDE.md: Service-specific development guidance
- Fail Fast: Never use mock responses - always return clear, actionable errors
Documentation Version: 2.0
Last Updated: 2025-01-28
Maintained By: Engineering Team
Related: See individual service CLAUDE.md files for service-specific guidance