AI Video Call SaaS Platform

Founder's Project · 2024 · 2 min read

Developed a white-label AI-powered video conferencing platform with real-time transcription, smart summaries, and action item extraction for a SaaS client.

Overview

Built a complete video calling SaaS product from the ground up for a client entering the remote collaboration market. The platform features HD video calls, AI-powered real-time transcription, automatic meeting summaries, and intelligent action item extraction.

Problem

The client wanted to enter the video conferencing market with a differentiated product. Existing solutions lacked meaningful AI integration — transcription was an afterthought, and meeting follow-ups still required manual effort.

Constraints

  • Sub-200ms latency for video/audio streams
  • Real-time transcription accuracy above 95%
  • Support for 50+ concurrent participants per room
  • White-label capability for B2B reselling
  • SOC 2 compliance requirements

Approach

We built the platform using WebRTC for peer-to-peer video with a selective forwarding unit (SFU) for larger rooms. The frontend is a React application with a custom video grid layout. AI features are powered by Whisper for transcription and GPT-4 for summarization, running on dedicated GPU instances.

Key Decisions

WebRTC with custom SFU

Reasoning:

Peer-to-peer for small calls keeps latency minimal. The SFU architecture scales for larger rooms without the CPU cost of an MCU (multipoint control unit).

Alternatives considered:
  • Third-party SDK (Twilio/Agora) — faster but higher per-minute costs and less customization
  • MCU architecture — better video quality control but prohibitive server costs

Whisper API for real-time transcription

Reasoning:

Best-in-class accuracy for multi-speaker environments. Streaming audio chunks keeps transcription near real-time without waiting for full sentences.

Alternatives considered:
  • Google Speech-to-Text — good accuracy but higher latency for streaming
  • AssemblyAI — strong but Whisper's multilingual support was a client requirement

Tech Stack

  • React
  • TypeScript
  • Node.js
  • WebRTC
  • Redis
  • PostgreSQL
  • OpenAI Whisper
  • GPT-4 API
  • AWS
  • Docker
  • Kubernetes

Result & Impact

  • <150ms
    Average call latency
  • 97%
    Transcription accuracy
  • Reduced by 90%
    Meeting follow-up time

The client successfully launched the product and onboarded their first enterprise customers within 3 months. The AI meeting summary feature became the primary differentiator in sales conversations.

Learnings

  • WebRTC complexity is often underestimated — TURN server fallbacks are essential for enterprise networks
  • Streaming transcription requires careful audio chunking to balance latency vs. accuracy
  • White-label architecture decisions made early save enormous refactoring later

Technical Deep Dive

The video infrastructure uses a custom SFU built on mediasoup, deployed on AWS with auto-scaling groups that spin up GPU instances for AI processing. Audio streams are chunked and sent to Whisper in parallel with the video rendering pipeline.

Meeting summaries are generated using GPT-4 with custom prompts tuned for extracting action items, decisions, and key discussion points. Results are stored in PostgreSQL and accessible via a REST API and webhook integrations.