Building a Voice AI for Restaurants with Nova Sonic + Pipecat



Building a Voice AI for Restaurants with Nova Sonic + Pipecat
“…the phones never stop ringing.” Every restaurant owner ever. A good problem to have, until your host is stuck answering the same questions about hours and menu items while the hungry diners in front of them grow restless. That’s the pain point we set out to solve by mashing up two very new tools – Amazon Nova Sonic and the Pipecat voice framework – to create a near‑instant, conversational AI that can handle calls, answer questions and even take reservations. Below is what we built, what went right, and what still feels like a half‑baked crème brûlée. Consider this a field report – informal, candid and sprinkled with some gentle complaints.
I built it using Pipecat and AWS’s new Nova Sonic model for a company called Slang.ai, who’s already doing big things in the restaurant voice AI space. You should check them out if you're interested in this kind of thing.
Why Nova Sonic and Pipecat?
Amazon’s new Nova Sonic model is the shiny toy in the LLM sandbox. Unlike the traditional three‑model pipeline (speech‑to‑text → LLM → text‑to‑speech), Nova Sonic rolls speech understanding and generation into a single model. This unification preserves the tone, style and pacing of the speaker and allows the model to adapt its response to the acoustic context. Developers interact with it through a bidirectional streaming API that lets audio flow in both directions; input is processed while output is generated so the agent can respond before you finish your sentence. In other words, it sounds less like a robot and more like a real maitre d’ – albeit one with infinite patience. Pipecat, on the other hand, is an open‑source framework for orchestrating audio, video and LLM services. It manages transports (WebRTC or telephony), performs voice activity detection, invokes your chosen LLM and synthesizes speech. The framework is designed for ultra‑low‑latency conversations, with a typical round‑trip interaction clocking in between 500–800 ms. Pipecat pipelines consist of processors that handle audio frames in real time: receiving audio, transcribing it, feeding it into an LLM, converting the reply back to speech and streaming it back to the user. This modular approach allowed us to plug in Nova Sonic as the speech‑to‑speech service while using Pipecat’s telephony transports, logging and tracing utilities.
What We Built (So Far)
The initial goal was simple: create a voice agent for a restaurant that can answer basic questions (hours, menu, location) and take reservations without transferring the call to a human. Here’s the high‑level architecture:
Transport & Voice Pipeline – The caller connects through a browser (WebRTC) today, but Pipecat offers telephony integrations via WebSocket media streams, Daily’s PSTN and SIP. Audio is streamed into Pipecat where Silero VAD detects when the customer stops talking.
LLM Service – We extended Pipecat’s
AWSNovaSonicLLMService
and wrapped it to log every user and assistant message into DynamoDB. Nova Sonic provides built‑in STT, TTS and context management. The LLM is instructed to answer questions about the restaurant or call a Python function (get_restaurant_info
) when the user asks about hours, menu or location.Data Persistence – A DynamoDB table stores conversation events. Each conversation’s start and end times are logged, along with a TTL to automatically purge data after 30 days.
Monitoring – A Streamlit dashboard visualizes conversations, message timelines and sentiment. We use Amazon Comprehend for basic sentiment analysis – more on its limitations later.
OpenTelemetry Tracing – Pipecat’s built‑in OpenTelemetry instrumentation lets us track latency, monitor bottlenecks and export traces to a collector like Jaeger or Langfuse. We enabled tracing in our
PipelineTask
by settingenable_metrics
andenable_tracing
, and configured an OTLP exporter to send spans to our collector.
The Good
⚡️ Low Latency, Natural Conversation
Nova Sonic delivered on its promise. In production tests, the agent responded within half a second, often interrupting politely to confirm a reservation or clarify a request. The unification of STT→LLM→TTS into a single model meant there was no audible “thinking” pause, and the voice kept context, tone and even hesitations. Pipecat’s pipeline kept overhead minimal and maintained the 500–800 ms round‑trip latency.
🎯 Function Calling and Context
Nova Sonic supports function/tool calling (similar to OpenAI’s function calling) via Pipecat. We defined a get_restaurant_info
tool that returns hours, menu items and address based on the user’s question. Because Nova Sonic produces text transcripts of the caller’s speech and streams them as events, our code could easily detect when to invoke the function. This pattern could later be extended to call a reservation API like OpenTable – more on that in the “Next Steps” section.
🔍 Observability (to a point)
Pipecat’s OpenTelemetry integration was a lifesaver. With just a few lines of code we spun up a Jaeger collector and could visualize each turn of the conversation: STT, LLM and TTS spans nested under each turn. These traces helped us identify which part of the pipeline caused a hiccup (hint: network jitter is still a thing). For local testing, enabling the console exporter printed spans to stdout, which was incredibly handy for debugging.
🗂️ Streamlit + DynamoDB: Good Enough for Now
I built a Streamlit app that queries DynamoDB (where all the assistant/user messages get logged). It lets me:
View conversations
Filter by conversations Comprehend labeled “Negative”
Get a rough view of user flows
Not bad for basic QA. But there’s still a lot of UX and performance nuance that gets lost without deeper observability tools.
The Bad & The Wishlist
Despite the impressive voice quality and latency, we hit some friction that’s worth sharing:
📞 PSTN Integration Is Still DIY
Pipecat offers multiple telephony integration options – WebSocket media streams with providers like Twilio and Telnyx, Daily’s PSTN calling, and SIP for advanced call flows. However, there’s no turnkey “just give me a phone number and point it at my bot” solution yet. We’re currently using a browser‑based front‑end, which is fine for demos but not for restaurants whose customers call by phone. Our next milestone is to map a real PSTN phone number to each restaurant and route calls into Pipecat, complete with business hours, holiday closures and a fallback to human staff.
🕵️ Observability Gaps with Nova Sonic
Here’s where I got a bit frustrated. Pipecat’s OpenTelemetry spans gave us timing metrics but didn’t include the actual user and assistant messages when using Nova Sonic. Langfuse can ingest OpenTelemetry spans, but without the transcript or assistant responses it can’t perform sentiment analysis or conversation quality scoring. Using a traditional cascaded pipeline (separate STT + LLM + TTS) solved this – we were able to log the text at each stage – but it blew up our latency and destroyed the conversational feel. A polite plea to the Nova Sonic team: please expose hooks or events that include the full message content along with the existing audio stream. Even just surfacing the text transcript in the OpenTelemetry spans would enable deeper observability through tools like Langfuse. The Pipecat docs show how to attach span attributes and export spans via OTLP; all that’s missing is the actual content.
💬 Sentiment Analysis Needs a Brain
Amazon Comprehend gives you: Positive, Negative, Neutral, or Mixed. That’s it. Confused caller asking “wait, are you a robot?” gets marked as Neutral. Not helpful.
What I want is a full “LLM as a judge” setup that can score things like:
Confusion or misunderstanding
Failed intent matches
Signs of customer impatience
☁️ Scaling Takes Work
Nova Sonic itself scales fine. The headaches are around container start times, autoscaling groups, and maintaining a healthy warm pool of Pipecat agents that can take a call in <1 second. It's very doable, but not plug-and-play.
📞 Reservations API & Business Config
Our agent can answer basic questions, but to be genuinely useful it must take reservations. We plan to integrate an external API like OpenTable or Resy. This will require mapping each PSTN phone number to a business configuration (hours, menu data, reservation API credentials) and handling concurrency limits. The LLM will need to call the reservation API with the correct time and party size, confirm availability and update the caller accordingly. Thankfully, Pipecat already supports tool calls, so this integration should slot right in once the API endpoints are wired up.
Final Thoughts
Before this project, I admit I harbored some skepticism about the Nova family of models. I assumed they were the economy class of LLMs compared to Anthropic or OpenAI. After hearing Nova Sonic respond to a frustrated diner with warmth and empathy at sub‑second latency, I stand corrected. The voice quality is excellent, the streaming API is robust, and Pipecat makes orchestrating the pipeline almost fun.
That said, the missing transcript data in OpenTelemetry spans is a significant gap. Without it, we can’t effectively evaluate or improve the agent’s behavior. If Nova Sonic wants to be the default choice for conversational agents, improving its observability hooks and embracing standards like OpenTelemetry will be key. We’re far from done. There’s a PSTN phone number to connect, an LLM judge to implement, reservation APIs to call and a Pipecat cloud deployment to configure. But with Nova Sonic and Pipecat at the core, we’re confident we can get there. The dream? A future where diners call and an AI host answers instantly, handles the reservation and even cracks a joke about the daily special. Stay tuned – and Nova Sonic team, we’re waiting for those observability updates!