call-i

My 2024 hackathon participation - won MacBook

We won a hackathon in front of 3000 people and the CTO of SAP gave us our prizes!!

winner picture

This post is a more technical documentation.

Challenge

Our challenge title was "Customer Chatbots with LLMs" in the context of the SAP ecosystem.

Our Approach

In 2024, people are already used to excellent chatbots like ChatGPT. Therefore, we wanted to present something most people have never experienced before: a voice chatbot - an automated call center.

Why?

Wow-Factor: A voice chatbot can be presented in a short demo and really impress the audience
Cost Reduction: The jury for the first stage of the hackathon was a group of 30 CIOs, who are mainly concerned with keeping costs low. Human call centers are an immense cost factor for companies. If we could replace them with a voice chatbot, we could save a ton of money.
Customer Experience: Voice chatbots are available 24/7 with no waiting time.

How I built it

The following steps need to be performed to have a real human-AI interaction via voice:

Voice Detection: The user speaks, speech is detected and transcribed to text
Answer Generation: The text is sent to an AI model, which generates an answer
Speech Synthesis: The answer is converted to speech and played to the user

It is super important that the latency between the user speaking and the AI answering is as low as possible. Therefore, it is necessary that the speech is transcribed while the user is speaking and instantly streamed to the AI model. The response of the AI model has to be streamed to the speech synthesis which needs to be streamed to the user.

Streaming means sending the data in small chunks instead of waiting for the whole data to be available. Best example is ChatGPT, where you can see the text while the model creates it

Voice Detection

For voice detection, we used the browser's built-in speech recognition API. The user speaks into the browser, the speech is transcribed to text and streamed to the AI model. Ideally you would use a more capable speech-to-text-model like Whisper.

Answer Generation

Here GPT-4 hosted on an Azure instance, connected to a SAP system, is used to generate the answer and trigger actions like sending an email or calling an API.

Speech Synthesis

The GPT-4 answer is streamed to elevenlabs, which converts the text to speech and streams it back to the user.

Techstack

Elevenlabs: Speech Synthesis
AzureOpenAI: Answer Generation, Function Calling
Python
Docker
React

For the demo, voice detection was performed in the browser. (Whisper was not streamable and led to long latencies)