How I Built a Bilingual Voice Assistant with Two Brains (and What I Learned Along the Way)

Introduction

This project is the practical part of my Master’s thesis, developed during my Erasmus+ exchange at the University of Patras, within the ESDA Lab (Embedded Systems and Digital Applications).

What started as a standard “upgrade the assistant” task turned into a full-on multilingual, multi-component NLP system. The goal was to make a voice assistant smarter — not just technically smarter, but able to handle real, everyday language, in both English and Greek.

I took Kalliope, an open-source modular voice assistant, and upgraded it with two powerful LLMs:

One to recognize what the user wants (intent classification)
One to generate a helpful response if the first one gets confused (generative fallback)

On top of that, I added typo correction, semantic search, and multilingual support. I learned a ton — from dealing with transformer models and vector search, to debugging legacy Python dependencies at midnight on a university server.

Why Kalliope, and Why Two Models?

Kalliope is lightweight, modular, and easy to extend — but it expects fixed commands. For example, if you say:

“Turn on the kitchen lights” → it works.
But if you say:
“I want to cook, turn on the lights”
“Lighten up the kitchen”
“It’s getting dark in the kitchen — lights, please”

...you’ll get silence. No action. No understanding.

This is a huge limitation for real users who don’t think like bots. So I decided to give Kalliope a brain transplant — actually, two. I added:

A multilingual intent classifier (based on XLM-RoBERTa), to understand flexible user input.
A generative fallback (mT5-small) that can respond naturally when the classifier isn’t confident.

The System Design

The final architecture includes:

A fine-tuned XLM-RoBERTa intent classifier (supports both English and Greek)
A FAISS vector search engine with SentenceTransformer embeddings
A fallback mT5-small model for generative responses
A typo correction layer for noisy user input
A multilingual FAQ database per domain (tourism, e-commerce, education, etc.)
Full integration with Kalliope’s voice command loop

Challenges Faced

1. Switching Between Classifier and Generator

Problem: How do we know when to fall back?
Solution: We added a confidence threshold. If the classifier score was too low or the predicted label was “unknown,” we used FAISS to find similar questions and passed context to mT5 for a smooth reply.

2. Typo Robustness

Problem: People make spelling mistakes, especially in Greek and Greeklish.
Solution: We created a typos.csv file with corrections, and applied it as a preprocessing step before classification and search.

3. FAISS Format Issues

Problem: FAISS stores vectors, not the original questions or answers.
Solution: We stored the vectors in .bin and aligned them with a .json file containing the text. This dual setup let us retrieve semantic matches and pass them to the generator.

4. Dataset Creation

Problem: We needed multilingual data for each domain.
Solution: We manually wrote, translated, and paraphrased questions across four domains: tourism, rent-a-car, e-commerce, and education — with typo variants to simulate real usage.

5. Overfitting & Underfitting

Problem: The classifier overfitted on simple keywords, but failed with longer paraphrases.
Solution: We applied dataset balancing, dropout, noise injection, and early stopping during training to prevent overfitting and improve generalization.

6. LLM Integration in Kalliope

Problem: Kalliope wasn’t built to work with PyTorch or Hugging Face models.
Solution: We wrapped the models in Python functions and triggered them from a custom Kalliope signal. This kept the assistant modular while enabling advanced NLP capabilities.

7. Snowboy and Python 3.8

Problem: Kalliope uses Snowboy for voice wake-up, which only works with old Python versions.
Solution: After hours of dependency errors, we patched Snowboy to work with Python 3.8, and eventually containerized everything with Docker to ensure stability across systems.

8. Model Size and Deployment

Problem: mT5 is large, and local devices have limited memory.
Solution: We used the smaller variant (google/mt5-small) and hosted the model on Hugging Face Hub and Google Drive. First-time downloads are cached to improve load time.

Live Example: From Speech to Action

Let’s look at what actually happens when the assistant is running and a user gives a voice command. Below is a real interaction, showing how speech is recognized, processed, and executed in real-time — all in a bilingual setup with Greek responses and English input.

This example shows how the system detects the user's intent ("I'm going to cook, turn on the lights please") and maps it to the internal command kitchen-on, which triggers a response and action.

This interaction demonstrates the complete loop:

Wake word is detected.
Assistant prompts the user in Greek.
The speech recognizer transcribes the input.
The classifier understands the user's intent.
A corresponding synapse is executed.
The assistant gives a spoken confirmation — again, in Greek.

And the best part? The input didn’t have to be a rigid command. It was a natural sentence — and the assistant still understood it.

What’s Next?

Now that the system works reliably on my laptop, I’m preparing to test it on:

A university server for low-latency response time
A Raspberry Pi, to explore edge-device deployment
And other environments, to benchmark real-time performance

We’re also working on a scientific paper to present this architecture — especially the hybrid fallback system — in academic NLP and voice assistant contexts. Beyond that, my goal is to push this assistant closer to the fluidity of ChatGPT, where fallback responses feel seamless, not like a plan B.

This project gave me real experience with:

Multilingual NLP
Retrieval-Augmented Generation (RAG)
FAISS search
Integration of transformers into live systems
And fighting with old Python libraries late at night

Final Thoughts

What started as a practical thesis project turned into a full-stack NLP challenge — and honestly, one of the most rewarding things I’ve worked on.

The assistant now understands flexible commands in two languages, switches intelligently between classification and generation, and handles real-world input with typos, paraphrases, and ambiguity.

It’s fast, lightweight, and extendable. And it’s just the beginning.

Special thanks to ESDA Lab, University of Patras, for the mentorship and resources during my Erasmus semester. This project wouldn’t exist without their support.

If you’re interested in building something similar or want to collaborate, feel free to get in touch — I’m always happy to exchange ideas.

MindLoop AI Blog

Search This Blog