Skip to main content

How I Built a Bilingual Voice Assistant with Two Brains (and What I Learned Along the Way)

  

Introduction

This project is the practical part of my Master’s thesis, developed during my Erasmus+ exchange at the University of Patras, within the ESDA Lab (Embedded Systems and Digital Applications).

What started as a standard “upgrade the assistant” task turned into a full-on multilingual, multi-component NLP system. The goal was to make a voice assistant smarter — not just technically smarter, but able to handle real, everyday language, in both English and Greek.

I took Kalliope, an open-source modular voice assistant, and upgraded it with two powerful LLMs:

  • One to recognize what the user wants (intent classification)

  • One to generate a helpful response if the first one gets confused (generative fallback)

On top of that, I added typo correction, semantic search, and multilingual support. I learned a ton — from dealing with transformer models and vector search, to debugging legacy Python dependencies at midnight on a university server.

 


Why Kalliope, and Why Two Models?

Kalliope is lightweight, modular, and easy to extend — but it expects fixed commands. For example, if you say:

  • “Turn on the kitchen lights” → it works.
    But if you say:

  • “I want to cook, turn on the lights”

  • “Lighten up the kitchen”

  • “It’s getting dark in the kitchen — lights, please”

...you’ll get silence. No action. No understanding.

This is a huge limitation for real users who don’t think like bots. So I decided to give Kalliope a brain transplant — actually, two. I added:

  1. A multilingual intent classifier (based on XLM-RoBERTa), to understand flexible user input.

  2. A generative fallback (mT5-small) that can respond naturally when the classifier isn’t confident.


The System Design

The final architecture includes:

  • A fine-tuned XLM-RoBERTa intent classifier (supports both English and Greek)

  • A FAISS vector search engine with SentenceTransformer embeddings

  • A fallback mT5-small model for generative responses

  • A typo correction layer for noisy user input

  • A multilingual FAQ database per domain (tourism, e-commerce, education, etc.)

  • Full integration with Kalliope’s voice command loop


Challenges Faced

1. Switching Between Classifier and Generator

Problem: How do we know when to fall back?
Solution: We added a confidence threshold. If the classifier score was too low or the predicted label was “unknown,” we used FAISS to find similar questions and passed context to mT5 for a smooth reply.


2. Typo Robustness

Problem: People make spelling mistakes, especially in Greek and Greeklish.
Solution: We created a typos.csv file with corrections, and applied it as a preprocessing step before classification and search.


3. FAISS Format Issues

Problem: FAISS stores vectors, not the original questions or answers.
Solution: We stored the vectors in .bin and aligned them with a .json file containing the text. This dual setup let us retrieve semantic matches and pass them to the generator.


4. Dataset Creation

Problem: We needed multilingual data for each domain.
Solution: We manually wrote, translated, and paraphrased questions across four domains: tourism, rent-a-car, e-commerce, and education — with typo variants to simulate real usage.


5. Overfitting & Underfitting

Problem: The classifier overfitted on simple keywords, but failed with longer paraphrases.
Solution: We applied dataset balancing, dropout, noise injection, and early stopping during training to prevent overfitting and improve generalization.


6. LLM Integration in Kalliope

Problem: Kalliope wasn’t built to work with PyTorch or Hugging Face models.
Solution: We wrapped the models in Python functions and triggered them from a custom Kalliope signal. This kept the assistant modular while enabling advanced NLP capabilities.


7. Snowboy and Python 3.8

Problem: Kalliope uses Snowboy for voice wake-up, which only works with old Python versions.
Solution: After hours of dependency errors, we patched Snowboy to work with Python 3.8, and eventually containerized everything with Docker to ensure stability across systems.


8. Model Size and Deployment

Problem: mT5 is large, and local devices have limited memory.
Solution: We used the smaller variant (google/mt5-small) and hosted the model on Hugging Face Hub and Google Drive. First-time downloads are cached to improve load time.




Live Example: From Speech to Action

Let’s look at what actually happens when the assistant is running and a user gives a voice command. Below is a real interaction, showing how speech is recognized, processed, and executed in real-time — all in a bilingual setup with Greek responses and English input.

This example shows how the system detects the user's intent ("I'm going to cook, turn on the lights please") and maps it to the internal command kitchen-on, which triggers a response and action.

 

This interaction demonstrates the complete loop:

  1. Wake word is detected.

  2. Assistant prompts the user in Greek.

  3. The speech recognizer transcribes the input.

  4. The classifier understands the user's intent.

  5. A corresponding synapse is executed.

  6. The assistant gives a spoken confirmation — again, in Greek.

And the best part? The input didn’t have to be a rigid command. It was a natural sentence — and the assistant still understood it.

 


 

 

What’s Next?

Now that the system works reliably on my laptop, I’m preparing to test it on:

  • A university server for low-latency response time

  • A Raspberry Pi, to explore edge-device deployment

  • And other environments, to benchmark real-time performance

We’re also working on a scientific paper to present this architecture — especially the hybrid fallback system — in academic NLP and voice assistant contexts. Beyond that, my goal is to push this assistant closer to the fluidity of ChatGPT, where fallback responses feel seamless, not like a plan B.

This project gave me real experience with:

  • Multilingual NLP

  • Retrieval-Augmented Generation (RAG)

  • FAISS search

  • Integration of transformers into live systems

  • And fighting with old Python libraries late at night


Final Thoughts

What started as a practical thesis project turned into a full-stack NLP challenge — and honestly, one of the most rewarding things I’ve worked on.

The assistant now understands flexible commands in two languages, switches intelligently between classification and generation, and handles real-world input with typos, paraphrases, and ambiguity.

It’s fast, lightweight, and extendable. And it’s just the beginning.


Special thanks to ESDA Lab, University of Patras, for the mentorship and resources during my Erasmus semester. This project wouldn’t exist without their support.

If you’re interested in building something similar or want to collaborate, feel free to get in touch — I’m always happy to exchange ideas.

GitHub  LinkedIn



Comments

Popular posts from this blog

Deep Learning Algorithm for Lip Reading without audio - Just based on Lip movement!

Author: Dragomir Božoki     1. Introduction      During the final year of my Bachelor's studies, once I realized I would pass all my exams on time, I knew it was time to choose a topic for my thesis. Throughout my studies, I gradually became interested in signal processing – a field where you can work with various types of signals to generate new images, analyze text, interpret brain activity, and more.      This interest grew even stronger when AI started booming in late 2022. Topics related to machine learning and data analysis suddenly became the center of attention across the tech industry. That was the moment I knew – this is the field I want to specialize in.      When I spoke with my professors about potential thesis topics, they proposed an idea that instantly caught my attention: teaching a machine to understand language purely through visual input—without any audio—by analyzing only lip movements. The concept sounded abso...

Analyzing 34 GB of Wikipedia Text Using Information Theory and Machine Learning

Author: Dragomir Božoki GitHub: https://github.com/DragomirBozoki/wiki-feature-selection-pysp Introduction Wikipedia contains millions of English-language articles on a wide range of topics. If we wanted to understand how language works — such as which words tend to appear together or which words are most relevant to a topic — we would need to analyze a huge amount of text. This project aimed to do exactly that, but automatically, using tools from Big Data processing , information theory , and machine learning . Step 1: Data Preparation Wikipedia dumps come in raw XML format, which includes not only article content but also technical details, formatting tags, and metadata that are not useful for language analysis. The first step was to clean the text , which involved: Removing all HTML, XML, and Wiki-specific tags Converting all text to lowercase Removing punctuation Splitting text into single words (unigrams) and consecutive word pairs (bigrams) I used WikiExtractor ...