Skip to main content

Posts

Showing posts from June, 2025

Analyzing 34 GB of Wikipedia Text Using Information Theory and Machine Learning

Author: Dragomir Božoki GitHub: https://github.com/DragomirBozoki/wiki-feature-selection-pysp Introduction Wikipedia contains millions of English-language articles on a wide range of topics. If we wanted to understand how language works — such as which words tend to appear together or which words are most relevant to a topic — we would need to analyze a huge amount of text. This project aimed to do exactly that, but automatically, using tools from Big Data processing , information theory , and machine learning . Step 1: Data Preparation Wikipedia dumps come in raw XML format, which includes not only article content but also technical details, formatting tags, and metadata that are not useful for language analysis. The first step was to clean the text , which involved: Removing all HTML, XML, and Wiki-specific tags Converting all text to lowercase Removing punctuation Splitting text into single words (unigrams) and consecutive word pairs (bigrams) I used WikiExtractor ...

Deep Learning Algorithm for Lip Reading without audio - Just based on Lip movement!

Author: Dragomir Božoki     1. Introduction      During the final year of my Bachelor's studies, once I realized I would pass all my exams on time, I knew it was time to choose a topic for my thesis. Throughout my studies, I gradually became interested in signal processing – a field where you can work with various types of signals to generate new images, analyze text, interpret brain activity, and more.      This interest grew even stronger when AI started booming in late 2022. Topics related to machine learning and data analysis suddenly became the center of attention across the tech industry. That was the moment I knew – this is the field I want to specialize in.      When I spoke with my professors about potential thesis topics, they proposed an idea that instantly caught my attention: teaching a machine to understand language purely through visual input—without any audio—by analyzing only lip movements. The concept sounded abso...

How I Built a Bilingual Voice Assistant with Two Brains (and What I Learned Along the Way)

   Introduction This project is the practical part of my Master’s thesis , developed during my Erasmus+ exchange at the University of Patras , within the ESDA Lab (Embedded Systems and Digital Applications) . What started as a standard “upgrade the assistant” task turned into a full-on multilingual, multi-component NLP system. The goal was to make a voice assistant smarter — not just technically smarter, but able to handle real, everyday language, in both English and Greek . I took Kalliope , an open-source modular voice assistant, and upgraded it with two powerful LLMs : One to recognize what the user wants ( intent classification ) One to generate a helpful response if the first one gets confused ( generative fallback ) On top of that, I added typo correction, semantic search, and multilingual support. I learned a ton — from dealing with transformer models and vector search, to debugging legacy Python dependencies at midnight on a university server.   Why Ka...