Author: Dragomir Božoki GitHub: https://github.com/DragomirBozoki/wiki-feature-selection-pysp Introduction Wikipedia contains millions of English-language articles on a wide range of topics. If we wanted to understand how language works — such as which words tend to appear together or which words are most relevant to a topic — we would need to analyze a huge amount of text. This project aimed to do exactly that, but automatically, using tools from Big Data processing , information theory , and machine learning . Step 1: Data Preparation Wikipedia dumps come in raw XML format, which includes not only article content but also technical details, formatting tags, and metadata that are not useful for language analysis. The first step was to clean the text , which involved: Removing all HTML, XML, and Wiki-specific tags Converting all text to lowercase Removing punctuation Splitting text into single words (unigrams) and consecutive word pairs (bigrams) I used WikiExtractor ...
Author: Dragomir Božoki 1. Introduction During the final year of my Bachelor's studies, once I realized I would pass all my exams on time, I knew it was time to choose a topic for my thesis. Throughout my studies, I gradually became interested in signal processing – a field where you can work with various types of signals to generate new images, analyze text, interpret brain activity, and more. This interest grew even stronger when AI started booming in late 2022. Topics related to machine learning and data analysis suddenly became the center of attention across the tech industry. That was the moment I knew – this is the field I want to specialize in. When I spoke with my professors about potential thesis topics, they proposed an idea that instantly caught my attention: teaching a machine to understand language purely through visual input—without any audio—by analyzing only lip movements. The concept sounded abso...