Author: Dragomir Božoki GitHub: https://github.com/DragomirBozoki/wiki-feature-selection-pysp Introduction Wikipedia contains millions of English-language articles on a wide range of topics. If we wanted to understand how language works — such as which words tend to appear together or which words are most relevant to a topic — we would need to analyze a huge amount of text. This project aimed to do exactly that, but automatically, using tools from Big Data processing , information theory , and machine learning . Step 1: Data Preparation Wikipedia dumps come in raw XML format, which includes not only article content but also technical details, formatting tags, and metadata that are not useful for language analysis. The first step was to clean the text , which involved: Removing all HTML, XML, and Wiki-specific tags Converting all text to lowercase Removing punctuation Splitting text into single words (unigrams) and consecutive word pairs (bigrams) I used WikiExtractor ...