Analyzing 34 GB of Wikipedia Text Using Information Theory and Machine Learning

Author: Dragomir Božoki
GitHub: https://github.com/DragomirBozoki/wiki-feature-selection-pysp

Introduction

Wikipedia contains millions of English-language articles on a wide range of topics. If we wanted to understand how language works — such as which words tend to appear together or which words are most relevant to a topic — we would need to analyze a huge amount of text.

This project aimed to do exactly that, but automatically, using tools from Big Data processing, information theory, and machine learning.

Step 1: Data Preparation

Wikipedia dumps come in raw XML format, which includes not only article content but also technical details, formatting tags, and metadata that are not useful for language analysis.

The first step was to clean the text, which involved:

Removing all HTML, XML, and Wiki-specific tags
Converting all text to lowercase
Removing punctuation
Splitting text into single words (unigrams) and consecutive word pairs (bigrams)

I used WikiExtractor to extract clean text from the XML file, and PySpark to efficiently process the massive dataset in parallel.

Step 2: Frequency Counts and Entropy Calculation

After cleaning, the next step was to count how often each word (unigram) and each word pair (bigram) appeared.

With those counts, I calculated entropy, which is a measure of how diverse or unpredictable a language is.

If a text repeats the same words over and over, entropy is low. If it uses a wide variety of words, entropy is high.

The formula used was Shannon’s entropy:

H(X) = -\sum p(x) \log_

Where $p(x)$ is the probability of a specific word occurring.

This tells us how “surprising” or information-rich the language is.

Step 3: Calculating Mutual Information

Next, I calculated Mutual Information (MI) for bigrams.

MI tells us how strongly two words are connected — it measures how often they appear together compared to what would be expected if they were independent.

If two words co-occur more often than random chance would suggest, they have a high MI score.

For example:

“pineapple family”
“cyrill cyrille”
“chimaira greek”

These are often meaningful pairs of words that represent specific concepts or domains (like mythology, botany, etc.).

The formula is:

MI(x, y) = \log_2 \left( \frac{p(x, y)}{p(x) \cdot p(y)} \right)

Step 4: Using TF-IDF for Classification

While MI helps us find meaningful word pairs, it’s not ideal for classifying documents by topic.

For that, I used TF-IDF (Term Frequency – Inverse Document Frequency). TF-IDF gives higher weight to words that appear frequently in a document but rarely across all documents.

It works well for identifying which words are most specific to a document, which is important for classification.

I labeled articles into topics such as:

Science
Sports
Technology
History
Business
And others

Using TF-IDF features, I trained machine learning models to predict an article’s topic. This method proved more accurate than using MI or frequency alone.

Step 5: Comparing the Methods

Method	Captures Word Patterns	Good for Classification	Easy to Interpret
Entropy	Yes (overall language diversity)	No	Yes
Mutual Info	Yes (word associations)	No	Moderate
TF-IDF	No (individual terms only)	Yes	Yes

Here’s how the three methods compare:

Entropy tells us how unpredictable or rich the vocabulary is.
Mutual Information identifies rare but important word combinations.
TF-IDF performs best when the goal is to categorize or classify text.

Technologies Used

The project was built using the following tools:

Python 3.8
PySpark for handling large-scale data
NLTK for natural language tokenization and lemmatization
scikit-learn for machine learning and model training
WikiExtractor to convert Wikipedia XML dumps into plain text
Apache Airflow (optional) to automate each step in a pipeline

Sample Results

Here are some examples of the most informative bigrams found using Mutual Information:

pineapple_family
gynoecia_in
chimaira_greek
cyrill_cyrille

These word pairs occur rarely but are tightly linked and meaningful within specific domains.

Conclusion

This project showed how mathematical tools like entropy and mutual information can be used to analyze natural language on a very large scale. While mutual information is good at finding hidden structure and meaningful word pairs, TF-IDF was more effective for classifying documents.

Each method has its strengths:

Entropy for analyzing language diversity
Mutual Information for understanding word relationships
TF-IDF for practical feature selection in machine learning tasks

Future improvements could include:

Incorporating sentence boundaries and grammatical structure
Improving classification using deep learning
Extending the system to work in multiple languages or on other platforms (e.g., Raspberry Pi, cloud servers)

How I Built a Bilingual Voice Assistant with Two Brains (and What I Learned Along the Way)

Introduction This project is the practical part of my Master’s thesis , developed during my Erasmus+ exchange at the University of Patras , within the ESDA Lab (Embedded Systems and Digital Applications) . What started as a standard “upgrade the assistant” task turned into a full-on multilingual, multi-component NLP system. The goal was to make a voice assistant smarter — not just technically smarter, but able to handle real, everyday language, in both English and Greek . I took Kalliope , an open-source modular voice assistant, and upgraded it with two powerful LLMs : One to recognize what the user wants ( intent classification ) One to generate a helpful response if the first one gets confused ( generative fallback ) On top of that, I added typo correction, semantic search, and multilingual support. I learned a ton — from dealing with transformer models and vector search, to debugging legacy Python dependencies at midnight on a university server. Why Ka...

MindLoop AI Blog

Search This Blog