Skip to main content

Analyzing 34 GB of Wikipedia Text Using Information Theory and Machine Learning


Author: Dragomir Božoki
GitHub: https://github.com/DragomirBozoki/wiki-feature-selection-pysp

Introduction

Wikipedia contains millions of English-language articles on a wide range of topics. If we wanted to understand how language works — such as which words tend to appear together or which words are most relevant to a topic — we would need to analyze a huge amount of text.

This project aimed to do exactly that, but automatically, using tools from Big Data processing, information theory, and machine learning.


Step 1: Data Preparation

Wikipedia dumps come in raw XML format, which includes not only article content but also technical details, formatting tags, and metadata that are not useful for language analysis.

The first step was to clean the text, which involved:

  • Removing all HTML, XML, and Wiki-specific tags

  • Converting all text to lowercase

  • Removing punctuation

  • Splitting text into single words (unigrams) and consecutive word pairs (bigrams)

I used WikiExtractor to extract clean text from the XML file, and PySpark to efficiently process the massive dataset in parallel.


Step 2: Frequency Counts and Entropy Calculation

After cleaning, the next step was to count how often each word (unigram) and each word pair (bigram) appeared.

With those counts, I calculated entropy, which is a measure of how diverse or unpredictable a language is.

If a text repeats the same words over and over, entropy is low. If it uses a wide variety of words, entropy is high.

The formula used was Shannon’s entropy:

H(X)=p(x)log2p(x)H(X) = -\sum p(x) \log_

Where p(x)p(x) is the probability of a specific word occurring.

This tells us how “surprising” or information-rich the language is.


Step 3: Calculating Mutual Information

Next, I calculated Mutual Information (MI) for bigrams.

MI tells us how strongly two words are connected — it measures how often they appear together compared to what would be expected if they were independent.

If two words co-occur more often than random chance would suggest, they have a high MI score.

For example:

  • “pineapple family”

  • “cyrill cyrille”

  • “chimaira greek”

These are often meaningful pairs of words that represent specific concepts or domains (like mythology, botany, etc.).

The formula is:

MI(x,y)=log2(p(x,y)p(x)p(y))MI(x, y) = \log_2 \left( \frac{p(x, y)}{p(x) \cdot p(y)} \right)

Step 4: Using TF-IDF for Classification

While MI helps us find meaningful word pairs, it’s not ideal for classifying documents by topic.

For that, I used TF-IDF (Term Frequency – Inverse Document Frequency). TF-IDF gives higher weight to words that appear frequently in a document but rarely across all documents.

It works well for identifying which words are most specific to a document, which is important for classification.


I labeled articles into topics such as:

  • Science

  • Sports

  • Technology

  • History

  • Business

  • And others

Using TF-IDF features, I trained machine learning models to predict an article’s topic. This method proved more accurate than using MI or frequency alone.


Step 5: Comparing the Methods

 
Method Captures Word Patterns Good for Classification Easy to Interpret
Entropy Yes (overall language diversity) No Yes
Mutual Info Yes (word associations) No Moderate
TF-IDF No (individual terms only) Yes Yes
 

Here’s how the three methods compare:

  • Entropy tells us how unpredictable or rich the vocabulary is.

  • Mutual Information identifies rare but important word combinations.

  • TF-IDF performs best when the goal is to categorize or classify text.


Technologies Used

The project was built using the following tools:

  • Python 3.8

  • PySpark for handling large-scale data

  • NLTK for natural language tokenization and lemmatization

  • scikit-learn for machine learning and model training

  • WikiExtractor to convert Wikipedia XML dumps into plain text

  • Apache Airflow (optional) to automate each step in a pipeline


Sample Results

Here are some examples of the most informative bigrams found using Mutual Information:

  • pineapple_family

  • gynoecia_in

  • chimaira_greek

  • cyrill_cyrille

These word pairs occur rarely but are tightly linked and meaningful within specific domains.


Conclusion

This project showed how mathematical tools like entropy and mutual information can be used to analyze natural language on a very large scale. While mutual information is good at finding hidden structure and meaningful word pairs, TF-IDF was more effective for classifying documents.

Each method has its strengths:

  • Entropy for analyzing language diversity

  • Mutual Information for understanding word relationships

  • TF-IDF for practical feature selection in machine learning tasks

Future improvements could include:

  • Incorporating sentence boundaries and grammatical structure

  • Improving classification using deep learning

  • Extending the system to work in multiple languages or on other platforms (e.g., Raspberry Pi, cloud servers)


Comments

Popular posts from this blog

How I Built a Bilingual Voice Assistant with Two Brains (and What I Learned Along the Way)

   Introduction This project is the practical part of my Master’s thesis , developed during my Erasmus+ exchange at the University of Patras , within the ESDA Lab (Embedded Systems and Digital Applications) . What started as a standard “upgrade the assistant” task turned into a full-on multilingual, multi-component NLP system. The goal was to make a voice assistant smarter — not just technically smarter, but able to handle real, everyday language, in both English and Greek . I took Kalliope , an open-source modular voice assistant, and upgraded it with two powerful LLMs : One to recognize what the user wants ( intent classification ) One to generate a helpful response if the first one gets confused ( generative fallback ) On top of that, I added typo correction, semantic search, and multilingual support. I learned a ton — from dealing with transformer models and vector search, to debugging legacy Python dependencies at midnight on a university server.   Why Ka...

Deep Learning Algorithm for Lip Reading without audio - Just based on Lip movement!

Author: Dragomir Božoki     1. Introduction      During the final year of my Bachelor's studies, once I realized I would pass all my exams on time, I knew it was time to choose a topic for my thesis. Throughout my studies, I gradually became interested in signal processing – a field where you can work with various types of signals to generate new images, analyze text, interpret brain activity, and more.      This interest grew even stronger when AI started booming in late 2022. Topics related to machine learning and data analysis suddenly became the center of attention across the tech industry. That was the moment I knew – this is the field I want to specialize in.      When I spoke with my professors about potential thesis topics, they proposed an idea that instantly caught my attention: teaching a machine to understand language purely through visual input—without any audio—by analyzing only lip movements. The concept sounded abso...