GitHub: https://github.com/DragomirBozoki/wiki-feature-selection-pysp
Introduction
Wikipedia contains millions of English-language articles on a wide range of topics. If we wanted to understand how language works — such as which words tend to appear together or which words are most relevant to a topic — we would need to analyze a huge amount of text.
This project aimed to do exactly that, but automatically, using tools from Big Data processing, information theory, and machine learning.
Step 1: Data Preparation
Wikipedia dumps come in raw XML format, which includes not only article content but also technical details, formatting tags, and metadata that are not useful for language analysis.
The first step was to clean the text, which involved:
-
Removing all HTML, XML, and Wiki-specific tags
-
Converting all text to lowercase
-
Removing punctuation
-
Splitting text into single words (unigrams) and consecutive word pairs (bigrams)
I used WikiExtractor to extract clean text from the XML file, and PySpark to efficiently process the massive dataset in parallel.
Step 2: Frequency Counts and Entropy Calculation
After cleaning, the next step was to count how often each word (unigram) and each word pair (bigram) appeared.
With those counts, I calculated entropy, which is a measure of how diverse or unpredictable a language is.
If a text repeats the same words over and over, entropy is low. If it uses a wide variety of words, entropy is high.
The formula used was Shannon’s entropy:
Where is the probability of a specific word occurring.
This tells us how “surprising” or information-rich the language is.
Step 3: Calculating Mutual Information
Next, I calculated Mutual Information (MI) for bigrams.
MI tells us how strongly two words are connected — it measures how often they appear together compared to what would be expected if they were independent.
If two words co-occur more often than random chance would suggest, they have a high MI score.
For example:
-
“pineapple family”
-
“cyrill cyrille”
-
“chimaira greek”
These are often meaningful pairs of words that represent specific concepts or domains (like mythology, botany, etc.).
The formula is:
Step 4: Using TF-IDF for Classification
While MI helps us find meaningful word pairs, it’s not ideal for classifying documents by topic.
For that, I used TF-IDF (Term Frequency – Inverse Document Frequency). TF-IDF gives higher weight to words that appear frequently in a document but rarely across all documents.
It works well for identifying which words are most specific to a document, which is important for classification.
I labeled articles into topics such as:
-
Science
-
Sports
-
Technology
-
History
-
Business
-
And others
Using TF-IDF features, I trained machine learning models to predict an article’s topic. This method proved more accurate than using MI or frequency alone.
Step 5: Comparing the Methods
Method | Captures Word Patterns | Good for Classification | Easy to Interpret |
---|---|---|---|
Entropy | Yes (overall language diversity) | No | Yes |
Mutual Info | Yes (word associations) | No | Moderate |
TF-IDF | No (individual terms only) | Yes | Yes |
Here’s how the three methods compare:
Entropy tells us how unpredictable or rich the vocabulary is.
-
Mutual Information identifies rare but important word combinations.
-
TF-IDF performs best when the goal is to categorize or classify text.
Technologies Used
The project was built using the following tools:
-
Python 3.8
-
PySpark for handling large-scale data
-
NLTK for natural language tokenization and lemmatization
-
scikit-learn for machine learning and model training
-
WikiExtractor to convert Wikipedia XML dumps into plain text
-
Apache Airflow (optional) to automate each step in a pipeline
Sample Results
Here are some examples of the most informative bigrams found using Mutual Information:
-
pineapple_family
-
gynoecia_in
-
chimaira_greek
-
cyrill_cyrille
These word pairs occur rarely but are tightly linked and meaningful within specific domains.
Conclusion
This project showed how mathematical tools like entropy and mutual information can be used to analyze natural language on a very large scale. While mutual information is good at finding hidden structure and meaningful word pairs, TF-IDF was more effective for classifying documents.
Each method has its strengths:
-
Entropy for analyzing language diversity
-
Mutual Information for understanding word relationships
-
TF-IDF for practical feature selection in machine learning tasks
Future improvements could include:
-
Incorporating sentence boundaries and grammatical structure
-
Improving classification using deep learning
-
Extending the system to work in multiple languages or on other platforms (e.g., Raspberry Pi, cloud servers)
Comments
Post a Comment