BBC News NLP Pipeline: Text Classification & Topic Modelling¶
This project uses Natural Language Processing techniques to classify news articles from a BBC News data set [1] into five categories (business, entertainment, politics, sport, tech).
The project will follow an end-to-end NLP pipeline workflow in four key stages:
- Data loading & Preprocessing
- Topic Modelling with LDA
- Exploratory Data Analysis
- Text Classification with Logistic Regression
# Import libraries
import os
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
# Download stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
Data Loading¶
We will first read all text files from the BBC News dataset, assigning labels based on their category from each folder. The texts will be stored in a Pandas
dataframe to create a structured dataset where each row contains the article's file name, category, and raw text.
categories = ["business","entertainment","politics","sport","tech"]
# Collect all articles
articles = []
for category in categories:
folder_path = "bbc/" + category
for filename in os.listdir(folder_path):
if filename.endswith(".txt"):
file_path = os.path.join(folder_path, filename)
with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
text = f.read()
articles.append({"filename": filename, "category": category, "text": text})
# Create dataframe
df = pd.DataFrame(articles)
# Show the first few rows
df.head()
filename | category | text | |
---|---|---|---|
0 | 289.txt | business | UK economy facing 'major risks'\n\nThe UK manu... |
1 | 504.txt | business | Aids and climate top Davos agenda\n\nClimate c... |
2 | 262.txt | business | Asian quake hits European shares\n\nShares in ... |
3 | 276.txt | business | India power shares jump on debut\n\nShares in ... |
4 | 510.txt | business | Lacroix label bought by US firm\n\nLuxury good... |
Preprocessing¶
We will preprocess the text data to create a new column (clean_text
) by lowercasing text and removing unnecessary characters like punctuation, digits, whitespace and stopwords. This means that the model will focus on meaningful words that can help distinguish different texts from different categories.
def preprocess(text):
# Lowercase
text = text.lower()
# Remove punctuation and digits
text = re.sub(r"[^a-z\s]", " ", text)
# Remove whitespace
text = re.sub(r"\s+", " ", text).strip()
# Remove stopwords
tokens = [word for word in text.split() if word not in stop_words]
return " ".join(tokens)
# Add cleaned data to the dataframe
df["clean_text"] = df["text"].apply(preprocess)
df.head()
filename | category | text | clean_text | |
---|---|---|---|---|
0 | 289.txt | business | UK economy facing 'major risks'\n\nThe UK manu... | uk economy facing major risks uk manufacturing... |
1 | 504.txt | business | Aids and climate top Davos agenda\n\nClimate c... | aids climate top davos agenda climate change f... |
2 | 262.txt | business | Asian quake hits European shares\n\nShares in ... | asian quake hits european shares shares europe... |
3 | 276.txt | business | India power shares jump on debut\n\nShares in ... | india power shares jump debut shares india lar... |
4 | 510.txt | business | Lacroix label bought by US firm\n\nLuxury good... | lacroix label bought us firm luxury goods grou... |
Exploratory Data Analysis¶
We now use EDA to understand trends within the dataset. We will check the distribution of articles per category, analyse the distribution of article lengths, and visualise the most frequent words in each category using word clouds.
# Summary statistics
print("Dataset size:", len(df))
print("\nClass distribution:")
print(df["category"].value_counts())
# Category distribution
counts = df["category"].value_counts()
plt.figure()
plt.bar(counts.index, counts.values)
plt.title("Documents per Category")
plt.xlabel("Category")
plt.ylabel("Count")
plt.xticks(rotation=20)
plt.show()
# Document length (characters)
df["char_len"] = df["text"].str.len()
plt.figure()
plt.hist(df["char_len"], bins=40)
plt.title("Distribution of Document Lengths (characters)")
plt.xlabel("Characters")
plt.ylabel("Frequency")
plt.show()
# Create a grid for plots
fig, axes = plt.subplots(3, 2, figsize=(16, 15))
axes = axes.flatten()
# Plot wordclouds for each category
for i, cat in enumerate(categories):
text = " ".join(df[df.category == cat].clean_text)
wc = WordCloud(stopwords='english', background_color='white', width=800, height=400).generate(text)
axes[i].imshow(wc, interpolation='bilinear')
axes[i].axis('off')
axes[i].set_title(f"Word Cloud for {cat}")
axes[5].axis("off")
plt.tight_layout()
plt.show()
Dataset size: 2225 Class distribution: category sport 511 business 510 politics 417 tech 401 entertainment 386 Name: count, dtype: int64
From this EDA, we conclude that the dataset is well balanced and has enough data to support classification using TF-IDF
Note that after preprocessing there are still certain stopwords - suchas said, would, also etc. - these will be handled naturally using TF-IDF vectorisation.
Text Classification¶
We will split the dataset into training (80%) and testing (20%) using stratified sampling to maintain balance.
# Define features and labels
X = df["clean_text"]
y = df["category"]
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
We then transform text into numerical vectors using TF-IDF, to capture word importance and reduce the effect of frequent, but less important words.
This will use up to 5000 features and include unigrams and bigrams.
# Transform text into numerical features with TF-IDF
vectoriser = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X_train_tfidf = vectoriser.fit_transform(X_train)
X_test_tfidf = vectoriser.transform(X_test)
Finally, we train a Logistic Regression classifier on the TF-IDF vectors and display the evaluation metrics.
# Train logistic regression classifier
classifier = LogisticRegression(max_iter=800)
classifier.fit(X_train_tfidf, y_train)
# Predict on the test set using the classifier
predictions = classifier.predict(X_test_tfidf)
# Print model accuracy
print("Accuracy:", f"{accuracy_score(y_test, predictions):.4f}")
# Print Precision, recall and f1 score for all categories
print(classification_report(y_test, predictions))
# Compute and display confusion matrix
cm = confusion_matrix(y_test, predictions)
print("\nConfusion Matrix:\n")
display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classifier.classes_)
display.plot(cmap=plt.cm.Blues)
plt.show()
Accuracy: 0.9865 precision recall f1-score support business 0.96 0.98 0.97 102 entertainment 0.99 0.99 0.99 77 politics 1.00 0.99 0.99 84 sport 1.00 1.00 1.00 102 tech 0.99 0.97 0.98 80 accuracy 0.99 445 macro avg 0.99 0.99 0.99 445 weighted avg 0.99 0.99 0.99 445 Confusion Matrix:
Topic Modelling¶
We can also use an unsupervised learning approach to understand trends in the dataset using a Latent Drichlet Allocation (LDA). This approach groups articles into categories without knowing their labels.
# Fit LDA model using TF-IDF features
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(X_train_tfidf)
# Display top 10 words for each topic
terms = vectoriser.get_feature_names_out()
for i, topic in enumerate(lda.components_):
print(f"Topic {i+1}: ", [terms[j] for j in topic.argsort()[-10:]])
Topic 1: ['said', 'users', 'video', 'technology', 'people', 'tv', 'digital', 'games', 'mobile', 'music'] Topic 2: ['first', 'wales', 'team', 'match', 'cup', 'club', 'said', 'win', 'game', 'england'] Topic 3: ['actress', 'oscar', 'films', 'star', 'actor', 'festival', 'award', 'awards', 'best', 'film'] Topic 4: ['clothes', 'technologies', 'could used', 'midlands', 'shops', 'chip', 'broadcast', 'wearing', 'tags', 'rfid'] Topic 5: ['uk', 'new', 'year', 'people', 'us', 'government', 'would', 'bn', 'mr', 'said']
We find that the discovered topics largely align with the categories from the BBC News dataset.
Conclusion¶
These results confirm that this NLP pipeline has been successful in building a highly accurate text classifier for BBC news articles. The model achieved 98.6% acuracy on the test set, and the strong performance was consistent across all five categories. We see that sport achieved perfect scores in the classification report while the other classes maintained recall and precision scores at or above 0.96. The confusion matrix also shows minimal misclassifications, highlighting that the model is both accurate and robust.
Although this model is highly effective, future work could build on this base by exploring deep learning and transformer based approaches as well as considering deployment for real world applications.
Reference¶
[1] D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006. Available at: http://mlg.ucd.ie/files/publications/greene06icml.pdf [PDF].