Skip to content

Commit 57e7d02

Browse files
Merge pull request #102 from x-tabdeveloping/logreg_clustering
Added feature importance methods based on cluster differences
2 parents 2f2af9e + 059c656 commit 57e7d02

5 files changed

Lines changed: 313 additions & 109 deletions

File tree

docs/clustering.md

Lines changed: 79 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,25 @@ Turftopic is entirely clustering-model agnostic, and as such, any type of model
107107

108108
Clustering topic models rely on post-hoc term importance estimation, meaning that topic descriptions are calculated based on already discovered clusters.
109109
Multiple methods are available in Turftopic for estimating words'/phrases' importance scores for topics.
110+
You can manipulate how these scores are calculated by changing the `feature_importance` parameter of your topic models.
111+
By and large there are two types of methods that can be used for importance estimation:
112+
113+
1. **Lexical methods**, which estimate term importance solely based on word counts in each cluster:
114+
- Generally faster, since the vocabulary does not need to be encoded.
115+
- Can capture more particular word use.
116+
- Usually cover the topics' content better.
117+
2. **Semantic methods**, which estimate term importance using the semantic space of the model:
118+
- They typically produce cleaner and more specific topics.
119+
- Can be used in a multilingual context.
120+
- Generally less sensitive to stop- and junk words.
121+
122+
| Importance method | Type | Description | Advantages |
123+
| - | - | - | - |
124+
| `soft-c-tf-idf` *(default)* | Lexical | A c-tf-idf mehod that can interpret soft cluster assignments. | Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf. |
125+
| `fighting-words` **(NEW)** | Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See [Fightin' Words paper](https://languagelog.ldc.upenn.edu/myl/Monroe.pdf). |
126+
| `c-tf-idf` | Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
127+
| `centroid` | Semantic | Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec. | Produces clean topics, easily interpretable. |
128+
| `linear` **(NEW, EXPERIMENTAL)** | Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
110129

111130

112131
!!! quote "Choose a term importance estimation method"
@@ -120,20 +139,8 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
120139
# or
121140
model = ClusteringTopicModel(feature_importance="c-tf-idf")
122141
```
123-
!!! failure inline end "Weaknesses"
124-
- Topics can be contaminated with stop words
125-
- Lower topic quality
126142

127-
!!! success inline end "Strengths"
128-
- Theoretically more correct
129-
- More within-topic coverage
130-
c-TF-IDF (Grootendorst, 2022) is a weighting scheme based on the number of occurrences of terms in each cluster.
131-
Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.
132-
By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is more robust to stop-words.
133-
134-
<br>
135-
136-
??? info "Click to see formulas"
143+
??? info "Click to see formulas"
137144
#### Soft-c-TF-IDF
138145
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
139146
- Estimate weight of term $j$ for topic $z$: <br>
@@ -157,7 +164,6 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
157164
- Calculate importance of term $j$ for topic $z$:
158165
$c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
159166

160-
161167
=== "Centroid Proximity (Top2Vec)"
162168

163169
```python
@@ -166,18 +172,21 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
166172
model = ClusteringTopicModel(feature_importance="centroid")
167173
```
168174

169-
!!! failure inline end "Weaknesses"
170-
- Low within-topic coverage
171-
- Assumes spherical clusters
175+
=== "Fighting' Words"
172176

173-
!!! success inline end "Strengths"
174-
- Clean topics
175-
- Highly specific topics
177+
```python
178+
from turftopic import ClusteringTopicModel
176179

177-
In Top2Vec (Angelov, 2020) term importance scores are estimated from word embeddings' similarity to centroid vector of clusters.
178-
This approach typically produces cleaner and more specific topic descriptions, but might not be the optimal choice, since it makes assumptions about cluster shapes, and only describes the centers of clusters accurately.
180+
model = ClusteringTopicModel(feature_importance="fighting-words")
181+
```
182+
183+
=== "Linear Probing"
179184

185+
```python
186+
from turftopic import ClusteringTopicModel
180187

188+
model = ClusteringTopicModel(feature_importance="linear")
189+
```
181190

182191

183192

@@ -305,6 +314,50 @@ model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10)
305314
model.print_topics_over_time()
306315
```
307316

317+
## Semi-supervised Topic Modeling
318+
319+
Some dimensionality reduction methods are capable of designing features that are effective at predicting class labels.
320+
This way, you can provide a supervisory signal, but also let the model discover new topics that you have not specified.
321+
322+
!!! warning
323+
TSNE, the default dimensionality reduction method in Turftopic is not capable of semi-supervised modelling.
324+
You will have to use a different algorithm.
325+
326+
327+
!!! note "Use a dimensionality reduction method for semi-supervised modeling."
328+
329+
=== "with UMAP"
330+
331+
```bash
332+
pip install turftopic[umap-learn]
333+
```
334+
335+
```python
336+
from umap import UMAP
337+
from turftopic import ClusteringTopicModel
338+
339+
corpus: list[str] = [...]
340+
341+
# UMAP can also understand missing class labels if you only have them on some examples
342+
# Specify these with -1 or NaN labels
343+
labels: list[int] = [0, 2, -1, -1, 0, 0...]
344+
345+
model = ClusteringTopicModel(dimensionality_reduction=UMAP())
346+
model.fit(corpus, y=labels)
347+
```
348+
349+
=== "with Linear Discriminant Analysis"
350+
351+
```python
352+
from sklearn.discriminant_analysis import LinearDisciminantAnalysis
353+
from turftopic import ClusteringTopicModel
354+
355+
corpus: list[str] = [...]
356+
labels: list[int] = [...]
357+
358+
model = ClusteringTopicModel(dimensionality_reduction=LinearDisciminantAnalysis(n_components=5))
359+
model.fit(corpus, y=labels)
360+
```
308361

309362
## Visualization
310363

@@ -339,3 +392,7 @@ _See Figure 1_
339392
## API Reference
340393

341394
::: turftopic.models.cluster.ClusteringTopicModel
395+
396+
::: turftopic.models.cluster.BERTopic
397+
398+
::: turftopic.models.cluster.Top2Vec

turftopic/feature_importance.py

Lines changed: 90 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1+
from __future__ import annotations
2+
3+
from typing import Literal
4+
15
import numpy as np
26
import scipy.sparse as spr
3-
from sklearn.feature_extraction.text import TfidfTransformer
7+
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
48
from sklearn.metrics.pairwise import cosine_similarity
5-
from sklearn.preprocessing import normalize
6-
from sklearn.utils import check_array
79

810

911
def cluster_centroid_distance(
@@ -36,6 +38,91 @@ def cluster_centroid_distance(
3638
return components
3739

3840

41+
def linear_classifier(
42+
doc_topic_matrix: np.ndarray,
43+
embeddings: np.ndarray,
44+
vocab_embeddings: np.ndarray,
45+
) -> np.ndarray:
46+
"""Computes feature importances based on embedding directions
47+
obtained with a linear classifier.
48+
49+
Parameters
50+
----------
51+
doc_topic_matrix: np.ndarray
52+
Document-topic matrix.
53+
embeddings: np.ndarray
54+
Document embeddings.
55+
vocab_embeddings: np.ndarray
56+
Term embeddings of shape (vocab_size, embedding_size)
57+
58+
Returns
59+
-------
60+
ndarray of shape (n_topics, vocab_size)
61+
Term importance matrix.
62+
"""
63+
labels = np.argmax(doc_topic_matrix, axis=1)
64+
model = LinearDiscriminantAnalysis().fit(embeddings, labels)
65+
components = cosine_similarity(model.coef_, vocab_embeddings)
66+
if len(set(labels)) == 2:
67+
# Binary is a special case
68+
components = np.concatenate([-components, components], axis=0)
69+
return components
70+
71+
72+
def fighting_words(
73+
doc_topic_matrix: np.ndarray,
74+
doc_term_matrix: spr.csr_matrix,
75+
prior: float | Literal["corpus"] = "corpus",
76+
) -> np.ndarray:
77+
"""Computes feature importance using the *Fighting Words* algorithm.
78+
79+
Parameters
80+
----------
81+
doc_topic_matrix: np.ndarray
82+
Document-topic matrix of shape (n_documents, n_topics)
83+
doc_term_matrix: np.ndarray
84+
Document-term matrix of shape (n_documents, vocab_size)
85+
prior: float or "corpus", default "corpus"
86+
Dirichlet prior to use. When a float, it indicates the alpha
87+
parameter of a symmetric Dirichlet, if "corpus",
88+
word frequencies from the background corpus are used.
89+
90+
Returns
91+
-------
92+
ndarray of shape (n_topics, vocab_size)
93+
Term importance matrix.
94+
"""
95+
labels = np.argmax(doc_topic_matrix, axis=1)
96+
n_topics = doc_topic_matrix.shape[1]
97+
n_vocab = doc_term_matrix.shape[1]
98+
components = []
99+
if prior == "corpus":
100+
priors = np.ravel(np.asarray(doc_term_matrix.sum(axis=0)))
101+
else:
102+
priors = np.full(n_vocab, prior)
103+
a0 = np.sum(priors) # prior * n_vocab
104+
for i_topic in range(n_topics):
105+
topic_freq = np.ravel(
106+
np.asarray(doc_term_matrix[labels == i_topic].sum(axis=0))
107+
)
108+
rest_freq = np.ravel(
109+
np.asarray(doc_term_matrix[labels != i_topic].sum(axis=0))
110+
)
111+
n1 = np.sum(topic_freq)
112+
n2 = np.sum(rest_freq)
113+
topic_logodds = np.log(
114+
(topic_freq + priors) / (n1 + a0 - topic_freq - priors)
115+
)
116+
rest_logodds = np.log(
117+
(rest_freq + priors) / (n2 + a0 - rest_freq - priors)
118+
)
119+
delta = topic_logodds - rest_logodds
120+
delta_var = 1 / (topic_freq + priors) + 1 / (rest_freq + priors)
121+
zscore = delta / np.sqrt(delta_var)
122+
components.append(zscore)
123+
return np.stack(components)
124+
125+
39126
def soft_ctf_idf(
40127
doc_topic_matrix: np.ndarray,
41128
doc_term_matrix: spr.csr_matrix,

turftopic/models/_hierarchical_clusters.py

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,16 @@
55

66
import numpy as np
77
from scipy.cluster.hierarchy import linkage
8+
from scipy.spatial.distance import pdist
89
from sklearn.metrics.pairwise import pairwise_distances
910

1011
from turftopic.base import ContextualModel
1112
from turftopic.feature_importance import (
1213
bayes_rule,
1314
cluster_centroid_distance,
1415
ctf_idf,
16+
fighting_words,
17+
linear_classifier,
1518
soft_ctf_idf,
1619
)
1720
from turftopic.hierarchical import TopicNode
@@ -188,7 +191,11 @@ def _estimate_children_components(self) -> dict[int, np.ndarray]:
188191
components = soft_ctf_idf(
189192
document_topic_matrix, self.model.doc_term_matrix
190193
) # type: ignore
191-
elif self.model.feature_importance == "centroid":
194+
if self.model.feature_importance == "fighting-words":
195+
components = fighting_words(
196+
document_topic_matrix, self.model.doc_term_matrix
197+
) # type: ignore
198+
elif self.model.feature_importance in ["centroid", "linear"]:
192199
if not hasattr(self.model, "vocab_embeddings"):
193200
self.model.vocab_embeddings = self.model.encode_documents(
194201
self.model.vectorizer.get_feature_names_out()
@@ -203,10 +210,17 @@ def _estimate_children_components(self) -> dict[int, np.ndarray]:
203210
n_word_dims=self.model.vocab_embeddings.shape[1],
204211
)
205212
)
206-
components = cluster_centroid_distance(
207-
topic_vectors,
208-
self.model.vocab_embeddings,
209-
)
213+
if self.model.feature_importance == "centroid":
214+
components = cluster_centroid_distance(
215+
topic_vectors,
216+
self.model.vocab_embeddings,
217+
)
218+
else:
219+
components = linear_classifier(
220+
document_topic_matrix,
221+
self.model.embeddings,
222+
self.model.vocab_embeddings,
223+
)
210224
elif self.model.feature_importance == "bayes":
211225
components = bayes_rule(
212226
document_topic_matrix, self.model.doc_term_matrix
@@ -248,9 +262,11 @@ def _calculate_linkage(
248262
n_classes = len(classes[classes != -1])
249263
topic_vectors = topic_representations[classes != -1]
250264
n_reductions = n_classes - n_reduce_to
251-
return linkage(topic_vectors, method=method, metric=metric)[
252-
:n_reductions
253-
]
265+
cond_dist = pdist(topic_vectors, metric=metric)
266+
# Making the algorithm more numerically stable
267+
if metric == "cosine":
268+
cond_dist[~np.isfinite(cond_dist)] = -1
269+
return linkage(cond_dist, method=method)[:n_reductions]
254270

255271
def reduce_topics(
256272
self, n_reduce_to: int, method: str = "average", metric: str = "cosine"

0 commit comments

Comments
 (0)