You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/clustering.md
+79-22Lines changed: 79 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -107,6 +107,25 @@ Turftopic is entirely clustering-model agnostic, and as such, any type of model
107
107
108
108
Clustering topic models rely on post-hoc term importance estimation, meaning that topic descriptions are calculated based on already discovered clusters.
109
109
Multiple methods are available in Turftopic for estimating words'/phrases' importance scores for topics.
110
+
You can manipulate how these scores are calculated by changing the `feature_importance` parameter of your topic models.
111
+
By and large there are two types of methods that can be used for importance estimation:
112
+
113
+
1.**Lexical methods**, which estimate term importance solely based on word counts in each cluster:
114
+
- Generally faster, since the vocabulary does not need to be encoded.
115
+
- Can capture more particular word use.
116
+
- Usually cover the topics' content better.
117
+
2.**Semantic methods**, which estimate term importance using the semantic space of the model:
118
+
- They typically produce cleaner and more specific topics.
119
+
- Can be used in a multilingual context.
120
+
- Generally less sensitive to stop- and junk words.
121
+
122
+
| Importance method | Type | Description | Advantages |
123
+
| - | - | - | - |
124
+
|`soft-c-tf-idf`*(default)*| Lexical | A c-tf-idf mehod that can interpret soft cluster assignments. | Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf. |
125
+
|`fighting-words`**(NEW)**| Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See [Fightin' Words paper](https://languagelog.ldc.upenn.edu/myl/Monroe.pdf). |
126
+
|`c-tf-idf`| Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
127
+
|`centroid`| Semantic | Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec. | Produces clean topics, easily interpretable. |
128
+
|`linear`**(NEW, EXPERIMENTAL)**| Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
110
129
111
130
112
131
!!! quote "Choose a term importance estimation method"
@@ -120,20 +139,8 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
120
139
# or
121
140
model = ClusteringTopicModel(feature_importance="c-tf-idf")
122
141
```
123
-
!!! failure inline end "Weaknesses"
124
-
- Topics can be contaminated with stop words
125
-
- Lower topic quality
126
142
127
-
!!! success inline end "Strengths"
128
-
- Theoretically more correct
129
-
- More within-topic coverage
130
-
c-TF-IDF (Grootendorst, 2022) is a weighting scheme based on the number of occurrences of terms in each cluster.
131
-
Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.
132
-
By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is more robust to stop-words.
133
-
134
-
<br>
135
-
136
-
??? info "Click to see formulas"
143
+
??? info "Click to see formulas"
137
144
#### Soft-c-TF-IDF
138
145
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
139
146
- Estimate weight of term $j$ for topic $z$: <br>
@@ -157,7 +164,6 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
157
164
- Calculate importance of term $j$ for topic $z$:
158
165
$c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
159
166
160
-
161
167
=== "Centroid Proximity (Top2Vec)"
162
168
163
169
```python
@@ -166,18 +172,21 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
166
172
model = ClusteringTopicModel(feature_importance="centroid")
167
173
```
168
174
169
-
!!! failure inline end "Weaknesses"
170
-
- Low within-topic coverage
171
-
- Assumes spherical clusters
175
+
=== "Fighting' Words"
172
176
173
-
!!! success inline end "Strengths"
174
-
- Clean topics
175
-
- Highly specific topics
177
+
```python
178
+
from turftopic import ClusteringTopicModel
176
179
177
-
In Top2Vec (Angelov, 2020) term importance scores are estimated from word embeddings' similarity to centroid vector of clusters.
178
-
This approach typically produces cleaner and more specific topic descriptions, but might not be the optimal choice, since it makes assumptions about cluster shapes, and only describes the centers of clusters accurately.
180
+
model = ClusteringTopicModel(feature_importance="fighting-words")
181
+
```
182
+
183
+
=== "Linear Probing"
179
184
185
+
```python
186
+
from turftopic import ClusteringTopicModel
180
187
188
+
model = ClusteringTopicModel(feature_importance="linear")
189
+
```
181
190
182
191
183
192
@@ -305,6 +314,50 @@ model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10)
305
314
model.print_topics_over_time()
306
315
```
307
316
317
+
## Semi-supervised Topic Modeling
318
+
319
+
Some dimensionality reduction methods are capable of designing features that are effective at predicting class labels.
320
+
This way, you can provide a supervisory signal, but also let the model discover new topics that you have not specified.
321
+
322
+
!!! warning
323
+
TSNE, the default dimensionality reduction method in Turftopic is not capable of semi-supervised modelling.
324
+
You will have to use a different algorithm.
325
+
326
+
327
+
!!! note "Use a dimensionality reduction method for semi-supervised modeling."
328
+
329
+
=== "with UMAP"
330
+
331
+
```bash
332
+
pip install turftopic[umap-learn]
333
+
```
334
+
335
+
```python
336
+
from umap import UMAP
337
+
from turftopic import ClusteringTopicModel
338
+
339
+
corpus: list[str] = [...]
340
+
341
+
# UMAP can also understand missing class labels if you only have them on some examples
342
+
# Specify these with -1 or NaN labels
343
+
labels: list[int] = [0, 2, -1, -1, 0, 0...]
344
+
345
+
model = ClusteringTopicModel(dimensionality_reduction=UMAP())
346
+
model.fit(corpus, y=labels)
347
+
```
348
+
349
+
=== "with Linear Discriminant Analysis"
350
+
351
+
```python
352
+
from sklearn.discriminant_analysis import LinearDisciminantAnalysis
353
+
from turftopic import ClusteringTopicModel
354
+
355
+
corpus: list[str] = [...]
356
+
labels: list[int] = [...]
357
+
358
+
model = ClusteringTopicModel(dimensionality_reduction=LinearDisciminantAnalysis(n_components=5))
0 commit comments