x-tabdeveloping
diff --git a/‎docs/clustering.md‎
Lines changed: 79 additions & 22 deletions b/‎docs/clustering.md‎
Lines changed: 79 additions & 22 deletions
diff --git a/‎turftopic/feature_importance.py‎
Lines changed: 90 additions & 3 deletions b/‎turftopic/feature_importance.py‎
Lines changed: 90 additions & 3 deletions
diff --git a/‎turftopic/models/_hierarchical_clusters.py‎
Lines changed: 24 additions & 8 deletions b/‎turftopic/models/_hierarchical_clusters.py‎
Lines changed: 24 additions & 8 deletions
@@ -107,6 +107,25 @@ Turftopic is entirely clustering-model agnostic, and as such, any type of model
 
 Clustering topic models rely on post-hoc term importance estimation, meaning that topic descriptions are calculated based on already discovered clusters.
 Multiple methods are available in Turftopic for estimating words'/phrases' importance scores for topics.
+You can manipulate how these scores are calculated by changing the `feature_importance` parameter of your topic models.
+By and large there are two types of methods that can be used for importance estimation:
+
+1. **Lexical methods**, which estimate term importance solely based on word counts in each cluster:
+    - Generally faster, since the vocabulary does not need to be encoded.
+    - Can capture more particular word use.
+    - Usually cover the topics' content better.
+2. **Semantic methods**, which estimate term importance using the semantic space of the model:
+    - They typically produce cleaner and more specific topics.
+    - Can be used in a multilingual context.
+    - Generally less sensitive to stop- and junk words.
+
+| Importance method | Type | Description | Advantages |
+| - | - | - | - |
+| `soft-c-tf-idf` *(default)* | Lexical | A c-tf-idf mehod that can interpret soft cluster assignments. | Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf. |
+| `fighting-words` **(NEW)** | Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See [Fightin' Words paper](https://languagelog.ldc.upenn.edu/myl/Monroe.pdf). |
+| `c-tf-idf` | Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
+| `centroid` | Semantic | Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec. | Produces clean topics, easily interpretable. |
+| `linear` **(NEW, EXPERIMENTAL)** | Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
 
 
 !!! quote "Choose a term importance estimation method"
@@ -120,20 +139,8 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
         # or
         model = ClusteringTopicModel(feature_importance="c-tf-idf")
         ```
-        !!! failure inline end "Weaknesses"
-            - Topics can be contaminated with stop words
-            - Lower topic quality
 
-        !!! success inline end "Strengths"
-            - Theoretically more correct
-            - More within-topic coverage
-        c-TF-IDF (Grootendorst, 2022) is a weighting scheme based on the number of occurrences of terms in each cluster.
-        Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance.
-        By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is more robust to stop-words.
-
-        <br>
-
-        ??? info "Click to see formulas"
+         ??? info "Click to see formulas"
             #### Soft-c-TF-IDF
             - Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
             - Estimate weight of term $j$ for topic $z$: <br>
@@ -157,7 +164,6 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
             - Calculate importance of term $j$ for topic $z$:   
             $c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
 
-
     === "Centroid Proximity (Top2Vec)"
 
         ```python
@@ -166,18 +172,21 @@ Multiple methods are available in Turftopic for estimating words'/phrases' impor
         model = ClusteringTopicModel(feature_importance="centroid")
         ```
 
-        !!! failure inline end "Weaknesses"
-            - Low within-topic coverage
-            - Assumes spherical clusters
+    === "Fighting' Words"
 
-        !!! success inline end "Strengths"
-            - Clean topics
-            - Highly specific topics
+        ```python
+        from turftopic import ClusteringTopicModel
 
-        In Top2Vec (Angelov, 2020) term importance scores are estimated from word embeddings' similarity to centroid vector of clusters.
-        This approach typically produces cleaner and more specific topic descriptions, but might not be the optimal choice, since it makes assumptions about cluster shapes, and only describes the centers of clusters accurately.
+        model = ClusteringTopicModel(feature_importance="fighting-words")
+        ```
+
+    === "Linear Probing"
 
+        ```python
+        from turftopic import ClusteringTopicModel
 
+        model = ClusteringTopicModel(feature_importance="linear")
+        ```
 
 
 
@@ -305,6 +314,50 @@ model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10)
 model.print_topics_over_time()
 ```
 
+## Semi-supervised Topic Modeling
+
+Some dimensionality reduction methods are capable of designing features that are effective at predicting class labels.
+This way, you can provide a supervisory signal, but also let the model discover new topics that you have not specified.
+
+!!! warning
+    TSNE, the default dimensionality reduction method in Turftopic is not capable of semi-supervised modelling.
+    You will have to use a different algorithm.
+
+
+!!! note "Use a dimensionality reduction method for semi-supervised modeling."
+
+    === "with UMAP"
+
+        ```bash
+        pip install turftopic[umap-learn]
+        ```
+
+        ```python
+        from umap import UMAP
+        from turftopic import ClusteringTopicModel
+
+        corpus: list[str] = [...]
+
+        # UMAP can also understand missing class labels if you only have them on some examples
+        # Specify these with -1 or NaN labels
+        labels: list[int] = [0, 2, -1, -1, 0, 0...]
+
+        model = ClusteringTopicModel(dimensionality_reduction=UMAP())
+        model.fit(corpus, y=labels)
+        ```
+
+    === "with Linear Discriminant Analysis"
+
+        ```python
+        from sklearn.discriminant_analysis import LinearDisciminantAnalysis
+        from turftopic import ClusteringTopicModel
+
+        corpus: list[str] = [...]
+        labels: list[int] = [...]
+
+        model = ClusteringTopicModel(dimensionality_reduction=LinearDisciminantAnalysis(n_components=5))
+        model.fit(corpus, y=labels)
+        ```
 
 ## Visualization
 
@@ -339,3 +392,7 @@ _See Figure 1_
 ## API Reference
 
 ::: turftopic.models.cluster.ClusteringTopicModel
+
+::: turftopic.models.cluster.BERTopic
+
+::: turftopic.models.cluster.Top2Vec
@@ -1,9 +1,11 @@
+from __future__ import annotations
+
+from typing import Literal
+
 import numpy as np
 import scipy.sparse as spr
-from sklearn.feature_extraction.text import TfidfTransformer
+from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 from sklearn.metrics.pairwise import cosine_similarity
-from sklearn.preprocessing import normalize
-from sklearn.utils import check_array
 
 
 def cluster_centroid_distance(
@@ -36,6 +38,91 @@ def cluster_centroid_distance(
     return components
 
 
+def linear_classifier(
+    doc_topic_matrix: np.ndarray,
+    embeddings: np.ndarray,
+    vocab_embeddings: np.ndarray,
+) -> np.ndarray:
+    """Computes feature importances based on embedding directions
+    obtained with a linear classifier.
+
+    Parameters
+    ----------
+    doc_topic_matrix: np.ndarray
+        Document-topic matrix.
+    embeddings: np.ndarray
+        Document embeddings.
+    vocab_embeddings: np.ndarray
+        Term embeddings of shape (vocab_size, embedding_size)
+
+    Returns
+    -------
+    ndarray of shape (n_topics, vocab_size)
+        Term importance matrix.
+    """
+    labels = np.argmax(doc_topic_matrix, axis=1)
+    model = LinearDiscriminantAnalysis().fit(embeddings, labels)
+    components = cosine_similarity(model.coef_, vocab_embeddings)
+    if len(set(labels)) == 2:
+        # Binary is a special case
+        components = np.concatenate([-components, components], axis=0)
+    return components
+
+
+def fighting_words(
+    doc_topic_matrix: np.ndarray,
+    doc_term_matrix: spr.csr_matrix,
+    prior: float | Literal["corpus"] = "corpus",
+) -> np.ndarray:
+    """Computes feature importance using the *Fighting Words* algorithm.
+
+    Parameters
+    ----------
+    doc_topic_matrix: np.ndarray
+        Document-topic matrix of shape (n_documents, n_topics)
+    doc_term_matrix: np.ndarray
+        Document-term matrix of shape (n_documents, vocab_size)
+    prior: float or "corpus", default "corpus"
+        Dirichlet prior to use. When a float, it indicates the alpha
+        parameter of a symmetric Dirichlet, if "corpus",
+        word frequencies from the background corpus are used.
+
+    Returns
+    -------
+    ndarray of shape (n_topics, vocab_size)
+        Term importance matrix.
+    """
+    labels = np.argmax(doc_topic_matrix, axis=1)
+    n_topics = doc_topic_matrix.shape[1]
+    n_vocab = doc_term_matrix.shape[1]
+    components = []
+    if prior == "corpus":
+        priors = np.ravel(np.asarray(doc_term_matrix.sum(axis=0)))
+    else:
+        priors = np.full(n_vocab, prior)
+    a0 = np.sum(priors)  # prior * n_vocab
+    for i_topic in range(n_topics):
+        topic_freq = np.ravel(
+            np.asarray(doc_term_matrix[labels == i_topic].sum(axis=0))
+        )
+        rest_freq = np.ravel(
+            np.asarray(doc_term_matrix[labels != i_topic].sum(axis=0))
+        )
+        n1 = np.sum(topic_freq)
+        n2 = np.sum(rest_freq)
+        topic_logodds = np.log(
+            (topic_freq + priors) / (n1 + a0 - topic_freq - priors)
+        )
+        rest_logodds = np.log(
+            (rest_freq + priors) / (n2 + a0 - rest_freq - priors)
+        )
+        delta = topic_logodds - rest_logodds
+        delta_var = 1 / (topic_freq + priors) + 1 / (rest_freq + priors)
+        zscore = delta / np.sqrt(delta_var)
+        components.append(zscore)
+    return np.stack(components)
+
+
 def soft_ctf_idf(
     doc_topic_matrix: np.ndarray,
     doc_term_matrix: spr.csr_matrix,
 
@@ -5,13 +5,16 @@
 
 import numpy as np
 from scipy.cluster.hierarchy import linkage
+from scipy.spatial.distance import pdist
 from sklearn.metrics.pairwise import pairwise_distances
 
 from turftopic.base import ContextualModel
 from turftopic.feature_importance import (
     bayes_rule,
     cluster_centroid_distance,
     ctf_idf,
+    fighting_words,
+    linear_classifier,
     soft_ctf_idf,
 )
 from turftopic.hierarchical import TopicNode
@@ -188,7 +191,11 @@ def _estimate_children_components(self) -> dict[int, np.ndarray]:
             components = soft_ctf_idf(
                 document_topic_matrix, self.model.doc_term_matrix
             )  # type: ignore
-        elif self.model.feature_importance == "centroid":
+        if self.model.feature_importance == "fighting-words":
+            components = fighting_words(
+                document_topic_matrix, self.model.doc_term_matrix
+            )  # type: ignore
+        elif self.model.feature_importance in ["centroid", "linear"]:
             if not hasattr(self.model, "vocab_embeddings"):
                 self.model.vocab_embeddings = self.model.encode_documents(
                     self.model.vectorizer.get_feature_names_out()
@@ -203,10 +210,17 @@ def _estimate_children_components(self) -> dict[int, np.ndarray]:
                             n_word_dims=self.model.vocab_embeddings.shape[1],
                         )
                     )
-            components = cluster_centroid_distance(
-                topic_vectors,
-                self.model.vocab_embeddings,
-            )
+            if self.model.feature_importance == "centroid":
+                components = cluster_centroid_distance(
+                    topic_vectors,
+                    self.model.vocab_embeddings,
+                )
+            else:
+                components = linear_classifier(
+                    document_topic_matrix,
+                    self.model.embeddings,
+                    self.model.vocab_embeddings,
+                )
         elif self.model.feature_importance == "bayes":
             components = bayes_rule(
                 document_topic_matrix, self.model.doc_term_matrix
@@ -248,9 +262,11 @@ def _calculate_linkage(
             n_classes = len(classes[classes != -1])
             topic_vectors = topic_representations[classes != -1]
             n_reductions = n_classes - n_reduce_to
-            return linkage(topic_vectors, method=method, metric=metric)[
-                :n_reductions
-            ]
+            cond_dist = pdist(topic_vectors, metric=metric)
+            # Making the algorithm more numerically stable
+            if metric == "cosine":
+                cond_dist[~np.isfinite(cond_dist)] = -1
+            return linkage(cond_dist, method=method)[:n_reductions]
 
     def reduce_topics(
         self, n_reduce_to: int, method: str = "average", metric: str = "cosine"