5 Latent Semantic Scaling

Semi-supervised Scaling with Seed Words

Author

Paride Carrara

Published

June 8, 2026

5.1 Introduction

Latent Semantic Scaling (LSX) (Watanabe 2021) sits between fully supervised methods (Wordscores) and fully unsupervised ones (Wordfish). You supply a small set of seed words that anchor the dimension of interest — left vs. right, pro- vs. anti-EU — and the model does the rest using Latent Semantic Analysis (SVD decomposition of the document-feature matrix).

The key insight: once words are projected into a dense semantic space, words that mean similar things end up near each other. So if “lavoro” is a left seed, words like “occupazione” or “lavoratori” will automatically receive high scores even if they are not in the seed list (as long as they appear in similar contexts). This allows LSX to capture a much richer and more nuanced ideological dimension than a Wordscores application.

The main issue is that in order to build the embedding space, we need a corpus that is large enough to capture the relevant semantic relationships. This usually requires several thousand documents (> 5 000) to work well.

5.2 Import and prepare the data

Load the required packages and the data. LSX is the package that implements LSX in R; it works with quanteda objects. For this section we will work with parliamentary speeches of the 18th legislature of the “Camera dei Deputati”, this data come from the dataset ItalParlCorpus (Cova 2025).

library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(LSX)
library(tidyverse)
library(ggplot2)
library(kableExtra)
library(dplyr)

#load the data
load("data/camera_legislature18.RData")

#remove if chair is TRUE
ita_df_04 <- ita_df_04 |> filter(chair == FALSE)

#remove if text has less than 50 words
ita_df_04 <- ita_df_04 |> filter(str_count(text, "\\w+") >= 50)

#from text remove the characters until the first "."
ita_df_04 <- ita_df_04 |> mutate(text = str_replace(text, "^[^.]*\\.\\s*", ""))

# create doc_id2 by concatenating lgislature and doc_id
ita_df_04 <- ita_df_04 |> mutate(doc_id2 = paste0(legislature, "_", doc_id))

# keep only if legislature is 18
ita_df_04 <- ita_df_04 |> filter(legislature == 18)

We remove speeches delivered from groups “USEI”, “Others”, “+Europa” “MAIE”, “SVP-PATT” because they have very few speeches and they are not relevant for our analysis.

#remove PPGs
ita_df_04 <- ita_df_04 |> filter(!party_name %in% c("USEI", "Others", "+Europa", "MAIE", "SVP-PATT", "Centrosinistra", "Centrodestra"))

#correct name PPG by replacing "Italia dei Valori" with "Italia Viva"
ita_df_04 <- ita_df_04 |> mutate(party_name = str_replace(party_name, "Italia dei Valori", "Italia Viva"))

5.3 Build the Corpus and the DFM

In order to properly estimate the embedding space we need to reshape the corpus at the sentence level. This is because the semantic relationships between words are better captured at a finer granularity than the whole speech. If we keep the speeches as documents, we might miss important co-occurrence patterns that only emerge at the sentence level, or introduce noise from long speeches that cover multiple topics. By reshaping to sentences, we allow the model to learn more accurate word embeddings based on the local context in which words appear.

#We replace the characters [‘’‚‛'`] with a space in the text column of the data frame
ita_df_04$text <- gsub("[‘’‚‛'`]", " ", ita_df_04$text)

#create a corpus from the data frame, using the text column as the text and the doc_id2 as the document id
corp <- corpus(ita_df_04, text_field = "text", docid_field = "row_id")

#reshape the corpus at the sentence level
corp <- corpus_reshape(corp, to = "sentences")

toks <- tokens(corp,
               remove_punct   = TRUE,   # drop . , ; : ! ? etc.
               remove_numbers = TRUE,   # drop 1, 2, 42 etc.
               remove_symbols = TRUE) |>
  tokens_tolower() |>                   # "Governo" -> "governo"
  tokens_remove(stopwords("it")) 

#create a DFM from the tokens
dfm_speeches <- dfm(toks)

# Print a summary of the DFM
print(dfm_speeches)

Document-feature matrix of: 410,116 documents, 115,293 features (99.98% sparse) and 13 docvars.
                 features
docs              elenco componenti gruppi formato base dichiarazioni rese
  20180329.0003.1      1          2      1       1    1             1    1
  20180329.0005.1      0          0      0       0    0             0    0
  20180329.0005.2      0          0      0       0    0             0    0
  20180329.0005.3      0          0      0       0    0             0    0
  20180410.0007.1      0          0      0       0    0             0    0
  20180410.0007.2      0          0      2       0    0             0    0
                 features
docs              deputati sensi articolo
  20180329.0003.1        1     1        1
  20180329.0005.1        1     1        1
  20180329.0005.2        1     1        1
  20180329.0005.3        0     0        0
  20180410.0007.1        0     1        1
  20180410.0007.2        0     0        0
[ reached max_ndoc ... 410,110 more documents, reached max_nfeat ... 115,283 more features ]

5.4 Define seed words

Seed words are the only supervision LSX requires. They define the poles of the dimension we want to measure. Words assigned +1 anchor the positive (left) pole; words assigned -1 anchor the negative (right) pole.

Tip

Seed word choice is a modelling decision. Here we are interested in the left-right dimension but LSX can be applied to any dimension you can define with words.

In his original application Watanabe (2021) employed LSX to perform sentiment analysis

We now specify the seed words to perform the left-right scaling. We will use the following seed words:

my_dictionary <- dictionary(list(
  
  right = c( "assistenzialismo", "patria", "clandestini", "detassazione", "sicurezza", "pmi", "imprese", "flat"),
    
  left = c( "precario", "diseguaglianze", "disuguaglianze", "salari", "lavoratori", "occupazione", "rifugiati")
))
library(LSX)

seeds <- as.seedwords(my_dictionary)

Important

Seed word choice is a modelling decision. There is no algorithmic way to pick the “right” seeds. You should:

Justify your seeds theoretically (e.g., from the literature on left-right ideology)
Check that each seed is present in your vocabulary.
Test sensitivity: refit the model with an alternative seed set and compare the party rankings. If the ordering is robust, you can be more confident in your results.

5.5 Fit the LSX model

textmodel_lss() performs three steps internally:

Runs Singular Value Decomposition (SVD) on the DFM, retaining k dimensions
Projects every word into the resulting k-dimensional semantic space
Scores each word by its average cosine similarity to the positive seeds minus its average cosine similarity to the negative seeds

lss_model <- textmodel_lss(
  dfm_speeches,
  seeds = seeds,
  k     = 300,    # number of LSA dimensions to retain
  cache = TRUE , # cache the SVD result to disk — saves time on re-runs
  include_data = TRUE, 
  group_data = TRUE   
)

summary(lss_model)


Call:
textmodel_lss(x = dfm_speeches, seeds = seeds, k = 300, cache = TRUE, 
    include_data = TRUE, group_data = TRUE)

Seeds:
assistenzialismo           patria      clandestini     detassazione 
               1                1                1                1 
       sicurezza              pmi          imprese             flat 
               1                1                1                1 
        precario   diseguaglianze   disuguaglianze           salari 
              -1               -1               -1               -1 
      lavoratori      occupazione        rifugiati 
              -1               -1               -1 

Beta:
(showing first 30 elements)
                 medie                piccole                   sace 
                0.2408                 0.2221                 0.2204 
              prestiti internazionalizzazione              liquidità 
                0.2113                 0.2111                 0.2106 
               imprese           medio-grandi                crediti 
                0.2019                 0.2011                 0.1980 
                   pmi              agevolato                   rate 
                0.1972                 0.1870                 0.1862 
          esportatrici           intermediari                bancari 
                0.1856                 0.1847                 0.1842 
                   tax               rotativo             rilavorano 
                0.1841                 0.1821                 0.1814 
            annaffiate                credito            erogheranno 
                0.1814                 0.1798                 0.1783 
             differite                  micro             creditrici 
                0.1782                 0.1766                 0.1755 
             pagamenti                 strong               garanzia 
                0.1751                 0.1746                 0.1739 
        controgaranzia                  split        200.000-300.000 
                0.1737                 0.1719                 0.1711 

Data Dimension:
[1]  35405 115293

On choosing k

k controls how many latent dimensions the SVD keeps. Think of it as the resolution of the semantic space:

Too small (e.g., k = 50): loses too much information, words cluster too coarsely
Too large (e.g., k = 1000): captures noise, slows computation
k = 300 is a widely used default for parliamentary corpora of this size

Tip

If your corpus is small (< 5 000 documents), consider k = 100 or k = 150 to avoid over-fitting the semantic space.

5.6 Explore word scores

After fitting, every word in the vocabulary has an LSX score. Positive scores indicate semantic proximity to the left pole; negative scores indicate proximity to the right pole.

Here are the top 20 words closest to the right pole, the most right-wing words according to the LSX model:

head(coef(lss_model), 20) |>
  sort() |>
  (\(x) tibble::tibble(word = names(x), score = x))() |>
  mutate(
    rank = row_number(),
    bar = score / min(score)
  ) |>
  select(rank, word, score) |>
  kbl(
    digits = 4,
    col.names = c("#", "Word", "Score"),
    align = c("r", "l", "r")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 14
  ) |>
  column_spec(1, color = "gray", width = "2em") |>
  column_spec(2, monospace = TRUE) |>
  column_spec(3, color = "#2126b2", bold = TRUE) |>
  row_spec(0, color = "black", font_size = 12)

#	Word	Score
1	credito	0.1798
2	annaffiate	0.1814
3	rilavorano	0.1814
4	rotativo	0.1821
5	tax	0.1841
6	bancari	0.1842
7	intermediari	0.1847
8	esportatrici	0.1856
9	rate	0.1862
10	agevolato	0.1870
11	pmi	0.1972
12	crediti	0.1980
13	medio-grandi	0.2011
14	imprese	0.2019
15	liquidità	0.2106
16	internazionalizzazione	0.2111
17	prestiti	0.2113
18	sace	0.2204
19	piccole	0.2221
20	medie	0.2408

Here are the top 20 words closest to the left pole, the most left-wing words according to the LSX model:

library(kableExtra)
library(dplyr)

tail(coef(lss_model), 20) |>
  sort() |>
  (\(x) tibble::tibble(word = names(x), score = x))() |>
  mutate(
    rank = row_number(),
    bar = score / min(score)
  ) |>
  select(rank, word, score) |>
  kbl(
    digits = 4,
    col.names = c("#", "Word", "Score"),
    align = c("r", "l", "r")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 14
  ) |>
  column_spec(1, color = "gray", width = "2em") |>
  column_spec(2, monospace = TRUE) |>
  column_spec(3, color = "#A32D2D", bold = TRUE) |>
  row_spec(0, color = "gray", font_size = 12)

#	Word	Score
1	disuguaglianze	-0.3200
2	diseguaglianze	-0.2978
3	lavoratrici	-0.2576
4	impugnano	-0.2406
5	tutele	-0.2385
6	disosso	-0.2384
7	steer	-0.2384
8	seghezzi	-0.2329
9	lavorative	-0.2309
10	socialmente	-0.2294
11	precoci	-0.2213
12	lavoratori	-0.2207
13	disparità	-0.2179
14	diseguaglianza	-0.2167
15	separandoli	-0.2141
16	ammortizzatori	-0.2141
17	marostica	-0.2139
18	salari	-0.2128
19	intermittenti	-0.2120
20	fragilità	-0.2109

Inspect this output carefully. Ask yourself:

Do the top left-leaning words make theoretical sense?
Are there surprising words near either pole? If so, why?
Are there seed words among the top-scored words, or have other words “caught up” to them?

Visual exploration — plot the word scores with the seeds highlighted:

textplot_terms(lss_model, highlighted = names(seeds))

This plot shows word scores on the horizontal axis. Seeds are labelled; other words are shown as points. Words near the seeds should be thematically related — this is your main validity check.

5.7 Score documents

Once word scores are established, each document receives a score equal to the frequency-weighted mean of the scores of the words it contains:

\[\hat{\theta}_d = \frac{\sum_w \text{freq}(w, d) \cdot s_w}{\sum_w \text{freq}(w, d)}\]

We first extract the document-level data and use the predict function to get the LSS score for each document. We can then visualise the distribution of these scores across all speeches.

dat <- docvars(lss_model$data)
dat$lss_score <- predict(lss_model)
print(nrow(dat))

[1] 35405

5.8 Party-level aggregation

We can estimate the left-right position of each party by aggregating the document scores. We can also plot how these positions evolve over time. But before we do that, let’s check how well the LSS scores correlate with the Wordscores and Wordfish scores we obtained in the previous modules. This will give us a sense of whether LSS is capturing a similar ideological dimension, and how much it agrees with the other methods.

dat |> 
  group_by(party_name) |>
  summarise(mean_lss = mean(lss_score, na.rm = TRUE)) |>
  arrange(mean_lss)

# A tibble: 6 × 2
  party_name                             mean_lss
  <chr>                                     <dbl>
1 Movimento 5 Stelle                      -0.362 
2 Partito Democratico                     -0.132 
3 Italia Viva                              0.0198
4 Forza Italia – Il Popolo della Libertà   0.101 
5 Fratelli d'Italia                        0.191 
6 Lega                                     0.253

We can also have a more nice table with the party names, the mean lss score and the direction (left or right) based on the mean lss score.

dat |>
  group_by(party_name) |>
  summarise(mean_lss = mean(lss_score, na.rm = TRUE)) |>
  arrange(mean_lss) |>
  mutate(
    rank = row_number(),
    direction = ifelse(mean_lss < 0, "LEFT", "RIGHT")
  ) |>
  select(rank, party_name, mean_lss, direction) |>
  kbl(
    digits = 4,
    col.names = c("#", "Party", "Mean LSS", "Pole"),
    align = c("r", "l", "r", "c")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 14
  ) |>
  column_spec(1, color = "gray", width = "2em") |>
  column_spec(2, bold = TRUE) |>
  column_spec(3, color = ifelse(
    dat |>
      group_by(party_name) |>
      summarise(mean_lss = mean(lss_score, na.rm = TRUE)) |>
      arrange(mean_lss) |>
      pull(mean_lss) < 0,
    "#A32D2D",   
    "#185FA5"    
  )) |>
  column_spec(4, color = ifelse(
    dat |>
      group_by(party_name) |>
      summarise(mean_lss = mean(lss_score, na.rm = TRUE)) |>
      arrange(mean_lss) |>
      pull(mean_lss) < 0,
    "#A32D2D",
    "#185FA5"
  )) |>
  row_spec(0, color = "gray", font_size = 12)

#	Party	Mean LSS	Pole
1	Movimento 5 Stelle	-0.3620	LEFT
2	Partito Democratico	-0.1320	LEFT
3	Italia Viva	0.0198	RIGHT
4	Forza Italia – Il Popolo della Libertà	0.1011	RIGHT
5	Fratelli d'Italia	0.1913	RIGHT
6	Lega	0.2526	RIGHT

5.9 Reflection

Take a few minutes to think about what the LSX results tell us about Italian parliamentary politics — and about the method itself.

Substantive questions

Are PD and FDI at the expected opposite poles of the left-right axis?
Where does M5S fall? The Five Star Movement explicitly rejects the left-right label; does LSX place it in the centre, or does it lean in one direction in practice?
Are the Lega scores consistent with what you know about its ideological evolution (from northern regionalism toward national right-wing populism)?
Are there any parties whose position surprises you? What might explain the discrepancy?

Methodological questions

How sensitive are the results to the seed word choice? Try replacing one or two seeds and refit — does the party ranking change substantially?
LSX scores range roughly between −1 and +1, but the exact values depend on the seed words and the corpus. The scale is relative, not absolute — you can compare parties to each other but not directly to Wordfish or Wordscores scores.
What does within-party variance (the error bars in Section 5) tell us? Is it noise, or genuine heterogeneity in how different politicians within the same party talk?

Important

Key takeaway: LSX gives us a principled, replicable, theoretically-grounded way to measure a specific ideological dimension from text, with minimal manual annotation. Its main assumption is that your seed words are valid operationalisations of the dimension you care about. That assumption should always be examined, not taken for granted, as the scores of the non-seed words are derived from the seed words.

6 Bibliography

Cova, Joshua. 2025. “A New Database for Italian Parliamentary Speeches: Introducing the ItaParlCorpus Dataset.” Italian Political Science Review / Rivista Italiana Di Scienza Politica 55 (1): 77–86. https://doi.org/10.1017/ipo.2025.6.

Watanabe, Kohei. 2021. “Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages.” Communication Methods and Measures 15 (2): 81–102. https://doi.org/10.1080/19312458.2020.1832976.