library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(LSX)
library(tidyverse)
library(ggplot2)
library(kableExtra)
library(dplyr)
#load the data
load("data/camera_legislature18.RData")
#remove if chair is TRUE
ita_df_04 <- ita_df_04 |> filter(chair == FALSE)
#remove if text has less than 50 words
ita_df_04 <- ita_df_04 |> filter(str_count(text, "\\w+") >= 50)
#from text remove the characters until the first "."
ita_df_04 <- ita_df_04 |> mutate(text = str_replace(text, "^[^.]*\\.\\s*", ""))
# create doc_id2 by concatenating lgislature and doc_id
ita_df_04 <- ita_df_04 |> mutate(doc_id2 = paste0(legislature, "_", doc_id))
# keep only if legislature is 18
ita_df_04 <- ita_df_04 |> filter(legislature == 18)5 Latent Semantic Scaling
Semi-supervised Scaling with Seed Words
5.1 Introduction
Latent Semantic Scaling (LSX) (Watanabe 2021) sits between fully supervised methods (Wordscores) and fully unsupervised ones (Wordfish). You supply a small set of seed words that anchor the dimension of interest — left vs. right, pro- vs. anti-EU — and the model does the rest using Latent Semantic Analysis (SVD decomposition of the document-feature matrix).
The key insight: once words are projected into a dense semantic space, words that mean similar things end up near each other. So if “lavoro” is a left seed, words like “occupazione” or “lavoratori” will automatically receive high scores even if they are not in the seed list (as long as they appear in similar contexts). This allows LSX to capture a much richer and more nuanced ideological dimension than a Wordscores application.
The main issue is that in order to build the embedding space, we need a corpus that is large enough to capture the relevant semantic relationships. This usually requires several thousand documents (> 5 000) to work well.
5.2 Import and prepare the data
Load the required packages and the data. LSX is the package that implements LSX in R; it works with quanteda objects. For this section we will work with parliamentary speeches of the 18th legislature of the “Camera dei Deputati”, this data come from the dataset ItalParlCorpus (Cova 2025).
We remove speeches delivered from groups “USEI”, “Others”, “+Europa” “MAIE”, “SVP-PATT” because they have very few speeches and they are not relevant for our analysis.
#remove PPGs
ita_df_04 <- ita_df_04 |> filter(!party_name %in% c("USEI", "Others", "+Europa", "MAIE", "SVP-PATT", "Centrosinistra", "Centrodestra"))
#correct name PPG by replacing "Italia dei Valori" with "Italia Viva"
ita_df_04 <- ita_df_04 |> mutate(party_name = str_replace(party_name, "Italia dei Valori", "Italia Viva"))5.3 Build the Corpus and the DFM
In order to properly estimate the embedding space we need to reshape the corpus at the sentence level. This is because the semantic relationships between words are better captured at a finer granularity than the whole speech. If we keep the speeches as documents, we might miss important co-occurrence patterns that only emerge at the sentence level, or introduce noise from long speeches that cover multiple topics. By reshaping to sentences, we allow the model to learn more accurate word embeddings based on the local context in which words appear.
#We replace the characters [‘’‚‛'`] with a space in the text column of the data frame
ita_df_04$text <- gsub("[‘’‚‛'`]", " ", ita_df_04$text)
#create a corpus from the data frame, using the text column as the text and the doc_id2 as the document id
corp <- corpus(ita_df_04, text_field = "text", docid_field = "row_id")
#reshape the corpus at the sentence level
corp <- corpus_reshape(corp, to = "sentences")
toks <- tokens(corp,
remove_punct = TRUE, # drop . , ; : ! ? etc.
remove_numbers = TRUE, # drop 1, 2, 42 etc.
remove_symbols = TRUE) |>
tokens_tolower() |> # "Governo" -> "governo"
tokens_remove(stopwords("it"))
#create a DFM from the tokens
dfm_speeches <- dfm(toks)
# Print a summary of the DFM
print(dfm_speeches)Document-feature matrix of: 410,116 documents, 115,293 features (99.98% sparse) and 13 docvars.
features
docs elenco componenti gruppi formato base dichiarazioni rese
20180329.0003.1 1 2 1 1 1 1 1
20180329.0005.1 0 0 0 0 0 0 0
20180329.0005.2 0 0 0 0 0 0 0
20180329.0005.3 0 0 0 0 0 0 0
20180410.0007.1 0 0 0 0 0 0 0
20180410.0007.2 0 0 2 0 0 0 0
features
docs deputati sensi articolo
20180329.0003.1 1 1 1
20180329.0005.1 1 1 1
20180329.0005.2 1 1 1
20180329.0005.3 0 0 0
20180410.0007.1 0 1 1
20180410.0007.2 0 0 0
[ reached max_ndoc ... 410,110 more documents, reached max_nfeat ... 115,283 more features ]
5.4 Define seed words
Seed words are the only supervision LSX requires. They define the poles of the dimension we want to measure. Words assigned +1 anchor the positive (left) pole; words assigned -1 anchor the negative (right) pole.
Seed word choice is a modelling decision. Here we are interested in the left-right dimension but LSX can be applied to any dimension you can define with words.
- In his original application Watanabe (2021) employed LSX to perform sentiment analysis
We now specify the seed words to perform the left-right scaling. We will use the following seed words:
my_dictionary <- dictionary(list(
right = c( "assistenzialismo", "patria", "clandestini", "detassazione", "sicurezza", "pmi", "imprese", "flat"),
left = c( "precario", "diseguaglianze", "disuguaglianze", "salari", "lavoratori", "occupazione", "rifugiati")
))
library(LSX)
seeds <- as.seedwords(my_dictionary)Seed word choice is a modelling decision. There is no algorithmic way to pick the “right” seeds. You should:
- Justify your seeds theoretically (e.g., from the literature on left-right ideology)
- Check that each seed is present in your vocabulary.
- Test sensitivity: refit the model with an alternative seed set and compare the party rankings. If the ordering is robust, you can be more confident in your results.
5.5 Fit the LSX model
textmodel_lss() performs three steps internally:
- Runs Singular Value Decomposition (SVD) on the DFM, retaining
kdimensions - Projects every word into the resulting
k-dimensional semantic space - Scores each word by its average cosine similarity to the positive seeds minus its average cosine similarity to the negative seeds
lss_model <- textmodel_lss(
dfm_speeches,
seeds = seeds,
k = 300, # number of LSA dimensions to retain
cache = TRUE , # cache the SVD result to disk — saves time on re-runs
include_data = TRUE,
group_data = TRUE
)
summary(lss_model)
Call:
textmodel_lss(x = dfm_speeches, seeds = seeds, k = 300, cache = TRUE,
include_data = TRUE, group_data = TRUE)
Seeds:
assistenzialismo patria clandestini detassazione
1 1 1 1
sicurezza pmi imprese flat
1 1 1 1
precario diseguaglianze disuguaglianze salari
-1 -1 -1 -1
lavoratori occupazione rifugiati
-1 -1 -1
Beta:
(showing first 30 elements)
medie piccole sace
0.2408 0.2221 0.2204
prestiti internazionalizzazione liquidità
0.2113 0.2111 0.2106
imprese medio-grandi crediti
0.2019 0.2011 0.1980
pmi agevolato rate
0.1972 0.1870 0.1862
esportatrici intermediari bancari
0.1856 0.1847 0.1842
tax rotativo rilavorano
0.1841 0.1821 0.1814
annaffiate credito erogheranno
0.1814 0.1798 0.1783
differite micro creditrici
0.1782 0.1766 0.1755
pagamenti strong garanzia
0.1751 0.1746 0.1739
controgaranzia split 200.000-300.000
0.1737 0.1719 0.1711
Data Dimension:
[1] 35405 115293
On choosing k
k controls how many latent dimensions the SVD keeps. Think of it as the resolution of the semantic space:
- Too small (e.g.,
k = 50): loses too much information, words cluster too coarsely - Too large (e.g.,
k = 1000): captures noise, slows computation k = 300is a widely used default for parliamentary corpora of this size
If your corpus is small (< 5 000 documents), consider k = 100 or k = 150 to avoid over-fitting the semantic space.
5.6 Explore word scores
After fitting, every word in the vocabulary has an LSX score. Positive scores indicate semantic proximity to the left pole; negative scores indicate proximity to the right pole.
Here are the top 20 words closest to the right pole, the most right-wing words according to the LSX model:
head(coef(lss_model), 20) |>
sort() |>
(\(x) tibble::tibble(word = names(x), score = x))() |>
mutate(
rank = row_number(),
bar = score / min(score)
) |>
select(rank, word, score) |>
kbl(
digits = 4,
col.names = c("#", "Word", "Score"),
align = c("r", "l", "r")
) |>
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE,
font_size = 14
) |>
column_spec(1, color = "gray", width = "2em") |>
column_spec(2, monospace = TRUE) |>
column_spec(3, color = "#2126b2", bold = TRUE) |>
row_spec(0, color = "black", font_size = 12)| # | Word | Score |
|---|---|---|
| 1 | credito | 0.1798 |
| 2 | annaffiate | 0.1814 |
| 3 | rilavorano | 0.1814 |
| 4 | rotativo | 0.1821 |
| 5 | tax | 0.1841 |
| 6 | bancari | 0.1842 |
| 7 | intermediari | 0.1847 |
| 8 | esportatrici | 0.1856 |
| 9 | rate | 0.1862 |
| 10 | agevolato | 0.1870 |
| 11 | pmi | 0.1972 |
| 12 | crediti | 0.1980 |
| 13 | medio-grandi | 0.2011 |
| 14 | imprese | 0.2019 |
| 15 | liquidità | 0.2106 |
| 16 | internazionalizzazione | 0.2111 |
| 17 | prestiti | 0.2113 |
| 18 | sace | 0.2204 |
| 19 | piccole | 0.2221 |
| 20 | medie | 0.2408 |
Here are the top 20 words closest to the left pole, the most left-wing words according to the LSX model:
library(kableExtra)
library(dplyr)
tail(coef(lss_model), 20) |>
sort() |>
(\(x) tibble::tibble(word = names(x), score = x))() |>
mutate(
rank = row_number(),
bar = score / min(score)
) |>
select(rank, word, score) |>
kbl(
digits = 4,
col.names = c("#", "Word", "Score"),
align = c("r", "l", "r")
) |>
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE,
font_size = 14
) |>
column_spec(1, color = "gray", width = "2em") |>
column_spec(2, monospace = TRUE) |>
column_spec(3, color = "#A32D2D", bold = TRUE) |>
row_spec(0, color = "gray", font_size = 12)| # | Word | Score |
|---|---|---|
| 1 | disuguaglianze | -0.3200 |
| 2 | diseguaglianze | -0.2978 |
| 3 | lavoratrici | -0.2576 |
| 4 | impugnano | -0.2406 |
| 5 | tutele | -0.2385 |
| 6 | disosso | -0.2384 |
| 7 | steer | -0.2384 |
| 8 | seghezzi | -0.2329 |
| 9 | lavorative | -0.2309 |
| 10 | socialmente | -0.2294 |
| 11 | precoci | -0.2213 |
| 12 | lavoratori | -0.2207 |
| 13 | disparità | -0.2179 |
| 14 | diseguaglianza | -0.2167 |
| 15 | separandoli | -0.2141 |
| 16 | ammortizzatori | -0.2141 |
| 17 | marostica | -0.2139 |
| 18 | salari | -0.2128 |
| 19 | intermittenti | -0.2120 |
| 20 | fragilità | -0.2109 |
Inspect this output carefully. Ask yourself:
- Do the top left-leaning words make theoretical sense?
- Are there surprising words near either pole? If so, why?
- Are there seed words among the top-scored words, or have other words “caught up” to them?
Visual exploration — plot the word scores with the seeds highlighted:
textplot_terms(lss_model, highlighted = names(seeds))
This plot shows word scores on the horizontal axis. Seeds are labelled; other words are shown as points. Words near the seeds should be thematically related — this is your main validity check.
5.7 Score documents
Once word scores are established, each document receives a score equal to the frequency-weighted mean of the scores of the words it contains:
\[\hat{\theta}_d = \frac{\sum_w \text{freq}(w, d) \cdot s_w}{\sum_w \text{freq}(w, d)}\]
We first extract the document-level data and use the predict function to get the LSS score for each document. We can then visualise the distribution of these scores across all speeches.
dat <- docvars(lss_model$data)
dat$lss_score <- predict(lss_model)
print(nrow(dat))[1] 35405
5.8 Party-level aggregation
We can estimate the left-right position of each party by aggregating the document scores. We can also plot how these positions evolve over time. But before we do that, let’s check how well the LSS scores correlate with the Wordscores and Wordfish scores we obtained in the previous modules. This will give us a sense of whether LSS is capturing a similar ideological dimension, and how much it agrees with the other methods.
dat |>
group_by(party_name) |>
summarise(mean_lss = mean(lss_score, na.rm = TRUE)) |>
arrange(mean_lss)# A tibble: 6 × 2
party_name mean_lss
<chr> <dbl>
1 Movimento 5 Stelle -0.362
2 Partito Democratico -0.132
3 Italia Viva 0.0198
4 Forza Italia – Il Popolo della Libertà 0.101
5 Fratelli d'Italia 0.191
6 Lega 0.253
We can also have a more nice table with the party names, the mean lss score and the direction (left or right) based on the mean lss score.
dat |>
group_by(party_name) |>
summarise(mean_lss = mean(lss_score, na.rm = TRUE)) |>
arrange(mean_lss) |>
mutate(
rank = row_number(),
direction = ifelse(mean_lss < 0, "LEFT", "RIGHT")
) |>
select(rank, party_name, mean_lss, direction) |>
kbl(
digits = 4,
col.names = c("#", "Party", "Mean LSS", "Pole"),
align = c("r", "l", "r", "c")
) |>
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE,
font_size = 14
) |>
column_spec(1, color = "gray", width = "2em") |>
column_spec(2, bold = TRUE) |>
column_spec(3, color = ifelse(
dat |>
group_by(party_name) |>
summarise(mean_lss = mean(lss_score, na.rm = TRUE)) |>
arrange(mean_lss) |>
pull(mean_lss) < 0,
"#A32D2D",
"#185FA5"
)) |>
column_spec(4, color = ifelse(
dat |>
group_by(party_name) |>
summarise(mean_lss = mean(lss_score, na.rm = TRUE)) |>
arrange(mean_lss) |>
pull(mean_lss) < 0,
"#A32D2D",
"#185FA5"
)) |>
row_spec(0, color = "gray", font_size = 12)| # | Party | Mean LSS | Pole |
|---|---|---|---|
| 1 | Movimento 5 Stelle | -0.3620 | LEFT |
| 2 | Partito Democratico | -0.1320 | LEFT |
| 3 | Italia Viva | 0.0198 | RIGHT |
| 4 | Forza Italia – Il Popolo della Libertà | 0.1011 | RIGHT |
| 5 | Fratelli d'Italia | 0.1913 | RIGHT |
| 6 | Lega | 0.2526 | RIGHT |
5.9 Reflection
Take a few minutes to think about what the LSX results tell us about Italian parliamentary politics — and about the method itself.
Substantive questions
- Are PD and FDI at the expected opposite poles of the left-right axis?
- Where does M5S fall? The Five Star Movement explicitly rejects the left-right label; does LSX place it in the centre, or does it lean in one direction in practice?
- Are the Lega scores consistent with what you know about its ideological evolution (from northern regionalism toward national right-wing populism)?
- Are there any parties whose position surprises you? What might explain the discrepancy?
Methodological questions
- How sensitive are the results to the seed word choice? Try replacing one or two seeds and refit — does the party ranking change substantially?
- LSX scores range roughly between −1 and +1, but the exact values depend on the seed words and the corpus. The scale is relative, not absolute — you can compare parties to each other but not directly to Wordfish or Wordscores scores.
- What does within-party variance (the error bars in Section 5) tell us? Is it noise, or genuine heterogeneity in how different politicians within the same party talk?
Key takeaway: LSX gives us a principled, replicable, theoretically-grounded way to measure a specific ideological dimension from text, with minimal manual annotation. Its main assumption is that your seed words are valid operationalisations of the dimension you care about. That assumption should always be examined, not taken for granted, as the scores of the non-seed words are derived from the seed words.