PCA Visualizations | CMCL Project

PCA visualisations of BERT's contextual embeddings

This website accompanies the CMCL 2026 paper “Layer su Layer: Identifying and Disambiguating the Italian NPN Construction in BERT’s family” and provides an interactive visualization of the PCA projections of contextual embeddings extracted from BERT. The aim is to qualitatively explore whether construction-relevant distinctions emerge in the representation space.

The study focuses on the Italian NPN (noun–preposition–noun) constructional family and investigates whether contextual embeddings encode information relevant to both construction identification and semantic disambiguation. Each point in the plots corresponds to an instance of an NPN construction or a distractor, depending on the experimental condition, while colors reflect constructional or semantic labels.

These visualizations are intended as a qualitative complement to probing-based evaluation: rather than providing direct evidence of linguistic knowledge, they offer an exploratory perspective on how embeddings are geometrically organized across layers and embedding types, and whether such organization aligns with linguistically motivated distinctions.

References:

Gorzoni, G., Pannitto, L., & Masini, F. (2026). 'Layer su Layer': Identifying and Disambiguating the Italian NPN Construction in BERT's family ↗

@article{gorzoni2026layer,
  author  = {Gorzoni, Greta and Pannitto, Ludovica and Masini, Francesca},
  title   = {Layer su Layer: Identifying and Disambiguating the Italian NPN Construction in BERT's family},
  journal = {Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics},
  year    = {2026}
}

Gorzoni, G., Pannitto, L., & Masini, F. (2026). NPN Construction and Distractor dataset (v1.0.2) [Data set]. Zenodo ↗

@dataset{gorzoni2026npn,
          author    = {Gorzoni, Greta and Pannitto, Ludovica and Masini, Francesca},
          title     = {NPN Construction and Distractor dataset},
          year      = {2026},
          publisher = {Zenodo},
          doi       = {10.5281/zenodo.19095867}
        }

Masini, F. (2024). Costruzioni su costruzioni: idiomaticity and regularity of NPN discontinuous reduplications in Italian. TOPOI, 51–82. ↗

@article{masini2024costruzioni,
  title={Costruzioni su costruzioni: idiomaticity and regularity of NPN discontinuous reduplications in Italian},
  author={Masini, Francesca and others},
  journal={TOPOI},
  pages={51--82},
  year={2024},
  publisher={Aracne}
}

Masini, F. (2024). NPN discontinuous reduplications in Italian: dataset. ↗

@dataset{masini2024npn,
  title={NPN discontinuous reduplications in Italian: dataset},
  author={Masini, Francesca},
  year={2024},
  publisher={Univesity of Bologna}
}

How to read the visualizations:

Points: individual dataset instances.

Colors: class labels or semantic labels.

Frames / layers: successive hidden layers of the BERT model.

Interpretation: visible clustering can suggest distinctions in representational space, but PCA remains a partial projection and should be read together with probing results.

Select a visualization

Switch between experimental conditions and embedding types.

How the contextual embddings of NPN instances are extracted from BERT?

Contextualized embeddings are extracted from BERT in inference mode to probe representations learned during pre-training. To isolate construction-level information, the preposition in the NPN construction is replaced with [UNK], forcing the model to rely on contextual and structural cues rather than lexical semantics. In parallel, embeddings are extracted for the original preposition in the unmodified sentence. Token positions are aligned via character-to-subword mapping, and these embeddings serve as a comparison point to assess the contribution of lexical information.

UNK · Identification across NPN Cxns and Distractors

This visualization shows PCA projections for the construction identification task based on [UNK] representations. By replacing the prepositional slot with [UNK], the setup reduces direct lexical information and highlights whether constructional distinctions can still emerge in the embedding space.

Dataset composition: 240 instances of NPN Cxns and distractors. Embedding: [UNK]

Why `[UNK]/PREP`?

The [UNK] strategy masks the lexical identity of the preposition, making it possible to test whether the model captures constructional information beyond specific lexical cues.

What to inspect

Look for whether positive and negative instances occupy more distinct regions as layer depth increases, and whether their organization becomes more coherent in later layers.

How it relates to probing

These plots provide a qualitative complement to classifier-based probing: they help assess whether distinctions that are recoverable quantitatively also have visible geometric correlates in the representation space.