Loading...
Loading...
Corpus linguistics toolkit for Soviet census methodology texts (1897-1989) — VLM extraction, pre-reform orthography, diachronic register analysis.
The Soviet census shaped how ethnic identities were categorized across the USSR, but the source material is trapped in scanned volumes. I built a VLM extraction pipeline and am using corpus linguistics methods to perform a diachronic analysis of how census language, methodology, and ethnic classifications shifted from 1897 to 1989. The output is both a structured dataset and a tool for studying how the state encoded identity.
This repository is private. Connector status is healthy and access is restricted to authorized collaborators.