Loading...
Loading...
Corpus of digitized Soviet census volumes (1897–1989) built for diachronic analysis of ethnic classification and census methodology. Uses a VLM pipeline to extract text from ~100GB of scanned archival material. Tracks shifts in taxonomic categories (like народность to национальность) and changes in the словари национальностей across census years.
The Soviet census shaped how ethnic identities were categorized across the USSR, but the source material is trapped in scanned volumes. I built a VLM extraction pipeline and am using corpus linguistics methods to perform a diachronic analysis of how census language, methodology, and ethnic classifications shifted from 1897 to 1989. The output is both a structured dataset and a tool for studying how the state encoded identity.
VLM pipeline for digitizing historical Soviet census tables. Extracts demographic data using Qwen2.5-VL and Gemini.