Technical★ Best WorkActive Development

Soviet-GISMA Thesis

Corpus linguistics toolkit for Soviet census methodology texts (1897-1989) — VLM extraction, pre-reform orthography, diachronic register analysis.

Why This Matters

The Soviet census shaped how ethnic identities were categorized across the USSR, but the source material is trapped in scanned volumes. I built a VLM extraction pipeline and am using corpus linguistics methods to perform a diachronic analysis of how census language, methodology, and ethnic classifications shifted from 1897 to 1989. The output is both a structured dataset and a tool for studying how the state encoded identity.

GitHub Repository

private

This repository is private. Connector status is healthy and access is restricted to authorized collaborators.

Technology Stack

Python 3.11Qwen2.5-VLGeminiFastAPIPostGISDockerNVIDIA CUDA

Details

Institution

Georgetown University

Back to Projects