extract.py

usage: extract.py [-h] [-v] [-i] [-u] [-s]

extract words from ALTO

optional arguments:

-h, --help show this help message and exit

-v, --verbose increase output verbosity

-i, --initial use inital revision (0) instead of latest

-u, --unlocked include unlocked documents

-s, --sanitise strip non-unicode characters from words

The `extract` tool extracts words from the ALTO XML files in the Revizor collection and stores them in an SQLite3 database. This enables one to use SQL queries to list and research the data in the collection.

Words are stored in tables based on their language ISO code. There is also a reference table with the document UUIDs and revisions from which the words have been extracted.

This document shows some basic SQL commands that can be used to extract some interesting information about the words contained in Revizor's collection.

$ sqlite3 words.db .tables

documents kca mhr myv udm

ers krl mns ru vep

izh mdf mrj sms yrk

$ sqlite3 words.db 'select id, uuid from documents'

1|04f6871364fd8e62948c9d5cc0955877

2|050c71e6dad2771ed929f4008ef947db

3|050c71dbd1bc8b662f59ecb649f6e5fb

4|050c847b2c83f8073c7a77877e0f7f60

5|050c8562a5ff5fa6d8ad424309f1c903

6|050c85a9b904cab694c7361c97a85a83

7|05153fa2559a45ca85115ee9d45cc07d

8|050c70aa8fd8d260566d47603b767fa9

9|05158b8478590725a226b4c16667217f

10|050c993afa20952471e3e318953f9f8d

11|04f6aaf6cd24634531fbebb9e1ffad2b

12|050e538c520abcc0ffab8e4a2579a036

13|04fed7e6920d812a37f599d081c4625e

14|04f68744f0db5701992bf631bdcc8dd6

15|050e3f89d3c25e174c0f42b881a4b309

16|050c850eb85f5637f21c0ad0aa7a858f

17|050c852d0eeee43bca5057f9db3f71bc

18|0514da047b036030ae39696bfb3b9add

19|050c98a97b7dc3cf9a3acbf6a6341b78

20|050d110633adba925debdd376c1f4527

21|050c9d9a1660080afa237df1c0c6f93d

22|050d2a9da1e416cbe19165bbc9269a45

23|0514eab63a76fca4bd1d90bf006a1e56

24|0519141c49d2eb829ac014eaf92b2acd

25|0517ab1da51fbc02e24bd85560265fa9

26|050e1c27194536d5bd21f1b4e18b7657

27|050d271063b71a2670f2644947b9248a

28|050c988621f10721bd4e7fb55c9f541d

29|0514da819e4bd4749084f9ba3fd96764

30|050e5187c8b30e9a6e8f14ae9dbf05a8

31|0514edf6396350fda8ff988716b5720e

32|0514da9a9ab45e3419f9b85d47a8be3f

33|05154ebc839d991fb086871e5c81421a

34|050e3c96fd4eeeecb8789275680f3b0d

35|050e2edd50fbc84b52e1651f3bf90b1a

36|04feae030dc1d6e37be4faf7a9601254

37|0514da37b3b21d7426d446be7107f27a

38|05153f9d7acea519f8a24603040cf435

39|050c8498cf6a37fed9da2bc66e491f9d

40|050c8409af10e887db8251e768812b5c

41|050c85d2a8fd497e72a30664582e20d0

42|04f737949a50c1c56dc48290aaa505f7

43|04f6aabe7afd1da6bb9d06227b714fbc

44|05153f8c38f231e333b9a807c3153ab8

45|04fed7d20951465770a628df4bd8e103

46|04feda835582ed231e793b6ee80ab897

47|04f699d5226f1878bbcd109cc1dce6e5

48|050e3f7da85e7a86c179901642c37c6c

49|05158bab9455e576eb7a494ce09d8d68

50|05153b55f1f96eb5de595e11a9927f18

51|050c88a39e9c7c3d0ad2089c159975aa

52|05153face44dd9b9774743e28eaff8ef

53|05153fb1357e09cfe8ea2432e2b06baf

54|05153fa8085ee69c4ec421a9ec62e329

55|0514dab6026dd5d14cd818a37d48dcae

56|050e65df183b0d35bdd069b8c6a9053c

57|04f7d781e216c66e4eccf5642455555d

58|04f79c146ce053c636e1e6e52a3f4666

59|05158e6a9a1b8342f95bbae0d4b99bfc

Show which revision of a document has been used to generate its word list:

$ sqlite3 words.db "select * from documents where uuid = '04f6871364fd8e62948c9d5cc0955877'"

1|04f6871364fd8e62948c9d5cc0955877|61

List of 20 first `izh` words (remove the limit):

$ sqlite3 words.db 'select word from izh limit 20'

iwfcc

korotajev

mitälee

mörni

pirnvozast

laittaa

miä

kaik

i

graafikast

hänt

en

jaksa

löytää

vot

vet

kons

miun

selän

takkaan

If you want alphabetical order and de-duplicate the words (every occurrence is inserted with reference, so lots of words have duplicate):

sqlite3 words.db 'select distinct(word) from izh order by word limit 20'

You can count the number of words for each dictionary:

$ sqlite3 words.db 'select count(word) from izh'

99946

Without duplicates:

$ sqlite3 words.db 'select count(distinct(word)) from izh'

23103

The most common words for works in `izh`:

$ sqlite3 words.db 'select word, count(word) from izh group by word order by count(word) desc limit 30'

i|2482

ja|2318

ei|1335

a|1151

hää|1133

ono|805

oli|708

jot|706

miä|611

ku|590

vaa|497

se|463

mitä|456

hänen|455

kaik|446

kuin|437

jo|412

kera|380

niku|378

tuli|377

siä|374

on|370

päälle|358

niin|355

no|348

siis|346

noisi|338

hänt|302

mäni|297

pois|288

The 30 most common words longer than 6 in `izh`:

$ sqlite3 words.db 'select word from izh where length(word) > 6 group by word order by count(word) desc limit 30'

hänelle

hyppäis

äkkiistää

kaikkinee

peremees

vastais

kysymykset

peremiis

talonpoika

poikain

mörähti

volodja

näyttiis

näyttää

enemmän

inmihiin

kastanka

kiljahti

inmihiset

maksima

kaştanka

takkaan

jooksoo

katsomaa

läkkäis

dispetcera

duumais

mitälee

männööt

alkoivat

Give all words for the `izh` language that start with "help" and show document UUID and page reference:

$ sqlite3 words.db 'select word, uuid, page from izh inner join documents on (doc = id) where word like "help%" order by word'

helpomp|04f6871364fd8e62948c9d5cc0955877|17

helpost|04f6aaf6cd24634531fbebb9e1ffad2b|44

helppo|04f699d5226f1878bbcd109cc1dce6e5|125

helppo|04f699d5226f1878bbcd109cc1dce6e5|78

Search for all occurrences of `seläst`, ordered by document and page:

$ sqlite3 words.db 'select word, uuid, page from izh inner join documents on (doc = id) where word = "seläst" order by uuid, page'

seläst|04f699d5226f1878bbcd109cc1dce6e5|34

seläst|04f699d5226f1878bbcd109cc1dce6e5|69

seläst|04f699d5226f1878bbcd109cc1dce6e5|87

seläst|04f699d5226f1878bbcd109cc1dce6e5|100

seläst|04f6aaf6cd24634531fbebb9e1ffad2b|14

seläst|04f6aaf6cd24634531fbebb9e1ffad2b|80

seläst|04f6aaf6cd24634531fbebb9e1ffad2b|100

Show the total number of words per document:

$ sqlite3 words.db 'select uuid, count(word) as total from documents inner join izh on (id = doc) group by doc order by total desc'

04f699d5226f1878bbcd109cc1dce6e5|28285

04f68744f0db5701992bf631bdcc8dd6|26876

04f6aaf6cd24634531fbebb9e1ffad2b|19772

04f6aabe7afd1da6bb9d06227b714fbc|17863

04f6871364fd8e62948c9d5cc0955877|6430

04feae030dc1d6e37be4faf7a9601254|427

04fed7d20951465770a628df4bd8e103|291

04feda835582ed231e793b6ee80ab897|2

Average number of words per document for a given language table:

$ sqlite3 words.db 'select avg(total) from (select count(word) as total, doc from izh group by doc)'

12493.25

Specifically exclude documents:

$ sqlite3 words.db 'select distinct(word) from izh inner join documents on (doc = id) where uuid not in ("04f699d5226f1878bbcd109cc1dce6e5", "04f6aaf6cd24634531fbebb9e1ffad2b") order by word'

a

aa

aaaaa

aaaaaaa

aadrestia

aallolle

aallon

aallot

aalto

aaltois

[...]

Include only certain documents:

$ sqlite3 words.db 'select distinct(word) from izh inner join documents on (doc = id) where uuid in ("04f699d5226f1878bbcd109cc1dce6e5", "04f6aaf6cd24634531fbebb9e1ffad2b") order by word'

a

aaa

aadressin

aadresti

aagentta

aallot

aalto

aaltojaa

aastaika

aastaikaa

[...]

Etc...

Content

Space Tools