usage: extract.py [-h] [-v] [-i] [-u] [-s]
extract words from ALTO
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase output verbosity
-i, --initial use inital revision (0) instead of latest
-u, --unlocked include unlocked documents
-s, --sanitise strip non-unicode characters from words
The `extract` tool extracts words from the ALTO XML files in the Revizor collection and stores them in an SQLite3 database. This enables one to use SQL queries to list and research the data in the collection.
Words are stored in tables based on their language ISO code. There is also a reference table with the document UUIDs and revisions from which the words have been extracted.
This document shows some basic SQL commands that can be used to extract some interesting information about the words contained in Revizor's collection.
$ sqlite3 words.db .tables
documents kca mhr myv udm
ers krl mns ru vep
izh mdf mrj sms yrk
$ sqlite3 words.db 'select id, uuid from documents'
1|04f6871364fd8e62948c9d5cc0955877
2|050c71e6dad2771ed929f4008ef947db
3|050c71dbd1bc8b662f59ecb649f6e5fb
4|050c847b2c83f8073c7a77877e0f7f60
5|050c8562a5ff5fa6d8ad424309f1c903
6|050c85a9b904cab694c7361c97a85a83
7|05153fa2559a45ca85115ee9d45cc07d
8|050c70aa8fd8d260566d47603b767fa9
9|05158b8478590725a226b4c16667217f
10|050c993afa20952471e3e318953f9f8d
11|04f6aaf6cd24634531fbebb9e1ffad2b
12|050e538c520abcc0ffab8e4a2579a036
13|04fed7e6920d812a37f599d081c4625e
14|04f68744f0db5701992bf631bdcc8dd6
15|050e3f89d3c25e174c0f42b881a4b309
16|050c850eb85f5637f21c0ad0aa7a858f
17|050c852d0eeee43bca5057f9db3f71bc
18|0514da047b036030ae39696bfb3b9add
19|050c98a97b7dc3cf9a3acbf6a6341b78
20|050d110633adba925debdd376c1f4527
21|050c9d9a1660080afa237df1c0c6f93d
22|050d2a9da1e416cbe19165bbc9269a45
23|0514eab63a76fca4bd1d90bf006a1e56
24|0519141c49d2eb829ac014eaf92b2acd
25|0517ab1da51fbc02e24bd85560265fa9
26|050e1c27194536d5bd21f1b4e18b7657
27|050d271063b71a2670f2644947b9248a
28|050c988621f10721bd4e7fb55c9f541d
29|0514da819e4bd4749084f9ba3fd96764
30|050e5187c8b30e9a6e8f14ae9dbf05a8
31|0514edf6396350fda8ff988716b5720e
32|0514da9a9ab45e3419f9b85d47a8be3f
33|05154ebc839d991fb086871e5c81421a
34|050e3c96fd4eeeecb8789275680f3b0d
35|050e2edd50fbc84b52e1651f3bf90b1a
36|04feae030dc1d6e37be4faf7a9601254
37|0514da37b3b21d7426d446be7107f27a
38|05153f9d7acea519f8a24603040cf435
39|050c8498cf6a37fed9da2bc66e491f9d
40|050c8409af10e887db8251e768812b5c
41|050c85d2a8fd497e72a30664582e20d0
42|04f737949a50c1c56dc48290aaa505f7
43|04f6aabe7afd1da6bb9d06227b714fbc
44|05153f8c38f231e333b9a807c3153ab8
45|04fed7d20951465770a628df4bd8e103
46|04feda835582ed231e793b6ee80ab897
47|04f699d5226f1878bbcd109cc1dce6e5
48|050e3f7da85e7a86c179901642c37c6c
49|05158bab9455e576eb7a494ce09d8d68
50|05153b55f1f96eb5de595e11a9927f18
51|050c88a39e9c7c3d0ad2089c159975aa
52|05153face44dd9b9774743e28eaff8ef
53|05153fb1357e09cfe8ea2432e2b06baf
54|05153fa8085ee69c4ec421a9ec62e329
55|0514dab6026dd5d14cd818a37d48dcae
56|050e65df183b0d35bdd069b8c6a9053c
57|04f7d781e216c66e4eccf5642455555d
58|04f79c146ce053c636e1e6e52a3f4666
59|05158e6a9a1b8342f95bbae0d4b99bfc
Show which revision of a document has been used to generate its word list:
$ sqlite3 words.db "select * from documents where uuid = '04f6871364fd8e62948c9d5cc0955877'"
1|04f6871364fd8e62948c9d5cc0955877|61
List of 20 first `izh` words (remove the limit):
$ sqlite3 words.db 'select word from izh limit 20'
iwfcc
korotajev
mitälee
mörni
pirnvozast
laittaa
miä
kaik
i
graafikast
hänt
en
jaksa
löytää
vot
vet
kons
miun
selän
takkaan
If you want alphabetical order and de-duplicate the words (every occurrence is inserted with reference, so lots of words have duplicate):
sqlite3 words.db 'select distinct(word) from izh order by word limit 20'
You can count the number of words for each dictionary:
$ sqlite3 words.db 'select count(word) from izh'
99946
Without duplicates:
$ sqlite3 words.db 'select count(distinct(word)) from izh'
23103
The most common words for works in `izh`:
$ sqlite3 words.db 'select word, count(word) from izh group by word order by count(word) desc limit 30'
i|2482
ja|2318
ei|1335
a|1151
hää|1133
ono|805
oli|708
jot|706
miä|611
ku|590
vaa|497
se|463
mitä|456
hänen|455
kaik|446
kuin|437
jo|412
kera|380
niku|378
tuli|377
siä|374
on|370
päälle|358
niin|355
no|348
siis|346
noisi|338
hänt|302
mäni|297
pois|288
The 30 most common words longer than 6 in `izh`:
$ sqlite3 words.db 'select word from izh where length(word) > 6 group by word order by count(word) desc limit 30'
hänelle
hyppäis
äkkiistää
kaikkinee
peremees
vastais
kysymykset
peremiis
talonpoika
poikain
mörähti
volodja
näyttiis
näyttää
enemmän
inmihiin
kastanka
kiljahti
inmihiset
maksima
kaştanka
takkaan
jooksoo
katsomaa
läkkäis
dispetcera
duumais
mitälee
männööt
alkoivat
Give all words for the `izh` language that start with "help" and show document UUID and page reference:
$ sqlite3 words.db 'select word, uuid, page from izh inner join documents on (doc = id) where word like "help%" order by word'
helpomp|04f6871364fd8e62948c9d5cc0955877|17
helpost|04f6aaf6cd24634531fbebb9e1ffad2b|44
helppo|04f699d5226f1878bbcd109cc1dce6e5|125
helppo|04f699d5226f1878bbcd109cc1dce6e5|78
Search for all occurrences of `seläst`, ordered by document and page:
$ sqlite3 words.db 'select word, uuid, page from izh inner join documents on (doc = id) where word = "seläst" order by uuid, page'
seläst|04f699d5226f1878bbcd109cc1dce6e5|34
seläst|04f699d5226f1878bbcd109cc1dce6e5|69
seläst|04f699d5226f1878bbcd109cc1dce6e5|87
seläst|04f699d5226f1878bbcd109cc1dce6e5|100
seläst|04f6aaf6cd24634531fbebb9e1ffad2b|14
seläst|04f6aaf6cd24634531fbebb9e1ffad2b|80
seläst|04f6aaf6cd24634531fbebb9e1ffad2b|100
Show the total number of words per document:
$ sqlite3 words.db 'select uuid, count(word) as total from documents inner join izh on (id = doc) group by doc order by total desc'
04f699d5226f1878bbcd109cc1dce6e5|28285
04f68744f0db5701992bf631bdcc8dd6|26876
04f6aaf6cd24634531fbebb9e1ffad2b|19772
04f6aabe7afd1da6bb9d06227b714fbc|17863
04f6871364fd8e62948c9d5cc0955877|6430
04feae030dc1d6e37be4faf7a9601254|427
04fed7d20951465770a628df4bd8e103|291
04feda835582ed231e793b6ee80ab897|2
Average number of words per document for a given language table:
$ sqlite3 words.db 'select avg(total) from (select count(word) as total, doc from izh group by doc)'
12493.25
Specifically exclude documents:
$ sqlite3 words.db 'select distinct(word) from izh inner join documents on (doc = id) where uuid not in ("04f699d5226f1878bbcd109cc1dce6e5", "04f6aaf6cd24634531fbebb9e1ffad2b") order by word'
a
aa
aaaaa
aaaaaaa
aadrestia
aallolle
aallon
aallot
aalto
aaltois
[...]
Include only certain documents:
$ sqlite3 words.db 'select distinct(word) from izh inner join documents on (doc = id) where uuid in ("04f699d5226f1878bbcd109cc1dce6e5", "04f6aaf6cd24634531fbebb9e1ffad2b") order by word'
a
aaa
aadressin
aadresti
aagentta
aallot
aalto
aaltojaa
aastaika
aastaikaa
[...]
Etc...