Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions Chinese/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
POS Count
noun 47450
num 900
pron 356
pnoun 8904
adv 5971
verb 31412
det 411
loc 371
conj 428
msr 763
part 3778
adj 9494
prep 1614
intj 272
mark 192
idiom 10
fw 15
ono 1
punc 144
12 changes: 12 additions & 0 deletions Czech/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
POS Count
j 71
v 6050
r 66
p 54
d 1774
c 58
n 13057
a 6670
t 67
i 61
x 233
12 changes: 12 additions & 0 deletions Dutch/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
POS Count
adj 769
noun 2061
verb 1000
conj 33
interj 21
prep 33
adv 208
num 39
pron 53
intj 1
art 2
1 change: 0 additions & 1 deletion Dutch/simplified-pos-tagset-dut.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,5 @@ conj conjuction
art article
pron pronoun
punc puctuation
sent sentence marker


16 changes: 16 additions & 0 deletions Finnish/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
POS Count
code 15
abbrev 409
proper 7931
noun 26666
comppart 608
adjective 3395
verb 3365
interjection 86
adverb 3243
preposition 224
conjunction 116
numeral 91
pronoun 73
noun 3
adverb 1
11 changes: 11 additions & 0 deletions French/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
POS Count
prep 60
noun 1633
adv 147
verb 448
adj 264
det 86
pron 56
conj 20
null 2
intj 8
15 changes: 15 additions & 0 deletions Italian/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
POS Count
abr 114
noun 22272
verb 11056
null 12
pnoun 417
adj 8153
num 59
prep 3130
conj 181
adv 1405
pron 366
intj 73
art 659
punc 2
13 changes: 13 additions & 0 deletions Portuguese/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
POS Count
num 57
pnoun 320
adj 3428
noun 9291
verb 3096
prep 1131
adv 681
intj 38
pron 132
conj 83
det 289
punc 28
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,33 @@ Number of unique values in lexicon file 2 7637
Number of unique values in common between the two files:3169
```

### Create Pos Tagsets

The script creates a POS tagset per language stated within the [./language_resources.json](./language_resources.json) meta data file, which is explained in the [USAS Lexicon Meta Data section](#usas-lexicon-meta-data), and creates a POS tagset based on the POS tags used within the language's single and MWE semantic lexicon files. The POS tagset generated is then saved within each language's folder under the file name `generated_pos_tagset.tsv`. Each generated tagset has two fields `POS`, and `Count`, the `POS` field represents the POS tags, and the `Count` represents the number of times the associated tag has been used within the language's lexicon file(s). An example of this generated POS tagset is shown below, taken from the Welsh language folder:

``` tsv
POS Count
verb 130197
adv 123
art 7
conj 87
pron 67
prep 293
noun 4358
pnoun 6572
adj 1542
fw 40
num 36
intj 6
* 2
```

To run this script:

``` bash
python create_pos_tagsets.py
```

### Python Requirements

This has been tested with Python >= `3.7`, to install the relevant python requirements:
Expand Down
19 changes: 19 additions & 0 deletions Russian/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
POS Count
fw 6
intj 13
s 21003
a 1931
adv 171
conj 17
v 1701
part 26
pr 39
a-pro 26
com 4
parenth 14
a-num 37
num 41
praedic 17
adv-pro 30
s-pro 17
* 1772
48 changes: 48 additions & 0 deletions Spanish/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
POS Count
noun 2217
prep 95
num/noun 52
pnoun 19993
adv 238
adj 947
verb 717
conj 37
intj 3
prep+art 2
adj/noun 36
pron 72
verb/adj 2
num 22
adj/num/noun 13
adj/nun 1
num/adj 16
adj/num 10
adv/conj 2
verb/adj/noun 1
art 67
verbo 1
abbr 1
abbr/noun 1
prefix 1
noun/adj 15
verb + pron 1
det 1
noun/adv 1
adv/adj 1
adj/adv 1
adj/pron 7
verb/noun 6
con/adj/pron 1
noun/verb 2
adj/nm 1
adv/pron 1
fw 7
noun 1
noun/num 1
num/art 1
art/art/pron 1
num/art/pron 2
art/pron 2
num/adj/pron 1
noun/pnoun 1
port 10
19 changes: 19 additions & 0 deletions Swedish/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
POS Count
nn 9120
pp 77
in 25
vb 2959
av 3294
ab 424
pm 60
1860
prefix 63
pn 87
sn 16
pres part 7
kn 19
nl 41
ie 1
al 3
nna 2
pma 24
99 changes: 99 additions & 0 deletions Urdu/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
POS Count
at 1
io 1
cc 4
at1 3
rp 21
to 2
pph1 1
vbz 1
vbdz 1
ppis1 1
rg 12
ppy 1
pphs1 2
iw 1
ii 13
vbr 2
xx 1
vhz 3
rr 141
vhn 1
pphs2 1
pp$ 1
ddq 2
ppis2 1
vbdr 1
vd0 1
vbn 1
appge 5
vm 8
rl 28
vv0 557
csw 2
vvn 99
pnqs 1
pn1 13
ppho2 1
rrq 6
ppho1 1
rt 9
mc 9
vdd 1
uh 16
ppio1 1
nn1 423
rrr 12
dd2 2
jj 147
csn 1
da2 3
nnt2 6
vh0 1
vbg 1
da 3
nnt1 13
vdz 1
vbm 2
ppio2 1
nn 16
rrqv 1
dd1 1
rrt 6
nnb 2
ra 5
vvd 39
ge 1
ja 1
nn2 93
cs 4
md 3
vvz 40
vdn 1
vhg 1
np1 60
jb 30
nnl1 17
ppx1 5
vdg 1
vvg 21
nno 4
ppx2 1
vvi 9
ddqge 1
nnu 1
nnl2 1
nd1 4
npm1 12
jjr 5
nnu2 3
pnqo 1
np2 2
nn11 1
npd1 4
pn 1
jj% 3
vvn@ 1
vmk 1
jjt 2
nno2 1
14 changes: 14 additions & 0 deletions Welsh/generated_pos_tagset.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
POS Count
verb 130197
adv 123
art 7
conj 87
pron 67
prep 293
noun 4358
pnoun 6572
adj 1542
fw 40
num 36
intj 6
* 2
Loading