-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
221 lines (185 loc) · 7.46 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: "lingcorpora: get data from different corpora"
author: "A. Koshevoy, G. Moroz"
output:
html_document:
theme: lumen
highlight: tango
toc: yes
toc_position: right
toc_float: yes
smooth_scroll: false
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=F, warning = F)
```
# About lingcorpora
`lingcorpora` package provides R with API from different linguistic corpora. A tutorial for this package is avaliable on GitHub wiki. This package includes APIs for:
* [Abkhaz Text Corpus](http://baltoslav.eu/apsua/index.php)
* [Avar Text Corpus](http://baltoslav.eu/avar/index.php)
* [National Corpus of Polish](nkjp.pl)
* [National Corpus of Russian Language](http://www.ruscorpora.ru/)
```{r, echo=FALSE}
library(lingtypology)
map.feature(c("Abkhaz", "Avar", "Polish", "Russian"))
```
# Instalation {.tabset .tabset-fade .tabset-pills}
## R version
Get the last version from GitHub:
```{r, eval = F}
install.packages("devtools")
devtools::install_github("agricolamz/lingcorpora.R", dependencies = TRUE)
```
Load a library:
```{r}
library(lingcorpora)
```
## Python version
If you want to install our package, please tap the following command in Terminal:
```{bash, eval = F}
pip3 install git+https://github.com/alexeykosh/lingcorpora.py
```
For import it in your project, tap:
```{python, engine.path = '/usr/bin/python3'}
import lingcorpora
```
ADD about python3!!!
# Usage
Most of the functions in `lingcorpora` have the same syntax: first part is a language iso code, the second part is `_corpus`.
## Abkhaz Text Corpus {.tabset .tabset-fade .tabset-pills}
### R version
The basic function for searching in Abkhaz Text Corpus is `abk_corpus`. This function creates a dataframe with a results from the corpus. The function `abk_corpus` have a lot of arguments (as in all R function, it is not obligatory to write names of the arguments):
* **`query`** --- the sole obligatory argument with your query. I will use library `DT` for data frame visualization, but it is not necessary
```{r, eval=FALSE}
df <- abk_corpus(query = "бызшәа")
head(df)
```
```{r, echo=FALSE}
df <- abk_corpus(query = "бызшәа")
library(DT)
datatable(head(df), options = list(dom = 'tip'))
```
* **`kwic`** (key word in context) is the format for resulted lines. If TRUE, then it returns a dataframe with query in the middle and left and right contexts. If FALSE, then it returns each result in one string. By default is TRUE.
```{r}
df <- abk_corpus(query = "бызшәа", kwic = FALSE)
head(df)
```
* **`write`** argument writes a file in the working derictory. If FALSE, then it creates a dataframe in Global Environment. Otherwise function writes a .tsv file with the name frome the argument value. By default is FALSE.
```{r, eval=FALSE}
abk_corpus(query = "бызшәа", write = "myquiry")
```
The **`query`** argument can be filled with regular expressions or CQL (corpus query language), read more at the [help page](http://baltoslav.eu/apsua/index.php)
```{r, eval=FALSE}
df <- abk_corpus(query = "бызшәа*")
head(df)
```
```{r, echo=FALSE}
df <- abk_corpus(query = "бызшәа*")
datatable(head(df), options = list(dom = 'tip'))
```
### Python version
## Avar Text Corpus {.tabset .tabset-fade .tabset-pills}
### R version
The basic function for searching in Avar Text Corpus is `ava_corpus`. This function creates a dataframe with a results from the corpus. The function `ava_corpus` have a lot of arguments (as in all R function, it is not obligatory to write names of the arguments):
* **`query`** --- the sole obligatory argument with your query
```{r, eval=FALSE}
df <- ava_corpus(query = "шагьар")
head(df)
```
```{r, echo=FALSE}
df <- ava_corpus(query = "шагьар")
datatable(head(df), options = list(dom = 'tip'))
```
* **`kwic`** (key word in context) is the format for resulted lines. If TRUE, then it returns a dataframe with query in the middle and left and right contexts. If FALSE, then it returns each result in one string. By default is TRUE.
```{r}
df <- ava_corpus(query = "вацазе", kwic = FALSE)
head(df)
```
* **`write`** argument writes a file in the working derictory. If FALSE, then it creates a dataframe in Global Environment. Otherwise function writes a .tsv file with the name frome the argument value. By default is FALSE.
```{r, eval=FALSE}
ava_corpus(query = "васазе", write = "myquiry")
```
The **`query`** argument can be filled with regular expressions or CQL (corpus query language), read more at the [help page](http://baltoslav.eu/avar/index.php)
```{r, eval=FALSE}
df <- ava_corpus(query = "магIарул*")
head(df)
```
```{r, echo=FALSE}
df <- ava_corpus(query = "магIарул*")
datatable(head(df), options = list(dom = 'tip'))
```
### Python version
## National Corpus of Polish {.tabset .tabset-fade .tabset-pills}
### R version
The basic function for searching in National Corpus of Polish is `pol_corpus`. This function creates a dataframe with a results from the corpus. The function `pol_corpus` have a lot of arguments (as in all R function, it is not obligatory to write names of the arguments):
* **`query`** --- the sole obligatory argument with your query
```{r, eval=FALSE}
df <- pol_corpus(query = "tata")
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus(query = "tata")
datatable(head(df), options = list(dom = 'tip'))
```
* **`tag`** --- if TRUE all the words in a result will have morphological tags
```{r, eval=FALSE}
df <- pol_corpus(query = "tata", tag = TRUE)
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus(query = "tata", tag = TRUE)
datatable(head(df), options = list(dom = 'tip'))
```
* **`n_results`** defines number of examples from the corpus. By default is 10.
```{r, eval=FALSE}
df <- pol_corpus(query = "tata", n_results = 6)
df
```
```{r, echo=FALSE}
df <- pol_corpus(query = "tata", n_results = 6)
datatable(df, options = list(dom = 'tip'))
```
* **`corpus`** --- vector with a type of the corpus: "nkjp300", "nkjp1800", "nkjp1M", "ipi250", "ipi030", "frequency-dictionary"
```{r, eval=FALSE}
df <- pol_corpus(query = "tata", corpus = "nkjp1M")
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus(query = "tata", corpus = "nkjp1M")
datatable(head(df), options = list(dom = 'tip'))
```
* **`kwic`** (key word in context) is the format for resulted lines. If TRUE, then it returns a dataframe with query in the middle and left and right contexts. If FALSE, then it returns each result in one string. By default is TRUE.
```{r}
df <- pol_corpus(query = "tata", kwic = FALSE)
head(df)
```
* **`write`** argument writes a file in the working derictory. If FALSE, then it creates a dataframe in Global Environment. Otherwise function writes a .tsv file with the name frome the argument value. By default is FALSE.
```{r, eval=FALSE}
pol_corpus(query = "tata", write = "myquiry")
```
The **`query`** argument can be filled with regular expressions or CQL (corpus query language), read more at the [help page](http://nkjp.pl/poliqarp/help/plse3.html#x4-50003):
```{r, eval=FALSE}
df <- pol_corpus("An*a")
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus("An*a")
datatable(head(df), options = list(dom = 'tip'))
```
```{r, eval=FALSE}
df <- pol_corpus("[base = 'strzyc']")
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus("[base = 'strzyc']")
datatable(head(df), options = list(dom = 'tip'))
```
### Python version
```{python, engine.path = '/usr/bin/python3'}
import lingcorpora
print(lingcorpora.pol_search("tata"))
```
## National Corpus of Russian Language {.tabset .tabset-fade .tabset-pills}
### R version
### Python version