README.Rmd

---
title: "lingcorpora: get data from different corpora"
author: "A. Koshevoy, G. Moroz"
output:
  html_document:
    theme: lumen
    highlight: tango
    toc: yes
    toc_position: right
    toc_float: yes
    smooth_scroll: false
    number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=F, warning = F)
```


# About lingcorpora
`lingcorpora` package provides R with API from different linguistic corpora. A tutorial for this package is avaliable on GitHub wiki. This package includes APIs for:

* [Abkhaz Text Corpus](http://baltoslav.eu/apsua/index.php)
* [Avar Text Corpus](http://baltoslav.eu/avar/index.php)
* [National Corpus of Polish](nkjp.pl)
* [National Corpus of Russian Language](http://www.ruscorpora.ru/)

```{r, echo=FALSE}
library(lingtypology)
map.feature(c("Abkhaz", "Avar", "Polish", "Russian"))
```

# Instalation {.tabset .tabset-fade .tabset-pills}
## R version
Get the last version from GitHub:
```{r, eval = F}
install.packages("devtools")
devtools::install_github("agricolamz/lingcorpora.R", dependencies = TRUE)
```

Load a library:
```{r}
library(lingcorpora)
```

## Python version
If you want to install our package, please tap the following command in Terminal:
```{bash, eval = F}
pip3 install git+https://github.com/alexeykosh/lingcorpora.py
```

For import it in your project, tap:
```{python, engine.path = '/usr/bin/python3'}
import lingcorpora
```

ADD about python3!!!

# Usage
Most of the functions in `lingcorpora` have the same syntax: first part is a language iso code, the second part is `_corpus`.

## Abkhaz Text Corpus {.tabset .tabset-fade .tabset-pills}

### R version
The basic function for searching in Abkhaz Text Corpus is `abk_corpus`. This function creates a dataframe with a results from the corpus. The function `abk_corpus` have a lot of arguments (as in all R function, it is not obligatory to write names of the arguments):

* **`query`** --- the sole obligatory argument with your query. I will use library `DT` for data frame visualization, but it is not necessary
```{r, eval=FALSE}
df <- abk_corpus(query = "бызшәа")
head(df)
```
```{r, echo=FALSE}
df <- abk_corpus(query = "бызшәа")
library(DT)
datatable(head(df), options = list(dom = 'tip'))
```

* **`kwic`** (key word in context) is the format for resulted lines. If TRUE, then it returns a dataframe with query in the middle and left and right contexts. If FALSE, then it returns each result in one string. By default is TRUE.
```{r}
df <- abk_corpus(query = "бызшәа", kwic = FALSE)
head(df)
```

* **`write`** argument writes a file in the working derictory. If FALSE, then it creates a dataframe in Global Environment. Otherwise function writes a .tsv file with the name frome the argument value. By default is FALSE.
```{r, eval=FALSE}
abk_corpus(query = "бызшәа", write = "myquiry")
```

The **`query`** argument can be filled with regular expressions or CQL (corpus query language), read more at the [help page](http://baltoslav.eu/apsua/index.php)
```{r, eval=FALSE}
df <- abk_corpus(query = "бызшәа*")
head(df)
```
```{r, echo=FALSE}
df <- abk_corpus(query = "бызшәа*")
datatable(head(df), options = list(dom = 'tip'))
```

### Python version

## Avar Text Corpus {.tabset .tabset-fade .tabset-pills}

### R version
The basic function for searching in Avar Text Corpus is `ava_corpus`. This function creates a dataframe with a results from the corpus. The function `ava_corpus` have a lot of arguments (as in all R function, it is not obligatory to write names of the arguments):

* **`query`** --- the sole obligatory argument with your query
```{r, eval=FALSE}
df <- ava_corpus(query = "шагьар")
head(df)
```
```{r, echo=FALSE}
df <- ava_corpus(query = "шагьар")
datatable(head(df), options = list(dom = 'tip'))
```

* **`kwic`** (key word in context) is the format for resulted lines. If TRUE, then it returns a dataframe with query in the middle and left and right contexts. If FALSE, then it returns each result in one string. By default is TRUE.
```{r}
df <- ava_corpus(query = "вацазе", kwic = FALSE)
head(df)
```
* **`write`** argument writes a file in the working derictory. If FALSE, then it creates a dataframe in Global Environment. Otherwise function writes a .tsv file with the name frome the argument value. By default is FALSE.
```{r, eval=FALSE}
ava_corpus(query = "васазе", write = "myquiry")
```

The **`query`** argument can be filled with regular expressions or CQL (corpus query language), read more at the [help page](http://baltoslav.eu/avar/index.php)
```{r, eval=FALSE}
df <- ava_corpus(query = "магIарул*")
head(df)
```
```{r, echo=FALSE}
df <- ava_corpus(query = "магIарул*")
datatable(head(df), options = list(dom = 'tip'))
```

### Python version

## National Corpus of Polish {.tabset .tabset-fade .tabset-pills}

### R version
The basic function for searching in National Corpus of Polish is `pol_corpus`. This function creates a dataframe with a results from the corpus. The function `pol_corpus` have a lot of arguments (as in all R function, it is not obligatory to write names of the arguments):

* **`query`** --- the sole obligatory argument with your query
```{r, eval=FALSE}
df <- pol_corpus(query = "tata")
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus(query = "tata")
datatable(head(df), options = list(dom = 'tip'))
```

* **`tag`** --- if TRUE all the words in a result will have morphological tags
```{r, eval=FALSE}
df <- pol_corpus(query = "tata", tag = TRUE)
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus(query = "tata", tag = TRUE)
datatable(head(df), options = list(dom = 'tip'))
```

* **`n_results`** defines number of examples from the corpus. By default is 10.
```{r, eval=FALSE}
df <- pol_corpus(query = "tata", n_results = 6)
df
```
```{r, echo=FALSE}
df <- pol_corpus(query = "tata", n_results = 6)
datatable(df, options = list(dom = 'tip'))
```

* **`corpus`** --- vector with a type of the corpus: "nkjp300", "nkjp1800", "nkjp1M", "ipi250", "ipi030", "frequency-dictionary"
```{r, eval=FALSE}
df <- pol_corpus(query = "tata", corpus = "nkjp1M")
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus(query = "tata", corpus = "nkjp1M")
datatable(head(df), options = list(dom = 'tip'))
```

* **`kwic`** (key word in context) is the format for resulted lines. If TRUE, then it returns a dataframe with query in the middle and left and right contexts. If FALSE, then it returns each result in one string. By default is TRUE.
```{r}
df <- pol_corpus(query = "tata", kwic = FALSE)
head(df)
```
* **`write`** argument writes a file in the working derictory. If FALSE, then it creates a dataframe in Global Environment. Otherwise function writes a .tsv file with the name frome the argument value. By default is FALSE.
```{r, eval=FALSE}
pol_corpus(query = "tata", write = "myquiry")
```

The **`query`** argument can be filled with regular expressions or CQL (corpus query language), read more at the [help page](http://nkjp.pl/poliqarp/help/plse3.html#x4-50003):

```{r, eval=FALSE}
df <- pol_corpus("An*a")
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus("An*a")
datatable(head(df), options = list(dom = 'tip'))
```

```{r, eval=FALSE}
df <- pol_corpus("[base = 'strzyc']")
head(df)
```
```{r, echo=FALSE}
df <- pol_corpus("[base = 'strzyc']")
datatable(head(df), options = list(dom = 'tip'))
```

### Python version
```{python, engine.path = '/usr/bin/python3'}
import lingcorpora
print(lingcorpora.pol_search("tata"))
```
## National Corpus of Russian Language {.tabset .tabset-fade .tabset-pills}

### R version

### Python version