Plain-text ccKres

Plain-text version of ccKres 1.0, the corpus of written Slovenian language. Includes a few examples of tools to extract useful information from TEI fies.

The plain-text files can be found in the plain-text-corpus directory, sorted as follows:

SSJ
├── I             - internet
└── T             - tisk (print)
    ├── D         - drugo (other)
    ├── K         - knjižno (literary)
    │   ├── L     - leposlovje (fiction)
    │   └── S     - strokovno (non-fiction)
    └── P         - periodično (periodicals)
        ├── C     - časopis (newspapers)
        └── R     - revija (magazines)

Although the text files were produced using teitomarkdown (part of TEIC/Stylesheets), the result is barely formatted into paragraphs. The text will therefore need substantial preprocessing for most uses.

Additionally we have extracted a list of all words found in the corpus, together with their morphosyntactic annotations (see morphosyntax_dict.txt).

Examples

Generating plain-text files

This repository already contains exctracted plain-text files. If, for whatever reason, you want to regenerate them, this is how they were originally generated:

$ rake kres:download[cckres]
$ rake kres:extract[cckres]
$ rake kres:sort[cckres,plain-text-corpus]

To generate the morphosyntactic dictionary:

$ rake kres:msd[~/Downloads/cckresV1_0/xml] \
  | sort \
  | uniq \
  > morphosyntax_dict.txt

The above tasks require the following programs to be available in your PATH:

curl
unzip
find
ruby

Ruby dependencies must be installed as well (gem install bundler if needed):

$ gem install bundler
$ bundle install

License

The code is licensed under the MIT license. The ccKres corpus is licensed under CC BY-NC-SA 4.0.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
TEIC-Stylesheets @ a4210f0		TEIC-Stylesheets @ a4210f0
examples		examples
plain-text-corpus/SSJ		plain-text-corpus/SSJ
.gitmodules		.gitmodules
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
morphosyntax_dict.txt		morphosyntax_dict.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plain-text ccKres

Examples

Generating plain-text files

License

About

Releases

Packages

Languages

License

mfilej/cckres-plain

Folders and files

Latest commit

History

Repository files navigation

Plain-text ccKres

Examples

Generating plain-text files

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages