This is the backing repo for ⲣⲉⲙⲛ̀Ⲭⲏⲙⲓ, a project that aims to make the Coptic language more learnable.
We use:
- GitHub for our code base.
- GitHub Pages for our website.
- Google Drive to share large files.
- Squarespace for DNS registration.
- Google Analytics and Google Search Console to analyze traffic.
NOTE: You can update the diagram by uploading it to draw.io.
-
Running
make install
should take care of most of the python installations.If there are missing binaries that you need to download them,
make install
will let you know. -
You might also want to alias
python
to the latest version. -
Our pipelines are defined in
Makefile
, and they correspond to blue circles in the diagram. Other pipelines inMakefile
are only used during development and testing, and are not relevant for output (re)generation. -
Keep in mind that parameters are written with the assumption that they are being invoked from the repo's root directory, rather than from the directory where the script lives. You should do most of your development from within the root directory.
-
This file is the only
README.md
in the repo (and this is enforced by a pre-commit hook). Technical documentation is intentionally centralized. Besides this file, docs can be found in:- In-code comments
- Planning framework
- Commit messages (albeit less significantly)
User-facing documentation shouldn't live on the repo, but should go on the website instead.
-
With the exception of
archive/
,test/
, anddata/
, andpre-commit/
, each subdirectory of the root directory represents a major pipeline, or category of pipelines, along with their associated data. You will also notice that shared code is (intentionally) minimized, and restricted to the pre-commits and some helpers and utility functions. -
We use pre-commit hooks extensively, and they have helped us discover a lot of bugs and issues with our code, and also keep our repo organized. They are not optional, and many of our pipelines assume that the pre-commits have done their job. Their installation should be covered by
make install
. They are defined in.pre-commit-config.yaml
. They run automatically before a commit, but you can trigger them with Make recipes as well by typingmake add
,make index
, ormake test
(the three are synonymous). Until #120 is resolved, you will need to pay some attention to when to trigger them manually. As a rule of thumb, run them once after each pipeline, and before starting another downstream pipeline.
Most of our projects have a data
subdirectory. We have somewhat strict rules
regarding its content. It usually (which, in our repo, means almost always)
contains three subdirectories:
-
raw/
: Data that is copied from elsewhere. This would, for example, include the Marcion SQL tables copied as is, unmodified. The contents of this directory remain true to the original source. -
input/
: Data that we either modified or created. If we want to fix typos to data that we copied, we don't touch the data underraw/
, but we take the liberty to modify the copies that live underinput/
.This directory also includes the data that we created ourselves.
You can show the delta between raw and input data using
git diff --no-index
. It's also good to be aware of the--word-diff
flag. -
output/
: This contains the data written by our pipelines, one subdirectory per format. If your pipeline writes both TSV and HTML, they should go respectively tooutput/tsv/
andoutput/html/
.
For now, run this once at the beginning of your coding session to export environment variables, which are necessary for some pipelines:
source .env_INFO
Equivalently:
. ./.env_INFO
Later on, you might need to create your own .env
file. It is ignored by a
rule in .gitignore
, so there is no shared version.
It is documented in .env_INFO
, so this section is intentionally
brief.
We use GitHub to track our plans and TODO's.
Issues need to be as specific and isolated as possible. Most of the time, they span a single component, although they can often work mainly in one component and spill to others, and sometimes they're generic and span one aspect of multiple components (such as the conventions set for the whole repo). Issues mostly have exactly one How, and usually one Why (see labels below). Issues should involve a local change or set of local changes.
High-priority issues are defined in two ways:
- Assignment to a developer
- Belonging to a component version that we are working to release.
The project page offers alternative views of the issues, which can come in handy for planning purposes.
-
Milestones represent more complex pieces of work. Their size is undetermined. They could weeks or years, but they are not simple enough to span just a few days. This is their main use case.
-
There is a second, somewhat unorthodox, use case for milestones as component backlogs backlogs, for miscellaneous issues related to some component that don't belong to a goal that we've already defined and crystalized into a milestone.
-
Every issue must belong to a milestone.
-
Milestone priorities are assigned using due dates. Milestones help make long-term plans.
-
The number of milestones should remain "under control".
-
The platform component milestone refers to the development platform and tooling. Issues under this milestone are mainly developer-facing rather than user-facing, and their purpose is to improve the framework that developers use to drive the project forward. This component is about sharpening our saw so we can cut wood faster.
-
When work on a milestone is good enough, it's closed, the achievement is celebrated, and its remaining issues move to the corresponding component backlog milestone.
-
Component-specific milestones are often named as component versions. (For example, Site v1.0 is a milestone referring to the first release of the Site).
-
Backlog milestone are often named after the component, but without a version, and often with the prefix Pipeline:.
-
All issues should be labeled.
-
We assign the following categories of labels to issues:
How
- How can the task be achieved?
architect
: Architecture and design.diplomacy
: Diplomacy, connections, and reachout.documentation
: Writing documentation.labor
: Manual data collection.freelance
: Hiring a freelancer.
- We don't assign a coding label, because that includes most tasks. A task that doesn't have a nature label should be a coding task.
- How can the task be achieved?
Who
- Is the issue user-facing or developer-oriented?
user
: A user-oriented improvement.dev
: A developer-oriented, not user-visible, improvement.
- Is the issue user-facing or developer-oriented?
Why
- What is the purpose of this issue?
data collection
: Expand the data that we own.maintenance
: Maintain existing territories, rather than expand into new ones.rigor
: Improve the rigor (particularly parsing, or inflection generation).UI
: Improve the user interface.bug
: Fix a bug.
- What is the purpose of this issue?
-
Minimize dependence on HTML, and implement behaviours in TypeScript when possible.
-
Add in-code assertions and checks. This is our first line of defense, and has been the champion when it comes to ensuring correctness and catching bugs.
-
We rely heavily on manual inspection of the output to verify correctness. The
git --word-diff
command is helpful when our line-orienteddiff
is not readable. Keep this in mind when structuring your output data. -
We force the existence of unit tests, at least one for each Python file. While these have so far been mere placeholders, the mere import of a package sometimes catches syntax errors, and the placeholders will make it convenient to write tests whenever desired. A big benefit of unit tests is that they make us confident that a change is correct, so we can speed up the development process.
-
Do not let Python tempt you to use its built-in types instead of classes and objects. Don't forget about OOP!
-
Document the code.
-
We use
mypy
for static typing checks. While not required bymypy
(which can often infer the types without hints, and would throw an error whenever an explicit type annotation is needed), it's still encouraged to use type hints extensively. -
Collect and print stats.
-
Color the outputs whenever you can. It keeps your programmers entertained!
-
Keep your code
grep
-able, especially when it comes to the constants used across directories. -
Privatize methods whenever possible. Use the name mangling feature in Python.
-
Our pipelines are primarily written in Python. There is minimal logic in Bash.
-
We have a strong bias for Python over Bash. Use Bash if you expect the number of lines of code of an equivalent Python piece to be significantly more.
-
We use TypeScript for static site logic. It then gets transpiled to JavaScript by running
make transpile
. We don't write JavaScript directly. -
We expect to make a similar platform-specific expansion into another territory for the app.
-
In the past, we voluntarily used Java (for an archived project). Won't happen again! We also used VBA and JS for Microsoft Excel and Google Sheet macros (also archived at the moment) because they were required by the platform.
-
It is desirable to strike a balance between the benefits of focusing on a small number of languages, and the different powers that different language can uniquely exhibit. We won't compromise the latter for the former. Use the right language for a task. When two languages can do a job equally well, uncompromisingly choose the one that is more familiar.
- We collect extensive stats, and we remind you of them using a pre-commit. The
primary targets of our statistics are:
- The size of our code (represented by the number of lines of code). We also collect this stat for each subproject or pipeline step independently.
- The number of data items we've collected for data collection tasks.
- We also record the number of commits, and the number of contributors.
This directory contains the data and logic for processing our dictionaries.
There are many reasons we have decided to add pictures to our dictionary, and heavily invested in the image pipeline. They have become one of the integral pieces of our dictionary framework.
-
The meaning of a word is much more strongly and concretely conveyed by an image than by a word. Learning is not about knowing vocabulary or grammar. Learning is ultimately about creating the neural pathways that enable language to flow out of you naturally. A given word needs to settle and connect with nodes in your associative memory in order for you to be able to use it. If our goal is to create or strengthen the neural pathways between a Coptic word and related nodes in your brain, then it aids the learning process to achieve as much neural activation as possible during learning. This is much better achieved by an image than by a mere translation, given the way human brains work. After all, the visual processing areas of our brains are bigger, faster, and far more ancient and primordial (even reptiles can see) compared to the language processing areas. You will often find that, when you learn a new word, the associated images pop up in your brain more readily than the translation. Thus the use of images essentially revolutionizes the language learning process.
-
Oftentimes, the words describe an entity or concept that is unfamiliar to many users. Things like ancient crafts, plant or fish species, farmer's tools, and the like, are unfamiliar. Showing a user the English translation of a word doesn't suffice for the user to understand what it is, and they would often look up images themselves in order to find out what the word actually means. By embedding the pictures in the dictionary, we save users some time so they don't have to look it up themselves.
-
Translations are often taken lightly by users. Pictures are not. When a dictionary author translates a given Coptic word into different English words, for example, the extra translations are often seen by users as auxiliary - tokens added there to convey a meaning that the dictionary author couldn't convey using fewer words.
That's not the case for pictures. Pictures are taken seriously by users, and are more readily accepted as bearing a true, authentic, independent meaning of the word. Listing images (especially after we have started ascribing each image to a sense that the word conveys) is a way to recognize and legitimize those different senses and meanings that a word possesses.
It's for this reason that images must be deeply contemplated, and a word must be digested well, before we add explanatory images for it. Collecting images is tantamount to authoring a dictionary.
Our experience collecting images has taught us a few lessons. We tend to follow the following guidelines when we search for pictures:
-
Each image ends up being resized to a width of 300 pixel and a height proportional to the original. We prefer images with a minimum width of 300 pixels, though down to 200 is acceptable.
-
As for image height, short images are rarely ugly, but long images usually are. So we set a generously low lower bound of 100 pixels on the resized height, but set a stricter upper bound of 500 pixels. Although we tend to prefer the height to fall within a range of 200 to 400 pixels.
-
Collecting sources is mandatory. We always record the URL that an image is retrieved from. Our
img_helper
script, which we use to process images, can be supplied by a URL, and it will download the image and store the source (and also resize the image to the final version). This simplifies the process. -
We make extensive use of icons. They can capture the meaning of a word in situations when it's otherwise hard to describe a word using an image (example).
-
This hasn't been contemplated, but when given a choice, prefer an ancient Egyptian explanatory image, followed by an old (not necessarily Egyptian) image, followed by a modern image (example). We prefer to keep the images as close as possible to their reflections in the mind of a native speaker. We also want to stress the fact that those Coptic words can be equally used to refer to entities from other cultures, or modern entities.
This could be revisited later.
The following entries have no dialect specified in Crum, so they are treated as part of all dialects.
- https://remnqymi.com/crum/1274.html
- https://remnqymi.com/crum/1292.html
- https://remnqymi.com/crum/1367.html
- https://remnqymi.com/crum/1462.html
- https://remnqymi.com/crum/1553.html
- https://remnqymi.com/crum/1555.html
- https://remnqymi.com/crum/1557.html
- https://remnqymi.com/crum/1558.html
- https://remnqymi.com/crum/1657.html
- https://remnqymi.com/crum/1659.html
- https://remnqymi.com/crum/1712.html
- https://remnqymi.com/crum/1957.html
- https://remnqymi.com/crum/2074.html
- https://remnqymi.com/crum/2075.html
- https://remnqymi.com/crum/2076.html
- https://remnqymi.com/crum/2077.html
- https://remnqymi.com/crum/2078.html
- https://remnqymi.com/crum/2079.html
- https://remnqymi.com/crum/2081.html
- https://remnqymi.com/crum/2082.html
- https://remnqymi.com/crum/2084.html
- https://remnqymi.com/crum/2085.html
- https://remnqymi.com/crum/2086.html
- https://remnqymi.com/crum/2087.html
- https://remnqymi.com/crum/2088.html
- https://remnqymi.com/crum/2090.html
- https://remnqymi.com/crum/2091.html
- https://remnqymi.com/crum/2092.html
- https://remnqymi.com/crum/2093.html
- https://remnqymi.com/crum/2195.html
- https://remnqymi.com/crum/2205.html
- https://remnqymi.com/crum/2832.html
- https://remnqymi.com/crum/3117.html
- https://remnqymi.com/crum/3230.html
- https://remnqymi.com/crum/3231.html
- https://remnqymi.com/crum/3257.html
- https://remnqymi.com/crum/3302.html
NOTE: Some undialected entries in this list have been removed because their dialect was inferred, e.g. all the entries under Ⳉ have been labeled as Akhmimic.
We are rethinking the current handling of undialected entries. See #237.
The following entries are absent from Crum's dictionary. They were added to our database from other sources:
- https://remnqymi.com/crum/3379.html
- https://remnqymi.com/crum/3380.html
- https://remnqymi.com/crum/3381.html
- https://remnqymi.com/crum/3382.html
- https://remnqymi.com/crum/3385.html
dawoud-D100/
contains scans of
Moawad Dawoud's dictionary. They are obtained from the
PDF
using the imagemagick
command. (The density used
is 100, hence the prefix -D100
.)
The PDF / image processing scripts can be found under
archive/dictionary/copticocc.org
We had some plans to combine the strength of KELLIA and Crum (#53, #6), but they have been abandoned.
This directory contains the data and logic for processing the Bible corpus.
There are several published versions of the Coptic Bible. The most recent, and most complete, is that of St. Shenouda the Archmandrite Coptic Society. It is the Coptic Bible project that is most worthy of investment at the moment.
This directory contains the data and logic for processing dictionaries into flashcards. It is named as such because our first use case was a flashcard app, although our use of the dictionaries has since become more versatile.
When you import a package into your (personal) Anki database, Anki uses the IDs to eliminate duplicates.
Uniqueness is therefore important. But what is trickier, and perhaps more important, is persistence. If we export new versions of a certain deck regularly, we should maintain persistent IDs to ensure correct synchronization. Otherwise, identical pieces of data that have distinct IDs will result in duplicates.
There are three types of IDs in the generated package:
- Note ID
genanki
suggests
defining the GUID as a hash of a subset of fields that uniquely identify a
note.
The GUID must be unique across decks. Therefore, this subset of field values must be unique, including across decks. You can solve this by prefixing the keys with the name of the deck.
In our script, we ask the user to provide a list of keys as part of their input, along the list of fronts, backs, deck names, ... etc. The users of the package must assign the keys properly, ensuring uniqueness, and refraining from changing / reassigning them afterwards.
This is somewhat straightforward for Marcion's words. Use of Marcion's IDs for synchronization should suffice.
For the Bible, we could use the verse reference as a note ID, and ensure that the book names, chapter numbers, and verse numbers don't change in a following version.
For other data creators without programming expertise, a sequence number works as long as nobody inserts a new row in the middle of the CSV, which would mess up the keys. Discuss keying with those creators. As of today, only copticsite.com's data has this problem.
- Deck ID
Whenever possible, we use a hardcoded deck ID. This is not possible for decks that are autogenerated, such as the Bible decks which are separated for nesting (as opposed to being grouped in a single deck). In such cases, we use a hash of the deck name, and the deck name becomes a protected field.
- Model ID
Model IDs are hardcoded.
This directory contains the data and logic for generating the morphological dictionaries (to support inflections).
This directory contains the data and logic for creating and publishing our website.
Code is released under GPL-3.0. Lexicon data is released under CC BY-SA 4.0.
Ⲉ̀ϣⲱⲡ ⲁⲓϣⲁⲛⲉⲣⲡⲉⲱⲃϣ Ⲓⲗ̅ⲏ̅ⲙ̅, ⲉⲓⲉ̀ⲉⲣⲡⲱⲃϣ ⲛ̀ⲧⲁⲟⲩⲓⲛⲁⲙ: Ⲡⲁⲗⲁⲥ ⲉϥⲉ̀ϫⲱⲗϫ ⲉ̀ⲧⲁϣ̀ⲃⲱⲃⲓ ⲉ̀ϣⲱⲡ ⲁⲓϣ̀ⲧⲉⲙⲉⲣⲡⲉⲙⲉⲩⲓ.