Here you find a description of the transcriptions of the Dead Sea Scrolls (DSS), the Text-Fabric model in general, and the node types, features of the DSS corpus in particular.
See also
The corpus consists of two files, one for the non-biblical scrolls and one for the
biblical scrolls.
In both files, the material is subdivided into scroll
, fragment
, line
.
In the biblical file, references to book
, chapter
and verse
are marked
at the word level.
Some scrolls contain biblical as well as non-biblical materials. In the source data files those scrolls are split between the files. During conversion, we have reunited the scrolls. There are a 14 lines that occur in both source files. Here we have given precedence to the biblical versions, because they are either identical, or contain a reconstruction (marked as reconstruction!) that is absent in the non-biblical file.
The feature biblical
below contains all the information to see whether the material
originates from the biblical or non-biblical source file or both.
Every line in both files has fields for
- transcription
- lexeme
- morphological tags
and some bits of extra information.
The Text-Fabric model views the text as a series of atomic units, called
slots. In this corpus signs
are the slots.
On top of that, more complex textual objects can be represented as nodes. In this corpus we have node types for:
sign
,
word
,
lex
,
cluster
,
line
,
fragment
,
scroll
,
The type of every node is given by the feature
otype
.
Every node is linked to a subset of slots by
oslots
.
Nodes can be annotated with features. Relations between nodes can be annotated with edge features. See the table below.
Text-Fabric supports up to three customisable section levels.
In this corpus we use:
scroll
and fragment
and line
.
We map the transcriptions and lexemes to Hebrew UNICODE. The transcriptions are consonant only, the lexemes are pointed. The vowels we encounter in those lexemes have been transcribed by one or more special characters, probably in order to fine-tune the position of those points with respect to their consonants. We reduce them to single Hebrew UNICODEs per vowel.
There are bracketing constructs in the transcription, such as << >>
, « »
, [ ]
.
It turns out that in the files as we see them, they are consistently written as if in the right to left writing
direction. So they appear as >> <<
, » «
, ] [
.
When we reproduce the original transcription, we put them all back into the left-to-right orientation,
because this is the intended direction.
The cause for encountering them in the opposite orientation might be that
we have stripped all UNICODE orientation characters (202A-202E)
in our sanitizing pre-processing step.
We also supply the ETCBC transcription for Hebrew material. For the full details see the extensive Hebrew transcription table.
(Keep this under your pillow)
Some features come in three variants, a main variant
and two variants with the letter e
of o
after the feature name.
- main variant the UNICODE value
e
the ETCBC transliteration, or something that extends ito
the original transcription (as in the source files)
When we say that a feature is absent for a node, we mean that the node has no value
for the feature. For example, if the feature biblical
is absent for node n
, then
F.biblical.v(n)
results in the Python value None
, not the string 'None'
.
In queries, you can test for absence by means of #
:
line biblical#
gives all lines where the feature biblical
is absent.
See also search templates under Value specifications.
Node type sign
Basic unit containing a single symbol, mostly a consonant, but it can also be punctuation, or a text-critical sign.
The type of sign is stored in the feature type
.
type | source | ETCBC | UNICODE | description |
---|---|---|---|---|
cons |
m M |
M m |
מ ם |
normal consonantal letter |
vwl |
I |
I |
ִ |
vowel point |
sep |
|
_ |
|
space |
sep |
- |
& |
־ |
maqaf |
sep |
/ |
' |
׳ |
morpheme break |
punct |
. |
00 |
׃ |
sof pasuq |
punct |
± |
0000 |
׃׃ |
paleo divider |
numr |
A D |
>' k' |
א֜ ך֜ |
a numeral |
missing |
-- |
0 |
ε |
representation of a missing sign |
unc |
? |
? |
? |
representation of an uncertain sign (degree 1) |
unc |
\ |
# |
# |
representation of a uncertain sign (degree 2) |
add |
+ |
+ |
+ |
representation of an addition between numerals |
term |
/ |
╱ |
╱ |
representation of an end of line |
feature | values | source | ETCBC | UNICODE | description |
---|---|---|---|---|---|
after |
|
whether there is a space after the last sign of a word and before the next word | |||
alt |
1 |
lwz/)h( |
LWZ61(H) |
indicates an alternative material, marked by being within brackets ( ) |
|
cor |
1 |
yqw>mw<N |
JQW(< MW >)n |
material is corrected by a modern editor, marked by being within single angle brackets < > |
|
cor |
2 |
>>zwnh«<< |
(<< ZWNH# >>) |
material is corrected by an ancient editor, marked by being within double angle brackets << >> |
|
cor |
3 |
^dbr/y^ |
(^ DBR ? J ^) |
material is corrected by an ancient editor, supralinear, marked by being within carets ^ ^ |
|
glyph[eo] |
m |
M |
מ |
transliteration of an individual sign | |
lang |
a g |
language, a is Aramaic, g is Greek, absent means Hebrew |
|||
rec |
1 |
]p[n»y |
[P]N#?Y |
material is reconstructed by a modern editor, marked by being within square brackets [ ] |
|
rem |
1 |
}m«x«r«yØM«{ |
{M#Y#R#J?m#} |
material is removed by a modern editor, marked by being within single braces { } |
|
rem |
2 |
twlo}}t{{ |
TWL<{{t}} |
material is removed by an ancient editor, marked by being within double braces {{ }} |
|
type |
type of sign, see table above | ||||
unc |
1 |
b«NØ |
B#n? |
indicates uncertainty of degree=1 by flag ` | |
unc |
2 |
at« aj«y»/K |
>T# >X#J#?) ? k |
indicates uncertainty of degree=2 by flag « or brackets « » , in this example the « » are not brackets but individual tokens |
|
unc |
3 |
]p[n»y |
[P]N#?Y |
indicates uncertainty of degree=3 by flag » |
|
unc |
4 |
a|hrwN |
>#?HRWn |
indicates uncertainty of degree=4 by flag | |
|
vac |
1 |
≥ ≤ |
(- -) |
indicates an empty, unwritten space by brackets ≤ ≥ |
The feature biblical
is defined for scrolls
, fragments
, lines
, clusters
, and words
.
value | node type | description |
---|---|---|
absent |
scroll fragment line word cluster |
material is completely non-biblical |
1 |
scroll fragment line word cluster |
material is completely biblical |
2 |
scroll fragment |
material is partly biblical, partly non-biblical |
2 |
line |
material is biblical, but the line also occurs in the non-biblical file, see remark below |
2 |
cluster word |
material occurs in a line with biblical=2 |
Remark
For lines with biblical=2
we have included the material according to the biblical source file
and we have discarded the material according to the non-biblical source file.
There are only 14 of such lines, 6 of them are identical in both source files, and the rest has a
reconstruction in the biblical source file (marked as such by [ ]
brackets and hardly any definite material
in the non-biblical source file.
Node type word
Sequence of signs separated corresponding to a single line in the source files.
Whether a word is adjacent to a next word can be gleaned from the numbering of the word in the source file.
If so, we leave the after
feature without value.
There are several types of things that can occupy a word: a string of consonants, a numeral, punctuation, nothing, ...
The type of word is stored in the feature type
.
type | description |
---|---|
empty |
nothing |
glyph |
a sequence of consonants or uncertain tokens |
numr |
a numeral |
punc |
punctuation |
other |
nothing of the above |
If a transcription field is empty, but there is lexeme information,
we insert a word node with type glyph
and all of its textual features (full[eo], glyph[eo], punc[eo]
) absent.
We add a slot of type empty
to this word.
feature | source | ETCBC | UNICODE | description |
---|---|---|---|---|
after |
|
whether there is a space after a word and before the next word | ||
full[eo] |
mm/nw[ |
MM61NW] |
ממ׳נו] |
full transcription of a word, including flags and clustering characters |
g_cons[eo] |
mmnw |
MMNW] |
ממנו |
consonantal letters of a word in ETCBC encoding excluding flags and brackets |
glex[eo] |
mIN |
MIn |
מִן |
lexeme of a word, without non-textual characters |
glyph[eo] |
mmnw |
MMNW] |
ממנו |
letters of a word excluding flags and brackets |
intl |
1 2 |
if the physical word is on an interlinear line, this is 1 , if there are two interlinear lines at that point, the words on the first line get 1 and words on the second line gets 2 |
||
lang |
a g |
language, a is Aramaic, g is Greek, absent means Hebrew |
||
lex_etcbc |
mIN |
MIn |
מִן |
consonantal lexeme of a word in ETCBC encoding |
lex[eo] |
mIN |
MIn |
מִן |
lexeme of a word |
punc[eo] |
. |
00 |
׃ |
punctuation at the end of a word |
morpho |
vHi1cpX3mp |
original morphological tag for this word; all information in this has been decomposed into the morphological features below | ||
script |
paleohebrew greekcapital |
indicates the script in which the word is written | ||
srcLn |
424242 |
line number of this word in its source data file; use biblical to find out whether it is the biblical or the non-biblical file |
||
type |
type of word, see table above |
Words coming from the biblical source file have references to a passage in the Bible.
feature | examples | description |
---|---|---|
biblical |
1 2 |
1 or 2 if this word is biblical material, otherwise absent, see section on biblical |
book |
Gen 1Q1 |
the book of the corresponding passage |
chapter |
3 f6 |
the chapter of the corresponding passage |
verse |
1 2 |
the verse of the corresponding passage |
halfverse |
a b (the only values) |
the half-verse of the corresponding passage |
N.B Many times chapters are not really chapter numbers of books, but fragments of scrolls. Likewise, verses are not always verse numbers in chapters, but many times they are line numbers in fragments.
A word has several morphology features. If a word is divided into morphemes, each of the morphemes can carry morphology. If we have gender masculine on the main word, and gender feminine on the suffix, and gender common on the second suffix, it will be represented by
gn=m
gn2=f
gn3=c
Below is a summary table.
For all values, look at the morphology
configuration file.
There you see also the connection with the original Abegg encoding of morphological tags.
We have switched to slightly more verbose feature values, and to feature names that are
in line with those of the
BHSA.
The original tag as a whole is also available in the feature morpho
.
We only describe the plain features here, but keep in mind that they may be accompanied by their numbered brothers.
Al these features may contain the value unknown
.
The xxx_etcbc
features below are part of the extra features by Martijn Naaijer,
which have been produced in a different way, not based on the Abegg sources.
They are the product of a model trained on BHSA data which has been subsequently applied to the DSS.
We mark them as derived from BHSA in the table below.
See ETCBC/DSS2ETCBC.
feature | examples | description |
---|---|---|
sp |
subs verb numr ptcl |
part-of-speech |
sp_etcbc |
subs verb numr ptcl |
idem, but derived from BHSA |
cl |
card prp prep |
class, i.e. a sub category within its part-of-speech |
ps |
1 2 3 |
person |
ps_etcbc |
p1 p2 p3 NA |
idem, but derived from BHSA |
gn |
m f c b |
gender, also with common and both |
gn_etcbc |
m f NA unknown |
idem, but derived from BHSA |
nu |
s p d |
number, also with dual |
nu_etcbc |
sg pl du NA |
idem, but derived from BHSA |
st |
a c d |
state, also with determined |
cs |
nom acc gen |
case |
vs |
qal passive piel hifil hithpolel |
verbal stem, also with passive , some are Hebrew, some are Aramaic |
vs_etcbc |
qal passive piel hif htpo |
idem, but derived from BHSA |
vt |
perf impf wayy impv infc infa ptca ptcp |
verbal tense or aspect, also with wayyiqtol |
vt_etcbc |
perf impf wayq impv infc infa ptca ptcp NA |
idem, but derived from BHSA |
md |
juss coho cons |
mood |
If the parsing of the morphology tag has been inconclusive, there will be an error feature present on that word:
feature | examples | description |
---|---|---|
merr |
vnPfpa @0 |
the characters are those that are not recognized by the parser at that point |
Node type lex
The type of lexemes, as found in the lexeme field of the source data files.
feature | source | ETCBC | UNICODE | description |
---|---|---|---|---|
lex[eo] |
mIN |
MIn |
מִן |
lexeme of a word |
complete |
1 | 1 if the lexeme is complete, i.e. without uncertain characters |
N.B.
Lexemes may contain characters with an uncertainty level, such as #
and ?
.
See the under sign
above.
Lexemes are connected to their occurrence words by means of an edge feature:
feature | description |
---|---|
occ |
edges from lexeme nodes to each of their word occurrences |
N.B. Note that you can use this feature in both directions:
words = E.occ.f(lex)
lex = E.occ.t(word)[0]
Node type cluster
Grouped sequence of signs
. There are different
types of these bracketings. Clusters of the same type are not nested.
Clusters of different types need not be nested properly with respect to each other.
The type of a cluster is stored in the feature type
.
This is a summary of the source encoding, see also the features at the sign level with the same names above.
type | value | examples | description |
---|---|---|---|
cor |
1 |
< > |
correction made by a modern editor |
cor |
2 |
<< >> |
correction made by an ancient editor |
cor |
3 |
^ ^ |
supralinear (ancient) correction |
rem |
1 |
{ } |
removed by a modern editor |
rem |
2 |
{{ }} |
removed by an ancient editor |
rec |
1 |
[ ] |
reconstructed by a modern editor |
vac |
1 |
≤ ≥ |
empty space |
alt |
1 |
( ) |
alternative |
unc |
2 |
« » |
uncertain, with level of uncertainty 2 |
Each cluster induces a sign feature with the same name as the type of the cluster, which gets a numeric value, as indicated in the table.
Note the vac
cluster: by definition, it contains no signs.
In order to anchor it into the text sequence, we have generated
an empty slot in each vacat cluster.
We have done the same for other clusters that happened to be without other slots.
N.B.: Note that such clusters do not have words
inside them, only an empty sign
.
These are cases of signs that do not belong to words!
Other features:
feature | examples | description |
---|---|---|
biblical |
1 2 |
1 or 2 if this cluster is biblical material, otherwise absent, see section on biblical |
Node type line
Section level 3.
Subdivision of a containing fragment
.
Corresponds to a set of source data lines with the same value in the line
column.
feature | values | description |
---|---|---|
biblical |
1 2 |
1 or 2 if this line is biblical material, otherwise absent, see section on biblical |
line |
3 |
number of a line of a fragment (not necessarily integer valued) |
fragment |
f3 |
label of a fragment or column of a scroll |
scroll |
1Q1 |
short name of a scroll |
There are lines in the source data with number 0
and with a subdivision by means of an
other number. We have converted this situation to a sequence of lines numbered as
0.1
, 0.1
, etc. Hence the number of a line is not always an integer.
So we store the number in a feature named label
, instead of number.
Node type fragment
Section level 2.
Subdivision of a containing scroll
.
Corresponds to a set of source data lines with the same value in the fragment
column.
For non-biblical scrolls, the fragment is usually called column
.
feature | values | description |
---|---|---|
biblical |
1 2 |
1 or 2 if this fragment contains biblical material, otherwise absent, see section on biblical |
fragment |
f3 |
label of a fragment or column of a scroll |
scroll |
1Q1 |
short name of a scroll |
Node type scroll
Section level 1.
Corresponds to a set of source data lines with the same value in the scroll
column.
feature | values | description |
---|---|---|
biblical |
1 2 |
1 or 2 if this scroll contains biblical material, otherwise absent, see section on biblical |
scroll |
1Q1 |
short name of a scroll |
We discuss the node types we are going to construct. A node type corresponds to a textual object. Some node types will be marked as a section level.
This is the basic unit of writing.
The node type sign
is our slot type in the Text-Fabric representation of this corpus.
Slots are the textual positions. They are be occupied by individual glyphs (consonants, "digits", punctuation, miscellaneous glyphs).
All signs have the features type
and glyph[eo]
.
The type
stores the kind of glyph, such as cons
.
The glyph glyphe glypho
features store the transcription of the glyph,
without any flags and brackets. They store it in UNICODE, ETCBC transcription,
and source transcription.
These features do not suffice to reconstruct the original source transcription, because the flags and brackets are not part of them.
Punctuation is either a mark or a white space, or a boundary. All punctuation characters have UNICODE representations. For some we have borrowed a Hebrew character that has a different meaning in the Masoretic text, but that does not occur otherwise in the Dead Sea Scrolls. The reason is that we can represent Hebrew consonants plus punctuation in a smooth, right-to-left way.
source | ETCBC | UNICODE | description |
---|---|---|---|
|
_ |
|
non-breaking intra-word space |
- |
& |
־ |
maqaf |
. |
00 |
׃ |
sof pasuq |
± |
0000 |
׃׃ |
double sof pasuq, questionably used as paleo divider |
/ |
61 |
׳ |
geresh (punctuation, not accent), questionably used as morpheme break |
Numerals are ancient signs for denoting quantities.
source | ETCBC | UNICODE | value |
---|---|---|---|
A |
>' |
א֜ |
1 |
å |
>52 |
אׄ |
1 |
B |
>53 |
אׅ |
1 |
∫ |
>35 |
אֽ |
1 |
C |
J' |
י֜ |
10 |
D |
k' |
ך֜ |
20 |
F |
Q' |
ק֜ |
100 |
Several characters have to do with uncertainty and illegibility. They have an improvised UNICODE representations. We propose an transcription that works inside the ETCBC transcription. Note that these have spaces around them.
source | ETCBC | UNICODE | description |
---|---|---|---|
-- |
0 |
ε |
missing sign |
? |
? |
? |
uncertain sign, degree 1 |
\ |
# |
# |
uncertain sign, degree 2 |
+ |
+ |
+ |
addition symbol between numerals |
/ |
╱ |
╱ |
end of line token |
Signs also have features corresponding to flags and brackets, that store under which flag
or inside which brackets the sign occurs:
unc
cor
rem
vac
alt
rec
.
Signs
may have flags.
In transcription they show up as a special trailing character.
Flags code for signs that are damaged, questionable (in their reading), in short: uncertain.
They apply to the preceding character.
We propose an transcription that works inside the ETCBC transcription. Note that these have no spaces around them.
We use this for the UNICODE representation as well.
source | ETCBC / UNICODE | description |
---|---|---|
Ø |
? |
uncertain, degree 1 |
« |
# |
uncertain, degree 2 |
» |
#? |
uncertain, degree 3 |
| |
## |
uncertain, degree 4 |
Note that there is also a bracket pair for uncertainty level 2.
We discuss the brackets under the node type cluster
.
Each type of bracket corresponds to a feature of the same name at the sign
level.
With some difficulty, you can reconstruct the source data from this, modulo the order of flags and brackets.
The recommended way to reconstruct the original transcriptions is to go to the word level.
One or more signs
may be bracketed by certain delimiters.
Together they form a cluster
.
Each pair of boundary signs marks a cluster of a certain type.
This type is stored in the feature type
.
Clusters are not be nested in clusters of the same type.
Clusters of one type in general do not respect the boundaries of clusters of other types.
Clusters may contain just one sign
.
Cluster boundaries are usually within words.
In Text-Fabric, cluster nodes are linked to the signs it contains.
So, if c
is a cluster, you can get its signs by
L.d(c, otype='sign')
More over, every type of cluster corresponds to a numerical feature on signs with the same name as that type.
We propose an transcription that works inside the ETCBC transcription. Note that these have sometimes a space at the inner side.
We use the original brackets for the UNICODE representation as well. But note that in the original the direction of the brackets is inverted, due to the conversion process that has stripped RTL and LTR triggering characters. In the UNICODE representation we restore the proper direction.
In the table below, the value is the value that the associated feature has for signs within that type of brackets under the given description.
source / UNICODE | ETCBC | value | type | description |
---|---|---|---|---|
^ ^ |
(^ ^) |
3 | cor3 |
correction by ancient editor, supralinear |
<< >> |
(<< >>) |
2 | cor2 |
correction by ancient editor |
< > |
(< >) |
1 | cor |
correction by modern editor |
{{ }} |
({{ }}) |
2 | rem2 |
removed by ancient editor |
} { |
({ }) |
1 | rem |
removed by modern editor |
≤ ≥ |
(- -) |
1 | vac |
vacat: an empty, unwritten space in the manuscript |
( ) |
( ) |
1 | alt |
alternative reading |
[ ] |
[ ] |
1 | rec |
modern reconstruction |
« » |
(# #) |
2 | unc2 |
uncertainty of degree 2 |
Words are the contents of the transcription fields of the source data lines.
Words will be separated by spaces or by nothing, in case the
connection field in the same source data line has a B
.
They have features glyph[eo] full[eo] punc[eo] after
.
full[eo]
full value of the word: letters, symbols, punctuation, flags, brackets;fullo
is the original content of thetrans
field in the source data fileglyph[eo]
letter value of the word: consonants, vowels, digits, numerals; no punctuation, flags, or brackets;punc[eo]
the punctuation of a word, if any;after
a space when the word should be followed by a space, i.e. when theconnection
field does not have aB
.
The source transcription can be reconstructed by walking over all words and printing
fullo + after
for each word.
A non-text-critical transcription can be generated by printing
glypho + punco + after
for each word.
Or, in ETCBC transcription / UNICODE:
glyphe + punce + after
glyph + punc + after
These features will be used in the text-formats below.
The following text formats are defined (you can also list them with T.formats
).
format | kind | description |
---|---|---|
text-orig-full |
plain | the source text, glyphs only, no flags / brackets, in UNICODE |
text-trans-full |
plain | the source text, glyphs only, no flags / brackets, in ETCBC transcription |
text-source-full |
plain | the source text, glyphs only, no flags / brackets, in source transcription |
text-orig-extra |
plain | the source text with flags and brackets, in UNICODE |
text-trans-extra |
plain | the source text with flags and brackets, in ETCBC transcription |
text-source-extra |
plain | the source text with flags and brackets, in source transcription |
lex-orig-full |
plain | lexeme of a word in UNICODE |
lex-trans-full |
plain | lexeme of a word in ETCBC transcription |
lex-source-full |
plain | lexeme of a word in source transcription |
layout-orig-full |
layout | as text-orig-full but the flag and cluster information is visible in layout |
layout-trans-full |
layout | as text-trans-full but the flag and cluster information is visible in layout |
layout-source-full |
layout | as text-source-full but the flag and cluster information is visible in layout |
The formats with text
result in strings that are plain text, without additional formatting.
The formats with layout
result in pieces HTML with CSS-styles; the richness of layout enables us to code more information
in the plain representation, e.g. blurry characters when signs are damaged or uncertain.
See also the showcases.