- manifest.json: type "manifest"
- For any given project name in
/projects.json
the file /[PROJECT]/manifest.json
provides a list of JSON files avaialable for the project. For the project rimanum
, the URL http://oracc.org/rimanum/manifest.json
, yields the following:
{
"type": "manifest",
"project": "rimanum",
"files": [
"corpus.json",
"index-akk-x-oldbab.json",
"index-cat.json",
"index-lem.json",
...
],
"everything": "json.zip"
}
For each entry in the array files
you can access the file rimanum/corpus.json
or you can retrieve, e.g., http://oracc.org/rimanum/corpus.json
.
- metadata.json: type "metadata"
- This provides several objects: "config"--the configuration info for
the project; "witnesses"--only present if projects use composite texts,
this provides information on which manuscripts are witnesses of the
composites in the project; and "formats", a collection of lists
indicating the presence of transliterations, transliterations and
lemmatized data in the project.
{
"type": "metadata",
"config": {
"pathname": "rimanum",
"name": "The House of Prisoners",
"abbrev": "RÄ«m-Anum",
...
},
"formats": {
"atf": [ "P295625","P296047","P296277","P296278", ... ],
"lem": [ "P295625","P296047","P296277","P296278","P296414", ... ],
"tr-en": [ "P295625","P296047","P296277","P296278", ... ],
"xtf": [ "P295625","P296047","P296277", ... ]
}
}
- catalogue.json: type "catalogue"
-
Provides the project's catalogue:
{
"type": "catalogue",
"project": "rimanum",
"members": {
"P295625": {
"author": "Simmons, Stephen D.",
"collection": "J. Pierpont Morgan Library Collection, Yale Babylonian Collection, New Haven, Connecticut, USA",
"date_of_origin": "Rim-Anum.01.10.01",
"dates_referenced": "Rim-Anum.01.10.01",
"designation": "YOS 14, 341",
"genre": "Administrative",
"height": "40",
"language": "Sumerian",
...
Although projects may have their own catalogue fields, all projects
provide at least one of id_text or id_composite (some use a mix);
designation; period; and provenience.
- corpus.json: type "corpus"
- The JSON file corpus.json is another manifest file: it lists the individual text editions that are located in the folder
corpusjson/
:
{
"type": "corpus",
"project": "rimanum",
"members": {
"P295625": "corpusjson/P295625.json",
"P296047": "corpusjson/P296047.json",
"P296277": "corpusjson/P296277.json",
"P296278": "corpusjson/P296278.json",
"P296414": "corpusjson/P296414.json",
"P297038": "corpusjson/P297038.json",
"P311964": "corpusjson/P311964.json",
"P368396": "corpusjson/P368396.json",
"P368398": "corpusjson/P368398.json",
"P372766": "corpusjson/P372766.json",
...
- JSON for individual text editions: type "cdl"
- Oracc text editions consist of two structures: one is the XML
version of the user's transliteration. The other is entirely generated
by Oracc and provides access to the divisions, content, and
lemmatization of the text in a relatively simple nested tree format. In
the Oracc world this format is called "XCL", or XML Chunks and Lemmas:
the XCL tree has only three primary node types:
c
, a chunk of text which may be the whole text, a sentence or unit, a clause, a phrase or possibly others; d
, a discontinuity, e.g., a line-break, a surface transition, damage to the content of the text; l
,
a lemma, the lemmatization of the text. The name of the array of
children of any chunk node is called "cdl" based on these three members.
Discontinuities and lemmata have a "text" property which can be concatenated to create text fragments.
{
{
"type": "cdl",
"project": "rimanum",
"source": "http://oracc.org/rimanum",
"license": "This data is released under the CC0 license",
"license-url": "https://creativecommons.org/publicdomain/zero/1.0/",
"more-info": "http://oracc.org/doc/opendata/",
"UTC-timestamp": "2017-06-21T22:02:40",
"textid": "P295625",
"cdl": [
{
"node": "c",
"type": "text",
"id": "P295625.U0",
"cdl": [
{
"node": "d",
"subtype": "tablet",
"type": "tablet",
"ref": "P295625.x374.1",
"label": "x374"
},
{
"node": "d",
"subtype": "obverse",
"type": "obverse",
"ref": "P295625.o.2",
"label": "o"
},
{
"node": "c",
"type": "discourse",
"subtype": "body",
"id": "P295625.U1",
"cdl": [
{
"node": "c",
"type": "sentence",
"id": "P295625.U2",
"label": "o 1 - r 2",
"cdl": [
{
"node": "d",
"type": "line-start",
"ref": "P295625.3",
"n": "1",
"label": "o 1"
},
{
"node": "l",
"frag": "5(BAN₂)",
"id": "P295625.l02b23",
"ref": "P295625.3.1",
"inst": "n",
"f": {
"lang": "akk-x-oldbab",
"form": "5(BAN₂)",
"delim": " ",
"gdl": [
{
"n": "n",
"form": "5(BAN₂)",
"id": "P295625.3.1.0",
"seq": [
{
"r": "5"
},
{
"s": "BAN₂"
}
]
}
],
"pos": "n"
}
},
...
{
"node": "l",
"frag": "ZI₃.DA",
"id": "P295625.l02b25",
"ref": "P295625.3.3",
"inst": "qēmu[flour]N",
"sig": "@rimanum%akk-x-oldbab:ZI₃.DA=qēmu[flour//flour]N'N$qēmu",
"f": {
"lang": "akk-x-oldbab",
"form": "ZI₃.DA",
"gdl": [
{
"gg": "logo",
"gdl_type": "logo",
"group": [
{
"s": "ZI₃",
"id": "P295625.3.3.0",
"role": "logo",
"logolang": "sux",
"delim": "."
},
{
"s": "DA",
"id": "P295625.3.3.1",
"role": "logo",
"logolang": "sux"
}
]
}
],
"cf": "qēmu",
"gw": "flour",
"sense": "flour",
"norm": "qēmu",
"pos": "N",
"epos": "N"
}
},
...
In the l
nodes of the example above, the "sig" property
is the string version of the lemmatization associated with a word. To
reduce the size of the JSON files, the parsed versions of the sig
properties are collected together in a separate object in the corpus,
called "sigs":
"sigs": {
"@rimanum%akk-x-oldbab:30-be-el-i₃-li₂=Sîn-bēl-ilī[00//00]PN'PN$Sîn-bēl-ilī": {
"form": "30-be-el-i₃-li₂",
"cf": "Sîn-bēl-ilī",
"gw": "00",
"sense": "00",
"pos": "PN",
"epos": "PN",
"norm": "Sîn-bēl-ilī"
} ,
You can use the string "sig" as a key to look up the parsed form.
Each form is also presented in its parsed form in the gdl
object making it easy to work with the text as a series of graphemes as well as a series of lemmata. GDL is described in the GDL schema documentation.
- glossary-XXX.json: type "glossary"
-
The glossary files are named on a template which puts the language
code in the place-holder XXX in the heading to the section. For
language
akk
there is a glossary file glossary-XXX.json
and so on; you can see which ones are provided by checking the project's manifest.
Glossaries are a list of entries which give the distributional data
on all of the facets of Oracc lemmatization, gathered under a series of
headings for spellings (forms
); normalizations (norms
) and meanings (senses
. The same data is also given for the full signatures which reference the entry (sigs
).
Several instance-related properties are common to many of these data:
- icount
- The instance count for the datum.
- ipct
- The percentage of instances of the datum that this count represents.
- xis
- A reference to the compilation of instances that make up the count for the datum.
The second element of a glossary object is the set of xis data for
the glossary: you can use the reference given in the xis property to
access the list of word IDs which makes up the instance set to access
the lemmatizations given in the corpus and traverse the context of any
instance.
{
"type": "glossary",
"project": "rimanum",
...
"lang": "akk-x-oldbab",
"entries": [
{
"headword": "DUMU.EDUBA[(military) scribe]N",
"id": "akk-x-oldbab.x000021",
"icount": "7",
"ipct": "100",
"xis": "akk.r00000",
"cf": "DUMU.EDUBA",
"gw": "(military) scribe",
"pos": "N",
"forms": [
{
"type": "form",
"id": "akk-x-oldbab.x000229",
"n": "DUMU.E₂.DUB.BA",
"icount": "4",
"ipct": "57",
"xis": "akk.r00001"
},
...
],
"norms": [
{
"id": "akk-x-oldbab.x000231",
"icount": "7",
"ipct": "100",
"xis": "akk.r00000",
"n": "DUMU.EDUBA",
"forms": [
{
"type": "normform",
"id": "akk-x-oldbab.x000232",
"ref": "akk-x-oldbab.x000229",
"icount": "4",
"ipct": "57",
"xis": "akk.r00001"
},
...
]
}
],
"senses": [
{
"type": "sense",
"id": "akk-x-oldbab.x000234",
"n": "DUMU.EDUBA[(military) scribe//(military) scribe]N'N",
"icount": "7",
"ipct": "100",
"xis": "akk.r00000",
"pos": "N",
"mng": "(military) scribe",
"forms": [
{
"type": "form",
"id": "akk-x-oldbab.x000235",
"n": "%akk-x-oldbab:DUMU.E₂.DUB.BA",
"icount": "4",
"ipct": "57",
"xis": "akk.r00001"
},
...
],
"norms": [
{
"id": "akk-x-oldbab.x000237",
"n": "DUMU.EDUBA",
"icount": "7",
"ipct": "100",
"xis": "akk.r00000"
}
],
"sigs": [
{
"type": "sig",
"id": "akk-x-oldbab.x000238",
"sig": "@rimanum%akk-x-oldbab:DUMU.E₂.DUB.BA.A=DUMU.EDUBA[(military) scribe//(military) scribe]N'N$DUMU.EDUBA",
"icount": "3",
"ipct": "43",
"xis": "akk.r00002"
},
...
]
}
]
},
...
},
"instances": {
"akk.r0019f": [
"rimanum:P405412.8.4"
],
"akk.r0019b": [
"rimanum:P405162.3.2",
"rimanum:P405163.3.2",
"rimanum:P405164.3.3",
"rimanum:P405165.4.3",
"rimanum:P405166.3.2",
"rimanum:P405167.3.3"
],
"akk.r001a1": [
"rimanum:P372792.5.4",
"rimanum:P405339.4.3",
"rimanum:P405373.4.6",
"rimanum:P405379.5.3"
],
...
- index-xxx.json: type "index"
- The index-xxx.json files are exports of a subset of the index
data created and used by the Oracc search engine, giving the keys the
indexer has generated from the input words and the locations in which
they occur in the corpus. These keys may have been normalized using a
variety of processes: accents are rendered as numeric indices; case may
be foled; for English translation indexes a stemmer is used so that,
e.g., "received" and "receives" will be gathered together under
"receive" in the index.
{
"type": "index",
"project":"rimanum",
"name": "cat",
"keys": [{
"key": "1",
"count": "16",
"instances": [
"rimanum:P405220","rimanum:P405220","rimanum:P405246","rimanum:P405246","rimanum:P405313","rimanum:P405313","
]},{
"key": "3",
"count": "16",
"instances": [
"rimanum:P405225","rimanum:P405225","rimanum:P405229","rimanum:P405229","rimanum:P405248","rimanum:P405248","
]},{
"key": "5",
"count": "11",
"instances": [
"rimanum:P405212","rimanum:P405212","rimanum:P405250","rimanum:P405250","rimanum:P405317","rimanum:P405317","
]},{
"key": "6",
"count": "11",
"instances": [
"rimanum:P405219","rimanum:P405219","rimanum:P405251","rimanum:P405251","rimanum:P405284","rimanum:P405318","
]},{
The text IDs are always qualified with a project name because any
project can use texts from other projects. For "txt", "lem" and "tra"
index types, the instances are given as word IDs, so they can be used to
locate the instance in the text edition. A simple way of displaying
instances is to use the URL http://oracc.org/PROJECT/INSTANCE_ID/html
, e.g., http://oracc.org/rimanum/P405219.4.1/html
. If you omit the "/html" the text is loaded into the Oracc pager instead of retrieving the simple HTML version.
No comments:
Post a Comment