GEDCOM

From TNG_Wiki
Jump to navigation Jump to search

What is GEDCOM?

GEDCOM, which stands for GEnealogical Data COMmunications, is a text file format that is widely used to share genealogical data. Some Genealogy software packages actually store their data in GEDCOM format. But it is more commonly used as a way to transfer data between genealogical packages. GEDCOM files are text files that can be opened in any text editor or word processor.

The GEDCOM standard was developed and is owned by FamilySearch.org, a service of the Church of Jesus Christ of Latter-Day Saints (the Mormons). Its latest version dates to the 1990's, and is far from perfect, but it is very widely used, and it essentially "all we have."

GEDCOM is

  • a data and file format,
  • a description of genealogical data elements and structures, and
  • a specification that maps those genealogical elements and structures to the file format.

As a consequence, GEDCOM also provides (or serves as a) Genealogy Data Model

GEDCOM File Format

The GEDCOM data/file format is strictly text-based:

  • One GEDCOM file contains all of the data records that represent what we generally think of as a "family tree".
  • Records are organized hierarchically. For instance, a "Person" record contains multiple "Event" records (Birth, Death, etc.), which contain "Date", "Place", and "Source" records, and so on.
  • GEDCOM records consist of
    • A line of text that consists of three space-separated elements:
      1. An integer "level number" that identifies the record's place in the record hierarchy.
      2. A short "Tag" (essentially a record type), and
      3. A possible tag value whose nature depends on the tag)
    • and all subordinate records, defined on subsequent lines with larger "level numbers"
  • GEDCOM tags are generally no more than 4 characters long
    • Some tag values essentially serve as attribute names (CITY, CTRY, PHON, AGE, NAME, SEX)
    • Some tag value are analogous to relational record types (INDI-individual person, FAM-family, OBJ-media object, etc.)
  • Long data values and textual data values that may contain end-of-line characters are broken into records that extend the value that was started in their parent record or in the previous record.
  • The top level of the hierarchy is level zero.

Overall, a GEDCOM file consists of

  1. A zero-level HEAD record that describes the file itself; filename, date, copyright, gedcom version, language, etc.
  2. A zero-level SUBM (submission) record that indicates who or what created the file, and that can describe the number of records or generations in the file.
  3. The data itself - in a series of zero-level data records (each of which contains the appropriate subordinate records)
  4. A zero-level TRLR record that simply marks the end of the file.

GEDCOM Person Record

Here's an GEDCOM excerpt describing part of my grandmother's genealogy record. I've indented the lines to illustrate the hierarchical structure, and added colored, italicized text as documentation. Gedcom files never use this indentation.

0 @I4526@ INDI Level 0 INDIvidual record - person #I4526
  1 NAME Ida Marie /HAZLET/ NAME fact. Last names are typically marked with slashes
    2 SOUR @S41@ A citation: Source #S41 supports the Name fact (but there is no citation data)
  1 BIRT Birth event
    2 SOUR @S77@ A citation: Source #S77 supports the Birth event
      3 Page 112 Info about her birth is on page 112 of the source
      3 DATA Relevant quote from the source
        4 TEXT Birth date: abt 1899 The beginning of the text
          5 CONT Birth place: Iowa The quotation continues
          5 CONT Residence date: 1915 The quotation continues
          5 CONT Residence place: Eden The last line of the quotation
       3 OBJE @M3349@ A media object associated with the Source (and the birth, and the person)
  1 SEX F Sex fact
  1 EDUC Harper College EDUC event
    2 DATE FROM 1920 TO 1923 The duration of the education event
    2 PLAC Harper, Harper, KS The place of the education event
  1 OBJE @M502@ Media object #M502 (probably a photo) is tied to this INDI record
  1 FAMS @F1513@ She is one of the spouses in Family #F1513
  1 FAMC @F2324@ She is a child of Family #F2324
0 @I43@ INDI The INDI record ends when the next Level 0 record starts

The actual complete GEDCOM record for my grandmother in my last extract was 266 lines long, and included birth, death, and residence events, plus additional source and media object references. But really, that additional length doesn't add to the complexity of the GEDCOM file; just to its volume.

See also

You can also review any GEDCOM file (you'll find plenty on the Internet) by opening it with a text editor or word processor. GEDCOM files are certainly not primarily written for human eyes, but they are structured text files, and it's pretty easy to understand them in small pieces.

GEDCOM Media Record

A media item in the GEDCOM file is represented by a set of lines that might look similar to the examples just below. Here is a hypothetical media record from an Apple MacIntosh environment. The record starts at level 0 with a OBJE (for media object) line, and contains 7 additional subordinate lines (all at level 1).

0 @M232@ OBJE 
1 FORM jpg
1 FILE ~/Documents/Documents/Genealogy/Roger/ReunionPictures/photos/people/RogerOval.JPG
1 TITL Roger Moffat
1 NOTE Taken at the time of Kurt and Ann Christensen's wedding - 2 March 1996.
1 _TYPE PHOTO
1 _PRIM Y
1 _SIZE 147.000000 193.000000

Note, in particular, the FILE line, which contains a fully specified filename, with its complete path on the Macintosh PC.

Here's the beginning of a comparable media record from a GEDCOM generated on a Windows PC. The only significant difference is that the PC GEDCOM (not surprisingly) uses Windows syntax to specify the filepath and filename.

0 @M232@ OBJE 
1 FORM jpg
1 FILE C:\Users\me\documents\gene\photos\people\RogerOval.JPG
...

GEDCOM Variability

GEDCOM files written by different genealogy software may have some confusing inconsistencies, especially when you consider that different people entering genealogical data can have different styles. For instance, in the person record shown above:

  1. Names can be broken down into parts - prefix, given name, surname, suffix, etc. by some genealogy programs, others just lump everything into Full Name and enclose the surname within slashes (as in the Person record example above).
  2. There is a NICK tag for nickname, but many people just include nicknames within the name in quotes or parentheses.
  3. The SOUR record within the NAME record above should have citation records as well as text from the source that specifically supports this event. But many genealogy data entry programs don't force users to enter such information, and many users don't bother, and some do not even provide a SOUR for the NAME
  4. GEDCOM dates are supposed to be - and usually are - in dd MMM yyyy format, but there are numerous date formats that may be interpreted differently by different applications. For instance,
    • Some applications may accept the month names in different languages; others may not.
    • Some applications may accept dates in mm/dd/yyyy or dd/mm/yyyy format (but probably not both!).
    • There are provisions for incomplete dates, approximate dates, relative dates, and date ranges. (See below.) Some application may handle approximate relative dates (e.g. BEF ABT 1831); others may not.
  5. GEDCOM place names are often expressed inconsistently (e.g. Houston, Harris, Texas, USA; Houston, Harris County, Texas; Houston, Texas; or Houston,,Texas) across GEDCOMS and even within a GEDCOM. And the town, county, state, country convention that works for the USA doesn't work for most other countries.
  6. Most genealogists express place names with full state names and the country. On the other hand, some databases are so USA-centric that they use state abbreviations and omit USA as a country name. Such abbreviations, however, lend themselves to confusion when accessed on the web by users in other countries.
  7. An educational facility (e.g. Harper College, from the example above) could be listed as a value on the EDUC line (as shown above), or it could be listed as a value for a subordinate AGNC (Agency) tag. It could also (but rarely would be) listed as part of the place name. Hospital names (associated with Birth or Death events) are generally treated the same way.
  8. Cemeteries, on the other hand, are typically included in the place name for a burial event, not as a value on the Burial event, nor as a value on a subordinate Agency tag.
  9. Different PC genealogy applications also assemble Media Object records differently than the record shown above. For example, not all of the sub-records (particularly SIZE, _PRIM, and _TYPE) shown in the Media Record example above are present in all GEDCOM files.
  10. Source citations are very inconsistent.
  11. The GEDCOM standard allows vendors to add their own proprietary tags, which start with an underscore.

Because of these these (and other) inconsistencies, many GEDCOM files need some kind of clean-up before they can be imported to another genealogy application.

GEDCOM data model

GEDCOM isn't supposed to be a data model, and many people will tell you that it isn't one. But the GEDCOM standard defines data structures for essential genealogical concepts such as "events", "source citations", "people", "families", "sources", "repositories" (and more). For all practical purposes, those definitions ARE a data model.

GEDCOM has six fundamental genealogical object types, which are identified by their record tags, and by a letter that prefixes their record ID:

  • People, that is, "Individuals": INDI (I)
  • Families: FAM (F)
  • Data sources: SOUR (S)
  • Repositories: REPO (R); places where data sources are held, such as libraries.
  • Notes: NOTE (N)
  • Media objects: OBJE (M). Generally, OBJE records simply describe media files.

Records of these six types are called "level zero" records, because they all start at level zero in the record hierarchy. In level-zero records,

  • The tag comes after the tag value, not before, and
  • The tag value is a record id, which is numeric except for an initial letter that identifies the object type.
    (Actually, in GEDCOM files, record ID's are always wrapped with @-signs.)

INDI and FAM records are quite similar. They consist primarily of

  • Level 1 "Event" records, in which the tag (NAME, BIRT, DEAT, OCCU...) represents a particular "event" or "attribute" (Name, birth, death, occupation...).

Event records typically contain subordinate level 2 records that specify

    • A date,
    • A place,
    • Source citations that can occupy several levels,

Event records and their subordinate records can also contain NOTE and OBJE records, which are frequently just references to zero-level NOTE and OBJE records.

Note that

  • SOUR, REPO, AND OBJE records are, for the most part, not really hierarchical. That is, they consist mostly of level 1 records that provide specific attributes such as source name, source title, source author, repository name, filename, file type, etc.
  • NOTE records just contain a value that is continued on subordinate level 1 CONC and CONT records.
    • CONC means "concatenate", and its its value is just appended to the note that has been generated so far.
    • CONT means "continue on a new line", so some kind of new-line character is added to the note before the CONT tag's argument.

Primary GEDCOM Objects

The hierarchical structure of GEDCOM files, where virtually everything is a "record", stands in sharp contrast to the typical relational database with tables, records, and record attributes. The six zero-level Gedcom record types and the tags that identify them are

  1. INDI - Individuals (that is, People)
  2. FAM - Families
  3. OBJE - Multimedia Objects (photos, document images, recordings, external HTML files, videos, and the like)
  4. SOUR - Sources (publications, legal documents, emails, personal contacts that provide information records in the genealogy database)
  5. REPO - Repositories (places where sources are stored), and
  6. NOTE - Notes (multiple lines of text)

In the Gedcom file, the data line that defines zero-level records contains

  1. The level number (0)
  2. A unique record ID (wrapped in @ signs) consisting of a single letter followed by a number. The ID of the Individual object illustrated above (for my grandmother) is I4526, and the ID of the Media Object illustrated above is M232. Note that the letter that begins the recordID is the first letter of the tag, except htat:
  • Media Object (OBJE) IDs start with "M",
  • In some GEDCOM files, the IDs of Individuals (INDI) start with "P" (for Person), and,
  • Likewise, in some GEDCOM files, Family IDs start with a letter other than "F" (usually "G", I think).

The meanings of the first 5 of these object types are pretty straightforward ; it's easy to imagine what attributes they may have, e.g.

  1. Individuals - Name, Gender, Birth, Death, Parents (a Family), Marriage (a Family), Occupation, Residence(s), Citizenship, Race, etc.
  2. Families - Husband, Wife, Marriage Date and Place, Divorce, etc.
  3. Media Objects - Filename, File type, Image dimensions, Video length, etc.
  4. Sources - Name, Author, Publication Date, Medium, Publisher, Repository, etc.
  5. Repository - Name, Address, etc.
  6. Note - Some text.

Links between GEDCOM Objects

When used as the tag for a zero-level record, "INDI", "FAM", "OBJE", "SOUR", "REPO", and "NOTE" define primary GEDCOM objects. When used in subordinate records, those same tags associate the current record with a primary GEDCOM object.

For example, in the Individual record listed near the beginning of this article, you'll see

0 @I4526@ INDI
1 BIRT
2 DATE 23 SEP 1898
2 PLAC Leon, Decatur, IA
2 SOUR @S77@
3 Page 112
4 DATA
...
3 OBJE @M3349@
...
1 OBJE @M502@
1 FAMS @F1513@
1 FAMC @F2324@
  • The level 2 SOUR tag here indicates that Source S77 is associated with the level 2 Birth event AND the level 0 Person object.
  • The level 3 OBJE tag indicates that Multimedia object M3349 is associated with the the Source object identified at level 2, the Birth event defined at l, and the Individual defined at level 0. (Multimedia object M3349 would likely be an image of a book page that describes the Individual's birth, or perhaps an image of a birth certificate, where the Source would be the state or county Birth records.)
  • The level 1 OBJE tag indicates that Multimedia object M502 is associated with the Individual defined as level 0. This object might be a photo of the person, or perhaps a story about the person in PDF, DOC, or HTML format.
  • The level 1 FAMS tag indicates that this person is a spouse in Family F1513.
  • The level 1 FAMC tag indicates that this person is a child in Family F2324.

Note that some of these links carry data with them. In particular, the relationship among a Source, an Event, and an Individual identified by the SOUR tag carries the data from the subordinate lines, including the Page attribute and the quote (transcription). That relationship is described as a "Source Citation"

Notes are a little funny, in that NOTE tags within another record do not necessarily refer to a Note ID, but can define the text of a note, whose relationship to the parent record is defined solely by the fact that the NOTE tag occurred within that record. Thus

  • 1 NOTE @N12@ in a person record says that Note n12 describes the person,
  • 1 NOTE John was an affable gentleman who remained a bachelor until age 50... in a Person records defines a new note which, to GEDCOM, has no ID.

As with zero-level NOTE records, long notes occupy more than one line of the Gedcom file, where each line is a CONT or CONC record. (The difference between CONT and CONC is not relevant here.)

Events

In Person and Family records, what we might generally think of a "Attributes" are, for the most part, "Events", which

  • Are identified by the tag on line with Level 1. That "Identification" has two senses:
    • That fact that a level 1 line is an event is determined by the Tag. There is a specific list of tags that can occur at level 1 of a Person or Family record, and any other tag indicates that the line represents an event.
    • The event tag indicates what the event means, through tag descriptions that are part of the Gedcom standard (or a Genealogy application), and are not contained in a Gedcom file. For instance "BIRT" means Birth, "OCCU" means Occupation, "RESI" means Residence, etc.
  • Can occur multiple times in the Person or Family record.
    • It is easy to see that a person can have more than one Occupation or Residence event. Events such as Birth and Death can also occur multiple times, in which case they represent different opinions about the Birth and Death data.
  • Can contain subordinate Level 2 records that are links to other Gedcom objects (as with the examples just before the "Events" heading above).
    • The concept of a "Citation" is represented by a Level 2 reference to a Source record. That is, a Citation indicates that an Event's attributes are documented in a Source. Citations, in turn, have subordinate Level 3 lines with tags such as
      • "PAGE", which is not necessarily a page number within the source, but more generically, a description of where the information can be found in a source,
      • "TEXT", which is a representation of the relevant text within the source, and
      • "OBJE", which would refer to a media item associated with the citation, such as an image of the relevant page within the source, and
      • "QUAY", which is a term for the quality of the citation - esentially a representation of how confident the genealogist is that the source is accurate.
  • Can also contain Level 2 records that represent single occurrences of event-specific attributes, which are identified by tags such as DATE, PLAC (place), AGE (person's age at the time of the event), or AGNC ("Agency" - the facility where the event occurred, or organization under whose auspices the event occurred)

Attributes

The attributes of zero-level records other than Families and Individuals, and in records that are subordinate to Families and Individuals, are defined by subordinate records, most of which are single-value, single-line records that define that we can call "Attributes" of the parent records.

Even at level 1 of Individual records, NAME and SEX are not considered to be Events, but are just attributes of the Individual record.

Dates

Dates, which are the argument of a DATE tag within an event, can be expressed as

  • Exact dates - 15 Jan 2015
  • Partial Dates - Jan 2015; 2015
  • Relative Dates - BEF Jan 2015; AFT 1852; BEF 17 JUL 1687
  • Approximate Dates - ABT Jan 1862; ABT 1910
  • Estimated Dates - EST Jan 1862
  • Date Ranges - (Representing the range in which a specific event might have occured) BET 1862 AND 1865; BET 5 JAN 1761 AND 6 MAY 1761
  • Durations - (Representing the duration of a fact) FROM 1762 TO 1789

In theory, only Facts have have Durations, and Facts can only have Durations, but this restriction is not routinely applied.

Secondary GEDCOM Objects

Several of the characteristics defined in GEDCOM files purely though the hierarchical nature of the GEDCOM file can mapped to tables (record types, objects) in a relational representation of the GEDCOM data.

Events

Events are described in detail above. They all have a parent Family or Individual record.

Places

In GEDCOM, there is no central list of places; the place property (sub-record) of each fact and event must state a placename, not a place ID. Genealogy applications typically store the set of placenames used in facts and events in a central list. (TNG does so.) However, almost all genealogy applications are subject to placename duplication, in that one event could refer to "Dallas, Dallas County, Texas", another could refer to "Dallas, Dallas, Texas, USA", and a third could refer to "Dallas, Texas".

GEDCOM placenames can be more specific than a town. They can incorporate neighborhood names, facility names, cemetery names, specific addresses, and so on.

Cemetery names are typically part of the place name associated with the burial event; e.g. "Fort Hill Cemetery, Cleveland, Bradley County, Tennessee". But other facilities such as hospitals and schools are typically either listed in an event note, or identified as the "Agency" associated with an event.

Source Citations

Sources of information (that is, the GEDCOM Source objects) are associated with Individuals and Family events through a "Source Citation" The "SOURC" tag, when used at level 0 in a GEDCOM file, defines the Source, but a "SOURC" tag that is subordinate to an event is a "Source Citation" that is supposed to define

  • The page or section within the source
  • The date when the source was used to supply information to the database.
  • A transcription of the information in the source that supports the fact or event.

But as it turns out, Source Citations are very inconsistently and often incorrectly used.

Media Links

Media item reference within a record (at any level) are medialinks, which tie a media item to the references immediate parent.

Processing by TNG

Here are some specific considerations that apply to TNG's handling of GEDCOM files.

Proprietary Tags

Any proprietary event tag (e.g. _MILT for military events) can be treated as an event in TNG.

Other proprietary tags that are typically treated as "Pure Attributes" include:

  • _MREL and _FREL, which characterize the relationship (Natural, Adoptive, Foster, etc.) between a child and a parent.
  • _PRIM means "Primary Multimedia Object", and is expected to be a level 2 tag that is subordinate to a level 1 Multimedia Object reference in an Individual record. It is used to designate one Multimedia Object as the primary photo for an Individual.
  • _PHOTO is an alternate way to identify the primary photo for an Individual, and is not understood by TNG. _PHOTO tags needs to be converted to _PRIM tags.
  • _LIVING designates an individual as living
  • _PRIVATE designates an individual or family as private.

See Desktop gotchas to read about GEDCOM quirks in some PC software packages, and the need to convert some GEDCOM files before they can be imported into TNG.

Custom Event Types

By default, when importing a GEDCOM file, TNG processes only the GEDCOM tags that are marked as Accept in the Custom Event Types table. On the Administration >> Custom Even Types screen, you can list the "custom" event types that you do (or don't) want TNG to import.

Or, on On the Administration >> Import/Export >> Import screen:
Gedcom-import.png
Two checkboxes allow you to

  • Accept data for all new Custom Event Types (i.e. those that are not already listed at Administration >> Custom Event types, and
  • Add tags to the Custom Event Types table by reading the GEDCOM files but not importing any data.

TNG-Generated GEDCOM

TNG supports several media collections that are typically not supported in other genealogy software programs, which often only support PHOTO as a valid media _TYPE. For example, if you created a Certificate User Media Collection, TNG will generate the following in a GEDCOM for a Certificate that is linked to the Birth event.

1 BIRT
2 DATE 18 Sep 1917
2 PLAC Rumford, Oxford County, Maine, USA
2 OBJE
3 FORM JPG
3 FILE F:\wamp\www\tng\certificates\Nadeau_Theresa_Birth.jpg
3 TITL Theresa Nadeau's original Birth Certificate 
3 _TYPE CERTIFICATES
3 NOTE Theresa Nadeau's original Birth Certificate from Rumsford, Maine

While the above will import in RootsMagic, for example, it will not generate the thumbnails for the media.

TNG V9 added the capability to change the Export As that generates the _TYPE for the media records. In order to import the above media into RootsMagic and generate the thumbnails the _TYPE must be PHOTO, so you need to edit the User Media Collection and change the Export As to PHOTO. See Export TNG Media As

Since you cannot edit the TNG Media Collections (Photos, Documents, Headstones, and Histories), you will need to manually change the Export As in the mediatypes.php or better install the mod in the Export Media For Desktop that pertains to the desktop genealogy software where you want to import the TNG GEDCOM file.

Related Links

External references