GEDCOM

From TNG Wiki
Jump to: navigation, search

What is GEDCOM?

GEDCOM, which stands for GEnealogical Data COMmunications, is a text file format that is widely used to share genealogical data. Some Genealogy software packages actually store their data in GEDCOM format. But it is more commonly used as a way to transfer data between genealogical packages. GEDCOM files are text files that can be opened in any text editor or word processor.

The GEDCOM standard was developed and is owned by FamilySearch.org, a service of the Church of Jesus Christ of Latter-Day Saints (the Mormons). Its latest version dates to the 1990's, and is far from perfect, but it is very widely used, and it essentially "all we have."

GEDCOM stores data in a hierarchical format; for instance, a "Person" record contains multiple "Event" records (Birth, Death, etc.), which contain "Date", "Place", and "Source" records, and so on. Each line in a GEDCOM file starts with an integer number (typically less than 10) that represents that line's level in the hierarchy. Each line also has a record type keyword (INDI for Individual, EVEN for Event, DATE for date, etc.) known as a "Tag". A GEDCOM "record" consists of a line plus any subordinate lines - i.e. subsequent, contiguous lines that start with larger level numbers.

GEDCOM Person Record

Here's an GEDCOM excerpt describing part of my grandmother's genealogy record. I've indented the lines to illustrate the hierarchical structure, and added colored, italicized text as documentation.

0 @I4526@ INDI Level 0 INDIvidual record - person #I4526
  1 NAME Ida Marie /HAZLET/ NAME fact. Last names are typically marked with slashes
    2 SOUR @S41@ Source #S41 supports the Name fact
  1 BIRT Birth event
    2 SOUR @S77@ Source #S77 supports the Birth event
      3 Page 112 Info about her birth is on page 112 of the source
      3 DATA Relevant quote from the source
        4 TEXT Birth date: abt 1899 The beginning of the text
          5 CONT Birth place: Iowa The quotation continues
          5 CONT Residence date: 1915 The quotation continues
          5 CONT Residence place: Eden The last line of the quotation
       3 OBJE @M3349@ A media object associated with the Source (and the birth, and the person)
  1 SEX F Sex fact
  1 EDUC Harper College EDUC event
    2 DATE FROM 1920 TO 1923 The duration of the education event
    2 PLAC Harper, Harper, KS The place of the education event
  1 OBJE @M502@ Media object #M502 (probably a photo) is tied to this INDI record
  1 FAMS @F1513@ She is one of the spouses in Family #F1513
  1 FAMC @F2324@ She is a child of Family #F2324
0 @I43@ INDI The INDI record ends when the next Level 0 record starts

The actual complete GEDCOM record for my grandmother in my last extract was 266 lines long, and included birth, death, and residence events, plus additional source and media object references. But really, that additional length doesn't add to the complexity of the GEDCOM file; just to its volume.

See also

You can also review any GEDCOM file (you'll find plenty on the Internet) by opening it with a text editor or word processor. GEDCOM files are certainly not primarily written for human eyes, but they are structured text files, and it's pretty easy to understand them in small pieces.

GEDCOM Media Record

A media item in the GEDCOM file is represented by a set of lines that might look similar to the examples just below. Here is a hypothetical media record from an Apple MacIntosh environment. The record starts at level 0 with a OBJE (for media object) line, and contains 7 additional subordinate lines (all at level 1).

0 @M232@ OBJE 
1 FORM jpg
1 FILE ~/Documents/Documents/Genealogy/Roger/ReunionPictures/photos/people/RogerOval.JPG
1 TITL Roger Moffat
1 NOTE Taken at the time of Kurt and Ann Christensen's wedding - 2 March 1996.
1 _TYPE PHOTO
1 _PRIM Y
1 _SIZE 147.000000 193.000000

Note, in particular, the FILE line, which contains a fully specified filename, with its complete path on the Macintosh PC.

Here's the beginning of a comparable media record from a GEDCOM generated on a Windows PC. The only significant difference is that the PC GEDCOM (not surprisingly) uses Windows syntax to specify the filepath and filename.
0 @M232@ OBJE 
1 FORM jpg
1 FILE C:\Users\me\documents\gene\photos\people\RogerOval.JPG
...

GEDCOM Variability

GEDCOM files written by different genealogy software may have some confusing inconsistencies, especially when you consider that different people entering genealogical data can have different styles. For instance, in the person record shown above:

  1. Names can be broken down into parts - prefix, given name, surname, suffix, etc. by some genealogy programs, others just lump everything into Full Name and enclose the surname within slashes (as in the Person record example above).
  2. There is a NICK tag for nickname, but many people just include nicknames within the name in quotes or parentheses.
  3. The SOUR record within the NAME record above should have citation records as well as text from the source that specifically supports this event. But many genealogy data entry programs don't force users to enter such information, and many users don't bother, and some do not even provide a SOUR for the NAME
  4. GEDCOM dates are supposed to be - and usually are - in dd MMM yyyy format, but there are numerous date formats that may be interpreted differently by different applications. For instance,
    • Some applications may accept the month names in different languages; others may not.
    • Some applications may accept dates in mm/dd/yyyy or dd/mm/yyyy format (but probably not both!).
    • There are provisions for incomplete dates, approximate dates, relative dates, and date ranges. (See below.) Some application may handle approximate relative dates (e.g. BEF ABT 1831); others may not.
  5. GEDCOM place names are often expressed inconsistently (e.g. Houston, Harris, Texas, USA; Houston, Harris County, Texas; Houston, Texas; or Houston,,Texas) across GEDCOMS and even within a GEDCOM. And the town, county, state, country convention that works for the USA doesn't work for most other countries.
  6. Most genealogists express place names with full state names and the country. On the other hand, some databases are so USA-centric that they use state abbreviations and omit USA as a country name. Such abbreviations, however, lend themselves to confusion when accessed on the web by users in other countries.
  7. An educational facility (e.g. Harper College, from the example above) could be listed as a value on the EDUC line (as shown above), or it could be listed as a value for a subordinate AGNC (Agency) tag. It could also (but rarely would be) listed as part of the place name. Hospital names (associated with Birth or Death events) are generally treated the same way.
  8. Cemeteries, on the other hand, are typically included in the place name for a burial event, not as a value on the Burial event, nor as a value on a subordinate Agency tag.
  9. Different PC genealogy applications also assemble Media Object records differently than the record shown above. For example, not all of the sub-records (particularly SIZE, _PRIM, and _TYPE) shown in the Media Record example above are present in all GEDCOM files.
  10. Source citations are very inconsistent.
  11. The GEDCOM standard allows vendors to add their own proprietary tags, which start with an underscore.

Because of these these (and other) inconsistencies, many GEDCOM files need some kind of clean-up before they can be imported to another genealogy application.

Primary GEDCOM Objects

The hierarchical structure of GEDCOM files, where virtually everything is a "record", stands in sharp contrast to the typical relational database with tables, records, and record attributes. It is helpful to think in terms of GEDCOM "zero-level" records such as the person and media records shown above as GEDCOM objects with attributes. The six zero-level Gedcom object types and their tags are

  1. INDI - Individuals (that is, People)
  2. FAM - Families
  3. OBJE - Multimedia Objects (photos, document images, recordings, external HTML files, videos, and the like)
  4. SOUR - Sources (publications, legal documents, emails, personal contacts that provide information records in the genealogy database)
  5. Repositories (places where sources are stored), and
  6. Notes (multiple lines of text)

In the Gedcom file, each object has an unique identifier that consists of a single letter (the first letter of the object type), followed by a number. The ID of the Individual object illustrated above (for my grandmother) is I4526, and the ID of the Media Object illustrated above is M232.

The first 5 of these object types are pretty natural; it's easy to imagine what attributes they may have, e.g.

  1. Individuals - Name, Gender, Birth, Death, Parents (a Family), Marriage (a Family), Occupation, Residence(s), Citizenship, Race, etc.
  2. Families - Husband, Wife, Marriage Date and Place, Divorce, etc.
  3. Media Objects - Filename, File type, Image dimensions, Video length, etc.
  4. Sources - Name, Author, Publication Date, Medium, Publisher, Repository, etc.
  5. Repository - Name, Address, etc.

Notes are a little funny, in that notes can occur in many places in the GEDCOM structure, and most of those notes are just attributes of an object (such as a note that says "The family lived in Queens at the time" attached to a Birth event where the place "Brooklyn, New York"). But some notes (particularly notes about a person rather than about a person's event) are stored as separate zero-level objects, and can be referenced from other objects.

Events and Facts

There are three kinds of object attributes:

  1. Events - These are, by definition, associated with a specific date (which might be unknown or missing), and can occur multiple times in an object. An obvious example is Residence events. A person who moves to another place can have a new Residence event at that new place. And Census records supply information about a person's residence at regular intervals. Birth and Death can also occur multiple times for a person, in the sense that more than one Birth date and place and more than one Death date and place can be recorded for a person. GEDCOM provide no distinction at all between events that naturally occur more than once (like residence), and events that truly occur only once, but can have alternate values.
  2. Facts - These (by definition), are either not associated with a date, or are associated with a date range. Facts can also occur more than once (if only in the sense that there can be more than one alternate value.) Name, Sex, Race, and SSN are examples of GEDCOM facts. (Yes, there is a standard GEDCOM tag for Social Security Number, since Social Security Numbers become public information when a person dies.)
  3. Pure Attributes - Pieces of information that cannot be repeated, such as the filename and FORM attributes of the Media object shown above.

Note that Events, Facts, and Pure Attributes do not have unique record ID's. They don't have to be linked to their parent objects; they are associated with their parent objects just because of their subordinate physical relationship.

Events and Facts are so similar to each other that they are often treated exactly the same way by Genealogy applications, and are, for the most part, only associated with Individual and Family objects. (TNG treats the e-mail attribute of a Repository object as an event, though it's not clear to me that that's strictly correct from a GEDCOM perspective.)

"Pure Attributes", on the other hand, are more typically associated with objects other then Families and Individuals. Actually, the only pure attributes for Families or Individuals that I can think of are non-standard GEDCOM tags such as "_LIVING" (is the person alive?), "_PRIVATE" (keep this person's data private)

Events and Facts can have the following subordinate attributes:

  • Type (a free-text term that characterizes the fact or event)
  • Date (see more details below)
  • Place (see more details below)
  • Agency (An organization or facility associated with the event, such as a hospital or school)
  • Religion (I'm not sure how this came to be an attribute of an event rather than a fact about a person)
  • Cause (Typically, Cause of Death)
  • Note (free text)
  • Source Citation (see more details below)
  • Multimedia Link (e.g. a Multimedia object)

Dates

Dates can be expressed as

  • Exact dates - 15 Jan 2015
  • Partial Dates - Jan 2015; 2015
  • Relative Dates - BEF Jan 2015; AFT 1852; BEF 17 JUL 1687
  • Approximate Dates - ABT Jan 1862; ABT 1910
  • Estimated Dates - EST Jan 1862
  • Date Ranges - (Representing the range in which a specific event might have occured) BET 1862 AND 1865; BET 5 JAN 1761 AND 6 MAY 1761
  • Durations - (Representing the duration of a fact) FROM 1762 TO 1789

In theory, only Facts have have Durations, and Facts can only have Durations, but this restriction is not routinely applied.

Places

In GEDCOM, there is no central list of places; the place property (sub-record) of each fact and event must re-state the placename. Genealogy applications typically store the set of placenames used in facts and events in a central list, but almost all genealogy applications are subject to place duplication, since one even could refer to "Dallas, Dallas County, Texas", another could refer to "Dallas, Dallas, Texas, USA", and a third could refer to "Dallas, Texas".

GEDCOM placenames can be more specific than a town. They can incorporate neighborhood names, facility names, cemetery names, specific addresses, and so on.

Cemetery names are typically part of the place name associated with the burial event; e.g. "Fort Hill Cemetery, Cleveland, Bradley County, Tennessee". But other facilities such as hospitals and schools are typically either listed in an event note, or identified as the "Agency" associated with an event.

Links between GEDCOM Objects

When used as the tag for a zero-level record, "INDI", "FAM", "OBJE", "SOUR", and "REPO" define primary GEDCOM objects. When used in a higher-level record, those same tags associate the current record with a primary GEDCOM object.

For example, in the Individual record listed near the beginning of this article, you'll see

0 @I4526@ INDI
1 BIRT
2 DATE 23 SEP 1898
2 PLAC Leon, Decatur, IA
2 SOUR @S77@
3 Page 112
4 DATA
...
3 OBJE @M3349@
...
1 OBJE @M502@
1 FAMS @F1513@
1 FAMC @F2324@
  • The level 2 SOUR tag here indicates that Source S77 is associated with the level 2 Birth event AND the level 0 Person object.
  • The level 3 OBJE tag indicates that Multimedia object M3349 is associated with the the Source object identified at level 2, the Birth event defined at l, and the Individual defined at level 0. (Multimedia object M3349 would likely be an image of a book page that describes the Individual's birth, or perhaps an image of a birth certificate, where the Source would be the state or county Birth records.)
  • The level 1 OBJE tag indicates that Multimedia object M502 is associated with the Individual defined as level 0. This object might be a photo of the person, or perhaps a story about the person in PDF, DOC, or HTML format.
  • The level 1 FAMS tag indicates that this person is a spouse in Family F1513.
  • The level 1 FAMC tag indicates that this person is a child in Family F2324.

Note that some of these links carry data with them. In Particular, the relationship among a Source, an Event, and an Individual identified by the SOUR tag carries the data from the subordinate lines, including the Page attribute and the quote (transcription). That relationship is described by a "Source Citation"

Source Citations

Sources of information (that is, the GEDCOM Source objects) are associated with Individuals and Family events through a "Source Citation" The "SOURC" tag, when used at level 0 in a GEDCOM file, defines the Source, but a "SOURC" tag that is subordinate to an event is a "Source Citation" that is supposed to define

  • The page or section within the source
  • The date when the source was used to supply information to the database.
  • A transcription of the information in the source that supports the fact or event.

But as it turns out, Source Citations are very inconsistently and often incorrectly used.

Derived GEDCOM Objects =

Some of the concepts described above are typically treated as objects (or relational tables) in genealogy applications.

  • Events - Events have a Tag name (BIRT, DEAT, RESI, etc.) are associated with an Individual or Family, and have attributes as listed above. (Date, Place, Agency, Cause, Note, Source Citation, etc.)
  • Places - Aside from the full place name, a place may have geographic coordinates (i.e. latitude and longitude). In a GEDCOM file, the coordinate essentially have to be expressed in every fact or event where the place name is expressed.
  • Source Citations, which form a link between Events and Sources

Processing by TNG

Here are some specific considerations that apply to TNG's handling of GEDCOM files.

Proprietary Tags

Any proprietary event tag (e.g. _MILT for military events) can be treated as an event in TNG.

Other proprietary tags that are typically treated as "Pure Attributes" include:

  • _MREL and _FREL, which characterize the relationship (Natural, Adoptive, Foster, etc.) between a child and a parent.
  • _PRIM means "Primary Multimedia Object", and is expected to be a level 2 tag that is subordinate to a level 1 Multimedia Object reference in an Individual record. It is used to designate one Multimedia Object as the primary photo for an Individual.
  • _PHOTO is an alternate way to identify the primary photo for an Individual, and is not understood by TNG. _PHOTO tags needs to be converted to _PRIM tags.
  • _LIVING designates an individual as living
  • _PRIVATE designates an individual or family as private.

See Desktop gotchas to read about GEDCOM quirks in some PC software packages, and the need to convert some GEDCOM files before they can be imported into TNG.

Custom Event Types

By default, when importing a GEDCOM file, TNG processes only the GEDCOM tags that are marked as Accept in the Custom Event Types table. On the Administration >> Custom Even Types screen, you can list the "custom" event types that you do (or don't) want TNG to import.

Or, on On the Administration >> Import/Export >> Import screen:
Gedcom-import.png
Two checkboxes allow you to

  • Accept data for all new Custom Event Types (i.e. those that are not already listed at Administration >> Custom Event types, and
  • Add tags to the Custom Event Types table by reading the GEDCOM files but not importing any data.

TNG-Generated GEDCOM

TNG supports several media collections that are typically not supported in other genealogy software programs, which often only support PHOTO as a valid media _TYPE. For example, if you created a Certificate User Media Collection, TNG will generate the following in a GEDCOM for a Certificate that is linked to the Birth event.

1 BIRT
2 DATE 18 Sep 1917
2 PLAC Rumford, Oxford County, Maine, USA
2 OBJE
3 FORM JPG
3 FILE F:\wamp\www\tng\certificates\Nadeau_Theresa_Birth.jpg
3 TITL Theresa Nadeau's original Birth Certificate
3 _TYPE CERTIFICATES
3 NOTE Theresa Nadeau's original Birth Certificate from Rumsford, Maine

While the above will import in RootsMagic, for example, it will not generate the thumbnails for the media.

TNG V9 added the capability to change the Export As that generates the _TYPE for the media records. In order to import the above media into RootsMagic and generate the thumbnails the _TYPE must be PHOTO, so you need to edit the User Media Collection and change the Export As to PHOTO. See Export TNG Media As

Since you cannot edit the TNG Media Collections (Photos, Documents, Headstones, and Histories), you will need to manually change the Export As in the mediatypes.php or better install the mod in the Export Media For Desktop that pertains to the desktop genealogy software where you want to import the TNG GEDCOM file.

Related Links

Import Data

Media - Import

Export TNG Media As

Export Media For Desktop

Export V10

Desktop gotchas

External references