SQL-The Least You Need To Know

From TNG_Wiki
Jump to navigation Jump to search
Construction This page is under construction. Anyone is welcome to contribute. Note that MediaWiki (the product behind this wiki) can't effectively handle simultaneous edits. Thus, for now, it would be a good idea to put your name and a note in this construction template when you open the page for editing so other possible editors can see that it is being edited.)

- Robin Richmond 16 Aug 2021

Construction


SQL (Structured Query Language) is the language through which TNG reads from and writes to the database. Just as Javascript and CSS can be embedded in HTML pages, SQL is embedded in PHP programs.

Notably, most computer programs that interact with database use SQL for that purpose, no matter what programming language they use.

Note that Javascript and PHP are "programming languages" and HTML and CSS are "descriptive languages". SQL falls somewhere in between, as a "query language". It's a lot closer to a programming language than HTML and CSS are, and it really is not uncommon for SQL statement to look or act like scripts. But still, SQL queries essentially say "here's what I want" instead of "here, follow these instructions to accomplish what I want.

SQL doesn't have to be embedded in computer programs. Programmers and non-programmers alike routinely write SQL queries and run them directly against the database. At some point, almost all TNG administrators will need to run an ad-hoc query. Even if someone else writes the query for you, you should at least have a general understand of SQL.

SELECT Queries

Most SQL queries are written as a single statement with several "clauses". Queries often include conditional functions (much like those in Microsoft Excel formulas). Much less often they contain conditional clauses, which are really out of the scope of this article.

All queries start with a keyword that is a verb that indications what kind of operation the query will perform. By far, the most common query operation is SELECT, which reads and displays data. In natural languages, we probably wouldn't think of UPDATE, INSERT, DROP, or ALTER operations as queries, but in SQL, they are. For readability, SQL keywords are typically written in uppercase letters, but they don't have to be.

SELECT queries essentially always produce a grid (well, a table, but not a database table) where rows represent result records and columns represent result values. In the simplest queries,

  • The rows represent records from a database table, and
  • The columns are just the values of database fields.

As queries get more complicated

  • The rows represent what we might call logical records that are assembled from more than one database table, and
  • The columns may contain computed values rather than direct field values.

SELECT queries

  • must start with a SELECT clause that lists the values to display (in columns),
  • And then specify one FROM clause that specifies the (first) table to use.

SELECT queries (really, of all types) may also have

  • JOIN clauses that pull other tables into the query,
  • A WHERE clause that uses boolean expressions to control which records to display (in rows),
  • An ORDER BY clause to specify the sort order
  • And several others, only a couple of which will be touched on here.

Example 1 - People & Events

This example includes queries that examine the People table and specific events for selected people. It introduces and describes SELECT, FROM, ORDER BY, INNER JOIN...ON, field and table aliases, composite keys, concatenation functions, TNG date fields, and TNG foreign key fields that can link to different table, depending on the field value.

[Show Details]
[Close Example 1]
We'll start with a simple one-table query that displays the personID, name, and birthdate of everyone in the Person table, and sorts them by name.
Query 1-1
SELECT gedcom, personID, firstname, lastname, birthdate
FROM tng_people
ORDER BY lastname, firstname

Note that different TNG sites use different person name components, so your queries may need to use additional fields.

Let's use a SQL function to combine the name parts into a single fullname column, and use a simple WHERE clause to select specific records.
Query 1-2

SELECT gedcom, personID, CONCAT(firstname, ' ', lastname, ' ', suffix) AS fullname, birthdate
FROM tng_people
WHERE birthdatetr > "1900"
  • The first and last names are concatenated into a single value. A single fullname value will essentially always occupy less horizontal space than separate columns for each field. It may also make the results more readable.
  • "fullname" is an "alias" for the expression that computes the fullname, and thus serves as the column heading for that results column. Formally, the keyword AS is supposed to precede the alias, but it is never necessary.
  • Notably, the WHERE clause uses the field 'birthdatetr' rather than 'birthdate'.
    • Since event date fields can have prefixes such as "abt', 'bef', etc, all event date fields have a companion field whose name ends with "tr". Every time a record with an event date is saved to the database, the "tr" field is given a value in the form "yyyy-mm-dd", which is the format of standard SQL dates. The "tr" dates are not actual SQL dates, they just look like them, and can be compared to other "tr" dates. They are different from actual dates because their month and date components can take on the value "00".
      relative and partial dates affect the "tr" field values as follows.
      • Partial dates: the 0th day of the month or year, e.g. "MAY 1930"=>"1930-05-00", "1930"=>"1930-00-00"
      • ABT- a full date->the full date, otherwise the same as partial dates!
      • BEF - a full date->the previous day, otherwise the last day of the preceding month, e.g. "BEF MAY 1930"=>"1930-04-30', "BEF 1930"=>"1929-12-31"
      • Everything else, including "EST", "FROM date1 TO date2", or really any text string will be treated exactly like ABT.

Note that birthdatetr > "1900" doesn't mean "after the year 1900", but rather, since datetr fields are always in the form 'yyyy-mm-dd', birthdatetr > "1900" means "starting with the year 1900". That is, it will include the partial date "1900" (which is 1900-00-00) but it will not include "BEF 1900" (1899-12-31).

Let's add a couple of levels of complexity. If we want to list all 20th century residence events, we need to pull in a couple of other tables:

  • We need the Events table, using the field 'persfamid' that links an event to a person or family,
  • For TNG "custom events", the Events table does not contain the event tag. Instead, it contains a foreign key called 'eventtypeID' that links to an Eventtypes record, which stores the tags in a field called 'tag'.

SQL uses "JOIN" clauses to pull in another table. There are several types of joins, but we'll focus for the moment on the most commonly-used one, INNER JOIN which looks for matching records in the second table and creates a result row for each matching record, and that does not create a result row at all if there is no match.

Because TNG's database does not explicitly define foreign keys, the foreign key and matching primary keys fields must (in almost all case) be expressed explicitly in ON expression that match foreign and primary keys.

Query 1-3

SELECT tng_people.gedcom, personID, CONCAT_WS(' ', firstname, lastname, suffix) AS fullname,
      birthdate, eventdate, eventplace
FROM tng_people
INNER JOIN tng_events ON persfamID=personID AND tng_events.gedcom=tng_people.gedcom
INNER JOIN tng_eventtypes ON tng_eventtypes.eventtypeID = tng_events.eventtypeID
WHERE eventdate > "1900" AND tag = 'RESI'

Note that:

  • This query uses an alternate concatenation function, CONCAT_WS, in which the first argument is placed between all of the other arguments. The concatenation in this query produces exactly the same result as the concatenation in Query 1-2.
  • The relationship between Events and People
    • TNG can handle multiple trees, and records in two trees that share a personID are not expected to be associated with each other. As a result, TNG's People table (along with Families, Sources, Media, and other several others) have composite primary keys combine combine a the treeID ('gedcom') and a recordID field such as personID, familyID, sourceID, etc.
    • Events.persfamID can contain either a personID or a familyID. Consequently, in the Events table, the composite foreign key (gedcom, persfamID) can link to a person or to a family.
    • We know that RESI is a person event, not a family event, so this query doesn't have to JOIN to both People and Families. It can just JOIN toPeople and not worry about Families.
    • Because of the composite key in the relationship between Events and People, the JOIN ON clause compares Events.persfamID to People.personID and compares Events.gedcom to People.gedcom.
      • If your tree has more than one table, you don't have to define a SELECT column for 'gedcom'.
      • And, ostensibly, you should not have to compare the two 'gedcom' fields in the ON expressions.
      • But, notably, because of the table indexes in the TNG database, including the gedcom field comparison in ON expressions will speed up the query.
  • The foreign key from Events to Eventtypes is Events.eventtypeID, and the primary key in the Eventtypes table is Eventtypes.eventtypeID. So, we have to specify the table name whenever we refer to an eventtypeID. But Eventtypes is not specific to a TNG tree, so it has no 'gedcom' field, and no reference to 'gedcom' is needed (nor would it work) in the JOIN between Events and Eventtypes.
    • Note that we didn't have to specify the table name for the fields that occur in only one table in this query'. tables. However, if the query used the Children or Mediatypes tables, which each have a field named 'personID', well, we'd have to be explicit about those fields, too.

Query 1-4
Typing "tng_people", "tng_events", and "tng_eventtypes" in the SELECT list, ON expressions, and WHERE clause (etc.) is a hassle that we can avoid by using more cryptic table aliases:

SELECT p.gedcom, personID, CONCAT(firstname, ' ', lastname, ' ', suffix) AS fullname,
       birthdate, eventdate, eventplace
FROM tng_people AS p
INNER JOIN tng_events AS e ON persfamID=personID AND e.gedcom=p.gedcom
INNER JOIN tng_eventtypes AS ee ON ee.eventtypeID = e.eventtypeID
WHERE eventdate > "1900" AND tag = 'RESI'</code>

I think that it is fair to say that the aliases not only simplified the typing of this query, but also made it more readable. In complex queries, it can be more difficult to remember exactly which alias refers to which table. Still, a query with aliases is almost always more readable than one with explicit table names.

[Close Example 1]

Example 2 - Multiple Parents

The goal: Find people who have more than one set of parents. Typically, this would happen when birth and adoptive parents are both in the database, or when a parent remarries and the researcher wants to mark the new spouse as a step-parent. This example includes queries that examine the People table and specific events for selected people. It introduces and describes SELECT, FROM, ORDER BY, INNER JOIN...ON, field and table aliases, composite keys, the COUNT function, subqueries, and TNG foreign key fields that can link to different table, depending on the field value.

[Show Details]
[Hide Example 2]

Query 2-1: Just a list of personIDs in one tree,

SELECT personID, COUNT(*) as count
FROM tng_children
WHERE gedcom='rr'
GROUP BY personID
HAVING count>1
ORDER BY personID
  • If you have only one tree in your database, you could omit the WHERE clause.
  • To get results from all trees in your database, just add "gedcom, " in front of each instance of personID, as in the queries below.

Query 2-2: A list of the personIDs and familyIDs in all trees, with each personID,familyID pair on a separate line.

In query 2-1, we were looking for people, and using COUNT(*) to figure out how many results there would be for each person. But we can't count people and then list more than one result for each person. We could write "SELECT personID, familyID, count(*), but that would only tell us how many results there would be for each family, and that count is always 1.

So we have to change the structure of the query significantly, both omitting the count of families per person, and using a "subquery" to make sure that we select only people who have more than one set of parents. (The subquery will be explained below).

SELECT gedcom, personID, familyID FROM tng_children
WHERE personID IN (
  SELECT personID FROM tng_children
  GROUP BY personID  HAVING count(*)>1
)
ORDER BY gedcom, personID

A subquery is just one query embedded in another. In this case, the subquery

  SELECT personID FROM tng_children
  GROUP BY personID  HAVING count(*)>1

generates a set of results very similar to the first query's results. That is, it is a one-column table of the personIDs of people with more than one set of parents.

  • A subquery's results table can be treated just like any database table in the outer query. And, if we need to, we can specify an alias for the subquery's results table. But that's not what we're doing here.
  • Because this subquery's results table has only one column, it can be treated as though it were a comma-delimited list of values. And we want to use that set of values in the WHERE clause's IN expression.
    • The IN operator asks whether the expression to the left of the keyword "IN" can be found in the list to the right. For example, the expression "personID IN ('I10', 'I27', 'I1440')" is true only when the personID has one of three specific values. In this query, the IN expression filters out all people who have fewer than two sets of parents. In other words,
      WHERE c.personID IN ({query that returns one column})
      is equivalent to
      WHERE personID is in the set of personID's returned by the subquery.

Parenthetically, when it is being used as a list, the subquery's column name doesn't have to be "personID". We could generate a list of personIDs from a field such as Events.persfamid. When a subquery's resutls table is being used like a database table, then, you bet, the column names are critical, just as they are in a database table. Query 2-3: Add the name to Query 2-1

SELECT c.gedcom, c.personID, COUNT(*) AS count, CONCAT(firstname, ' ',lastname, ' ',suffix) AS fullname
FROM tng_children AS c
INNER JOIN tng_people AS p ON p.personID=c.personID AND p.gedcom=c.gedcom
GROUP BY c.gedcom, c.personID
HAVING count>1
ORDER BY c.gedcom, p.lastname, p.firstname
  1. The select list includes 'c.gedcom' because In the query above, the fullname is presented in one results column rather than in three. As a result, it takes up less horizontal space. Horizontal space doesn't matter with this simple query, but in some of the queries below, especially the one with the child's and parents' name, it makes a big difference.
  2. This query, like Query 2-2, displays the gedcom field and as I understand it, excluding the gedcom field comparisons will slow the query down even if you have only one tree. I haven't tested that assertion.

Query 2-4: Add the name to Query 2-2
Note that, once I include the familyID in the results, I can't use the COUNT function in the same way, and I have to use a subquery (which will be explained shortly).

SELECT c.gedcom, c.personID, CONCAT_WS(' ', firstname, lastname, suffix) AS fullname, c.familyID
FROM tng_children c
INNER JOIN tng_people AS p ON p.personID=c.personID AND p.gedcom=c.gedcom
WHERE c.personID IN (
  SELECT personID FROM tng_children
  GROUP BY personID  HAVING count(*)>1
)
ORDER BY c.gedcom, c.personID

Query 2-5: Add the parents' personIDs

SELECT c.gedcom, c.personID, CONCAT_WS(' ', firstname, lastname, suffix) AS fullname, 
      f.familyID, husband, wife
FROM tng_children c
INNER JOIN tng_people AS p ON p.personID=c.personID AND p.gedcom=c.gedcom
INNER JOIN tng_families AS F ON f.familyID=c.familyID AND p.gedcom=c.gedcom
WHERE c.personID IN (
  SELECT personID FROM tng_children
  GROUP BY personID  HAVING count(*)>1
)
ORDER BY c.gedcom, c.personID

Query 2-6: Add the parent's names.
This query has to look up three different people in the People table:

  • The child whose parents we are looking for (just as in the queries above),
  • The husband, and
  • The wife.

Thus, we need to specify the People table three times, each with a different alias: p for person, h for husband, and w for wife. (Note that, here, the alias is not just for readability; it is essential.)

Also, because we have three instances of the People table, we have to add the table alias 'p' to the name fields that didn't require aliases in the query above.

SELECT c.gedcom, c.personID, CONCAT_WS(' ', p.firstname, p.lastname, p.suffix) AS fullname, f.familyID,
   husband, CONCAT_WS(' ', h.firstname, h.lastname, h.suffix) AS husbname,
   wife, CONCAT_WS(' ', w.firstname, w.lastname, w.suffix) AS wifename
FROM tng_children c
INNER JOIN tng_people AS p ON p.personID=c.personID AND p.gedcom=c.gedcom
INNER JOIN tng_families AS f ON f.familyID=c.familyID AND p.gedcom=c.gedcom
LEFT JOIN tng_people AS h ON h.personID=f.husband AND h.gedcom=f.gedcom
LEFT JOIN tng_people AS w ON w.personID=f.wife AND w.gedcom=f.gedcom
WHERE c.personID IN (
  SELECT personID FROM tng_children
  GROUP BY personID  HAVING count(*)>1
)
ORDER BY c.gedcom, p.lastname, p.firstname

Note the use of LEFT JOIN in this query. Joins are explained in more detail in the INNER JOIN vs LEFT JOIN section of this article. But, in summary, INNER JOIN is used if the result row would not be useful without the joined table, and LEFT JOIN is used when we still need the result, but we don't need the values supplied by the joined table.

  1. We need INNER JOIN tng_people because there isn't much we can to with a value of Children.personID if there is no corresponding People record.
  2. We need INNER JOIN tng_families for the same reason.
  3. We use LEFT JOIN tng_people as h because a value of Families.husband can be empty. But even when it it is empty, we want to display the child's personID and the parents' familyID; we'll just have to be content with a blank 'husbname' column.
  4. We use LEFT JOIN tng_people as w for the reason, just as applied to the Families.wife.
we still want to The
[Hide Example 2]

INSERT Queries

gotta be simple and quick

UPDATE Queries

simple and quick

DROP, TRUNCATE, and ALTER Queries

VERY simple and quick

Letter Case

In MySQL, the letter case of tablenames and fieldnames does not matter. (That is if you define a field named "personID", then "personid" would refer to the same field.) In other database products, letter case may be significant, in which case "personid" and "personID" would be different fields. Or a database product might require all table names to be uppercase. Or one product might allow underscores in tables, and another might even allow spaces.

Since SQL doesn't care about the letter case of tablenames, SQL statements that express a tablename with the "wrong" letter case will be perfectly legal and work just fine.

Aspects OF TNG's database

TNG's database does not have formally defined foreign keys, which means that TNG SQL queries (whether embedded in PHP programs, or executed independently) must be more explicit about which fields in one table match which fields in another. For instance

  • The fields Families.husband and Families.wife are expected to contain personID values, which in combination with Families.gedcom, serve as foreign keys to People.
    • But a rogue query could story any other value (that matches the fields' datatype and length) in those fields. Bad values would generally not cause an error; they would simply cause a JOIN from Families.wife or Families.husband not to match a People record.
    • It is entirely up to the PHP programmer and SQL query-writer to make sure that the right fields are used in JOIN clauses and that they have the expected values.

But TNG can also take advantage of the fact that those relationships are not formalized. For instance,

  • Events.persfamID field can have have personID or familyID values; that is, one field (well, in combination with Events.gedcom) can support links to two different tables.
  • In Citations.eventID can have values other than Event table keys. Some values in the Citations.eventID field have other purposes, and don't point to table's primary key.


Aliases

Well,  where I say "people p" in my queries, the p is an alias for "people".  As result I can use "p" instead of "people" everywhere else in the query. I find aliases to be extremely useful. They frequently make it easier to read the query. Beside, having to write the full tablename everywhere is a drag.

I admit that sometimes using explicit table names does make it easier to understand what the query is doing. So, I try to remember to think about other humans and use the full tablename.

Note that not all programmers use alias. For example, most of Darrin's SQL queries do not.  (That may be the case for historical reasons; I don't know.  But it's definitely true that his queries do not generally need to be readable.)

Here's a simple example, using the first few lines of Query 2-3 above:

My version, with aliases:

SELECT f.familyID, h.lastname Husband, w.lastname Wife, ee.tag, e.eventdate, e.eventplace
FROM tng_families f
INNER JOIN tng_events e ON e.persfamID=f.familyID AND e.gedcom=f.gedcom
INNER JOIN tng_eventtypes ee ON ee.eventtypeID=e.eventtypeID
LEFT JOIN tng_people h ON h.personID=f.husband AND h.gedcom=f.gedcom
LEFT JOIN tng_people w ON w.personID=f.wife AND w.gedcom=f.gedcom

Here, "f" is an alias for families, "e" for events, "ee" for eventtypes, and both "h" and"w" for people. Both "h" and "w" are used for people because the query uses the people table twice - once for the husband, and once for the wife. When there are two instances of a table in a query, you have to use alias to eliminate ambiguity. But the others are not required; they are just handy. 

Using explicit table names when aliases are not required, the query might look like this:

SELECT tng_families.familyID, h.lastname Husband, w.lastname Wife, tng_eventtypes.tag,
       tng_events.eventdate, tng_events.eventplace
FROM tng_families
INNER JOIN tng_events ON tng_events.persfamID=tng_families.familyID AND
                         tng_events.gedcom=tng_families.gedcom
INNER JOIN tng_eventtypes ON tng_eventtypes.eventtypeID = tng_events.eventtypeID
LEFT JOIN people h ON h.personID = tng_families.husband AND h.gedcom = tng_families.gedcom
LEFT JOIN people w ON w.personID = tng_families.wife AND w.gedcom = tng_families.gedcom

I'll use aliases every time.

Even More Details

Now, I'm going go dig even deeper in to SQL to make a few additional points that affect the verbosity and/or clarity of SQL.  I can't say that it is essential for you to know them, but they might affect your ability to understand other SQL queries.

  1. Formally, you use the AS keyword to define aliases: "FROM families AS f", "INNER JOIN events AS e", etc.
  2. In English, tablenames are usually expressed with an initial capital letter. I'll use that notion in all English text below.
  3. In SQL, fieldnames,  tablenames, and keywords do not have to use the correct letter case. For example
    INNER JOIN events ON events.persfamID=families.familyID AND events.gedcom=families.gedcom
    could be expressed as
    inner join Events ON Events.persfamID=Familes.familyID and Events.gedcom=Families.gedcom.persfamID
  4. Tablenames don't have to be attached to fieldnames when a fieldname exists in only one of the tables in the query. For instance , in the SELECT clause, the fieldname "familyID" is used only in the Families table, "tag" occurs only in Eventtypes, and eventdate and eventplace occur only in the Events table. Thus, the first four lines of the query could be
SELECT familyID, h.lastname Husband, w.lastname Wife, tag, eventdate, eventplace
FROM families
INNER JOIN events ON events.persfamID=familyID AND events.gedcom=families.gedcom
INNER JOIN eventtypes ON eventtypes.eventtypeID=events.eventtypeID

because "persfamID" occurs in Events and Citations, "eventtypeID" occurs only in Eventtypes, and "gedcom" occurs in almost every TNG table.

Of course, to avoid explict use of tablenames (or aliases), the programmer has to know the names of every field in every table that is used in the query.  Otherwise, how could programmers know that they can say "familyID" instead of "families.familyID"?  In truth "familyID is used in other tables in the database. But those other tables are not relevant to this query.

INNER JOIN vs LEFT JOIN

INNER JOIN is used to connect Families to Events because, if there are no Event records for a given Family, well, we don't need to show that Family at all.

INNER JOIN is used to connect Events to Eventtypes because we know that there has to be a match. The Events table does not contain the event tag.  Instead, the event type is represented by the "eventtypeID" field, which is a number that points to the Eventtypes table.  Why?  Well, this may not be the ideal example, but if we want to change census events (CENS) to residence events (RESI), we only have to change CENS to RESI in one record in the Eventtypes table.  We don't have to touch the Events table at all.

LEFT JOIN is used to connect the husband and wife in a Families record because we want to see event records even if we don't know who the husband wife it.  That is, when the husband or wife in a family is unknown, the husband or wife field will be empty and will not link to a family record.  If we used an INNER JOIN, the absence of a Person record for a husband or wife would remove the Family from the results.

More specifically, if you back to Query 2, you'll see

LEFT JOIN citations c ON c.eventID=e.eventID<br>
WHERE c.citationID IS NULL...

Here, the LEFT JOIN is looking up all citations for a given event. If there are no such citations, the LEFT JOIN will produce a table row in which, in the absence of any Citation records, the citationID value would be null. And since those are the very events we want, the WHERE clause has the condition "c.citationID IS NULL". Hence we want to see events where the ci would (and, as would all if we cared about other fields in the citation table, theall fields in the citation recordwe're still going to get a result row for that event, but the value of the citationID in the results will be null (which is different from blank or empty)  we're matching a citation record with an event the If you go back to the Query 2, you'

The personID is sufficient to identify a person record in the query results; the person's name is not essential.  It is there only for the possible convenience of a human who is using the results.  That is, a human who wants to resolve citationless events can use the personID to look up the relevant person records.  But the name of a person the database could allow the human to focus on people who are more important for some reason or another.  people that llows the human to recognize certain people  humans can use the familyIDwants to open a record in the People table in order to resolve citationless events, needed at all. That is, the personID is suffient to look up a person in the database to try to resolve events without citations, The person's name is there only for the convenience of humans who read the results. In this query, the name may not be relewants to know the names of each person who has events without citations. (Well, humans who reads the results may not want to know the name of each person in the query results; they may only want to be able to recognize some names.) of Maybe not to know the names of each person. citationless"person who has citationlessis reading the query results. If the husband or wife in a given family is unknown, the husband or wife field would be empty, and would not link to a People record. In this query, the familyID is sufficient to identify a family record; the the husband's and wife's last name are useful only to help a human recognize a family. 

Finally, the essential difference between INNER JOIN and LEFT JOIN is that, with INNER JOIN, if the ON condition fails, no result row will appear in the table.  With OUTER JOIN, a table row will result no matter whether the ON condition matches, and the missing record will be populated with NULL values.

See also

Relational Databases-The Least You Need To Know
Database Collation - Explain Choosing (well, this would move with the content on table definitions