CREATE TABLE some_relationship (participant1 INTEGER, participant2 INTEGER)
This is the most trivial and straightforward in-database representation of a symmetric relationship between two participants. Actually, this schema could represent an asymmetric relationship as well, but we'll write our queries and application code so that (participant1=X, participant2=Y) is treated the same as (participant1=Y, participant2=X).
This schema still leaves us with some decisions. For the simplest possible query,
SELECT participant2 FROM some_relationship WHERE participant1 = X
we would need to store each relationship in the table twice for each relationship, one of them inverted.
If you don't want to duplicate each relationship in the table (I sure don't), then we have to have a more complex query (and more complex way of processing the result).
SELECT participant1, participant2 FROM some_relationship WHERE participant1 = X OR participant2 = X
Then we must go through the results of this query, and for each of them say "If p1 = x then treat p2 as the other participant; if p2 = x then treat p1 as the other participant." I'm also not sure how well this query will perform (on any database), even if both participant columns are indexed.
Both of these are annoying. The first requires explicit duplication of data which may become inconsistent, and the second is very tedious to query for.
Here's another idea:
CREATE TABLE some_relationship (relation_id INTEGER, participant INTEGER)
This one allows us to avoid explicit duplication in the table and also allows us to just query against one participant column. For every new relationship (X <-> Y), you'd insert two rows consisting of a unique ID for this particular relation and each participant. The two relationships (50 <-> 51) and (50 <-> 52), you'd insert four rows::
INSERT INTO some_relationship VALUES (1, 50)
INSERT INTO some_relationship VALUES (1, 51)
INSERT INTO some_relationship VALUES (2, 50)
INSERT INTO some_relationship VALUES (2, 52)
Then, to get all participants related to 50, you'd do the query::
SELECT participant FROM some_relationship
WHERE relation_id IN (SELECT relation_id FROM some_relationship
WHERE participant = 50)
AND participant != 50
This query is pretty complex, but at least it gives us exactly the results we want (no post-processing to decide which participant is the "other" participant, like in the second usage of Schema 1 that I discussed above. And hey, we don't have any weird duplication in our data. By the way, if you were to use this schema, you'd want to create indexes on both relation_id and participant.
Extra data and Schema 3
However, Schema 2 has some problems which Schema 1 doesn't have. What if you want to add some extra metadata to the relationship? If you're representing "similarity", for example, you may want to add a "strength" column to represent the fact that two items are more similar than two other items. Encoding this data into Schema 1 is very simple: just add the other column, and grab it out in your queries, and it's in as convenient a form as it's going to get. Encoding it into Schema 2, however, is extremely awkward because the relationship is split over multiple rows. This can be solved, however, by sticking that data into another table entirely. We can even leave Schema 2 exactly how it is, except for adding a REFERENCES annotation to the relation_id if your database supports it and you're so inclined.
CREATE TABLE similarity (
relation_id INTEGER REFERENCES similarity_data (id),
CREATE TABLE similarity_data (
But then getting data in a useful form out of these two schemas becomes even more complex. I'll leave that as an exercise for the reader. Doing it without application-code processing and multiple queries will probably involve DISTINCT and a join.
Well what the heck
I still don't know which option I'll use for my application. Any input, bagoblog?