Representing symmetric relations in a relational database
I haven't searched terribly hard, but I can't find very much information on the Internet about clever and efficient ways of representing and querying for symmetric relationships in relational databases. I've thought up a few options, so I'll spew them onto my blog to get feedback. The keen researcher will probably want to do more googling for representations of undirected graphs in RDBs, since undirected edges are a prime example of a symmetric relationship.
Schema 1
CREATE TABLE some_relationship (participant1 INTEGER, participant2 INTEGER)
This is the most trivial and straightforward in-database representation of a symmetric relationship between two participants. Actually, this schema could represent an asymmetric relationship as well, but we'll write our queries and application code so that (participant1=X, participant2=Y) is treated the same as (participant1=Y, participant2=X).
This schema still leaves us with some decisions. For the simplest possible query,
SELECT participant2 FROM some_relationship WHERE participant1 = X
we would need to store each relationship in the table twice for each relationship, one of them inverted.
If you don't want to duplicate each relationship in the table (I sure don't), then we have to have a more complex query (and more complex way of processing the result).
SELECT participant1, participant2 FROM some_relationship WHERE participant1 = X OR participant2 = X
Then we must go through the results of this query, and for each of them say "If p1 = x then treat p2 as the other participant; if p2 = x then treat p1 as the other participant." I'm also not sure how well this query will perform (on any database), even if both participant columns are indexed.
Both of these are annoying. The first requires explicit duplication of data which may become inconsistent, and the second is very tedious to query for.
Here's another idea:
Schema 2
CREATE TABLE some_relationship (relation_id INTEGER, participant INTEGER)
This one allows us to avoid explicit duplication in the table and also allows us to just query against one participant column. For every new relationship (X <-> Y), you'd insert two rows consisting of a unique ID for this particular relation and each participant. The two relationships (50 <-> 51) and (50 <-> 52), you'd insert four rows::
INSERT INTO some_relationship VALUES (1, 50)
INSERT INTO some_relationship VALUES (1, 51)
INSERT INTO some_relationship VALUES (2, 50)
INSERT INTO some_relationship VALUES (2, 52)
Then, to get all participants related to 50, you'd do the query::
SELECT participant FROM some_relationship
WHERE relation_id IN (SELECT relation_id FROM some_relationship
WHERE participant = 50)
AND participant != 50
This query is pretty complex, but at least it gives us exactly the results we want (no post-processing to decide which participant is the "other" participant, like in the second usage of Schema 1 that I discussed above. And hey, we don't have any weird duplication in our data. By the way, if you were to use this schema, you'd want to create indexes on both relation_id and participant.
Extra data and Schema 3
However, Schema 2 has some problems which Schema 1 doesn't have. What if you want to add some extra metadata to the relationship? If you're representing "similarity", for example, you may want to add a "strength" column to represent the fact that two items are more similar than two other items. Encoding this data into Schema 1 is very simple: just add the other column, and grab it out in your queries, and it's in as convenient a form as it's going to get. Encoding it into Schema 2, however, is extremely awkward because the relationship is split over multiple rows. This can be solved, however, by sticking that data into another table entirely. We can even leave Schema 2 exactly how it is, except for adding a REFERENCES annotation to the relation_id if your database supports it and you're so inclined.
CREATE TABLE similarity (
relation_id INTEGER REFERENCES similarity_data (id),
participant INTEGER)
CREATE TABLE similarity_data (
id INTEGER,
strength INTEGER)
But then getting data in a useful form out of these two schemas becomes even more complex. I'll leave that as an exercise for the reader. Doing it without application-code processing and multiple queries will probably involve DISTINCT and a join.
Well what the heck
I still don't know which option I'll use for my application. Any input, bagoblog?
9 comments:
How about Schema 1 and query against a view that gives you the symmetry:
CREATE VIEW full_some_relationship (participant1, participant2) AS (SELECT participant1, participant2 FROM some_relationshsip UNION SELECT participant2, participant1 FROM some_relationship);
No post-processing of results, simple queries and inserts, simple addition of attributes to the relationship. At the cost of a VIEW and a UNION - maybe eventually you materialize that view, in which case you do have your double storage, but your application code doesn't have to know.
I had a similar issue just the other day where I need to store all my records in pairs, to represent the original read-only data and the edited version of that data. From either I need to find either, since given some ID I might not know if its an original or edited version.
My solution was simply to add original_id and edited_id columns, where either original_id or edited_id will be equal to id, if that row is the original or edited record, respectively. A single trigger sets things up nicely and I've had good usage of this pattern.
If a common use of this relation will be to find if there is a relationship between two given items, then you could consider using schema 1, and storing the participant with the lower ID in the first column. There would be no need to check for the IDs in both columns, so it would be just a single SELECT. It does seem like a lot of work though for only a O(1) speed improvement.
I'd go with schema 1, mostly because of the KISS principle. If someone else ever needs to maintain your code, you can explain schema 1 in less than a minute -- the other two are more complicated and unless time or space efficiency is paramount here, IMHO the extra complexity is unjustified.
store (a_id,b_id) such that a_id <= b_id, and query with left_id=? OR right_id=?
Of course, asking for advice on a particular strategy for something really ought to come with a lot more context for the application it will be used in.
Actually, my common operation is "show me all items that this items is related to" (and ordered by "strength" such as the one discussed near the end of my blog post).
So the "lower ID in a, higher ID in b" trick doesn't gain my anything.
Colin, my implementation is currently in the first schema. It's really not very simple at all.
Check this out:
def getSimilarItems(self):
similarities = self.store.query(
Similarity,
OR(Similarity.item1 == self,
Similarity.item2 == self),
sort=Similarity.strength.descending)
for similarity in similarities:
if similarity.item1 is self:
yield similarity.item2
if similarity.item2 is self:
yield similarity.item1
I don't think KISS really applies to this one. Besides, I write thorough docstrings and do all of my development TDD; this code is pretty maintainable :)
However, given that I wrote this implementation already, I will probably keep it; I just need to do some EXPLAINing to make sure that I can make the queries fast.
Do this according to the rules.
Which one fits the rules best? Schema 2, I think. If you need metadata, create another table (with relation id and metadata).
If you write code later - use some cool API like Storm or Axiom.
If you need to query everything efficiently - use Xapian or some cache table.
That will work.
_Every_ solution which involves numbering columns is bad.
TC,
--
dotz
Ramble:
Another way of thinking about the strength thing is that you are defining a kind of inverse metric on a set.
For reference, a metric in this sense is a function d : S x S -> R, where 'R' is the real numbers and S is a set.
It satisfies the constraints:
d(x, x) = 0 for all x in S
d(x, y) = d(y, x) for all x, y in S
d(x, y) + d(y, z) >= d(x, z) for all x, y, z in S
This doesn't really help, but gosh it's interesting.
Oh, one question, is there a total ordering on participant? (I guess id is an integer).
In that case, you can include a constraint so that participant1 < participant2?
Post a Comment