The customer wants to see relationships between multiple data types, groups of data, drill down into it, have a geographical view, a temporal view and a conceptual view. And much more of course. That’s the moment you realize you spent ten years of your life creating all possible variations on the subject ‘diagramming’ and now you actually need a ‘diagramming database’. A whole different beast, not the usual yada-yada, what does it entail?
The thing is, relational database systems are not very suitable to act as repositories for graph-like structures but at the same time they are unavoidable because the whole world runs on them. Sure, there is something like an OO database but it would be even less suitable for graphs. So, you choose for a RDBMS even if you have some back-thoughts. At least RDBM systems are well supported on every API level and you get tons of fun tools which on the long run could be useful. In the case of SQL Server you know things like reporting services, StreamInsight, OLAP and whatnot will show up at the end of the road.
How do you store a graph-like system whose nodes are of an arbitrary data type? Say, you wish to store relationships between telephone numbers, file locations, car types, chat messages and credit card numbers. It wouldn’t be very wise to create data tables for each data type since you know at some point the customer will come along and ask for mobile numbers, addresses, book titles and cooking recipes! Well, there are not that many solutions in this case; you can use binary serialization of some data blob and store it as varbinary or you can use string-like serialization (which includes XML). Obviously, binary stuff in your database is not a good thing if you wish it to be searchable and ultimately open to other integrations. There is, hence, no other choice but to go for XML structs which at least is a well-supported standard across the while industry.
What about the relational structure? Easy enough, you just store the ‘from’ and the ‘to’ (sink, source, target, drain, start, end, initial, final…) in a two-field table. Voila, you’ve got a graph-database on top of an RDBMS.
Well, not really.Obviously having the data in place is just the beginning; you need to calculate a spanning tree, Hamiltonian cycles, Eulerian paths, shortest distance…and more fun pseudo-math stuff. If you want to use the standard algorithms it means you will put a lot of pressure on the data exchange with your client (say, a WCF service). Argh, you realize the only way is to actually do as much as possible on the database side. Through T-SQL or use .Net assemblies inside the SQL process? Not an easy choice. You start to wonder if creating a server is not a better option. But this means…writing a query processor, heavy threading programming and more wizardry which goes far beyond one’s standard toolbox. So…let’s assume T-SQL can do it. Actually T-SQL is not too bad if you are a bit fluent with it you can go quite far into this direction.
Status: a graph-database consists of two tables and a lot of smart stored procedures.
But hold on, the customer wants it also multi-user and multi-application. That is, the total data is not always ‘on’, only a subset of if in function of the application the current user is using. More branching, more subtleties in the stored procedures.
The situation is actually comparable to having heaps of XML documents where each user and each application adds attributes to the standard XML to decorate it with the necessary application-options and security settings. On top of that lots of tuples describe relationships between these XML documents. Did I mentions the relationships are also user and application dependent?
Maybe I’m giving you the impression that it’s an hopeless complexity and an overly open system which tries to embrace too much flexibility. It’s not that bad. In fact, we have a working prototype which includes a full-fledged client application (both for Silverlight and for WPF). The feeling sticks with me, however, that as far as abstract beauty is concerned a true graph database server would be better. This exists in Java but not in .Net and even if it would, I wonder how well adjacent systems (like reporting and such) would integrate with it. So, the current prototype is a balanced solution between various poles and constraints. In a next phase we’ll develop connectors to StreamInsight, file system, legacy event aggregators and more. The purpose being to be able to store and analyze graphs which cross the boundary of the local storage and also capture real-time events (through, for example, StreamInsight). Big fun and more challenging stuff is ahead.






I recommend to use combination of two approaches. For data storing use RDMS with constrains and data integration and use XML data for visualization stuff.
Both Oracle and SQL Server allows manipulate with XML documents stored in blobs and load partially.
Such combination I’ve use in OCR management tool and get very good results and performance.
Hi, I am working for about a year now on a project that also has graph (network) in it’s core and I also have to deal with persist the data to a database (RDBMS).
As you said, OO DB like DB4O is not production-ready, but I did look into “document-oriented database” like CouchDB and MongoDB but I end up using regular RDBMS as back-end database and J2EE application as the front-end of the database. This give me the flexibility to “talk” to the client application in more “graph language” while still persist the data into tables at the end – but the client application is clean from any table knowledge.
There is no chance connecting this database to a reporting application because the reporting application was “table oriented” and not “graph oriented” so it will be able to create very limited reports on it.
If you give a molecular scientist a hamer – who far can it reach?
Thank you for the post and I’m looking forward for more.
Ido