Representing Graphs in PostgreSQL

It is always good to know, at what point does "Postgres as X" break down. For instance, I know from experience that Postgres as timeseries DB (without add-ons) starts to break down in low billions of rows. It would be great to know that for graph DBs as well. I think a lot of people would prefer just to use Postgres if they can get away with it.

I've done almost exactly the kind of thing described in article for a couple millions of rows. It broke down when I needed to find "friends of friends" of depth 6. None of the optimizations I could come up with helped. Neither did Apache Age.

Maybe I'm just not skilled enough to handle such a workload with Postgres. But Neo4j handled it easily

I'm not sure if you are serious or joking. "6 degrees of separation" is the famous radius for how many steps are needed such that everyone is connected.

https://en.wikipedia.org/wiki/Six_degrees_of_separation

Except that it wasn't about people at all? I was just referring to a _type_ of query

What kind of data was it, of I may ask?

Graph databases have a very narrow usecase, and it's almost always in relation to people - at least ime.

Though the data type isn't really important for the performance question, the amount of data selected is. So a 6-level depth graph of connections that only connect 2-3 entities would never get into performance issues. You'd be able to go way beyond that too with such a narrow window. (3 entity Connections on 6 level join would come out to I believe ~750 rows)

If you're modeling something like movies instead, with >10 actors per production you're looking at millions of rows.

Network analysis of financial transactions to detect "layering" (onion-routed money laundering) paths, probably.

Too funny, enjoy the vote

I tested PostgreSQL vs Neo4j with a friends-of-friends query two years ago, and to my surprise it was Neo4j that crashed, not PostgreSQL.

https://github.com/joelonsql/graph-query-benchmarks

I haven't tried the latest versions of both databases though.

I've been running into exactly that problem. Which time series add-on would you recommend looking into?

We ended up using ClickHouse after trying Timescale and InfluxDB. ClickHouse is great but important to spend a day or two understanding the data model to make sure it fits what you are trying to do. I have no affiliation with ClickHouse (or any company mentioned).

We’ve been using InfluxDB 1.x since 2017 and I’m itching to get off of it. Currently we’re at about a trillion rows, and we have to ship our data to Snowflake to do large-scale aggregations (like hourly averages). I put a bunch of effort into building this on top of InfluxDB and it’s not up to the task.

Any sense if ClickHouse can scale to several TB of data, and serve giant queries on the last hour’s data across all sensors without getting OOMKilled? We’re also looking at some hacked together DuckDB abomination, or pg_duck.

A trillion rows should be no issue at all for ClickHouse. Your use case sounds a bit more typical for it (our data is simulated so there is no single / monotonically increasing clock). I don't know though, is this ~100KS/s data acquisition type stuff (i.e. sound or vibration data for instance)? If so, it wouldn't be possible to push that into ClickHouse without pre-processing.

Interesting.. But clickhouse is an entirely different database engine, no? I will look into it, thank you!

Yes, it is a columnar DB. A lot of things feel pretty familiar as it has a SQL like query language. The data model is different though.

The nice thing is, once you understand the data model it becomes very easy to predict if it will fit your use case or not as there is really no magic to it.

Timescale is definitely worth a look. Pg_partman gets you part of the way. We ended up going with bigquery for our workload because it solved a bigger bag of problems for our needs (data warehouse). It’s very hard to beat for big… queries.

I never understood the rationale behind TimescaleDB — if you’re building a time series database using row-oriented storage, you’ve already got one hand tied behind your back.

What does your testing strategy look like with bigquery? We use snowflake, but the only way to iterate and develop is using snowflake itself, which is so painful as to impact the set of features that we have the stomach to build.

Testing strategy? What’s that? I kid, but just a bit. Our use case is a data warehouse. We use DBT to build everything. Each commit is built in CI to a CI target project. Each commit gets its own hash prefixed in front of dataset names. Each developer also has their own prefix for local development. The dev and ci datasets expire and are deleted after like a week. We use data tests on the actual data for “foreign keys”, checking for duplicates and allowed values. But that’s pretty much it. It’s very difficult to do TDD for a data warehouse in sql.

My current headache is what to do with an actually big table, 25 billion rows of json, for development. It’s going to be some DBT hacks I think.

God help you if you want to unit test application code that relies on bigquery. I’m sure there are ways but I doubt they don’t hurt a lot.

Interesting strategy with appending the commit hash to the dataset name. If one of those commits is known to be good and you want to “ship” it, do you then rename it?

What are you doing with that JSON? What’s the reason why you can’t get a representative sample of rows onto your dev machine and hack on that?

"Postgres as graph DB" starts to break down when you try to do serious network analysis with it, using specialized algorithms that are heavy on math - as opposed to merely using 'graphs' as the foundion of your data model, which is what graph databases mostly get used for. It's more about "what your actual use case is" than "how much data you have".

I also found this. Computing a transitive closure, for example, on a dataset that's not a pure DAG was absurdly painful for data that I stored in any SQL database. It ended up being cheaper and much, much faster, where I could fit the data, to just buy an old workstation with more than half a terabyte of RAM and just do all the computations in memory using networkx and graph-tool.

I presume that for larger real world graph datasets, maybe there's some better algorithms and storage methods. I couldn't figure out neo4j fast enough, and it wasn't clear that I could map all of the stuff like block modeling into it anyways, but it would be very useful for someone to figure out a better production ready storage backend for networkx at least where some of the data could be cached in SQLite3.

> "Postgres as X"

This reminds me of "Just Use Postgres for Everything": https://news.ycombinator.com/item?id=33934139

any recursive query is absurdly slow, so I think it scales wonderfully only if you match your representation to your workload

Do you have more details about it?

My first thought is sharding and materialized views.

I mean there is a lot of extensions and optimizations that you can pull on a postgres instance before you truly have to abandon it.

For timeseries I'd argue with timescale there is very few use cases that ever outgrow postgres. But I might be biased.

I don’t think there’s a set predictable level of where Postgres as x can breakdown.

If you plan for it, for example store your graph in an export friendly format it should be ok.

Using postgres as long as you can get away with it is a no brainer. It’s had so many smart people working on it for so long it really comes thru. It’s nice to see the beginner or intro level tools for it evolving quickly similar to what has made MySQL approachable for so many years.

I thought this was going to be about querying a Postgres table with a graph query language (SQL/PGQ).

Indeed this is coming to Postgres eventually.

https://www.postgresql.org/message-id/flat/a855795d-e697-4fa...

https://ashutoshpg.blogspot.com/2024/04/dbaag-with-sqlpgq.ht...

Like apache age (postgres with cypher support) https://age.apache.org/

What I'm talking about is going to be committed into Postgres.

Postgraphile[1] does this via GraphQL, or have i misunderstood what you're referring to?

1: https://www.graphile.org/postgraphile

Nope I'm talking about something getting committed to Postgres itself.

Ah I see, yes it'd be pretty awesome to have Postgres be even more of a Swiss army knife of databases.

There is also https://age.apache.org/, an extension to facilitate graph queries in postgresql.

Glad they used Cypher. One of the best and most intuitive query languages I've used personally.

I had to use neo4j on a project some years ago and hated it (it didn’t help that the client forced its use for really dumb and bad reasons, wasting tons of time in the process) but Cypher is great. Only thing about it I liked. It’s very good.

Yeah Cypher was so good and made everything so intuitive. Loved working with Neo4J for a prototype and then had to face the licensing issues and huge costs and had to abandon it.

There are open source alternatives to Neo4J, and also very good scalable open source graph databases with query languages similar to Cypher. I think we are in agreement that it is a good idea to choose platform that scales both for compute and all other costs, like licensing costs.

Can you list them?

Not sure if you everyone reading is aware, but Neo4j did end up turning Cypher into the openCypher spec and that allowed a lot of other graph stores to adopt Cypher as a query language.

I got bitten by the same, we were looking at switching to memgraph (which is also Cypher, but C++ rather than Java), sadly abandoned for other reasons ...

What reasons? Would love to know for future ref.

If your org is amenable to Google Spanner or already using Spanner, it has preview support for Cypher and graph semantics: https://cloud.google.com/spanner/docs/graph/overview

Unrelated to the tech, just the graph-based algorithm (recommender stuff) that we had in mind did not give us the results we were hoping for ... and other things were queued up to be tried ...

I agree with you about Cypher. For a couple of decades I was a RDF and then RDF + SPARQL advocate, but I am a fairly recent convert to property graphs and Cypher.

I was going to say, this is what SPARQL was designed for and it's much better at than SQL.

It was pretty dead last time I checked.

This time it must be me... Bilbo is not Frodo's father (Frodo is actually Bilbo's first and second cousin, once removed)

Yes, but Bilbo does adopt him to make him his heir

Very good, and also Frodo refers to Bilbo as his "uncle".

It's also just right up front in the fellowship, he adopts Frodo because he never got around to finding a partner, implied because of his stewardship of the ring.

If you need some "Routing Graph" - check the pgRouting library https://pgrouting.org/

  "pgRouting library contains following features:
  All Pairs Shortest Path, Johnson’s Algorithm
  All Pairs Shortest Path, Floyd-Warshall Algorithm
  Shortest Path A*
  Bi-directional Dijkstra Shortest Path
  Bi-directional A\* Shortest Path
  Shortest Path Dijkstra
  Driving Distance
  K-Shortest Path, Multiple Alternative Paths
  K-Dijkstra, One to Many Shortest Path
  Traveling Sales Person
  Turn Restriction Shortest Path (TRSP)"\*

see more: https://docs.pgrouting.org/latest/en/pgRouting-concepts.html...

Is it normally preferred to use some library such as this over a more traditional approach of nested sets, adjacency lists or ltree module?

PG Database + "The Boost Graph Library (BGL)" C++ ( https://www.boost.org/doc/libs/1_87_0/libs/graph/doc/index.h... )+ PostGIS ==> PGRouting

Just check the Workshop : https://workshop.pgrouting.org/3.0/en/index.html

Pedestrian Routing:

https://workshop.pgrouting.org/3.0/en/basic/pedestrian.html#...

Graph views:

https://workshop.pgrouting.org/3.0/en/basic/graph_views.html

Create a Network Topology:

https://workshop.pgrouting.org/3.0/en/advanced/chapter-12.ht...

ltree is the one I ended up using. I don't like it much. The way it works allows you to setup a tree but you can't easily move things around within a tree once they're in.

I forgot to say that it's also obviously only a tree, not a graph. It's still useful for some cases.

It has a very good way of querying which is probably fast if you have very deep trees but in the end we didn't need the deep trees so it was chosen for the wrong reasons.

https://www.postgresql.org/docs/current/ltree.html

You essentially have a path field in your record which lists the ids of all the records that are parents of the current one plus the current one.

e.g. if your records are regions of the world and the id is the country name then the path might look like this:

Earth.Europe.Poland

There is support for queries like "what are all the descendants of Europe"

    SELECT name FROM regions WHERE path <@ 'Earth.Europe';

Or you can use matching:

    SELECT name FROM regions WHERE path ~ '*.Europe.*';

"CTE" refers to Postgres' "common table expression" syntax, the "WITH" keyword. See https://www.postgresql.org/docs/current/queries-with.html

Not just Postgres. It's in the SQL standard, and widely supported.

Even by SQLite!

Im a big fan of GraphDatabase's since about 10 years. I even wrote my own "in memory graph storage" in golang for a specific use case that none of the big GraphDatabase's could cover at the time.

That said - i WISH people would embrase the existing GraphDatabases more and make the hosters support them as standard, rather than abusing existing relational databases for graph purposes.

And to make it clear,i'm not talking about my own experimental one, i mean stuff like Neo4j, OrientDB etc.

> rather than abusing existing relational databases for graph purposes.

In my experience, the vast majority of graphs can be embedded in relational databases just fine and most people don't want general graph querying. People just don't like optimizing queries (or equivalently the schema to enable such queries).

I personally have never seen a pitch for graph databases that makes them seem attractive for more than data exploration on your local machine.

Well im working since multiple years on a private lets call it "research" project which deeply relies on growing/deep structured graphs.

I don't think that GraphDBs should a default choice, but there are cases in which they just perform better.

Could i write my research project with a relational DB? Yes - i tried - and it sucked xD

Yep, something like adjacency lists have solved a lot of dumb recursive queries/"graph queries" for me, usually because the graph doesn't change much and is just a few nodes deep.

When you look at AWS Neptune database implementation details, it looks realy close to the Aurora backend... :-)

https://docs.aws.amazon.com/neptune/latest/userguide/migrati...

Hrhr very true.

Also, the last time i checked neptune, it could only be run in AWS but there was no public application image for local tinkering/dev. But its some years ago to be fair - maybe have changed.

My personal opinion is that nobody should touch OrientDB with a 10 ft pole.

I started with orientdb than switched to Neo4j. Orientdb was good for starting tbf, but we talking about 10+ years back. Now i would definately default to Neo4j

Why?

The only problem with representing graphs on an RDBMS is the queries. So whats the big issue if queries are added which can handle the graph cases?

Thats like saying why do we need json we can represent it also in a textfile but accessing values can be more tricky.

Sure you can represent a GraphDB in a RDBMS - but graphdbs are optimized for their specific use case. Therefor for example locking of resources is optimized for said purpose. When going a "Edge" and "Node" table approach, locking is by default rather unoptimized in many RDBMS. Just one simple example.

> Thats like saying why do we need json we can represent it also in a textfile but accessing values can be more tricky.

If your text processing stack works good enough, and JSON only solves edge cases, then I think its a legit argument against adding another tech to the stack.

Its not like everyone immediately switched over to JSON even if they could have.

And the analogy may not be the best - in the case of graph databases, they don't do what RDBMS can do, at least not as efficiently.

So if we need some graph functionality in our RDBMS, do we split the database and use a "real" graph database or just incorporate the graph functionality into the existing DB? The choice seems quite easy for me, because there are whole host of issues in splitting DBs (duplication of information, inconsistency, etc.)

Well i took the best example that came to my mind at that moment. may not be the best analogy but well its what it is.

My point was not about people enrichhing an existing RDMBS concept with Graph, it was about using RDBMS as Graph (as only purpose). So maybe i wasn't exact enaugh in my definition. Therefor:

"If your purpose is to use the benefits of Graph and you want to use it for this purpose only, use a GraphDB and dont use a RDBMS and make it a GraphDB."

I hope thats better now.

It’s a neat trick! A simpler choice would be to use Postgres’ own ltree data type: https://www.postgresql.org/docs/15/ltree.html

I wrote about how we use it here: https://www.yetto.app/blog/post/how-labels-work/

The issue with ltree is that it requires fanout on write when moving a node. The downside is you need to update every recursive descendant of the moved node to rewrite their ancestor path for the new location in the tree, so a move is O(subtree nodes), but you get to scan a subtree in using btree prefix search which is great.

That makes it good for relatively static and flat data like your label hierarchy, which probably receives a new label or a repainting within a tree relatively infrequently, and might have a p95 depth of 10. That makes it also a good fit for actual human (parent, child) relationships since parent-child change is usually infrequent and the total children per layer in the tree is also small.

For a tree like a Notion workspace where depth can be quite deep, and layers can easily have hundreds or hundreds of thousands of siblings, the cost at write-time when you reparent an ancestor in the worst case isn’t feasible.

For something like a social graph I’m not sure how to use ltree at all since a social graph isn’t a dag.

I've found the main advantage to this over a more specialized graph database is integration and maintenance alongside the rest of your application. Real world data is rarely just (or even mostly) graph data.

If you are interested in the subject, also take a look at NetworkDisk[1] which enable users/devs of NetworkX[2] to work with graphs via SQLite.

[1] https://networkdisk.inria.fr/

[2] https://networkx.org/

Surprised to see this article with no mention of LTree: https://www.postgresql.org/docs/current/ltree.html

First-party data structure for tree (directed, acyclic graph) data

This is where you start to see major limitations of OLTP, and need to look at specialized graph DBs (or indexes really) that take advantage of index-free adjacency

At DuckCon #6 there was a talk given about implementing SQL/PGQ as an extension to DuckDB

https://m.youtube.com/watch?v=QDdTbhSR2Vo

https://duckdb.org/community_extensions/extensions/duckpgq.h...

I recently came across https://kuzudb.com/ which takes a lot of inspiration from DuckDB. Seems promising for embedded.

There was a very good discussion about PostgreSQL as graph database on Hacker News about a year ago: https://news.ycombinator.com/item?id=35386948

[deleted]

This doesn't seem to cover infinite recursion, which is one of the ugliest parts of doing this type of query with CTEs. You effectively need to concatenate all of the primary keys you've seen into a string, and check it each recursive iteration.

The approach I came up with was to instead use MSSQL hierarchy IDs (there's a paper describing them, so they can be ported to any database). This specifically optimizes for ancestor-of/descendant-of queries. The gist of it is:

1. Find all cycles/strong components in the graph. Store those in a separate table, identifying each cycle with an ID. Remove all the nodes in the cycle, and insert a single node with their cycle ID instead (turning the graph into a DAG). The reason we do this is because we will always visit every node in a cycle when doing an ancestor-of/descendant-of query.

              +---+         
      +-------> B +------+  
    +-+-+     +-^-+    +-v-+
    | A |       |      | C |
    +---+       |      +-+-+
              +-+-+      |  
              | D <------+  
              +---+

Becomes:

    +---+  +---+
    | A +--> 1 |
    +---+  +---+
    +---+---+   
    | 1 | B |   
    | 1 | C |   
    | 1 | D |   
    +---+---+

2. You effectively want to decompose the DAG into a forest of trees. Establish a worklist and place all the nodes into it. The order in this worklist may be something that can affect performance (you want the deepest/largest tree first). Grab nodes out of this worklist and perform a depth-first search, removing nodes from the worklist as you traverse them. These nodes can then be inserted into the hierarchy ID table. You should insert all child nodes of each node, even if it has already been inserted before, but only continue traversing downwards if the node hasn't yet been inserted.

           +---+              
      +----> B +----+         
    +-+-+  +---+  +-v-+  +---+
    | A |         | D +--> E |
    +-+-+         +-^-+  +---+
      |    +---+    |         
      +----> C +----+         
           +---+

Becomes:

           +---+  +---+  +---+
       +---> B +--> D +--> E |
     +-+-+ +---+  +---+  +---+
     | A |                    
     +-+-+ +---+  +---+       
       +---> C +--> D |       
           +---+  +---+

Or, as hierarchy IDs

    A /1
    B /1/1
    D /1/1/1
    E /1/1/1/1
    C /1/2
    D /1/2/1

Now, you do still need a recursive CTE - but it's guaranteed to terminate. The "interior" of the CTE would be a range operation over the hierarchy IDs. Basically other.hierarchy >= origin.hierarchy && other.hierarchy < next-sibling(origin.hierarchy) (for descendant-of queries). You need to recurse to handle situations involves nodes like D in the example above (if we started at C, we'd get D in the first iteration, but would need to recurse to the first D node in order to find E).

This was two orders of magnitude faster than a recursive CTE using strings to prevent infinite recursion.

The major disadvantage of this representation is that you can't modify it. It has to be calculated from scratch each time a node in the graph is changed. The dataset I was dealing with was 100,000s of nodes for each customer, took under a second, so that wasn't a problem. You could probably also identify changes that don't require a full rebuild (probably any change that doesn't involve a back edge or cross edge), but I never had to bother so didn't solve it.

[deleted]