3

In Postgres, parallel query works by having multiple workers....

As such, depending on workers available, queries can run theoretically fast (workers available) or slow (workers not available) - right?

Question is: are there more optimizations like this... And is my assumption correct?

It seems like a major pain in the ass... Scalability and reliability wise. (?!)

Comments
  • 0
    I've zero knowledge on this feature. The official doc link below might help you

    https://postgresql.org/docs/10/...
  • 2
    @asgs If you ever read the docs fully you'll notice that the optimizations are sparsely documented... And I like opinions from people with hands in experience more than manuals.

    Database theory can be a tremendous difference to database practice.

    Database optimizations are very complex and as such rarely fully documented (sadly)
  • 0
    I agree with @IntrusionCM. Official documentation almost always vastly differs from real experience.

    I haven't used this feature either, and am very curious to know what people's experiences are.
  • 1
    In my experience of using queries over clusters, it depends on if the data is partitioned over the cluster enough and doesn't rely on any data in another partition. Be careful of running queries on a cluster but using things that actually have to collect all of the data into one node before it can be processed. That's a huge bottleneck.

    For example adding a unique row number to the every row in a table can only be done in the master node because all worker nodes don't know about each other. It then become non-optimal and you need to make the master node big enough to hold all data even though its parallelized.
  • 0
    @cmarshall10450

    You are talking about replication and sharding (bit unfamiliar with terminology)… Correct?

    Currently evaluating Postgres for a (wannabe) microservice architecture... Replication / failover is too risky (as the necessity is near zero, it makes no sense to open pandoras box).

    Currently I'm bombing the MySQL slowlog (>200 MB) via threaded python Script to PostGres to get a bit of a grasp of what's what.

    Stumbled over the parallel worker's case as there was too much deviation in the statements query execution ...
  • 0
    @IntrusionCM not replication or sharding, although you're right in thinking that with it being a multiple machine issue.

    I haven't used parallel execution in Postgres but a lot of my job is writing parallel ETLs for use with hadoop/hive.

    Sharding and replication is more concerned with scalability and high availability respectively, whereas parallel execution is about processing speed.

    The query execution plan can become more complex with the need to move data between nodes and network latency can become the slowest part of the process

    There's probably some other bottleneck than the databases because 200mb is tiny for what I deal with and isn't close enough for parallelism to be necessary. Is your query computationally heavy or memory heavy?
  • 0
    @cmarshall10450 ah.

    It's not one query XD

    The slowlog consists of roughly 850 (unique, fingerprinted) queries...

    To mimic the behaviour of a true SQL server and do a worst case comparison between the original MySQL based server vs the PostGres one, I wrote the python script....

    It takes the MySQL slowlog (~500 K queries, 200 MB size) and - via threading - uses multiple connections to the PostGres server and feeds the queries in parallel to it.

    So no, the database is not 200MB :) The MySQL slowlog is.

    And as the default of max parallel worker is 8 and my script spawns 8 threads... I pretty much nailed it.
  • 0
    @IntrusionCM I may be wrong as I'm going off of knowledge of Apache Spark, Hadoop and Hive but one worker is not equal to one thread. A worker node has multiple executors, each with a slice of the overall memory but at least one logical cpu (thread). The real limitation is threads per core.

    https://postgresql.org/docs/10/... is what I think we are both talking about. Correct me if I'm wrong.

    In this terminology, a Gather is a worker node and the leader is what I'd call the master.

    Id also suggest looking at https://citusdata.com/solutions/... if you haven't already
Add Comment