Show HN: Hydra 1.0 – open-source column-oriented Postgres

hydra-so.notion.site

269 points by coatue 2 years ago

hi hn, hydra ceo here

hydra is an open source, column-oriented postgres. you can set up remarkably fast aggregates on your project in minutes to query billions of rows instantly.

postgres is great, but aggregates can take minutes to hours to return results on large data sets. long-running analytical queries hog database resources and degrade performance. use hydra to run much faster analytics on postgres without making code changes. data is automatically loaded into columnar format and compressed. connect to hydra with your preferred postgres client (psql, dbeaver, etc).

following 4 months of development on hydra v0.3.0-alpha, our team is proud to share our first major version release. hydra 1.0 is under active development, but ready for use and feedback. we’re aiming to release 1.0 into general availability (ga) soon.

for testing, try the hydra free tier to create a column-oriented postgres instance on the cloud. https://dashboard.hydra.so/signup

qeternity 2 years ago

Ok so this is a Citus fork.

Where can I read about what the differences / trade offs are? I don’t see anything in the docs.

pella 2 years ago

Congratulations!

Can we expect support for gist, gin, spgist, and brin indexes sometime in the near future?

Based on the source code, it appears that they are not supported:

https://github.com/hydradatabase/hydra/blob/96056312e7c0f413...

"... Columnar supports `btree` and `hash `indexes (and the constraints requiring them) but does not support `gist`, `gin`, `spgist` and `brin` indexes."

nerdponx 2 years ago

Do different kinds of indexes work better for columnar storage? Or is it the same principles for both?
- hodgesrm 2 years ago
  
  Difference principles of indexing, as least based on my experience with ClickHouse.
  * Column-based stores have really fast scans due to compression and vectorization, so you'll generally always read down the column. The way to speed it up is to have "skip indexes" that allow you to skip blocks, e.g., don't even bother to read/decompress them.
  * Commonly used indexes need to be very sparse, so they fit in memory even when tables run to hundreds of billions of rows.
  * Finally highly compressed columns can be used as indexes to filter data rapidly. ClickHouse calls this PREWHERE processing.
  Edit: clarify skip indexes
- pella 2 years ago
  
  We need a spatial index for spatial (columnar) data!
  - https://www.crunchydata.com/blog/the-many-spatial-indexes-of...
  - http://postgis.net/workshops/postgis-intro/indexing.html
  - Spatial indexes for OSM in PostGIS (PDF) : https://pretalx.com/media/sotm2019/submissions/CAD93S/resour...

therealwardo 2 years ago

how does Hydra compare to Citus? https://www.citusdata.com

jerrysievert 2 years ago

generally faster across the board, a lot of work was done to expand and speed it up, plus updates, deletes, and vacuuming.
https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...
- adr1an 2 years ago
  
  Right when I was thinking URL shorteners were out of fashion... /S
  
  biugbkifcjk 2 years ago
  
  It's just there to make it easier for mobile users to click it..
  
  setr 2 years ago
  
  I don’t see why the GitHub link is any harder to click than the tiny url link in that post.
  I’m pretty sure the only reason url shorteners exist with purpose is because of Twitter limits (and software that doesn’t visually hide egregiously long urls), but continues to be used outside of those places due to cargo culting
- rubiquity 2 years ago
  
  Since benchmarks can be misleading I want to point out that the differences between Hydra and the "tuned"[0] PostgreSQL (which are some very basic settings) are a lot less convincing, with plain old PG coming ahead on quite a few: https://tinyurl.com/eju9tht2
  I also noticed quite a bit of parity between Hydra and Citus on data set size. Is Hydra a fork of Citus columnar storage?
  0 - https://github.com/ClickHouse/ClickBench/blob/main/postgresq...
  
  arp242 2 years ago
  
  > plain old PG coming ahead on quite a few
  I found that is common among these types of databases (e.g. Citus, Timescale, etc.) which perform well under very specific conditions, and worse for many (most?) other things, sometimes significantly worse.
  That said, Hydra does take up ~17.5G for that benchmark and "PostgreSQL tuned" about 120G, the insert time is ~9 times faster, and "cold run" is quite a bit faster too. It's only "hot run" that shows a fairly small difference. I think it's fair to say Hydra "wins" that benchmark.
  > Is Hydra a fork of Citus columnar storage?
  Yes: "Hydra Columnar is a fork from Citus Columnar c. April, 2022".
  
  riku_iki 2 years ago
  
  > Hydra does take up ~17.5G for that benchmark and "PostgreSQL tuned" about 120G
  you can run pg on compressed filesystem
  
  arp242 2 years ago
  
  I'm sure you can, but AFAIK neither uses compression in that benchmark so it's a fair comparison. Even if filesystem compression would reduce that to 17.5G (doubtable), it won't be free in terms of CPU cycles, and no matter what it's still ~120G to load in memory, bytes to scan/update, etc.
  
  riku_iki 2 years ago
  
  my bet is that hydra uses compression inside already, otherwise it is hard to explain where difference comes from.
  > it won't be free in terms of CPU cycles
  it can reduce IO traffic significantly, and it can be very positive trade off depending on circumstances.
  
  arp242 2 years ago
  
  I had assumed that PostgreSQL is so much larger because it creates heaps of indexes (which is probably also why inserts are so much slower for it), but I don't really have a good way to confirm that quickly.
  
  riku_iki 2 years ago
  
  one can choose to not create "heaps of indexes".
  
  arp242 2 years ago
  
  At which point your performance will drop like a brick for these types of queries – I'm pretty sure these indexes weren't added for the craic.
  
  riku_iki 2 years ago
  
  it depends on your query obviously.
  In general, I did very deep benchmarking of pg, clickhouse and duckdb, and I sure didn't make stupid mistakes like this: https://news.ycombinator.com/item?id=36990831
  My dataset has 50B rows and 2tb of data, and I think columnar dbs are very overhiped and I chose pg because:
  - pg performance is acceptable, maybe 2-5x times slower than clickhouse and duckdb on some queries if pg is configured correctly and run on compressed storage
  - clickhouse and duckdb start falling apart very fast because they specialized on very narrow type of queries: https://github.com/ClickHouse/ClickHouse/issues/47520 https://github.com/ClickHouse/ClickHouse/issues/47521 https://github.com/duckdb/duckdb/discussions/6696
  
  zX41ZdbW 2 years ago
  
  ClickHouse can do large GROUP BY queries, not limited by memory: https://clickhouse.com/docs/en/sql-reference/statements/sele...
  
  riku_iki 2 years ago
  
  as explained in https://github.com/ClickHouse/ClickHouse/issues/47521#issuec... it can't, that parameters only applies on pre aggregation phase but not aggregation.
  Feature request is not implemented yet: https://github.com/ClickHouse/ClickHouse/issues/40588
  
  zX41ZdbW 2 years ago
  
  ClickHouse uses "grace hash" GROUP BY with the number of buckets = 256.
  It can do size about 256 times larger than a memory because only one bucket has to be in memory while merging. It works for distributed query processing as well and is enabled by default.
  About the linked issue - it looks like it is related to some extra optimization on top of what already exists.
  
  riku_iki 2 years ago
  
  > only one bucket has to be in memory while merging.
  its hard for me to judge about implementation details, but per that person reply memory is also multiplied by number of threads which do aggregation.
  
  benn0 2 years ago
  
  Do you have happen to have any documentation about your benchmarking? I'm also considering these options at the moment (currently using pg+timescaledb) and interested in what you found.
  
  riku_iki 2 years ago
  
  I don't have documentation.
  I just created large tables, and tried to join, group by, sort them in pg, clickhouse, duckdb, looked what failed or being slow, and tried to resolve it.
  I am happy to answer specific questions, but I didn't use timescaledb.
  
  arp242 2 years ago
  
  "2-5x times slower" can mean the difference from 2 seconds to 4 to 10 seconds. Two seconds is still (barely) acceptable for interactive usage, ten seconds: not so much. You're also going to need less beefy servers, or fewer servers.
  I also "just" use PostgreSQL for all of this by the way, but the limitations are pretty obvious. You're much more limited in what you can query with good performance, unless you start creating tons of queries or pre-computed data and such, which have their own trade-offs. Columnar DBs are "overhyped" in the sense that everything in programming seems to be, but they do exist for good reasons (the reason I don't use it are because they also come with their own set of downsides, as well as just plain laziness).
  
  riku_iki 2 years ago
  
  > 0 - https://github.com/ClickHouse/ClickBench/blob/main/postgresq...
  that postgres config is very underpowered, it has only 8 workers per gather while machine has 192 vcpus.
  
  re-thc 2 years ago
  
  Submit a PR?
  
  riku_iki 2 years ago
  
  I am not sure what is the process, who will rerun benchmarks then?

mlenhard 2 years ago

Congrats on the launch!

For those who have not experimented with columnar based databases, I would highly recommend toying around with them.

The performance improvements can be substantial. Obviously there are drawbacks involved with integrating a new database into your infrastructure, so it is exciting to see columnar format introduced to Postgres. Removes the hurdle of learning, deploying and monitoring another database.

say_it_as_it_is 2 years ago

Hydra Columnar is a fork from Citus Columnar c. April, 2022.

thih9 2 years ago

Citus: https://github.com/citusdata/citus
BTW, Citus license is GNU Affero General Public License (github lists “conditions: same license”) and hydra is Apache. How is that possible if the latter is a fork? There’s probably something about these licenses I’m not aware of and I’m curious.
- jerrysievert 2 years ago
  
  hydra columnar inherits its license from citus: https://github.com/hydradatabase/hydra/blob/main/columnar/LI...
  but, hydra itself is more than just the columnar extension.
  
  thih9 2 years ago
  
  Thanks for explaining. This is confusing to me as a github user, i.e. if I saw a license in the project’s description, I wouldn’t expect another license in a subdirectory.
  Github now has UI for repos with multiple licenses: https://github.blog/changelog/2022-05-26-easily-discover-and... , that would have been clearer for me.
  
  mdaniel 2 years ago
  
  they must not look at anything other than the top directory, then, because this repo's sidebar only says "Apache-2.0 license" which I grant is clarified down at the _bottom_ of the README but if that multi-license thing really wants to be helpful it should at least look one subdirectory down

hgimenez 2 years ago

Always awesome to see folks moving Postgres forward. Congrats on the launch!

ryanb_wise 2 years ago

Talked to OP last night and played around with it this morning. This is something I've wanted to see added to postgres for a long time, and couldn't have been done by a nicer and more accommodating founder. Very excited.

edublancas 2 years ago

Congrats on the 1.0 milestone!

A few months ago, we worked with the team to bring Hydra to Jupyter, you can check out the tutorial here: https://docs.hydra.so/analyze/jupyter

JupySQL's GitHub: https://github.com/ploomber/jupysql

coatue 2 years ago

Nice and thanks @edublancas, that was a useful tutorial and it's nice to be able to query hydra with SQL via Jupyter. fan of your project

zhendlin 2 years ago

awesome project - and we tested Zing Data ( http://www.zingdata.com ) with Hydra to make really fast analytical queries on postgres scale to analytics users on mobile and so far have seen great results.

techwizrd 2 years ago

This looks really impressive, and I'm excited to see how it performs on our data!

P.S., I think the name conflicts with Hydra, the configuration management library: https://hydra.cc/

entuno 2 years ago

And also the password bruteforcing tool by THC.
- notpushkin 2 years ago
  
  Also with the OIDC server by Ory [1] and a certain defunct Russian darknet marketplace [2].
  [1]: https://www.ory.sh/hydra/
  [2]: Not today, tovarisch mayor.
  
  quickthrower2 2 years ago
  
  The many heads of Hydra

anagri 2 years ago

Congrats on the launch.

Curious, why not contribute to Citus? Why create a fork and have duplicated effort?

cosmojg 2 years ago

Because then it would be much harder to bait investors. By forking, they can more confidently say things like, "We built Hydra to solve real problems affecting real people who will part with oodles of money to access our totally unique solution."

kevincox 2 years ago

I would love to have the ability to define "column groups" in regular Postgres. For example most of the time I just leave the table row-oriented, but when trying to optimize performance of specific tables and queries I could tell postgres to store one or a handful of columns separately similar to a column oriented database. This would allow getting quite a bit of performance out of my database for the few cases in OLTP where I need to do some scans or searches. Even just breaking out some rarely used columns into a separate group could be quite nice for keeping the caches hot.

samgaw 2 years ago

From my initial 5 minutes of getting setup with an existing install, it's all fairly easy.

# Requirements

  - lz4
  - zstd

# Building

  git clone https://github.com/hydradatabase/hydra
  cd hydra/columnar
  ./configure
  make
  make install

# Install

  CREATE EXTENSION IF NOT EXISTS columnar;

The actual extension docs are at https://docs.hydra.so/concepts/using-hydra-columnar

chrisjc 2 years ago

Very out of touch with Postgres, but is there a native column oriented table type option in Postgres so that you choose either row-based or columnar in the CREATE TABLE DDL?

btown 2 years ago

I don't believe Postgres has this natively, but an alternative to OP is Citus, a Postgres extension which allows this kind of syntax.
https://www.citusdata.com/blog/2021/03/06/citus-10-columnar-...
EDIT: per another comment, OP is a fork of Citus Columnar!

joshgray 2 years ago

Congrats to the entire Hydra team on the launch! We (Artemis - https://www.artemisdata.io/) are stoked to be build with you as a partner to help data teams analyze data even faster!

mitchpatin 2 years ago

super impressive performance improvements!

do most of your customers replicate their postgres database to Hydra for analytics jobs, or what's the typical set up?

florianherrengt 2 years ago

Can you add the extension to an existing database?

ahmedfromtunis 2 years ago

That's also my question. Couldn't find anything for it in the docs provided.
Also, how to migrate data from an existing database? (Is it the usual pg_dump/psql combo?)
- d4rkp4ttern 2 years ago
  
  Indeed I am also wondering — if I have an existing Postgres DB , how do I leverage this?

dikei 2 years ago

For a Postgres extension, there's a strangely lack of documentation on how to add Hydra to an existing PG installation.

s-mon 2 years ago

This looks wild! Been looking for a good event based logs DB and didn’t want to go full clickhouse. This will do!

KingOfCoders 2 years ago

How does it compare to Timescaledb? Performance? Features? (TSDB is my current Postgres data lake setup)

burcs 2 years ago

This is really cool, just played around with it a bit but excited to do a deeper dive later. Nice work guys!

coatue 2 years ago

Thanks! feel free to DM me in the hydra discord (or email) anytime

zrizavi17 2 years ago

Such a game changer and useful alternative to legacy databases!

quadrature 2 years ago

How are updates handled, is it doing a merge on read ?

wuputah 2 years ago

First we added a bitmask to mark rows as deleted - these rows are filtered out on read. Then updates are implemented as deletions + inserts. We have also added vacuum functions to remove/rewrite stripes that have >20% of deleted rows in order to reclaim space and optimize those stripes.

mch82 2 years ago

Is hydra compatible with the PostGIS extension?

mdaniel 2 years ago

I wanted to say thank you for using actual Open Source licenses. It's gotten to where I treat any recent "Launch HN" or "Show HN" containing "open source" in the title as "well, I wonder which crazy license this project is using"

dang 2 years ago

Can you point me to examples of Launch HNs using funky licenses? Show HNs are free-form but Launch HNs are curated by us, and I'd like to know what red flags to watch for.
(As this is offtopic for the current Show HN, it might be better for to email hn@ycombinator.com if you, or anyone, would be willing to share that way.)
- mdaniel 2 years ago
  
  I sent email to avoid being a distraction, but I did want to follow up publicly and say that I apologize for lumping Launch HN into the same bucket as Show HN. For the most part the Launch ones are really Open Source and I apologize for the over generalization :-(

wrowan33 2 years ago

Congrats on the success!

pajep 2 years ago

can I ask if you guys take contributors?

wuputah 2 years ago

of course :) drop by our Discord if there's something you'd like to contribute and want to chat about it beforehand, need help/have questions getting started, etc. https://hydra.so/discord

ushakov 2 years ago

Are you funded?

coatue 2 years ago

Check out the "about us" section on the page for more details! "We are excited to share that Hydra raised a $3.1M seed round to drive development of columnar Postgres. We remain committed to sharing our upcoming releases to open source."

carlod 2 years ago

Congrats guys!

afeiszli 2 years ago

Really cool!

Iwan-Zotow 2 years ago

> to query billions of rows instantly

Rows? Rows?!? What's the point to have columnar DB to query rows?

ithkuil 2 years ago

Columnar DBs often allow you to have tables consisting of multiple columns where values in a column are correlated with values in the other columns. All such correlated values belonging to different columns are commonly called a "row" despite not being stored contiguously.
Generally what you do is to scan a column and evaluate a predicate in each value you encounter during the scan (possibly in parallel). For each value of that column that matches the predicate you then keep track of the "position" of the value in the column (a common technique is sparse set data structure such as for example a roaring bitmap). Then you scan through another column and select values for the saved "positions".
As you can see, it's not a stretch to view values from different columns that belong to the same "position" as belonging to the same "row" and the "position" to be the "row index" or "row id"
- Iwan-Zotow 2 years ago
  
  Sure, you could think of a row as "tuple of values returned from all columns at the same index". But why it was that important to measure and use and present for primarily columnar store? What's the point to measure and count row access? Isn't existing row DBs working for such case?