PRQL as a DuckDB Extension

105 points by tosh 2 years ago

I love the idea of PRQL and having a DuckDB extension will make it a lot easier to play with since I'm already using that for my hobby data science projects.

cced 2 years ago

Can you share your use of this? I’m interesting in using duckdb and wanted a boost from people with more experience.
Currently playing with postgress and dbt.
- datadrivenangel 2 years ago
  
  You can replace local postgres with DuckDB and it should be even faster for data analysis.
  
  vgt 2 years ago
  
  And if you want a severless hosted DuckDB with managed storage, sharing, UI, and ecosystem support, check out MotherDuck[0]!
  (head of produck at MotherDuck)
  [0]https://motherduck.com/
  
  datadrivenangel 2 years ago
  
  Do y'all have a reference architecture or templates for local development with CICD pipelines?
  My last data engineering team struggled to get something like that working with BigQuery, so I'm super excited about the possibility of better data warehouse developer tooling.
- teworks 2 years ago
  
  I replaced a ton of pandas code with an embedded DuckDB for internal data application and got a massive performance boost and (arguably) cleaner code.

snthpy 2 years ago

"Querying DuckDB with PRQL" by Learn Data with Mark (6 min) https://youtu.be/Rdohw424DA4?si=jWydM-LMo2c3ulrh

PRQL: A Modern Language for Data Transformation (11 min) https://youtu.be/t4-f9vjq2lc?si=Qx3k5oAq5A9THU0G

kthejoker2 2 years ago

Disclaimer: I work at Databricks ... so hopefully my subsequent opinion Is worth even more.

PRQL and DuckDB are amazing. I've been building a Power BI clone - think low code ETL, semantic layer, dashboards + chatbot - on top of Databricks SQL with PRQL as transformation engine and DuckDB as caching / aggregation layer. (Also huge shoutout to Toby and sqlglot.)

So easy to generate things programmatically, PRQL is the perfect map to low code GUIs, Arrow in and out, easy to scale ...

Big fan. Every analytics engineer should try them.

datadrivenangel 2 years ago

The post-modern data stack is going to be PRQL + DuckDB + Prefect, and it's going to be much smaller and cheaper for most analytics.

cced 2 years ago

Thoughts on dbt?
- nerdponx 2 years ago
  
  Great idea, kind of a chaotic mess in practice. Better than nothing by far, but the industry I think will be eager to receive an improved alternative.
  The problem with any tool like Dbt that abstracts over differences in databases is that a huge amount of work goes into building "adapters" to support the various details and quirks of each supported database. That ends up being a substantial technological moat which inhibits the growth of competitor systems. Another option is to do what Datasette did and focus on supporting one specific database, gradually expanding to a second database after years of demand for it.
esafak 2 years ago

Are you sure prefect is better than flyte? How so?
https://neptune.ai/blog/best-workflow-and-pipeline-orchestra...
bonchicbongenre 2 years ago

I'm with you at least 2/3 of the way. My preferred stack is PRQL + DuckDB + Dagster. I evaluated the space for work at my current company (was originally only DE, handling ingests from ~300 sources across various systems, on order of ~1k downstream tables in dbt + hundreds of dashboards + a handful of business-critical data+app features; now leading a small team).
I came away ranking dagster first, prefect second, everything else not close. IMO dagster wins fundamentally for data engineers bc it picks the right core abstraction (software defined assets) and builds everything else around that. Prefect for me is best for general non-data-specfic orchestration as a nearly transparent layer around existing scripts.
Ofc to each their own based on their usecase.

dang 2 years ago

PRQL: Pipelined Relational Query Language - https://news.ycombinator.com/item?id=36866861 - July 2023 (209 comments)

PRQL: a simple, powerful, pipelined SQL replacement - https://news.ycombinator.com/item?id=34181319 - Dec 2022 (215 comments)

Show HN: PRQL 0.2 – a better SQL - https://news.ycombinator.com/item?id=31897430 - June 2022 (159 comments)

PRQL – A proposal for a better SQL - https://news.ycombinator.com/item?id=30060784 - Jan 2022 (292 comments)

didip 2 years ago

Can someone tell me why PRQL is better? I went here: https://github.com/PRQL/prql

It looks nice, but what's the strengths compared to SQL?

klysm 2 years ago

it sounds minor, but having `from` before `select` means you can get autocomplete
- henrydark 2 years ago
  
  I get the sentiment, but personally I can easily imagine myself writing an autocompleter that would work fine with select before from. (I don't write much sql so I don't)
  Just to clarify, my point is that when we do write sql most of us start by writing the from part, and even if we didn't I can just offer all columns from all tables I know about with some heuristic for their order when autocompleting in the select part.
- laerus 2 years ago
  
  Well, you can always type "select from some_table" and then move the cursor back after select and have auto completion. For example Jetbrains DataGrip supports this.
  
  klysm 2 years ago
  
  Yeah that’s what I do it’s just annoying though
snthpy 2 years ago

Have a look at the online playground: https://prql-lang.org/playground/
It starts you off with a very well documented example. Try commenting out one of the lines and watch how the SQL on the RHS changes.
Each line is a separate transformation and follows a logical flow from top to bottom. IMHO it combines the best of SQL and pipelined DSLs like dplyr, LINQ, Kusto, to name just a few. An advantage over things like Pandas is that it still generates SQL so you can take your compute to where your data is (Cloud DWH) and benefit from query optimisation whereas Pandas has to download all your data first and then follows an eager execution model which doesn't benefit from query optimisation. Polars fixes a lot of data but still has the data transfer probleam and it is not universal whereas PRQL can compile to different dialects of SQL like DuckDB, Postgres, BigQuery, MS SQL Server, ...
For a more in-depth overview, watch this:
PRQL: A Modern Language for Data Transformation (11 min) https://youtu.be/t4-f9vjq2lc?si=Qx3k5oAq5A9THU0G
Disclaimer: I'm a PRQL contributor and the presenter of that talk.

oulipo 2 years ago

Nice! Perhaps you could distribute a NPM package with the WASM extension too? Otherwise the alternative is to use the WASM PRQL compiler, and use that in the browser to generate queries for ducksb-wasm

amcaskill 2 years ago

I'm quite excited about this, and would also love to have it distributed as an NPM package.
I work on an OSS web framework for reporting/ decision support applications (https://github.com/evidence-dev/evidence), and we use WASM duckDB as our query engine. Several folks have asked for PRQL support, and this looks like it could be a pretty seamless way to add it.
- oulipo 2 years ago
  
  I added it using prql-js
carlopi 2 years ago

Here it is: https://carlopi.github.io/duckdb-wasm-unsigned////#queries=v...
That is, the duckdb-wasm web shell with loaded PSQL extension

tacone 2 years ago

This is probably stupid but... would scrapscript[1] be a good base for a DSL aimed at replacing SQL?

[1]: https://scrapscript.org/

edit: added link

hintymad 2 years ago
Maybe, but at least it is not immediately clear to me why scrapscript is better than SQL.
The homepage of scrascript says:
```
   > Scrapscript is best understood through a few perspectives:

   >     “it’s JSON with types and functions and hashed references”
   >     “it’s tiny Haskell with extreme syntactic consistency”
   >     “it’s a language with a weird IPFS thing”
```
It later says: "Scrapscript solves the software sharability problem."
None of the statements addresses the commonly recognized shortcomings of SQLs.
In contrast, PRQL's value proposition is clear: it improves the readability and composability of SQL by offering a linear order of transformation through a pipelined structure. In addition, it is based on relational algebra so database professionals can still apply the theoretical framework that they understand and trust to master the language.
- tacone 2 years ago
  
  Hello, thank you for your answer. There's apparently more in scrapscript than what's written on its homepage. I didn't really meant to compare it to PRQL, but at the same time it probably offers some of the propositions of PRQL: extreme composability, functional programming (so: pipelines) and some more.
surprisetalk 2 years ago

Scrapscript author here. I'm a huge fan of PRQL and indeed hope to implement a DSL called "scrounge" later this year. The DSL will be nice to have for codegen and type-safety interop inside of scrapscript, but there's no way it would feel as nice as using PRQL directly.

up2isomorphism 2 years ago

What’s the advantage to learn a new query language when most of the statements can be done in SQL?

bvrmn 2 years ago

It's not very good from prql devs to use artificially bloated (line wise) examples to show SQL is bad.

   from employees
   select {id, first_name, age}
   sort age
   take 10

    SELECT
      id,
      first_name,
      age
    FROM
      employees
    ORDER BY
      age
    LIMIT
      10

Huh? For simple queries SQL is simple. You can't beat it on this field.

aerzen 2 years ago

Yeah, we use an SQL formatter that is a bit too verbose for most tastes. But I do agree, PRQL is not really needed for simple queries - it starts to shine only when you have 5+ transforms (clauses).
- bdcravens 2 years ago
  
  I'd love to see good examples of joining across several tables and applying aggregates. These are the types of queries that are a bit tedious to write but where SQL really provides value vs other ways of storing and extracting data.
timClicks 2 years ago

This style of formatting makes more sense when queries become larger. It's also quite good when you want to have readable diffs to review when queries change.
- bvrmn 2 years ago
  
  Same reasoning then should be applied for PRQL. My issue is unfair depiction of PRQL as "simple" and "concise".
  BTW most modern differs show changed parts of line which really helps not to bother with diff-friendly formatting.

qwertox 2 years ago

"PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement" [0]

[0] https://prql-lang.org/

jeroen79 2 years ago

SQL is good enough and well established, i see no need for another sql wrapper with a different syntax.

datadrivenangel 2 years ago

There are benefits to improving SQL when you have to maintain hundreds of SQL files for a pipeline.
setr 2 years ago

In the world of application languages, we accept that many languages can exist and compete, and serve various needs and improvements. And we curse the alternative, like that the browser forces us into a single language — JS.
In the world of RDBMS languages, we say there is SQL, and only SQL, and none other need ever exist. Ignoring that the SQL standard defines more of an aesthetic of the language than anything actually useful, so you’re only ever dealing with incompatible dialects anyways — there is no one SQL language in practice. Somehow the only way to consider any possible alternative is to drop the relational model altogether, throwing the baby out with the bathwater.
I feel like DBA’s have managed to get themselves stuck in the 80’s in terms of tooling; the glory of Codd squirreled away in this great fear of changing anything at all. COBOL is derided by all, and SQL praised unquestionably — I don’t know where it all went wrong
- nerdponx 2 years ago
  
  I think the concern is more that database engines themselves have a wide range of supported features and functionality, and the variations in SQL mostly reflect the variations in the supported feature set. Developers have also found themselves burned for decades by trying to interact with databases in languages that aren't SQL, including and especially various DSLs and ORM frameworks. The skepticism level is high because the track record for SQL alternatives/replacements isn't good. Maybe PRQL is the one that finally succeeds, but the lack of success until this point is not for lack of trying or imagination.
  
  setr 2 years ago
  
  > variations in SQL mostly reflect the variations in the supported feature set
  I mean, the notion of modifying the language itself to enable various features is a kind of psychosis not seen in decades in the application-language universe -- functions becoming keywords, and flags becoming keywords, and keywords injected randomly for english-ness [TRIM(LEADING "c" FROM col)] is generally absurd. For the most part, these should be modules/libraries/etc but SQL defines an aesthetic, and that aesthetic is a smattering of new keywords for every feature/function.
  > Developers have also found themselves burned for decades by trying to interact with databases in languages that aren't SQL, including and especially various DSLs and ORM frameworks.
  I think the fundamental issue here for DSLs is that at the end of the day, they're all compile-to-SQL languages, because the database does not offer any other API than SQL. And the general confusion is that SQL pretends to be standardized, so DSLs try to support all the databases, and end up implementing only some minimal shared subset... no matter how nice the DSL might be, you end up having to leave it anyways.
  ORMs are a different story; they're not just trying to hide the SQL language, they're trying to hide the relational model itself, and the vast majority of issues you have revolve around that object-relational mismatch. If one gets turned off by the mismatch, fixes it by writing raw SQL directly, and avows to never not-write SQL again (including the avoidance of query builders), then I don't know. They've conflated the two unnecessarily.
  But if you admit to query-builders being useful (and I demand you must; who could look at C# LINQ and think "nah, I LIKE smashing strings together and carefully administering the correct number of commas and precise ordering of clauses"?), then you admit to the possibility a better API exists.
  > Maybe PRQL is the one that finally succeeds, but the lack of success until this point is not for lack of trying or imagination.
  The fundamental confusion I have is that looking at other systems with fancy underlying engines, you would typically see a division between the API/language interface and the engine itself -- the frontend, and backend. And the frontend may be swapped out cleanly. Erlang/Elixer both sit on BEAM. Java/Scala/Clojure sit on the JVM. But an RDBMS vendor will only ever implement and support the singular dialect of SQL; there is nothing else. They'll probably implement support for app-language support for procedural logic (mssql has language extensions, postgres has PL/*, etc), but to execute actual database commands you end up submitting... SQL. There is nothing else.
  It's intimately (and afaik unnecessarily) tied to database engine, and I don't really understand why. And anyone who tries to do anything different must do so as a compile-to-SQL, and inherit all of the issues of transpilation and the underlying language itself. I could understand it as market forces, except even our one hope for good in the database world, postgres, makes the same choice
  
  nerdponx 2 years ago
  
  I really liked this response and I agree with all of it, but I especially agree with this statement:
  > I think the fundamental issue here for DSLs is that at the end of the day, they're all compile-to-SQL languages, because the database does not offer any other API than SQL. And the general confusion is that SQL pretends to be standardized, so DSLs try to support all the databases, and end up implementing only some minimal shared subset... no matter how nice the DSL might be, you end up having to leave it anyways.
  Until databases offer a better interface for actually interacting with the database, we are stuck with SQL, and as long as we are stuck with SQL, X-that-compiles-to-SQL will always be challenging to design and use successfully. I know libpq/Postgres has a binary wire protocol, but I'm not sure what it consists of, or to what extent it improves on the current model of sending a blob of SQL to be parsed the server.
  My bias against query builders specifically I think comes from spending too long with Python, where SQLAlchemy Core is the only standalone query builder in town, and it comes with a lot of ORM-derived baggage and a relatively high level of abstraction (not to mention complexity). This IMO makes it not much better than string templating because it's a relatively large amount of fuss to get it working, unless you also want to take advantage of other features like uniform logging, client-side connection pooling, client-side parameter binding of arbitrary Python objects, supporting multiple database backends in an application like Airflow, etc.
  Whereas in Common Lisp I very much enjoy S-expression-based DSLs like SxQL (https://github.com/fukamachi/sxql), and I agree that they're much better than constructing SQL out of raw strings.
  I'm not aware of a Scheme equivalent, although it looks like Gauche is planning to add one in the future (https://practical-scheme.net/gauche/man/gauche-refe/SQL-pars...): "The plan is to define S-expression syntax of SQL and provides a routine to translate one form to the other."
  
  setr 2 years ago
  
  right; this is why I won't accept that SQL is good-enough, so DSL's aren't worth looking at. The current state of affairs should be despised just as much as JS-only browsers -- these DSLs are an attempt to alleviate the suffering! SQL even shares many of the same issues -- weird semantics (stuffing a trinary logic into boolean operations, enabling both false negatives and false positives), lack of a proper library/module system, inconsistent language, non-composable features, etc. The difference is only that with JS, there are many better-behaved popular languages to look to and criticize it on. With SQL, there are competing languages implementing the relational algebra, but they're mainly serve as a curiosity and few use them in practice; there is SQL and only SQL.
  You can fairly argue that the DSL cannot fundamentally succeed until RDBMS vendors finally buy into it, but you can't argue that SQL is such a great representation of the relational algebra that it isn't worth bothering with; it's just the language we happen to be stuck with.
- zokier 2 years ago
  
  As a tangent, I have this same complaint about how regex are perceived, except the situation is even worse. At least SQL is actually designed, standardized and somewhat readable. In comparison the familiar regex syntax has been organically brewing, from simple kleene star to the impenetrable linenoise it is today, with no rhyme or reason. I know there has been some movement to create new regex languages[1] but like PRQL they are very niche.
  [1] https://github.com/pomsky-lang/pomsky comes to mind, no endorsement
pie_flavor 2 years ago
Are you familiar with what the SQL for the following PRQL looks like?
```
    from employees
    group role (
        sort join_date
        take 1
    )
```
It's pretty complicated and annoying. There's loads of attempts to reinvent SQL, but PRQL is the only one that's designed for what SQL is actually for, and fixes real problems instead of 'ick' problems.
- _dain_ 2 years ago
  
  something like
  select * from employees qualify row_number() over ( partition by role order by join_date ) = 1
  although I don't disagree with you. and `qualify` isn't standard SQL iirc
  
  pie_flavor 2 years ago
  
  Right, if you don't have QUALIFY you need a nested SELECT with named outputs within a WITH. (The example is taken from the website.)
- jeroen79 2 years ago
  
  no, i can't read PRQL, and a have no interest in learning it, it also doesn't really matter if it is a few lines more, if that means more people can read and understand it.