What category theory teaches us about dataframes

mchav.github.io

190 points by mchav 2 months ago

The article starts well, on trying to condense pandas' gaziliion of inconsistent and continuously-deprecated functions with tens of keyword arguments into a small, condensed set of composable operations - but it lost me then.

The more interesting nugget for me is about this project they mention: https://modin.readthedocs.io/en/latest/index.html called Modin, which apparently went to the effort of analysing common pandas uses and compressed the API into a mere handful of operations. Which sounds great!

Sadly for me the purpose seems to have been rather to then recreate the full pandas API, only running much faster, backed by things like Ray and Dask. So it's the same API, just much faster.

To me it's a shame. Pandas is clearly quite ergonomic for various exploratory interactive analyses, but the API is, imo, awful. The speed is usually not a concern for me - slow operations often seem to be avoidable, and my data tends to fit in (a lot of) RAM.

I can't see that their more condensed API is public facing and usable.

bbkane 1 month ago

Check out polars- I find it much more intuitive than pandas as it looks closer to SQL (and I learned SQL first). Maybe you'll feel the same way!
- Lyngbakr 1 month ago
  
  Agreed — I much prefer polars, too. IIRC the latest major version of pandas even introduced some polars-style syntax.
  
  Patient0 1 month ago
  
  which makes sense because I believe that polars was written by the same guy that did pandas (hence the name - panda and polar are bears)
  
  wenc 1 month ago
  
  Polars is Ritchie Vink. Pandas is Wes McKinney.
- rich_sasha 1 month ago
  
  I've looked at Polars. My sense is that Pandas is an interactive data analysis library poorly suited to production uses, and Polars is the other way around. Seemed quite verbose for example. Sometimes doing `series["2026"]` is exactly the right thing to type.
  
  entropicdrifter 1 month ago
  
  You can do that in Polars, too
  
  mwexler 1 month ago
  
  With some of the newest 3.x changes like copy-on-write, I find pandas getting quite verbose now as well.
  In a world where AI is writing the code, I guess I shouldn't complain, but when I am discovering something the ai of choice yet again missed, both pandas and polars still feel verbose and lacking sugar.
sweezyjeezy 1 month ago

The pandas API is awful, but it's kind of interesting why. It was started as a financial time series manipulation library ('panels') in a hedge fund and a lot of the quirks come from that. For example the unique obsession with the 'index' - functions seemingly randomly returning dataframes with column data as the index, or having to write index=False every single time you write to disk, or it appending the index to the Series numpy data leading to incredibly confusing bugs. That comes from the assumption that there is almost always a meaningful index (timestamps).
- gwerbin 1 month ago
  
  > The pandas API is awful
  I hate to be the "you're holding it wrong" guy but 90% of "Pandas bad!" posts I find are either outright misinformed or mischaracterizing one person's particular opinion as some kind of common truth. This one is both!
  > That comes from the assumption that there is almost always a meaningful index (timestamps)
  The index can be literally any unique row label or ID. It's idiosyncratic among "data frames" (SQL has no equivalent concept, and the R community has disowned theirs), but it's really not such a crazy thing to have row labels built into your data table. Excel supports this in several different ways (frozen columns, VLOOKUP) and users expect it in just about any table-oriented GUI tool.
  > having to write index=False every single time you write to disk
  If you're actually using the index as it's meant to be used, you'd see why this isn't the default setting.
  > functions seemingly randomly returning dataframes with column data as the index
  I assume you're talking about the behavior of .groupby() and .rolling()? It's never been random. Under-documented and hard to reason about group_keys= and related options, yes. But not random.
  > appending the index to the Series numpy data leading to incredibly confusing bugs
  I've been using Pandas professionally almost daily since 2015 and I have no idea what this means.
  
  _diyar 1 month ago
  
  I think the commenter you are replying to might well understand these nuances. The point is not that Pandas is inscrutable, but instead that it‘s annoying to use in many common use-cases.
  
  sweezyjeezy 1 month ago
  
  > but it's really not such a crazy thing to have row labels built into your data table.
  Sometimes you need data in a certain order. Sometimes there is no primary key. And it is nuts how janky the pandas API is if you just want the index to mean the current order of the dataframe and nothing else. Oh you did a pivot? I'm just going to make those pivot columns a row label now if that's alright with you. I don't do that for all functions though, you're going to have to remember which ones. Oh you want to sort a dataframe? You better make damn sure you reindex if you're planning to use that with data from another dataframe (e.g. x + y on data from separate dataframes), otherwise I'm going to align the data on indices, and you can't stop me. Also - want to call pyplot.plot(df['column'])? Yeah I'm giving it the data in index order obviously I don't care about that sort you just did. Oh you want to port this data to excel? Well if your row labels aren't meaningful and you don't want "Unnamed: 0" you're going to have to tell me not to. You need to manipulate a multi-index? You're so cute. Have fun with that buddy.
  There is a reason no other dataframe library does this - because it's confusing and cognitive overhead that doesn't need to exist. I've used pandas since ~2013, had this chat with colleagues and many recommend just giving in and maintaining an index throughout. Except I've read their pandas and it sucks because now _you_ need to reason about what is currently the index - because it actually needs to change a lot to do normal things with data. I just use .reset_index copiously and try to make it behave like a normal dataframe library because it's just easier to understand later. Pandas has not earned the right to redefine what a dataframe means.
  At the absolute least, index behaviour should be opt-in, not something imposed on the user.
rdevilla 1 month ago

> Pandas is clearly quite ergonomic for various exploratory interactive analyses, but the API is, imo, awful.
Having previously inherited (and now dispossessed) an un-disentangleable pile of Python, pandas, and SQL hacks reminiscent of a spreadsheet rammed with inscrutable Excel formulae, I have no idea how data scientists collaborate on anything with this technology. It's like when bioinformatics was full of write-only Perl code that was maybe executed successfully once for the purposes of a study or paper, and was kept around for future archaeologists to hopefully one day resuscitate when the need may arise again.
If programmers are expected to just throw garbage like this at the next asshole with the misfortune to have to maintain code that was never designed to be maintained, it's not a surprise that the industry is once again moving towards write-only code, this time produced at scale by LLMs.
It's like we're back to Visual Studio Ultimate slopping out 10k lines of XAML in response to your dragging and dropping in the WYSIWYG. There is a reason nobody does this any more.

few 1 month ago

I felt like one or two decades ago, all the rage was about rewriting programs into just two primitives: map and reduce.

For example filter can be expressed as:

  is_even = lambda x: x % 2 == 0
  mapped = map(lambda x: [x] if is_even(x) else [], data)
  filtered = reduce(lambda x, y: x + y, mapped, [])

But then the world moved on from it because it was too rigid

mememememememo 1 month ago

Performance aside it seems you could do most maybe a the ops with those three. I say three because your sneaky plus is a union operation. So map, reduce and union.
But you are also allowing arbitrary code expressions. So it is less lego-like.
mrlongroots 1 month ago

MapReduce is nice but it doesn't, by itself, help you reason about pushdowns for one. Parquet, for example, can pushdown select/project/filter, and that's lost if you have MapReduce. And a reduce is just a shuffle + map, not very different from a distributed join. MapReduce as an escape hatch over what is fundamentally still relational algebra may be a good intuition.
bjourne 1 month ago

Reductions are painful because they specify a sequence of ordered operations. Runtime is O(N), where N is the sequence length, regardless of amount of hardware. So you want to work at a higher level where you can exploit commutativity and independence of some (or even most) operations.
- ux266478 1 month ago
  
  You're right it's primarily a runtime + compiler + language issue. I really don't understand why people tried to force functional programming in environments without decent algebraic reasoning mechanisms.
  Modern graph reducers have inherent confluence and aren't reliant on explicit commutation. They can do everything parallel and out of order (until they have to talk to some extrinsic thing like getting input or spitting out output), including arbitrary side-effectual mutation. We really live in the future.
- toxik 1 month ago
  
  You can reduce in parallel. That was the whole point of MapReduce. For example, the sum abcdefgh can be found by first ab, cd, ef, gh; then those results (ab)(cd), (ef)(gh); then the final result by (abcd)(efgh). That's just three steps to compute seven sums.
  
  bjourne 1 month ago
  
  No, you can not. Your example is correct only if addition is associative. And it is not always associative. Hence the need for higher abstractions, where you model commutativity and associativity of certain operations.
- heavenlyblue 1 month ago
  
  Reduce is massively parallel for commutative operations
antonvs 1 month ago

There might have been some misunderstanding there.
The point of map/reduce was that it could easily be parallelized across large numbers of machines, for processing very large amounts of data. Hadoop implemented the first open-source example of this.
The limitations on what it could do were well-known from the start. No-one who knew what they were doing proposed that programs should be rewritten that way unless you were processing enough data to need to run them distributed on a cluster, in which case that was often your best option.
Many of the limitations of pure map/reduce were overcome by adding steps to the basic map/reduce parallel pipelines. Apache Spark is one example. It still has map and reduce operations in its pipeline, but it has several other operations as well. Nothing better than map and reduce has been found for the purpose it serves in such pipelines.

pavodive 1 month ago

When I started reading about pandas complexity and the smaller set of operations needed, couldn't help but think of R's data.table simplicity.

Granted, it's got more than 15 functions, but its simplicity seems to me very similar to what the author presented in the end.

Lyngbakr 1 month ago

Back when I used to use Stackoverflow, someone would always come along with a data.table solution when I asked a question about dplyr. The terse syntax seemed so foreign compared to the obvious verb syntax of dplyr. But then I learned data.table and I've never looked back. It's a superb tool!
gwerbin 1 month ago

data.table "simplicity" is actually a huge set of features, they just have a clever and compact way to express those features in code. At the same time, there is effectively no standard-eval programmatic interface for it, which makes it a headache for building programs rather than scripting with. data.table is amazing, but it is anything but simple IMO.

caseyross 1 month ago

Interesting idea. I feel like it could be productive to categorize operations by their result shape as well:

- Row select: From N rows, produce 0-N rows.

- Column select: From N columns, produce 0-N columns.

- Table add: From MxN and OxP tables, produce max M+OxN+P table.

- Table subtract: From MxN and OxP tables, produce min 0x0 table.

This line of thinking reveals some normally hard-to-see similarities, such as `groupby` and `dedupe` sharing the same underlying mechanism. (i.e., both are "collapsing" row selects.)

voxleone 1 month ago

It’s almost suspiciously elegant: focus on transformations and their composition, and the structure takes care of itself.

toxik 1 month ago

Pandas and so on exist for the same reason Django's ORM and SqlAlchemy do: people do not want to string interpolate to talk to their database. SQL is great for DBA's, and absolutely sucks for programmers. Microsoft was really onto something with LINQ, in my opinion.

getnormality 1 month ago

Hmm. Folks trying to discover the elegant core of data frame manipulation by studying... pandas usage patterns. When R's dplyr solved this over a decade ago, mostly by respecting SQL and following its lead.

The pandas API feels like someone desperately needed a wheel and had never heard of a wheel, so they made a heptagon, and now millions of people are riding on heptagon wheels. Because it's locked in now, everyone uses heptagon wheels, what can you do? And now a category theorist comes along, studies the heptagon, and says hey look, you could get by on a hexagon. Maybe even a square or a triangle. That would be simpler!

No. Stop. Data frames are not fundamentally different from database tables [1]. There's no reason to invent a completely new API for them. You'll get within 10% of optimal just by porting SQL to your language. Which dplyr does, and then closes most of the remaining optimality gap by going beyond SQL's limitations.

You found a small core of operations that generates everything? Great. Also, did you know Brainfuck is Turing-complete? Nobody cares. Not all "complete" systems are created equal. A great DSL is not just about getting down to a small number of operations. It's about getting down to meaningful operations that are grammatically composable. The relational algebra that inspired SQL already nailed this. Build on SQL. Don't make up your own thing.

Like, what is "drop duplicates"? What are duplicates? Why would anyone need to drop them? That's a pandas-brained operation. You want the distinct keys defined by a select set of key columns, like SQL and dplyr provide.

Who needs a separate select and rename? Select is already using names, so why not do your name management there? One flexible select function can do it all. Again, like both SQL and dplyr.

Who needs a separate difference operation? There's already a type of join, the anti-join, that gets that done more concisely and flexibly, and without adding a new primitive, just a variation on the concept of a join. Again, like both SQL and dplyr.

Props to pandas for helping so many people who have no choice but to do tabular data analysis in Python, but the pandas API is not the right foundation for anything, not even a better version of pandas.

[1] No, row labels and transposition are not a good enough reason to regard them as different. They are both just structures that support pivoting, which is vastly more useful, and again, implemented by both R and many popular dialects of SQL.

fn-mote 1 month ago

Amen.
The author takes the 4 operations below and discusses some 3-operation thing from category theory. Not worth it, and not as clear as dplyr.
> But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?
DangitBobby 1 month ago

I guess I have pandas brain because I definitely want to drop duplicates, 100% of the time I'm worried about duplicates and 99% of the time the only thing I want to do with duplicates is drop them. When you've got 19 columns it's _really fucking annoying_ if the tool you're using doesn't have an obvious way to say `select distinct on () from my_shit`. Close second at say, 98% of the time, I want to a get a count of duplicates as a sanity check because I know to expect a certain amount of them. Pandas makes that easy too in a way SQL makes really fucking annoying. There are a lot of parts on pandas that made me stop using it long ago but first class duplicates handling is not among them.
And the API is vastly superior to SQL is some respects from a user perspective despite being all over the place in others. Dataframe select/filtering e.g. df = df[df.duplicated(keep='last')] is simple, expressive, obvious, and doesn't result in bleeding fingers. The main problem is the rest of the language around it with all the indentations, newlines, loops, functions and so on can be too terse or too dense and much hard to read than SQL.
- getnormality 1 month ago
  
  Duplicates in source data are almost always a sign of bad data modeling, or of analysts and engineers disregarding a good data model. But I agree that this ubiquitous antipattern that nobody should be doing can still be usefully made concise. There should be a select distinct * operation.
  And FWIW I personally hate writing raw SQL. But the problem with the API is not the data operations available, it's the syntax and lack of composability. It's English rather than ALGOL/C-style. Variables and functions, to the extent they exist at all, are second-class, making abstraction high-friction.
  
  DangitBobby 1 month ago
  
  Oooh buddy how's the view from that ivory tower??
  But seriously I'm not in always in control of upstream data, I get stuff thrown over to my side of the fence by an organization who just needs data jiggled around for one-off ops purposes. They are communicating to me via CSV file scraped from Excel files in their Shared Drive, kind of thing.
  
  getnormality 1 month ago
  
  Do what you gotta do, but most of my job for the past decade has been replacing data pipelines that randomly duplicate data with pipelines that solve duplication at the source, and my users strongly prefer it.
  Of course, a lot of one-off data analysis has no rules but get a quick answer that no one will complain about!
  
  DangitBobby 1 month ago
  
  I updated my OG comment for context. As an org we also help clients come up with pipelines but it's just unrealistic to do a top-down rebuild of their operations to make one-off data exports appeal to my sensibilities.
  
  getnormality 1 month ago
  
  I agree, sometimes data comes to you in a state that is beyond the point where rigor is helpful. And for some people that kind of data is most of their job!
  
  mamcx 1 month ago
  
  > Duplicates in source data are almost always a sign of bad data modeling
  Nope. Duplicates in source data(INPUT) is natural, correct and MUST be supported or almost all data become impossible.
  What is the actual problem is the OUTPUT. Duplicates on the OUTPUT need to be controlled and explicit. In general, we need in the OUTPUT a unique rowby a N-key, but probably not need it to be unique for the rest, so, in the relational model, you need unique for a combination of columns (rarely, by ALL of them).
  
  doug_durham 1 month ago
  
  Duplicates are a sign of reality. Only where you have the resources to have dedicated people clean and organize data do you have well modeled data. Pandas is a power tool for making sense of real data.
- gregw2 1 month ago
  
  You articulate your case well, thank you!
  I always warn people (particularly junior people) though that blindly dropping duplicates is a dangerous habit because it helps you and others in your organization ignore the causes of bad data quickly without getting them fixed at the source. Over time, that breeds a lot of complexity and inefficiency. And it can easily mask flaws in one's own logic or understanding of the data and its properties.
  
  DangitBobby 1 month ago
  
  When I'm in pandas (or was, I don't use it anymore) I'm always downstream of some weird data process that ultimately exported to a CSV from a team that I know has very lax standards for data wrangling, or it is just not their core competency. I agree that duplicates are a smell but they happen often in the use-cases that I'm specifically reaching to pandas for.
  
  michaelbarton 1 month ago
  
  Exactly. It’s not that getting rid of duplicates is bad, is that they may be a symptom of something worse. E.g. incorrect aggregation logic
getnormality 1 month ago

On reflection I think it's possible I may have missed the potential positive value of the post a bit. Maybe analyzing pandas gets you down to a set of data frame primitives that is helpful to build any API. Maybe the API you start with doesn't matter. I don't know. When somebody works hard to make something original, you should try to see the value in it, even if the approach is not one you would expect to be helpful.
I stand by my warnings against using pandas as a foundation for thinking about tabular data manipulation APIs, but maybe the work has value regardless.
doug_durham 1 month ago

SQL only works on well defined data sets that obey relational calculus rules. Pandas is a power tool for dealing with data as you find it. Without Pandas you are stuck with tools like Excel.
mr_toad 1 month ago

> just by porting SQL to your language
You make it sound like writing an SQL parser and query engine is a trivial task. Have you ever looked at the implementation of a query engine to see what’s actually involved? You can’t just ‘build on SQL’, you have to build a substantial library of functions to build SQL on top of.
- gwerbin 1 month ago
  
  Also it's not like dplyr is anything close to a "port" of SQL. You could in theory collect dplyr verbs and compile them to SQL, sure. That's what ORMs typically do, and what the Spark API does (and its descendants such as Polars).
  "Porting" SQL to your language usually means inventing a new API for relational and/or tabular data access that feels ergonomic in the host language, and then either compiling it to SQL or executing it in some kind of array processing backend, or DataFusion if you're fancy like that.
  
  getnormality 1 month ago
  
  dplyr straightforwardly transpiles to SQL through the dbplyr package, so it's semantically pretty close to a port, even though the syntax is a bit different (better).
gwerbin 1 month ago

> There's no reason to invent a completely new API for them
Yes there is: SQL is one of many possible ways to interact with tabular data, why should it be the only one? R data frames literally pioneered an alternative API. Dplyr is fantastic for many reasons, one of those being that people like the verb-based approach
Furthermore I argue that dplyr is not particularly similar to SQL in the way you actually use it and how it's actually interpreted/executed.
As for the rest I feel like you're just stating your preferences as fact.
BigTTYGothGF 1 month ago

"The only tool I'm willing to use is a hammer, and by god I'll turn everything into nails."
therobots927 1 month ago

I couldn’t agree more. But at the same time I try to stay quiet about it because SQL is the diamond in the rough that 95% of engineers toss into the trash. And I want minimal competition in a tight job market.

jiehong 1 month ago

Dups of a few days ago:

- https://news.ycombinator.com/item?id=47567087

hermitcrab 1 month ago

>a dataframe is a tuple (A, R, C, D): an array of data A, row labels R, column labels C, and a vector of column domains D.

What is 'a vector of column domains D'? A description of how the data A maps to columns?

throw_await 1 month ago

I think "domain" here is like the datatype

jeremyscanvic 1 month ago

It's very insightful how they explain the difference between dataframes and SQL tables / standard relational structures!

kiviuq 1 month ago

there is also ZIO Prelude and ZIO schema...

jmount 1 month ago

I like this sort of study- but it really misses the point to not give more credit for some of the observations and designs to Codd and others.

hermitcrab 1 month ago

I guess this article is an interesting exercise from a pure maths point of view. But, as someone developing a drag and drop data wrangling tool the important thing is creating a set of composable operations/primitive that are meaningful and useful to your end user. We have ended up 73 distinct transforms in Easy Data Transform. Sure they overlap to an extent, but feel they are at the right semantic level for our users, who are not category theorists.

mrlongroots 1 month ago

Algebras are also nice for implementations. If you can decompose a domain into a few algebraic primitives you can write nice SIMD/CUDA kernels for those primitives.
To your point, I wonder if the 73 distinct transforms were just different defaults/usability wrappers over these. And you may also get into situations where kernels can be fused together or other batching constraints enable optimizations that nice algebraic primitives don't capture. But that's just systems---theory is useful in helping rethink API bloats and keeping us all honest.
- hermitcrab 1 month ago
  
  They are effectively highly level wrappers over the most primitive operations. High enough level that they can be used from a GUI, rather than code.
  It is a balance. Too few transforms and they become to low level for my users. Too many and you struggle to find the transform you want.
  
  jimbokun 1 month ago
  
  You don’t have to limit the transforms you offer users to just the core ones. But for your own sanity you can implement the none core ones in terms of the core ones.
whattheheckheck 1 month ago

Have you heard of the book Mathematics for Big data
https://github.com/Accla/d4m
He says himself the ideas are more important than the software package
- hermitcrab 1 month ago
  
  D4M seems to be a library, not a book. Or am I missing something?
  
  esafak 1 month ago
  
  https://mitpress.mit.edu/9780262038393/mathematics-of-big-da...
tikhonj 1 month ago

You can have both: you start with a small, mathematically inspired algebraic core, then you express the higher-level more user-friendly operations in terms of the algebraic core.
As long as your core primitives are well designed (easier said than done!), this accomplishes two things: it makes your implementation simpler, and it helps guide and constrain your user-facing design. This latter aspect is a bit unintuitive (why would you want more constraints to work around?), but I've seen it lead to much better interface designs in multiple projects. By forcing yourself to express user-level affordances in terms of a small conceptual core, you end up with a user design that is more internally consistent and composable.
- jimbokun 1 month ago
  
  For one thing it gives users of your library fewer concepts to learn.
  
  hermitcrab 1 month ago
  
  Yes, but fewer concepts may not be simpler in practice. E.g. assembler is simpler than C++, but I wouldn't want to write a big program in assembler.