The question I have is schema updates. The biggest pain I have had with things like Mongo is dealing with old data records.
Use case example for Uber:
1. In 2011, a driver joined. They made a bunch of trips
2. In 2012, Uber added more detail about the trip. Information not collected for the 2011 trips.
3. And so on, each year there are 'just a few changes'
Given the above:
In 2016, Uber want to run a query to reward all drivers based on some piece of information that was only present in 2014 on.
At this point the historical trip information from 2011 is in a significantly different format than in 2016.
In a RDB, at least the old columns are there - or if the db was migrated to a new schema ( a pain ) the issue of the missing fields was addressed.
But dealing with data in old formats was an Uber pain. And the lack of visibility into just knowing the schema used to generate that JSON object is a PITA.
God forbid if you had new code that never even knew about the old 2011 format.
Lastly, what happens if a bug slips through and some JSON field is missing, has odd spelling ( capitalization wrong ), etc.
I would love to hear about how old data is handled in schemaless.
My experience with MongoDB was less than pleasant.
I'm a huge proponent against "NoSQL" for 99% of use cases/scale and I'll admit that you would have this same problem even with a relational database any time that you add a column that you can't auto-populate based off some pre-existing knowledge.
Is it me or could they have done this way more easily by building some indexing and triggering functionality on top of Cassandra? Even two years ago when they started. Instead they built sharding, indexing, triggering and a Cassandra-like data model on top of MySQL.
I had the same thought [0] when Matt Ranney mentioned that at QCon SF this past November. Especially since this post calls out a Bigtable-like data model, I'm somewhat perplexed.
Is it just me or is the reasoning behind the switch from postgres to mysql very vague? They describe a sharded mysql database... Sharding postgres isn't necessarily any more difficult, instagram apparently uses it in a sharded manner with many shards. You'd think storing json in the pretty sweet jsonb column type in postgres would be a nice bonus for querying or indexing on.
I guess someone at uber must really like mysql, a good enough reason as any other I suppose. I'd love to hear about what other reasons as to why mysql turned out to be the choice here, as I've usually gone the other way (mysql to pgsql) for many of the great features and performance pgsql has.
They might've had really experienced mysql dbas. While postgres does seem to be as nice, in my experience, it's harder to find people who truly understand it, vs people who truly understand mysql. Not saying mysql is better, but there's more (deep) experience for it out there.
An interesting system with very close semantics that Google built on top of bigtable: http://static.googleusercontent.com/media/research.google.co... .
Since that's built top of bigtable, you could in theory extend Schemaless to do 2PC for the cases that need it.
The implementation (using MySQL) seems very close to Vitess (http://vitess.io/overview/) which manages mysql as a series of "tablets", but exposes most MySQL features directly in the query language.
Odd that they chose MySQL, when they were previously using Postgres. In particular, Postgres' JSON support is so extensive (including indexing, which now is even more extensive [1]), and offers performance benefits over MySQL.
The advantage of MySQL in this situation is probably the support for multimaster replication.
main lesson - for a new generation of what would at first look seems like OLTP business, the OLTP pieces like transactional triggers and transactional indexes aren't a requirement anymore. I.e. those requirements seems to go the same way - south - as the transactional consistency of search indexes had went several years ago.
The question I have is schema updates. The biggest pain I have had with things like Mongo is dealing with old data records.
Use case example for Uber:
1. In 2011, a driver joined. They made a bunch of trips
2. In 2012, Uber added more detail about the trip. Information not collected for the 2011 trips.
3. And so on, each year there are 'just a few changes'
Given the above:
In 2016, Uber want to run a query to reward all drivers based on some piece of information that was only present in 2014 on.
At this point the historical trip information from 2011 is in a significantly different format than in 2016.
In a RDB, at least the old columns are there - or if the db was migrated to a new schema ( a pain ) the issue of the missing fields was addressed.
But dealing with data in old formats was an Uber pain. And the lack of visibility into just knowing the schema used to generate that JSON object is a PITA.
God forbid if you had new code that never even knew about the old 2011 format.
Lastly, what happens if a bug slips through and some JSON field is missing, has odd spelling ( capitalization wrong ), etc.
I would love to hear about how old data is handled in schemaless.
My experience with MongoDB was less than pleasant.
I'm a huge proponent against "NoSQL" for 99% of use cases/scale and I'll admit that you would have this same problem even with a relational database any time that you add a column that you can't auto-populate based off some pre-existing knowledge.
Is it me or could they have done this way more easily by building some indexing and triggering functionality on top of Cassandra? Even two years ago when they started. Instead they built sharding, indexing, triggering and a Cassandra-like data model on top of MySQL.
I had the same thought [0] when Matt Ranney mentioned that at QCon SF this past November. Especially since this post calls out a Bigtable-like data model, I'm somewhat perplexed.
[0] https://twitter.com/_wsh/status/666329980515647488
What fun is that when you can abuse technology and then flaunt how clever you are for the abuse?
script kiddies. vitess!
Is it just me or is the reasoning behind the switch from postgres to mysql very vague? They describe a sharded mysql database... Sharding postgres isn't necessarily any more difficult, instagram apparently uses it in a sharded manner with many shards. You'd think storing json in the pretty sweet jsonb column type in postgres would be a nice bonus for querying or indexing on.
I guess someone at uber must really like mysql, a good enough reason as any other I suppose. I'd love to hear about what other reasons as to why mysql turned out to be the choice here, as I've usually gone the other way (mysql to pgsql) for many of the great features and performance pgsql has.
They might've had really experienced mysql dbas. While postgres does seem to be as nice, in my experience, it's harder to find people who truly understand it, vs people who truly understand mysql. Not saying mysql is better, but there's more (deep) experience for it out there.
God, what a name. A hyphen might be in order, as in:
...at first I read it as she-males.
Same for me. "Schemalessness" is even worse.
I have no idea why this word caught on instead of "aschematic", which is much easier to parse.
An interesting system with very close semantics that Google built on top of bigtable: http://static.googleusercontent.com/media/research.google.co... . Since that's built top of bigtable, you could in theory extend Schemaless to do 2PC for the cases that need it.
The implementation (using MySQL) seems very close to Vitess (http://vitess.io/overview/) which manages mysql as a series of "tablets", but exposes most MySQL features directly in the query language.
Odd that they chose MySQL, when they were previously using Postgres. In particular, Postgres' JSON support is so extensive (including indexing, which now is even more extensive [1]), and offers performance benefits over MySQL.
The advantage of MySQL in this situation is probably the support for multimaster replication.
[1] http://pgxn.org/dist/jsquery/
main lesson - for a new generation of what would at first look seems like OLTP business, the OLTP pieces like transactional triggers and transactional indexes aren't a requirement anymore. I.e. those requirements seems to go the same way - south - as the transactional consistency of search indexes had went several years ago.