Damien Katz

October 28, 2012

How to achieve lots of code?

I get mail.

hello damien
I read about you from a book on erlang.

Your couchdb application is really a rave.

please can you help me out ,i've got questions only a working programmer can answer.

i'm shooting now:

i've been programming in java for over 3 years

i know all about the syntax and so on but recently i ran a code counter on my apps and
the code sizes were dismal. 2-3k

commercial popular apps have code sizes in the 100 of thousands.

so tell me- for you and what you know of other developers how long does it take to write those large applications ( i.e over 30k lines of code)

what does it take to write large applications - i.e move from the small code size to really large code sizes?

thank you.

Never try to make your project big. Functionality is an asset, code is a liability. What does that mean? I love this Bill Gates quote:

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.

More code than necessary will bloat your app binaries, causing larger downloads and more disk space, use more memory, and slow down execution with more frequent cache misses. It can make it harder to understand, harder to debug, and will typically have more flaws.

CouchDB, when we hit 1.0, was less than 20k lines of production code, not including dependencies. This included a storage engine (crash tolerant, highly concurrent MVCC with pauseless compactor), language agnostic map/reduce materialized indexing engine (also crash tolerant highly concurrent MVCC with pauseless compactor), master/master replication with conflict management, HTTP API with security model, and simple JS application server.

The small size is partly because it was written in Erlang, which generally requires 1/5 or less code of the equivalent in C or C++, and also because the original codebase was mostly written by one person (me), giving the design a level of coherency and simplicity that is harder to accomplish -- but still very possible -- in teams.

Test are different. Lines of code are more of an asset in tests. More tests (generally) means more reliable production code, helps document code functionality that can't get out of sync the way comments and design docs can (which is worse than no documentation) and doesn't slow down or bloat the compiled output. There are caveats to this, but generally more code in tests is a good thing.

Also you can go overboard with trying to make code short (CouchDB has some WTFs from terseness that are my fault). But generally you should try to make code compact and modular, with clear variable and function names. Early code should be verbose enough to be understandable by those who will work on it, and no more. You should never strive for lots of code, instead you want reliable, understandable, easily modifiable code. Sometimes that requires a lot of code. And sometimes -- often for performance reasons -- the code must be hard to understand to accomplish the project goals.

But often with careful thought and planning, you can make simple, elegant, efficient, high quality code that is easy to understand. That should be your goal.

Link

August 30, 2012

CouchConf SF

CouchConf SF is coming.

This is our premier Couchbase event. We're going ham.

Come hear speakers from established enterprises and how they are betting their business on Couchbase.

Hang out and talk with speakers, me and other Couchbase engineers in the Couchbase lounge.

I'll be talking at the closing session. Let me know what you'd like to hear about!

Killer after-party. Witness my drunken antics ;)

Some highlights:

Three tracks and nearly 30 technical sessions for dev and ops

15 customer speakers from companies like:
- McGraw Hill - who will be sharing their experiences and demoing their Couchbase Server 2.0 app - including full-text search integration among other features
- Orbitz who will be talking about how they replaced Oracle Coherence with Couchbase NoSQL software
- Sabre - discussing how they are using NoSQL to reduce mainframe costs
- Tencent will be sharing their evaluation process (and results) for choosing a NoSQL solution
- Other speakers include Linked In, Tapjoy, TheLadders, and more
There are also training sessions for developers and admins the two days prior to CouchConf for those who want to also get more hands-on experience.

When you register, you can get the early bird rate if you use the promotional code Damien.

Link

June 25, 2012

Reminder: Couchbase Community Summer BBQ

We're celebrating summer by throwing a huge outdoor BBQ bash at our Mountain View office. Wednesday, June 27, 2012 from 5:00 PM to 10:00 PM (PT) Mountain View, CA HQ

Please RSVP

Link

June 21, 2012

Why Database Technology Matters

Sometimes I get so down in the weeds of database technology, I forget why I think databases are so fascinating to me, why I found them so important to begin with. ACID. Latency, bandwidth, durability, performance, scalability, Bits and bytes. Virtual this, cloud that. Blah blah blah. Who the fuck cares?

I care.

Dear lord I care. I care so much it hurts.

"A database is an organized collection of data, today typically in digital form." -Wikipedia

I think about databases so much. So so much. New schemes for expanding their capacity, new ways of making them work, new ways of making them faster, more reliable, new ways of making them accessible to more developers and users.

I spend so much time thinking about them, it's embarrassing. As much time as I spend thinking about them, I feel like I should I should know so much more than I do.

HTTP, JSON, memcached, elastic clusters, developer accessibility, incremental map/reduce, distributed indexing, intra-cluster replication, cross-cluster replication, tail-append generational storage, disk fragmentation, memory fragmentation, memory/storage hierarchy, disk latency, write amplification, data compression, multi-core, multi-threading, inverted indexes, language parsing, interpreter runtimes, message passing, shared memory, recovery-oriented architectures. All that stuff that makes a database tick.

Why do I spend so time on this? Why have spent so many years on them?
Why do they fascinate me so much? Why did I quit my job and build an open source database engine with my own money, when I wasn't wealthy and I had a family to support?

Why the hell did I do that?

Because I think database technologies are among the most important fundamental advancements of humanity and our collective consciousness. I think databases are as important as telecommunications and the internet. I think they are as important as any scholarly library -- and that libraries are the earliest non-digital databases. I think databases are almost as important the invention of the written word.

Forget SQL. Forget network, document or object databases. Forget the relational algebra. Forget schemas. Forget joins and normalization. Forget ACID. Forget Map/Reduce.

Think knowledge representation. Think knowledge collection, transformation, aggregation, sharing. Think knowledge discovery.

Think of humanity and its collective mind expanding.

When IBM was at the absolute height of its power, they were the richest, most powerful company on the planet. They primarily sold mainframes for a lot of money, and at the core of those mainframes were big database engines, providing a big competitive advantage their customers gladly paid for.

Google has created a database indexing of the internet. They are force because they found ways to find meaning in the massive amounts of information already available. They are a very visible example of changing the way humanity thinks.

File systems are very simple databases. People have been building all sorts of searching and aggregation technology on top them for many years, to better unlock all that knowledge and information stored within.

Email? Email technology is essentially databases that you can send messages to. It's old fashioned and simple, and yet our email systems keeping getting more clever about ways to shows us what's in our unstructured personal databases.

Databases don't have to be huge to have a huge impact. SQLite makes databases accessible on small devices. It's the most deployed database on the planet. It's often easy to miss the impact when when it's billions of small installations, it starts to look like air. Something that's just there, all around us. But add it up and the impact is huge.

And of course big bad Oracle. As much as people love to hate them, they've made reliable database technology very accessible, something you can bet your business on, year after year. They are great at not just making the technology work, but the complete ecosystem around it, something necessary for enterprises and mission critical uses. There is a lot to criticize about them, but much to praise as well.

So yes, I care. I care deeply. I care about the big picture. And I care about the bits and bytes. I care about the ridiculously complex details most people will never see. I care about the boring stuff that makes the bigger stuff happen. And sometimes I forget why I care about it. Sometimes I lose sight of the big picture as I'm so focused on making the details work.

And sometimes I remember. And I feel incredibly lucky and privileged for the opportunities to have a positive impact on the collective mind of humanity. And my reward is to know, in some small way, that I've succeeded. And I want to do more. This is important stuff, the most important and effective way I know how to contribute to the world. It matters to me.

Link

May 30, 2012

Stabilizing Couchbase Server 2.0

I'm glad to report we are now pretty much going into full-on stabilization and resource optimization mode for Couchbase Server 2.0. It's taken us a lot longer than we planned. Creating a high performance, efficient, reliable, full-featured distributed document database is a non-trivial matter ;)

In addition to the same "simple, fast, elastic" memcached and clustering technology we have in previous versions of Couchbase, we've added 3 big new features to dramatically extend it's capabilities and use cases, as well as its performance and reliability.

Couchstore: High Throughput, Recovery Oriented Storage

One of the biggest obstacles for 2.0 was the Erlang-based storage engine was too resource heavy compared to our 1.8.x release, which uses SQLite. We did a ton of optimization work and modifications, stripping out everything we could to make it as a fast and efficient as possible, and in the process making our Erlang-based storage code several times faster than when we started, but the CPU and resource usage was still too high, and without lots of CPU cores, we couldn't get total system performance where our existing customers needed it.

In the end, the answer was to rewrite the core storage engine and compactor in C, using a format bit for bit compatible with our Erlang storage engine, so that updates written in one process could be read, indexed, replicated, and even compacted from Erlang. It's the same basic tail-append, recovery oriented MVCC design, so it's simple to write to it from one OS process and read it from another process. The storage format is immune to corruption caused by server crashes, OOM killers or even power loss.

Rewriting it in C let us break through many optimization barriers. We are easily getting 2x the write throughput over the optimized Erlang engine and SQLite engines, with less CPU and a fraction of the memory overhead.

Not all of this is due to C being faster than Erlang. A good chunk of the performance boost is just being able to embed the persistence engine in-process. That alone cut out a lot of CPU and overhead by avoiding transmitting data across processes and converting to Erlang in-memory structures. But also it's C, which provides good low level control and we can optimize much more easily. The cost is more engineering effort and low-level code, but the performance gains have proven very much worth it.

click to enlarge

And so now we've got the same optimistically updating, MVCC capable, recovery oriented, fragmentation resistant storage engines both in Erlang and C. Reads don't block writes and writes don't block reads. Writes also happen concurrently with compaction. Getting all or incremental changes via MVCC snapshotting and the by_sequence index makes our disk io mostly linear for fast warmup, indexing, and cluster rebalances. It allows asynchronous indexing, and it also powers XDCR.

B-Superstar: Cluster Aware Incremental Map/Reduce

Another big item was bringing all the important features of CouchDB incremental map/reduce views to Couchbase, and combining it with clustering while maintaining consistency during rebalance and failover.

We started using an index per virtual partition (vbucket), merging across all indexes results at query time, but quickly scrapped that design as it simply wouldn't bring us the performance or scalability we needed. We needed a system to support MVCC range scans, with fast multi-level key based reductions (_sum, _count, _stats, and user defined reductions), and require the fewest index reads possible.

What we came up with uses the proven CouchDB-based view model, same javascript incremental map/reduce, same pre-indexed, memoized reductions stored in inner btree nodes for low cost range queries, yet can instantly exclude invalid partitions results when partitions are rebalanced off a node, or are partially indexed on a new node.

We embed a bitmap partition index in each btree node that is the recursive OR of all child reductions. Due to the tail append index updates, it's a linear write to update modified leaf nodes through to root while updating all the bitmaps. Now we can tell instantly which subtrees have values emitted from a particular vbucket.

click to enlarge

During steady state we have a system that performs with nearly the same efficiency as our regular btrees (just the extra cost of 1 bit per btree node times the number of virtual partitions).

click to enlarge

But can exclude vBucket partitions by flipping a single bit mask, for rebalance/failover consistency, with temporary higher query-time cost until the indexes are re-optimized.

click to enlarge

In the worst case, O(logN) operations become O(N) until the excluded index results are removed from the index.

click to enlarge

The index is once again the steady state, and queries are 0(logN).

click to enlarge

The really cool thing is this also works in reverse, so we can start inserting into a vBucket's new node's view index as it rebalances, but exclude the results until the rebalance is complete. The result is consistent view indexes and queries both during steady state and while actively failing-over or rebalancing.

Cross data center replication (XDCR)

Couchbase 2.0 will also have multi-master, cluster aware replication. It allows for geographically dispersed clusters to replicate changes incrementally, tolerant of transient network failures and independent cluster topologies.

If you have a single cluster and geographical dispersed users, latency will slow down applications for distant users. The further away and more network hops a user faces the more inherent latency they will experience. The best way to lower latency for far-away users is to bring the data closer to the user.

click to enlarge

With Couchbase XDCR, you can have clusters in multiple data centers, spread across regions and continents, greatly reducing the application latency for users in those regions. Data can be updated at any cluster, replicating changes to remote clusters either on a fixed schedule or continuously. Edit conflicts are resolved by using a "most edited" rule, allowing all clusters to converge on the same value.

Solid Foundation

I feel like we are just getting started. There is a still a ton of detail and new features I haven't gone into, these are just some of the highlights. I'm really proud and excited not just by what we have for 2.0, but what's possible on the fast, reliable and flexible foundation we've built and the future features and technology we can now easily build. I see a very bright future.

Link | Comments (2)

March 27, 2012

0 to 35 million in Six Weeks

If you haven't played Draw Something, you might want to wait until you have some free time, it's creative, social and addictive :) OMGPOP released this game less than 2 months ago, and it's currently #1 game on Facebook, and the #1 app in the iOS app store.

What kind of backend let's you grow a game from nothing to #1 that fast with no downtime? Couchbase baby! Find out more here. Super proud of our guys who built our platform and made this happen. And congrats to OMGPOP for their $200 million sale to Zynga. Nice!

Link

February 22, 2012

Couchbase Housewarming Party

Join us in celebrating our big move into brand new offices! We're throwing a HUGE housewarming party at our new space in Mountain View. And we have quite the entertainment lined up - trust us, you don't want to miss this event. RSVP now to get your spot at this event as we break in the new office (and more importantly, the KEGBOT). Any and all are invited to come celebrate with us!

Our office is pretty damn cool. RSVP now

Link

January 18, 2012

Couchbase Meetup at new HQ

Join us Thursday January 19 at 6:30 PM at our brand new Headquarters (aka Fort Awesome). Join and RSVP here.

Link

January 10, 2012

Why Couchbase?

So apparently my last entry ruffled some feathers, so maybe I should explain why I think Couchbase is the future?

Simple Fast Elastic.

That's pretty much it. We make it very simple to get started, we are extremely fast (and getting faster), and we really are "web scale", with the ability to add and remove machines from a cluster to rapidly scale your capacity to your workload.

The Membase product was very fast and scalable, but a bit too simple, with no reporting capability or cross-datacenter replication capability.

The CouchDB product has a lot of features, but is too slow, unable to keep up with high loads and inability scale-out on it's own.

The combination of the 2 will hit a sweet spot to allow developers to quickly get their apps up and running, along with the reliability, speed and low cost that make running it in production cheap and worry free.

Our 2.0 product is coming soon, adding CouchDB style views and reporting with a nifty trick for extremely fast failover while maintaining full coherency with the underling distributed data storage (we are calling it our B-Superstar index). We'll of course have lighting fast reads (same as Memcached) but also very fast durable writes. For 2kb docs, we are currently getting sustained random insert/updates rates of 25k writes/sec, fully durable, with compaction in background so it can go all day and all night. We've got some more write work coming soon which we are hoping will give us another performance boost too before 2.0. Stay tuned.

And so right now the focus is on the features and customers that pay, a thing that allow us to build a real sustainable business. And that's REAL DAMN IMPORTANT. It's not enough to build some cool technology, not enough to build a community of excited technologist. You need to cross the chasm and build a real business. A business that provides support, training, documentation and of course a reliable product. A business you can call up when you have difficultly upgrading from an old version, or are getting some weird error you've never seen before at 3am. A business you know will be around to support you for years to come.

And so while we focus on the features and customers that most quickly make us a viable business (and it's growing fast), we are still looking to build the features and technology to expand our use cases and, get customers and developers excited. Future versions are planned to have full CouchDB compatible replication technology, with the ability to support all sorts of mobile and embedded databases, such as our new TouchDB projects for iOS and Android. So with Couchbase you can have fast, scalable database in the cloud that also supports the offline use of thousands, or millions of apps on devices that drop in and out of internet connectivity, and can sync when connected but still completely usable when disconnected.

That's some cool shit. Simple Fast Elastic. And Reliable. And Mobile. That's why Couchbase.

Link | Comments (10)

January 4, 2012

The Future of CouchDB

What's the future of CouchDB? It's Couchbase.

Huh? So what about Apache CouchDB? Well, that's a great project. I founded it, coded the earliest versions almost completely myself, I've spent a huge amount of blood, sweat and tears on it. I'm very proud of it and the impact it's had. And now I, and the Couchbase team, are mostly moving on. It's not that we think CouchDB isn't awesome. It's that we are creating the successor to it: Couchbase Server. A product and project with similar capabilities and goals, but more faster, more scalable, more customer and developer focused. And definitely not part of Apache.

With Apache CouchDB, much of the focus has been around creating a consensus based, developer community that helps govern and move the project forward. Apache has done, and is doing a good job of that. But for us, it's no longer enough. CouchDB was something I created because I thought an easy to use, peer based, replicating document store was something the world would find useful. And it proved a lot of the ideas were possible and useful and it's been successful beyond my wildest ambitions. But if I had it all to do again, I'd do many things different.

If it sounds like I'm saying Apache was a mistake, I'm not. Apache was a big part in the success of CouchDB, without it CouchDB would not have enjoyed the early success it did. But in my opinion it's reached a point where the consensus based approach has limited the competitiveness of the project. It's not personal, it's business.

And now, as it turns out, I have a chance to do it all again, without the pain of starting from scratch. Building on the previous Apache CouchDB and Membase projects, throwing out what didn't work, and strengthening what does, and advancing great technologies to make something that is developer friendly, high performance, designed for mission critical deployment and mobile integration, and can move faster and more responsively to users and customers needs than a community based project.

Apache CouchDB, as project and community, is in fine shape. And many of us at Couchbase are still contributing back to it. But the future, the one I'm pushing forward on, is Couchbase Server.

And what is my part in building Couchbase? Right now I'm focusing on getting Couchbase 2.0 ready for serious production use. I'm once again an engineer and coder, back in the trenches, designing and writing code, reviewing code and designs, helping other engineers and solving tough problems. And I'm dead serious about making it the easiest, fastest and most reliable NoSQL database. Easy for developers to use, easy to deploy, reliable on single machines or large clusters, and fast as hell. We are building something you can put your mission critical, customer facing business data on, and not feel like you're running a dirty hack.

Soon, to work more closely with the team (and get rid of my nasty Oakland commute), I'll be relocating my family to the Mountain View area. Shit just got real!

And I'm really excited about the work we've got in the pipeline. We are moving more and more of the core database in C/C++, while still using many of the concurrency and reliability design principles we've proven with the Erlang codebase. And Erlang is still going to be part of the product as well, particularly with cluster management, but most of the performance sensitive portions will be moving to over C code. Erlang is still a great language, but when you need top performance and low level control, C is hard to beat.

Anyway, there so much to talk about, to much for one blog post. One of my New Years resolutions is to blog more, and I've got a ton of interesting things to talk about. The trials of tribulations of building a startup and an engineering culture. What's wrong (and right) with Erlang. Bringing forth UnQL. TouchDB for Mobile. And yes, we'll still interoperate with Apache CouchDB and Memcached. But the future is Couchbase.

Ride with me.

Edit

As J. Chris Anderson notes in the comments, Couchbase is completely open source and Apache licensed:

Everything Couchbase does is open source, we have 2 github pages that are very active:

https://github.com/couchbaselabs

https://github.com/couchbase

Probably the most fun place to jump into development is the code review: review.couchbase.org/

Let me clarify, if you like Apache CouchDB, stick with it. I'm working on something I think you'll like a lot better. If not, well, there's still Apache CouchDB.

Link | Comments (25)

Damien Katz

Everybody keeps on talking about it
Nobody's getting it done

October 28, 2012

How to achieve lots of code?

August 30, 2012

CouchConf SF

June 25, 2012

Reminder: Couchbase Community Summer BBQ

June 21, 2012

Why Database Technology Matters

May 30, 2012

Stabilizing Couchbase Server 2.0

March 27, 2012

0 to 35 million in Six Weeks

February 22, 2012

Couchbase Housewarming Party

January 18, 2012

Couchbase Meetup at new HQ

January 10, 2012

Why Couchbase?

January 4, 2012

The Future of CouchDB

Contact

Obsession

Search This Site

Twitter

Popular Posts

Archives

Damien Katz

Everybody keeps on talking about it Nobody's getting it done

October 28, 2012

How to achieve lots of code?

August 30, 2012

CouchConf SF

June 25, 2012

Reminder: Couchbase Community Summer BBQ

June 21, 2012

Why Database Technology Matters

May 30, 2012

Stabilizing Couchbase Server 2.0

March 27, 2012

0 to 35 million in Six Weeks

February 22, 2012

Couchbase Housewarming Party

January 18, 2012

Couchbase Meetup at new HQ

January 10, 2012

Why Couchbase?

January 4, 2012

The Future of CouchDB

Contact

Obsession

Search This Site

Twitter

Popular Posts

Archives

Everybody keeps on talking about it
Nobody's getting it done