Thursday, 23 October 2008

Final CR for Naga - with more benchmarks

So I've decided to cut another CR on 3.0.0. I wasn't originally planning on it and this would have gone gold save for the fact that a few important JIRAs did get in, which have pretty significant impact on performance (going northwards, of course - JBCACHE-1419 is notable, which contributed to a huge performance boost on async replication).

Anyway, to cut a long story short, download this final CR here, provide feedback here. Changelog is in JIRA, and all-new documentation is also available at last. If all goes well, I hope to cut a GA towards the end of next week, so I would appreciate as much feedback as possible - both on the release, as well as the documentation, tutorials, sample configs, etc.

And now for the fun stuff - I've finally found the time to put together some replicated benchmarks, to go with the standalone benchmarks I published a while back.

Kit used

The benchmarks were run on Red Hat's lab cluster of 8 servers, connected via gigabit ethernet.

  • Quad-CPU Intel Xeon 3GHz servers with 4GB of RAM each
  • RHEL 4 x86_64 with kernel version 2.6.9-42.0.10.ELsmp
  • SUN JDK 1.5.0_11-b03 (32-bit)
  • Benchmarks generated using the CacheBenchFramework
  • A single thread on each server instance, reading 90% of the time and writing 10% of the time
NB: You can click on the images to download full-sized versions

Benchmark 1: Comparing 2.2.0 "Poblano" and 3.0.0 "Naga" (synchronous replication)

This benchmark compares 2.2.0.GA with 3.0.0.CR2, using synchronous replication and pessimistic locking (Poblano) and MVCC (Naga).

The performance gain is between 10 and 20% for synchronous replication, and a pretty consistent 20% when using buddy replication. A large part of this is MVCC locking, but also other improvements in the code base play a part, including more efficient marshalling.

The total aggregate throughput chart shows that buddy replication (still) scales almost linearly. Increasing cluster size directly benefits the overall throughput the entire cluster can handle.

Benchmark 2: Comparing 2.2.0 "Poblano" and 3.0.0 "Naga" (asynchronous replication)

This had to be a separate benchmark from the synchronous one since the throughput is so much higher than sync replication it made a combined charts unreadable!

JBCACHE-1419 is the main contributor to the phenomenal performance gains in async replication in Naga. For folks who use JBoss Cache with async replication, I strongly encourage you to give Naga a try! :-)

Benchmark 3: Competitive analysis

This is something a lot of people have been asking me for. For this benchmark, I've pitted JBoss Cache 3.0.0.CR2 against EHCache 1.5.0, Terracotta 2.5.0 and a popular commercial, closed-source distributed cache which will have to remain unnamed. I've called this Cache X in my charts.
I've used synchronous replication throughout since this is all that Cache X supported.

Both EHCache and Cache X slightly outperforms Naga on a 2-node cluster (not so slightly with EHCache), but these quickly fall behind as the cluster size increases, particularly in the case of EHCache. This is also reflected in the overall throughput chart. Buddy replication still shows linear scalability.

All of these benchmarks are reproducible using the cache benchmark framework, but as with all benchmarks, these should be used as a guideline only. Real performance can only be measured in your environment, with your specific use case and data access patterns.



muuh said...

Would be nice to see the performance versus memcached.

Alex Miller said...

Hey Manik,

I glanced at a couple things about the Terracotta test and wanted to mention that:
- The test is using Terracotta 2.5.0 which is almost a year old. Our latest release is 2.7.0. That seems misleading to me, but maybe it's unintentional.
- The Terracotta configuration is using synchronous write locks which is a stronger level of locking than people usually use - simple write locks usually suffice.
- The TerracottaWrapper uses a HashMap with a single lock which is not a good implementation. I would suggest a ConcurrentHashMap. As with normal concurrent Java programming, striping locks when accessing a data structure will give greater concurrency.
- I was happy to see how much simple the TerracottaWrapper code was than the alternatives though. :)

I'm a little puzzled by the test and config itself, too. Why is it a single thread per node? Generally cache users tend to use multiple threads per node. Running a single-thread test hides any single-VM contention and masks what happens in a real use case.

On the cache contents, the key is a List<String>[3] and the values are byte[] - that also seems really weird to me. In practice, we see that the overwhelming use case is String->Object graph. I guess maybe you're trying to replicate the need to put serialized objects in the session cache? Seems like this setup neatly sidesteps two of Terracotta's advantages: the ability to avoid serialization plus fine-grained object change deltas without sending the whole value graph.

Seems like all of this is actually trying to simulate a clustered http sessions implementation. Sessions tends to be highly partitioned in the case of using sticky sessions but the test is not written to be partitioned at all, so doesn't match the use case well. Sessions is actually a use case natively supported in the Terracotta sessions product that actually leverages the sessionid to do locking only on a per-session basis which avoids basically all map contention, greatly increasing performance.

We've actually been working on a more realistic reference web application called The Examinator with load testing up to 20,000 concurrent users on a 16 node cluster. You can find some details about it here although we won't be releasing officially for a little bit longer:

At some point, we'll take a look at this test and see if moving to a more modern version and better data structure / locking choice would make the Terracotta results a little more realistic.

Alex Miller
Terracotta Engineer

Manik Surtani said...

Regarding memcached, yes this is something we want to do. In fact, if you have the time/bandwidth and feel like contributing a memcached wrapper for the framework, we'd appreciate it. :-)

Manik Surtani said...

Hi Alex

Thanks for your comments. Regarding Terracotta versions, thanks for pointing this out, this is something I need to update.

Regarding the write locks, the other caches I compared with were all configured with synchronous replication, so to be fair Terracotta would have to use synchronous write locks as well.

You are correct in that the attempts to simulate HTTP session replication clustering, and the test does assume sticky sessions, hence the use of the session ID as the first element in the "path" list in the test. So I'm guessing you're suggesting synchronizing on the session ID and having a separate HashMap per session for the TerracottaWrapper? (In the case of JBoss Cache, since it is a regionalized cache, using the path when storing stuff is enough to provide adequate isolation.)

Making the test multithreaded is on my TODO list.


Alex Miller said...

Regarding the Terracotta locks, I think maybe there's a misunderstanding of how Terracotta works. Normal Terracotta write locks are fully coherent - all changes will be applied on all other nodes before they return. Synchronous write locks in Terracotta give a greater level of certainty that the data has actually been written all the way down to disk before returning from a synchronized block. Certainly the vast majority of our users find normal write locks to be the appropriate locking level. I don't believe using them here is comparable, but that's up to you.

I missed the nodeId in the sessionId for the tests, so I see now how it's testing stickiness.

I'm not really suggesting any change in the structure of the Terracotta wrapper (other than possibly the data structure). I don't think it's easy to reproduce what we're doing internally in our sessions product in this framework since it makes some assumptions that aren't really valid for us. I was really just making an observation that our session support is a lot more scalable than this implementation.

I find it hard to take seriously a benchmark like this using a single thread. Especially for a sessions simulation when you're likely to have a large number of threads per node handling web requests.

Mingfai said...

for the benchmark 1 and 2, they show good improvement. I wonder if, in the future, you may do a benchmark with more than 8 nodes to find out at how many nodes the throughtput will begin to drop.

Manik Surtani said...

Any time you feel like providing me with access to a bigger cluster, I'd be happy to run more benchmarks. ;)

Girish Adat said...

Is there a load test conducted for JBossCache? For example for storing huge data in cache?