Saturday 27 August 2011

Disruptor 2.0 Released

Significantly improved performance and a cleaner API are the key takeaways for the Disruptor 2.0 concurrent programming framework for Java.  This release is the result of all the great feedback we have received from the community.  Feedback is very welcome and really improves the end product so please keep it coming.

You can find the Disruptor project here, plus we have a wiki with links to detailed blogs describing how things work.

Naming & API

Over the lifetime of the Disruptor naming has been a challenge.  The funny thing is that with the 2.0 release we have come almost full circle.  Originally we considered the Disruptor as an event processing framework that often got used as a queue replacement.  To make it understandable to queue users we adopted the nomenclature of producers and consumers.  However the consumers are not true consumers.  With this release the consensus is to return to the event processing roots and adopt the following naming changes.

Producer -> Publisher
Events are claimed in strict sequence and published to the RingBuffer.

Entry -> Event
Events represent the currency of data exchange through the dependency graph of EventProcessors.

Consumer -> EventProcessor
Events are processed by EventProcessors.  The processing of an event can be read only, but can also involve mutations on which other EventProcessors depend.

ConsumerBarrier -> DependencyBarrier
Complex graphs of dependent EventProcessors can be constructed for the processing of an Event.  The DependencyBarriers are assembled to represent the dependency graph.  This topic is the real value of the Disruptor and often misunderstood.  A fun example can be seen playing FizzBuzz in our performance tests.

The ProducerBarrier was always a one-to-one relationship with the RingBuffer so for ease of use its behaviour has been merged into the RingBuffer.  This allows direct publishing into the RingBuffer.

DSL Wizard

The most complex part of using the Disruptor is the setting up of the dependency graph of EventProcessors.   To simplify this for the most common cases we have integrated the DisruptorWizard project which provides a DSL as a fluent API for assembling the graph and assigning threads.

Performance

Significant performance tuning effort has gone into this release.  This effort has resulted in a ~2-3X improvement in throughput depending on CPU architecture.  For most use cases it is now an order of magnitude better than queue based approaches. On Sandybridge processors I've seen over 50 million events processed per second.

Sequence tracking has been completely rewritten to reduce the usage of hardware memory barriers, indirection layers, and megamorphic method calls resulting in a much more data and instruction cache friendly design.  New techniques have been employed to prevent false sharing because the previous ones got optimised out by the Oracle Java 7 JVM.

The one area not seeing a significant performance increase is the sequencer pattern.  The Disruptor is still much faster than queue based approaches for this pattern but a limitation of Java hits us hard here.   Java on x86/x64 is using LOCK CMPXCHG for CAS operations to implement the AtomicLong incrementAndGet() method which, based on my measurements, is ~2-10X slower than using LOCK XADD as contention increases.  Hopefully Oracle will see the error of SUNs ways on this and embrace x86/x64 to take advantage of such instructions.  Dave Dice at Oracle has blogged on the subject so I live in hope.

Memory Barriers


Of special note for this release is the elimination of hardware memory barriers on x86/x64 for Sequence tracking.  The beauty in the Disruptor design is that on CPU architectures that have a memory model [1] whereby:

  • loads are not reordered with older loads”, and
  • stores are not reordered with older stores”;

it is then possible to take advantage of the semantics provided by AtomicLong to avoid the use of the Java volatile keyword, and thus hardware fences on x86/x64.  The one sticky rule for concurrent algorithms, such as Dekker [2] and Peterson [3] locks, on x86/x64 is “loads can be re-ordered with older stores”.  This is not an issue given the design of the Disruptor.  The issue relates to the snooping of CPU local store buffers for older writes.  I’m likely to blog in more detail about why this is the case at a later date.  The code should be safe on other CPU architectures if the JVM implementers get the semantics of AtomicLong and Unsafe correct, however your mileage may vary for performance on other architectures compared to x64.

Roadmap

With this latest release it is becoming increasingly obvious how sensitive some CPU architectures are to processor affinity for threads.  When an EventProcessor gets rescheduled on a different core, after its time-slice is exhausted or it yields, the resulting cache pollution really hits performance.  For those who require more extreme and predictable performance I plan to release an Executor service with the Disruptor to allow the pinning of threads to CPU cores.

I'm also thinking of adding a progressive back off strategy for waiting EventProcessors as a WaitStrategy.  This strategy would first busy spin, then yield, then eventually sleep in millisecond periods to conserve CPU resource for those applications that burst for a while then go quiet.

  1. Memory Model: See Section 8.2 of http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
  2. Dekker algorithm: http://en.wikipedia.org/wiki/Dekker%27s_algorithm
  3. Peterson Algorithm: http://en.wikipedia.org/wiki/Peterson%27s_algorithm

13 comments:

  1. I just found org.axonframework.com.lmax:disruptor:2.0 in the central maven repository but http://code.google.com/p/disruptor/issues/detail?id=2 is still open.

    Is this an official artifact and the issue simply isn't updated, yet?

    ReplyDelete
  2. "Dave Dice at Oracle has blogged on the subject so I live in hope"

    An issue has been filed:

    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7023898

    Best,
    Ismael

    ReplyDelete
  3. Huxi,

    There is not an official Maven artefact for the Disruptor yet. We plan to add one soon.

    ReplyDelete
  4. I for one preferred the usage-agnostic nomenclature used previously rather than the event handling terminology adopted here. While event handling is a very common scenario for using the disruptor there are many others that are no less plausible e.g. command handling (Checkout the AxonFramework for instance). It's not every intuitive be talking about events when you are processing commands. The other name and api changes are very welcome however.

    ReplyDelete
  5. Hi guys, interesting work, however I would like to point out that this concept is not new, see link: http://www.mjmwired.net/kernel/Documentation/trace/ring-buffer-design.txt

    ReplyDelete
  6. Thenim,

    The idea of using ring buffer has been around a long time and often used in operating systems and network devices.

    There is nothing special about the Disruptor other than to bring together a group of concepts and implement it well.

    One part I've not seen done elsewhere is the ability to construct dependency graphs between event processors like the Disruptor can with the SequenceBarriers. However I'd not be surprised if someone else has done that before.

    As to the implementation, I've not seen anyone publish an implementation like the Disruptor and get anywhere near the same performance results.

    Martin...

    ReplyDelete
  7. Martin,

    I must admit, I've not had the chance to play with it, just read the doc, and got the impression that it was a new idea! :) Wasn't meant as a trolling attempt...

    As to the open implementation - that's true, I did some research in this space before having to do my own implementation late last year, the article linked above gave me the starting point.

    I will download this and test it out, I'd like to see how my implementation stacks up against it.. :)

    Nim.

    ReplyDelete
  8. Martin,

    In the memory barrier section you mention that on x86/x64 you're able to avoid the use of volatile due to the memory semantics of the x86 being strong enough to enforce the guarantees you need. You also state that on other architectures that this may not be the case and YMMV.

    Shouldn't it be the case that the JVM/JIT is able to optimise away the additional instructions for volatile on x86 because it knows it 'is' an x86/x64 JVM so it knows implicitly the underlying memory semantics.

    On other architectures it should also do the correct thing (optimising away or adding fences as necessary). This is how the GCC memory barriers work on Linux, on some architectures they are no-ops on others they add fence instructions.

    Is this is a case where the JVM implementation is not quite as we'd like it to be? Or are there other subtleties which mean that the JVM is not able to optimise away the volatile keyword?

    Regards
    David

    ReplyDelete
  9. David,

    The JVM/JIT is able to optimise and in the case of x86 the AtomicLong.lazySet() is simply a software, rather than hardware, memory barrier. On other platforms it may need a hardware memory barrier/fence, e.g. ARM.

    StoreLoad semantics are the ones you really need a memory barrier on x86 for otherwise the likes of the Dekker algorithm cannot work.

    The "issue" with the volatile keyword is the semantics are defined for the Java Memory Model regardless of underlying hardware. An example where on x86 it costs more than it should, is the cost of creating an object with a volatile field because the JVM will emit a "LOCK ADD 0" for the memory barrier. This, in my opinion, should be treated like final and not needed if the this reference does not escape during construction.

    Martin...

    ReplyDelete
  10. HI Martin,

    I went through couple of posts in your blog.
    I am a coder myself.
    I am just wondering does such deep knowledge of hardware is really required to code in our daily requirements(day job).
    Pardon my ignorance.

    Abhi

    ReplyDelete
    Replies
    1. It often does in my day job but then I'm often dealing with fast or efficient processing of data.

      Most programmers do not need very detailed knowledge everyday but an appreciation of mechanical sympathy really helps build reasonable efficient software that is also more robust. Being robust is often more important than outright performance. Driving the platform with understanding leads to less surprises and edge cases which make software fragile.

      Delete
  11. Hi Martin,
    What should be the duration of a minor GC for a trading application using the disruptor based on industry standards.

    ReplyDelete
    Replies
    1. That's a "how long is a piece of string question". Every application is different and you have the whole context of your application to consider plus what garbage collector, JVM, OS, hardware, etc. you are running on. There is no reasonable answer to that question.

      Delete