Next Generation ModeShape
rhauch Oct 31, 2011 6:04 PMIn March 2008 we announced a new project called 'JBoss DNA' whose purpose was to be a new JCR repository implementation, and one that offered new ways of storing and federating content. Since repositories often store files and metadata, JBoss DNA would also automatically examine those files, extract useful information, and store that derived information back in the repository. The project would also leverage other JBoss.org technologies, including Hibernate, JBoss Cache, JGroups, and the JBoss Application Server. We released seven minor releases of JBoss DNA.
In March 2010, the project was rebranded and the 1.0 release was issued, with support for all of the required and many of the optional parts of the JCR 1.0 API; this was followed by a few minor releases. Support for JCR 2.0 (JSR-283) came a few months later with the release of ModeShape 2.0 in July 2010. This was a pretty straightforward upgrade for our users, since JCR 2.0 was largely an expansion of functionality (with only a handful of methods that were deprecated without some form of replacement). And during the next 15 months, we continued releasing new minor versions with bug fixes, new features, and performance improvements. ModeShape is more stable and faster than ever.
But we want to and can do more. We want ModeShape to be the fastest, most scalable, and most available JCR implementation there is. We want ModeShape to support large numbers of concurrent clients. We want ModeShape to support flatter hierarchies as good and fast as deeper ones. We want ModeShape to be much easier to configure, deploy, monitor, and manage. We want seamless integration with JBoss AS7 to make it trivial to use JCR within your web apps. We want to support XA transactions and to participate in distributed transactions. We want to open the door to other kinds of persistent storage, including those that are eventually consistent. We want ModeShape to scale to massively large repositories and even very large clusters.
These goals are not small steps forward. No, they're major leaps beyond where we are now. We'll obviously keep the standard JCR API as our public API, so that will remain consistent for client applications. But we're actually at a perfect place to start considering some significant changes to the way ModeShape works under the covers. With ModeShape 3.0, we have the opportunity to change and improve how ModeShape stores its content, how the sequencers work, how and where our indexes are stored, and how content is cached. We can continue to provide fixes on the 2.x branch while we work and stabilize the 3.x branch. And no matter what, we must commit to making it very easy to migrate a 2.x installation to a 3.0 installation.
I've been working four to five months on considering ways we can achieve these goals. Approaches that are more incremental reduce the effort and risk, but make it more difficult to achieve the goals. However, one approach was so promising that I thought it warranted a full-blown prototype to test it out, and the more I implemented the more promising it looked. Recently I completed enough of it that I could run some performance comparisons using the JCR API. I was floored by the results, even after spending no time optimizing the system:
- Time to create 10K child nodes, perform a save, and persist to disk: 0.8 seconds.
- Time to create a subgraph with 1.01M nodes, periodically saving, and persisting to disk: 3.5 seconds
- Time to get a node by path from a workspace with 1M nodes: 0.00058 seconds
Running the same operations in ModeShape 2.x takes significantly longer (in some cases multiple orders of magnitude!), so I think this new design shows tremendous promise.
How does the new approach work? It takes ModeShape's existing JCR implementation and puts it on top of a simple framework of storing each node as a JSON/BSON document inside Infinispan, and using Infinispan's cache loaders for persistence. It uses Infinispan both as a distributed store and a distributed cache, so accessing nodes is quite efficient and reading from persistent storage only happens when/if the cache has previously purged that value from memory. Infinispan also supports XA transactions, making it far easier for ModeShape to support them. Infinispan is a data grid that has three modes (local, replicated, and distributed ) to dictate how/whether information is copied across the data grid, making it possible to use Infinispan as a massive, distributed and available in-memory heap where all content can be kept in-memory. By using Infinispan this way, a cluster of ModeShape processes effectively becomes a "content grid".
What does this mean for ModeShape's JCR implementation layer? Most importantly, we're not starting from scratch. The JCR implementation classes in the 2.x codebase have a lot of logic in them to properly implement the JCR specification, but they use the internal graph API and cache mechanism to do this. We'll keep all that logic but will refactor the classes to use the new document-oriented approach on top of Infinispan, and this means that the overall risk of this major refactoring is significantly lower than a clean-sheet rewrite.
What does this mean for ModeShape's connectors? ModeShape's Infinispan and JBoss Cache connectors are no longer needed with this new architecture, simply because Infinispan plays such an important role in the new approach. Infinispan already has a number of cache loaders that can cached content (a number of different ways), so some of our connectors can effectively be replaced with existing Infinispan cache loaders. For example, the JDBC cache loaders would work well in place of our JPA connector. Other connectors might not have a direct analogy, but may actually be replaced with a cache loader that has more capability. For example, Infinispan has a file system cache loader that's not transactional, and so could be used in place of our disk-based storage connector. But the BerkleyDB cache loader may actually be a far better and faster replacement that supports transactions. Plus, there are other cache loaders that offer functionality we didn't have before, including JClouds and Cassandra. The federation connector will not be needed, due to built-in support for federation within the new architecture. We can develop ModeShape-specific cache loaders to support the functionality provided by the remaining connectors (e.g., SVN, file system, JDBC metadata). If needed, we can even provide an Infinispan cache loader that works with the database schema used by the 2.x JPA connector.
What does this mean for ModeShape's sequencers? We'll likely change the sequencers to directly use the JCR API, making it much easier for developers already familiar with JCR to write new custom sequencers. We'll also deprecate the existing sequencer API and provide an adapter that can be continue to use sequencers that use the older API, giving a few releases for people to convert their existing custom sequencers.
What about my existing ModeShape repositories? As mentioned above, we want to make it very easy for you (and your customers) to migrate from ModeShape 2.x to 3.x, so we'll be providing utilities that will help convert 2.x configuration files to the newer format and will help migrate the content from the existing stores into the new Infinispan data grid.
Where do we go from here? The next steps are to move this prototype into a branch in the ModeShape Git repository, and to complete its features and continue testing it. We'll want to add more performance tests to verify that this is heading in the right direction and to be able to measure and compare the performance relative to ModeShape 2.x and the reference implementation.
What's the schedule? I would like to issue the first 3.0.0.Alpha1 release within a few weeks, and issue a second Alpha release about 2-3 weeks later. We'll switch to Beta releases as soon as the JCR and ModeShape 2.x features are done, and continue those while we iron out the bugs. When we're confident that ModeShape 3 is functioning correctly, passing the TCK tests, performing very well, and passing all of our unit and integration tests, we can then issue a Candidate Releases and move quickly towards a final release.
What can you do? As always, we welcome anyone that wants to contribute. If you're primarily interested in testing and using ModeShape 3, please follow along until we start issuing the Alpha, Beta and Candidate releases, and please start testing those releases and filing JIRA issues for any bugs that you've found. If, however, you want to get more involved and help implement ModeShape 3 and/or write tests, please let us know! There's lots of exciting and fulfilling work to do.
Where can I learn more? I've attached a PDF file that contains some rough documentation, written from the perspective of a user trying to use ModeShape 3. It's not really what we'll use for documentation, but it's a start at outlining the design of this new approach and what it means for users.
Finally, stay tuned, because tomorrow I'll push out the branch with the code.
-
Introducing ModeShape 3.0.pdf 140.4 KB