ModeShape roadmap and plans for 4.0

Version 11

Created by rhauch on Aug 30, 2013 12:47 PM. Last modified by escowles on Aug 26, 2014 6:27 PM.

This document is a draft and has not been finalized.

We've been working on ModeShape for well over 5 years, and in that time our community of users and contributors has grown substantially and become very active. Thank you all very much!

We started with an initial JSR-170 (JCR 1.0) implementation, and with ModeShape 2.x switched our API the newer and expanded JSR-283 (JCR 2.0). We learned a lot from our initial architecture, and about 2 years ago we started a major effort with 3.x to rearchitect ModeShape to dramatically improve the scalability and performance by leveraging Infinispan. And during those 2 years we've issued 23 releases of 3.x (including alphas and betas) with lots of great features and numerous fixes.

There certainly are areas where we need and want to improve ModeShape. During all those 3.x releases, we've tried very hard to make sure that we do not make incompatible changes to our API, SPI, configuration or storage. That makes it very easy for our users to simply upgrade without having to change and recompile their applications. But we've also put off making some changes in 3.x because it would have required making breaking changes to the API, SPI and/or configuration format.

We think now is the time to start working on those bigger changes, and since they will likely entail changing our API/SPI and configuration format, it also means that these will be done with a change in major version number (e.g., 4.x). Here are some of the areas we'd like to focus on:

We'd like to offer simpler ways of configuring a repository. Some specific ideas:
- Consider how we might use/include various templates that provide baselines for common topologies. For example, creating a non-clustered in-memory repository, or a non-clustered repository that stores all content on the local file system, or a clustered repository. (See the various Fedora 4 Repository configurations for some concrete samples.) EAP kits include several out-of-the-box configurations, so perhaps that's something our kits need to include. Consider layering multiple configuration files on top of one another, where the effective configuration is a merger of individual files. (This might be a more effective way of defining defaults, as well as having a library of various "default configurations" or "templates".) Other ideas are welcome!
- Provide a way to more easily load and validate all of the configuration files at once (e.g., perhaps having RepositoryConfiguration resolve the ISPN and JGroups configuration files). See MODE-2033.
- Better documentation and examples. Perhaps separate configuration guides for some common topologies. Figure out and document the do's and don'ts of ISPN clustering.
- Provide a way for sequencers, connectors, and other extension points to be instantiated and "injected" into a running repository. One way would be to move the instantiation of components into Environment, where it could be overridden. Another option might be to allow registering already-configured components with a repository instance. Or, consider a standard-based dependency injection framework (e.g., CDI), though that might require too much baggage and might conflict with application's environments.
Clustering is still far more difficult than it should be. Infinispan is indeed complicated and we can't do a whole lot about that, but we think that clustering the Lucene indexes is actually one of the most difficult and complicated parts of ModeShape. Our goal is that whatever indexes are used internally should be clustered automatically when a ModeShape cluster is defined. (Yes, there may be some options for tuning how the indexes are clustered, but those should be optional.) Secondly, adding a process to the cluster should not require reindexing the whole repository; ModeShape should be able to seed the new process with one of the other processes. And if a process is brought back into the cluster after being shut down for a period of time, all state should be able to be brought up-to-date far more intelligently and quickly (MODE-1683, MODE-1903). We can also make ModeShape more aware of cluster/network partitions, perhaps making a partition go into a read-only mode if disconnected from most of the rest of the cluster.
- Better documentation probably is important here, too. Especially describing the differences between replication and distribution, and the do's and don'ts of clustering.
We're proud of our query engine, but we'd like to expand the query language even more to include aggregates (MODE-1904,MODE-1905), more complex criteria, group by, etc.). We'd like the query planner to take into account more optimizations to more efficiently and more quickly produce results. And we'd like to be able to push down queries to connectors that talk to system with their own query/search capabilities (MODE-1686). One way we'd like to do this is by embedding a query engine that is far more powerful and capable, and that has its own active community (MODE-1869).
We currently rely upon Lucene indexes to answer all queries, even when the queries don't involve full-text search. This uses Lucene in non-optimal ways, and it prevents us from leveraging other search/query technology. We should change how ModeShape uses indexes (MODE-2018) and make them extensible (MODE-2023):
- No longer require indexing all content. Instead, allow specific indexes to be defined for a specific property definition (perhaps on a specific nodetype) and then used during query execution. If no indexes are used, the query system still should work (even for full-text search?) but will likely result in slower queries that must scan the entire workspace (e.g., like a "table scan" in a relational database). Slow queries could then be explained and improved by adding/changing indexes.
- Allow for "query index providers" to manage/own one or more indexes, and for a repository to simultaneously use one or more index services. This would be an extension mechanism, allowing us to provide index services that used internal indexes, embedded Lucene (for full-text search indexes), external systems such as Solr and ElasticSearch, or even custom services. Then move the Lucene-based provider into a separate optional module, and upgrade to Lucene 4.x and Hibernate Search 5.x (MODE-2063).
- The query engine would need to incorporate index scoring into the planning process. For example, we might want to evaluate different plans based upon which indexes are available and how well those indexes are scored. This might need a standard definition of indexes, such as knowing which indexes are full-text search or which ones apply to a particular property on a node type.
- It is not clear that we need a standard definition of indexes across all index services, and if so what that would look like (e.g., using CND extensions, or completely separate definitions). If not, then perhaps each index service can define its own index structures.
- Move the query index provider that use Lucene (and Hibernate Search) components
Support JCR's event journal optional feature (MODE-2019). This will largely be enabled by some of the functionality to support our own indexes. We can either store events centrally (e.g., a database) or locally for each repository in each process. If we use a file format for the latter, that file format should likely be append-only (for performance and to enable copying events to newly-started processes).
Remove our REST client for Java applications. It needs to be rewritten to make it easier for Java clients to interact with a remote ModeShape repository, and does not really offer much functionality. So we either spend a lot of time rewriting it (MODE-1652, MODE-1718) or we remove it altogether. (Right now we're leaning towards removing it.)
Upgrade from Infinispan 5.x to 6.x and JGroups, accordingly (MODE-2066). Yes, it might be possible to upgrade to these in 3.x, but at this time it is not clear whether or how ISPN's API and configuration files will change. We think it's better to upgrade to these in our 4.0 effort. See Infinispan Roadmap.
Better local file storage options. Infinispan's FileCacheStore performs pretty poorly, though the Infinispan community is looking at several alternatives in 6.0. (There used to be a JDBM cache store, but JDBM has died off and been rewritten as MapDB; we're looking at MapDB for use in indexes, so maybe we'd like to contribute to make a MapDB cache store.) We may also want consider a cache store that is quite a bit more transparent in what and how it stores nodes on the file system (e.g., a file per node); this would be tricky to make properly concurrent, and we'd probably have to assume it never be shared between multiple processes.
Add distributions/kits for Wildfly (MODE-2065), and accept from the community any other kits that make deployment easier. Please let us know if you're interested in these or other web/app servers, especially if you have experience with them and are interested in contributing.
Support for JSR-333, aka "JCR 2.1" (MODE-2064). We've participated in the expert group, which just recently submitted the proposed final draft, and I expect it will pass in the next few weeks or months. JCR 2.1 is largely a minor improvement over 2.0 with some welcome changes and clarifications, some of which we already have implemented. At this point, none of the JCR 1.0 methods deprecated in JCR 2.0 have been removed, so that looks like applications that still use some JCR 1.0 methods will still be able to use ModeShape 4.0.
Performance. Spend time to find the most hindering parts of ModeShape, then work to improve and measure the performance gains. This is a continual effort.
Federation improvements, including sources with their own search (MODE-1686), their own identities (MODE-1803), and their own versioning (MODE-1188).
Project website (MODE-1625), and optionally move our documentation into it (in Asciidoc format managed in the site's Git repository).

These are just some of the improvements and changes we're considering. See the list of issues currently targeted for 4.0, and remember that this is by no means definitive. Also, some of the issues currently targeted for 3.6 might better be accomplished in 4.0 If you have other ideas for features and improvements, please let us know in the comments below or by editing this document directly.

We plan to keep the same storage formats! This is important - we absolutely want to make it very easy to upgrade a 3.x repository to 4.0. Infinispan will likely offer more cache store options in 6.x, and you can always use ModeShape's backup/restore functionality if you want to change which cache store you use. Your configuration files will need to change, though we're open to requirements and suggestions about how best to solve this (some of this will be influenced by #1 above, but will largely be impacted by the changes in the indexing technology).

As for timeframe, we're starting to prototype some of these changes now, though they're still very rough. At some point, we'll start releasing 4.0 alphas on a regular basis and, when feature complete, will transition to releasing betas. Only after we're feature complete and things are looking stable/solid will we start releasing candidate releases, and when we're happy will we issue the final release. Note that even while we work on 4.0, we'll continue to fix issues and add minor enhancements to 3.6 (and possibly 3.7, depending upon how quickly 4.0 betas or CRs appear).

The next steps are:

(DONE) Discuss these and other features/improvements, logging JIRA issues for each separate task. See Initial planning meeting for 4.0
Triage the issues in 3.6 to see which, if any, would better be done in 4.0.
Triage the issues in 4.0 to prioritize the effort and track any inter-issue dependencies.
Create a new branch for 4.0 work. (TBD whether we create a new branch for 3.x and use 'master' for 4.0, or use 'master' for 3.x and a separate branch for 4.0 work. I prefer the former, especially if most new feature work is targeted to 4.0.) Once we do this, all changes for 3.x will need to be merged onto the 4.0 branch as well.
Start working the 4.0 features, using our normal methodology by assigning issues. We should plan on creating design documents (e.g., wiki pages) for each larger or complex feature.
Fork our documentation. (We'll do this as late as possible, just before we want to start documenting the 4.0 features.)

The following documents describe the design of the 4.0 features:

JBossDeveloper

ModeShape roadmap and plans for 4.0

Comments