It depends on what you mean by "huge amount of data": is it a matter of nodes & properties or binary content ? ModeShape will store the 2 differently: nodes/properties (the default JCR idioms) are stored in the main repository cache (via Infinispan) while all binary information (for example the content of files) is stored in a separate binary store for performance reasons: https://docs.jboss.org/author/display/MODE40/Binary+values
Separate repositories "is a" high degree of separation & configuration overhead and is very much dependent on your use case. If you're planing on storing 20TB of data as nodes & properties that's probably a good idea, but I'm highly skeptical that you can anticipate in bytes the amount of nodes & properties that your application will use.
If on the other hand the 20TB is binary content (i.e. binary JCR properties), as mentioned above the question comes down to the binary storage for which IMO you don't need separate repositories. But you do need to clarify the following:
- if you were not using Modeshape, what would you choose to store the 20TB bytes in ? Once you know that, it's very likely you can configure a binary store in Modeshape to use it (be it FS, Database, Infinispan, Mongo, Cassandra etc)
- ModeShape has a "composite binary store" implementation which basically is an aggregate of different binary stores mapped under a key. So you have the option of using different binary stores to store different data. See modeshape/composite-binary-storage.json at master · ModeShape/modeshape · GitHub for an example.
Actually I think in long term it will be more something like 1-2TB of data as nodes ans 18TB as binaries files. So 1-2TB is not a big deal as nodes ?
Anyway thank you for your answer
1-2TB worth of node data is significant, but I'm very curios how you came to this assessment size-wise ? Normally one would be able to estimate the number of nodes, not their size (the data is stored as BSON documents in Infinispan and there's no easy way to be able to estimate the size).
You can use multiple repositories, but it's really hard to tell how practical this will be in the long run. When dealing with a large amount of nodes, it's also very important how you structure your nodes within a repository. See also http://modeshape.wordpress.com/2014/08/14/improving-performance-with-large-numbers-of-child-nodes/
I would recommend doing a POC with 1 repository and looking at the amount of data that you can manipulate in that, based on your use-case.