2 Replies Latest reply on Apr 20, 2011 11:51 PM by gciuloaica

    Distributed file storage

    gciuloaica

      Hi,

       

      I'm looking for an advise to the following approach/problems:

      Story

      Distributed File Storage [DFS] is a simple distributed storage of unstructured data (files). It should expose a simple CRUD interface to be accessed from different platforms (browser, mobile apps...). The storage should be able to scale horizontally easily such it should be able to accommodate increasing need of storage.

      The service should be able to respond to a client request under 30 ms.

      It should be able to store billions of small files (under 50kb) and large files (Gb).

      Medium load request that should be supported is 300 req/s.

      System should be resilient to data centre failure - to be able to replicate data cross -data centre.

      Approach

      After some architecture research following non- functional requirements and approaches has been identified:

       

      Scalability: Shared nothing architecture - this will allow to horizontally scale the system theoretically infinitely by adding new nodes. HDFS model is not ideal, because it use NameNodes to identify the location of the file in the cluster, that in case of a large system may become a bottleneck, and also because HDFS is optimized for large files.

      High Availability: Async replication of data inside data centre and cross-data centre. To evaluate if 3 copies inside a data centre (in different racks) and 2 copies in other 2 different data centres will be a reasonable approach.

      Routing: - client side routing - it will be hard to support dumb clients like browsers/ mobile clients

      • server side routing - How to resolve the file location and file retrieving with a minimum of network calls? Ideal it should be able to retrieve a file content in one hope in more 50% of cases... to be able support the 30ms requirement for response, it should not touch disk at all...

      Clustering: - few hundred nodes - there are plenty of systems available to handle properly medium size clusters.

      • thousands of nodes - not aware of any available system...

       

       

       

      In order to be able to build a shared nothing system we will have to store routing information in a distributed data grid. We'll not be able to store files in the grid. Most of the data grids are using consistent hashing to split the keys in all available nodes. All consistent hashing algorithms that I'm aware of, are assuming that value is a reasonable size blob, but in our case we can't assume this. How to build a routing algorithm that is able to not route write requests to full nodes?In order to support load requested, it should be able to read from all 3 copies from a datacentre.Re-balancing is not a solution to spread the data uniform in the system. This is because it may end-up moving gigabytes of data between notes.

      Concepts, Terminology

      • Replication area - a group of nodes that mirrors same data.
      • Extended replication area - a replication area that has associated nodes from other data centres.

      Practical approaches:

      HTTP SERVER THAT OFFER CRUD OPERATIONS ON EACH NODE.

      • Netty used to implement HTTP server, taking advantages of zero-copy byte buffer support.
      • Routing information kept in Hazelcast

      Impressed by the number of concurrent request that I was able to achieve on a single node, without using any special caching (just native files system caching). Able to get up to 12k concurrent connection on a regular desktop machine (Dell Dimension, quad code proc, 6 Gb Ram, running CentOS, updated to allow to have unlimited number of connections.). Hazelcast running in same jvm with Netty

      Issues: - Hazelcast is not stable at all - serialization of routing data is not properly working. - I was not able to keep a cluster of 3 nodes up and running enough time to be able to run a performance test on the cluster.

      HTTP SERVER THAT OFFER CRUD OPERATIONS ON EACH NODE.

      • Netty used to implement HTTP server, taking advantages of zero-copy byte buffer support.
      • Routing information kept in Voldemort

      Issues: - Voldemort is more stable than Hazelcast but consume more resource and much complex to install/configure.

      HTTP SERVER THAT OFFER CRUD OPERATIONS ON EACH NODE. - TBD

      • Netty used to implement HTTP server, taking advantages of zero-copy byte buffer support.
      • Routing information kept in Infinispan

       

      So, will Infinispan be able to provide a reliable distributed storage for routing information?

       

      Thanks,
      Gabi

        • 1. Distributed file storage
          mircea.markus

          Interesting problem.

           

          If the amount of data cannot be accomodated by a single cluster, you can use a layer of them: if information is not found at the top layer ask the layer below etc. So once you've reached the horisontal limit of scale by adding more servers to the same cluster, you can go to the next level and create another cluster and link it to the first one.

          Infinispan supports this approach (i.e. layered clusters) using ClusterCacheLoader    : if a node is configured with such a loader and the requested data is not found on it then it tries to look it up in another ISPN cluster.

          For large objects (GB that can not be acomodated to a single node) support for large object is implemented as we speak.

          • 2. Distributed file storage
            gciuloaica

            Thanks Mircea for reply.

             

            layered  clusters seems to be a solution for 2 problems - splitting cluster in small subcluster (similar to sharding) and also for accesing data from a cluster from other data center.

            large object - like is mention in the issue iSPN-78, there is a need for a streaming interface. I have added a comment about redudant copies to it.