11 Replies Latest reply on Feb 17, 2010 5:52 PM by michaelneale

    Auto scaling with CirrAS

    altes-kind

      Hi,

       

      thanks for providing the CirrAS AMIs - they work like a charm.

       

      Now I'm wondering if its possible to build a JBossAS Cluster which scales up/down automatically depending on the workload of the nodes. I was thinking about monitoring the workload (e.g. cpu load) of the nodes on the RHQ Server (management appliance) and then automatically launch or terminate back-end appliance instances. In general, is it possible to do anything like that?

       

      Thanks,

      Matthias

        • 1. Re: Auto scaling with CirrAS
          goldmann
          thanks for providing the CirrAS AMIs - they work like a charm.

          Great to hear that!

          Now I'm wondering if its possible to build a JBossAS Cluster which scales up/down automatically depending on the workload of the nodes. I was thinking about monitoring the workload (e.g. cpu load) of the nodes on the RHQ Server (management appliance) and then automatically launch or terminate back-end appliance instances. In general, is it possible to do anything like that?

          Yes, we're thinking about such feature. It's definitely doable.

           

          Every node calculates its load (mod_cluster stuff) and sends it to the mod_cluster module in Apache. We should be able to grab those values from JBoss AS via JMX calls and then collect them from all cluster with a quite simple script and send it back to managemet appliance. From there we could run additional instances.

           

          More, we can write an RHQ plugin which does this for us and presents data as graphs. We could this RHQ plugin with our AWS credentials and a list of AMI's we're using right now. Launching new instance would be easy then.

           

          But we need also have a way to define the border values for launching/stopping instances and calculate costs for our cluster. Example? Company don't wants to give more than $200 for a running cluster per month.

           

          There are many ways to make auto scalling – we need to choose the right one, let's start the talk!

          • 2. Re: Auto scaling with CirrAS
            ray.ploski
            In the past I've built an integration from RHQ into Drools using RHQ's ManagementReport object to parse statistics. The idea was to use the CEP of Fusion to define your own performance tuning heuristics. Activated perfomance statistics then invoke the appropriate MBean to size a pool appropriately.  It would not be too far of a stretch to extend the rulesets based upon a company's runtime governance policy ("Don't exceed $500, etc") and make the calls to spin-up or shutdown instances.  Let me know if it's of interest and I'll throw the code up on GitHub.
            • 3. Re: Auto scaling with CirrAS
              goldmann
              In the past I've built an integration from RHQ into Drools using RHQ's ManagementReport object to parse statistics. The idea was to use the CEP of Fusion to define your own performance tuning heuristics. Activated perfomance statistics then invoke the appropriate MBean to size a pool appropriately.  It would not be too far of a stretch to extend the rulesets based upon a company's runtime governance policy ("Don't exceed $500, etc") and make the calls to spin-up or shutdown instances.  Let me know if it's of interest and I'll throw the code up on GitHub.

              I think that's a great use case for CEP. I saw a nice demo some time ago – had made a good impression!

               

              It would be great to have a code to start the game. We could reuse that in CirrAS and make a new great feature!

              • 4. Re: Auto scaling with CirrAS
                altes-kind

                Sounds great.

                 

                But why take the detour via mod_cluster for determining the load? RHQ already monitors things like cpu load, memory usage etc. Isn't there a way of using this data?

                • 5. Re: Auto scaling with CirrAS
                  goldmann

                  Answer is pretty simple. Using mod_cluster load factor (or using all available metrics and computing our own factor) we can be more accurate. CPU load and memory usage isn't sufficient IMHO. Wee need to know more. Just take a look at all available mod_cluster metrics. Connection pool, traffic, active sessions, busy connectors, etc.

                  • 6. Re: Auto scaling with CirrAS
                    michaelneale

                    Great timing Matthias !

                     

                    I have started to look into this as part of CoolingTower - like what Ray said - am looking at using rules. Ray - you want to jump in a bit more?

                     

                    Marek - what project should this scaling stuff live in? I was parking it in CT - and CT would run as a (optional) part of the management appliance I guess, but perhaps an autoscaling service is more generic than that? In any case, using rules/CEP I think is on the right track.

                     

                    Here are some ideas I mocked up today:

                     

                    So the model:

                     

                    class AppCluster(id: String) //just an overall grouping (for now)
                    class AppServerInstance(id: String,
                                                 busyConnectors: Int,
                                                 heapUsage: Long,
                                                 sessionCount: Int,
                                                 cpuLoad: Int,
                                                 server: Server,
                                                 cluster: AppCluster) //a working app server - with some sample metrics
                    class Server(id: String) //a place holder for the instance an AS runs on
                    

                     

                    with the above "relational" model, we can write rules like:

                     

                    rule "All loaded up"
                        when
                            not AppServerInstance(busyConnectors < 20)
                        then
                            requestNewApplicationServer("Connection busy")
                    end
                    
                    rule "Empty server"
                        when
                            srv: Server()
                            not AppServerInstance(server == srv)
                        then
                            shutdown(srv)
                    end
                    
                    
                    rule "Can scale down"
                        when
                           srv: Server()
                           as: AppServerInstance(sessionCount < 2, server == srv)
                           exists AppServerInstance(sessionCount < 5, server != srv)
                        then
                           shutdownApplicationServer(as)
                    end
                    
                    
                    #and so on...
                    

                     

                    This would most certainly run on the management server - so should we still call it part of the CT utility?

                    Anyone want to jump in with other ideas?

                     

                    In terms of a model for the rules - the above is just some snapshot data from the mod_proxy page JMX stuff  (which is accumulated) - but if we had more realtime feeds we could do things like CEP to be more responsive (I don't know enough as to what data is available in what timeframe to say at this stage) - but Ray/Mareks ideas are excellent I think (Ray - you have made some of this work before - at least for analysis - this is really taking it to the next step).

                    • 7. Re: Auto scaling with CirrAS
                      goldmann

                      Marek - what project should this scaling stuff live in? I was parking it in CT - and CT would run as a (optional) part of the management appliance I guess, but perhaps an autoscaling service is more generic than that? In any case, using rules/CEP I think is on the right track.

                      Agreed. I think CoolingTower is the right project for this too. We can deploy CT on CirrAS.

                      This would most certainly run on the management server - so should we still call it part of the CT utility?

                      Anyone want to jump in with other ideas?

                      This stuff could be reported using the management service (CirrAS). But if we want to go with RHQ (like Ray), a plugin (CT) would be better.

                      In terms of a model for the rules - the above is just some snapshot data from the mod_proxy page JMX stuff  (which is accumulated) - but if we had more realtime feeds we could do things like CEP to be more responsive (I don't know enough as to what data is available in what timeframe to say at this stage) - but Ray/Mareks ideas are excellent I think (Ray - you have made some of this work before - at least for analysis - this is really taking it to the next step).

                      You were looking at which service in JMX Console? Take a look at metrics. For example:

                      jboss.web:metric=BusyConnectors,provider=LoadBalanceFactor,service=ModCluster

                      There you have:

                      http://img.skitch.com/20100212-jmdy7ksbah1uu1cq9dkq9apktc.png

                      If we use RHQ agent to grab that data we can execute JMX calls directly. But if we go with our own solution, we can add it into, for example management service and use Twiddle to get (or set!) values:

                      [root@ip-10-212-75-32 bin]# ./twiddle.sh -s ip-10-212-75-32 get jboss.web:metric=BusyConnectors,provider=LoadBalanceFactor,service=ModCluster Capacity
                      Capacity=1.0
                      [root@ip-10-212-75-32 bin]# ./twiddle.sh -s ip-10-212-75-32 get jboss.web:metric=BusyConnectors,provider=LoadBalanceFactor,service=ModCluster Weight
                      Weight=1

                      Of course that's only weights used to compute the load. I Exposing load metrics via JMX how we can get the load for a metric load via JMX.

                       

                      By default only a few mod_cluster metrics are deployed. For our usage we can alter that and deploy more (or all).

                      • 8. Re: Auto scaling with CirrAS
                        ray.ploski
                        I'm stuck under a heavy load of previous commitments for the next two weeks.  I'll do my best to get the work I've already done up in a place where you can all mock my hastily whipped up Jopr / Drools plugin.
                        • 9. Re: Auto scaling with CirrAS
                          michaelneale

                          OK - so Question: these stats are per node in the cluster right? we don't have/want an aggregate of them in mod_cluster? (mod_cluster uses them to distribute load evenly, we want to know when there is surplus power or insufficient "free load" for scaling - kind of the opposite to what mod_cluster wants).

                           

                          So we would use RHQ/JOPR to get the stats for a given node (and this can also look at CPU usage for app process, not just internal stuff to the AS itself): http://www.jopr.org/display/JOPR2/Operating+System+Service#OperatingSystemService-Metrics and http://www.jopr.org/display/JOPR2/Web+Application+(WAR)+Service#WebApplication%28WAR%29Service-Metrics

                           

                          Does the above mean that we can get from JOPR/RHQ the metrics/load for a node - without going direct to JMX on the nodes (which will require a patch as was mentioned in the forum here) ?

                           

                          These data points/snapshots would be fed into the rules (in fact, probably can have traps in JOPR that trigger it to happen) and then it can decide if it is time to crack open a new instance. Also - when load is below threshold it can un-deploy from that node (and perhaps after a time remove the unused instance - but in some cases we may leave it running in case another application requires it - to reduce latency). Can have some "hysterysis" in the rules so that new servers don't come and go to often... (once again - how people tune their cloud very much depends on if it is public "infinite" infrastructure like EC2/Rackspace versus a private RHEV-M/vSphere cloud).

                          • 10. Re: Auto scaling with CirrAS
                            goldmann

                            OK - so Question: these stats are per node in the cluster right? we don't have/want an aggregate of them in mod_cluster? (mod_cluster uses them to distribute load evenly, we want to know when there is surplus power or insufficient "free load" for scaling - kind of the opposite to what mod_cluster wants).

                            Currently shown above example isn't usefully stats for us. If we expose getLoad() method via JMX (or wait until it's backed in mod_cluster described here) we could get the right stats per node, not for a whole cluster which isn't (very) useful for us.

                             

                            So we would use RHQ/JOPR to get the stats for a given node (and this can also look at CPU usage for app process, not just internal stuff to the AS itself): http://www.jopr.org/display/JOPR2/Operating+System+Service#OperatingSystemService-Metrics and http://www.jopr.org/display/JOPR2/Web+Application+(WAR)+Service#WebApplication%28WAR%29Service-Metric

                            mod_cluster has many  metrics we can use. Take a look at AverageSystemLoadMetric or SystemMemoryUsageLoadMetric. But of course, more metrics is always useful, we could create with this more fine grained solution. If we choose to stick for now with mod_cluster metrics, it should be easier to implement (we don't have to normalize metric values). Just my opinion.

                            Does the above mean that we can get from JOPR/RHQ the metrics/load for a node - without going direct to JMX on the nodes (which will require a patch as was mentioned in the forum here) ?

                            No, I think not. Mod_cluster metrics give us more flexibility IMO. I'm not thinking about mentioned change as a patch. We can create our JAR file with our beans and simple classes to achieve this I personally wouldn't change mod_cluster source. And, besides this, mod_cluster devs are really fast so you can expect implemented it soon

                            These data points/snapshots would be fed into the rules (in fact, probably can have traps in JOPR that trigger it to happen) and then it can decide if it is time to crack open a new instance. Also - when load is below threshold it can un-deploy from that node (and perhaps after a time remove the unused instance - but in some cases we may leave it running in case another application requires it - to reduce latency). Can have some "hysterysis" in the rules so that new servers don't come and go to often... (once again - how people tune their cloud very much depends on if it is public "infinite" infrastructure like EC2/Rackspace versus a private RHEV-M/vSphere cloud).

                            Exactly, writing good rules is the hardest part of this work. We should prepare a default rule set and the ability to change some properties in UI, for example: grace period for a node – node load is below the value and we need to shutdown this, but we'll wait X minutes to see if this will not change.

                             

                            I think we shouldn't split clouds in that way. We're not interested in this. More, if we move to Deltacloud with CirrAS we will be not interested where we are running IMHO. Maybe I'm wrong, could you please tell me more about the platform dependencies for tuning part?

                            • 11. Re: Auto scaling with CirrAS
                              michaelneale

                              OK lets go with mod_cluster defined metrics - seems spot on for us. I am working on the ruleset based on this, trying out a few models.

                               

                              I think we shouldn't split clouds in that way. We're not interested in this. More, if we move to Deltacloud with CirrAS we will be not interested where we are running IMHO. Maybe I'm wrong, could you please tell me more about the platform dependencies for tuning part?

                              You are right- there isn't really a difference. I was just trying to capture the little nuances between running something in say, EC2, where you can ALWAYS ask for more resources (for a tiny incremental cost) and a "private cloud" build on virtualised servers, which are runing on paid for servers - should that cloud be exhausted, the cost of extra resources (eg at peak) is MUCH greater... so rules may be adjusted (but that is all it should be, rule adjustments).

                               

                              ED. There is a JIRA to track this being made available on the beans: https://jira.jboss.org/jira/browse/MODCLUSTER-130