3 Replies Latest reply on Sep 18, 2015 4:55 AM by esatikka

    Adding a new storage-node failed and will not seem to recover

    esatikka

      I'm in the process of moving our RHQ (single node, only a couple of agents) to a new server, and I thought the best way would be to add a new node in a HA setup and when everything works also on the new server, shut down / undeploy the old one. Adding a new RHQ server went fine, but I'm having trouble with the storage node. RHQ is version 4.11 and it is installed on Win2008 (the new server, the older is...older) with Oracle JDK 1.7.80.

       

      After getting the new RHQ server up, I did rhqctl.bat  install --storage. No errors. After starting up the new node it however logged things like:

       

      ERROR [WrapperSimpleAppMain] 2015-09-07 13:58:28,239 CassandraDaemon.java (line 466) Exception encountered during startup

      FSWriteError in ..\..\..\rhq-data\data\system\local\system-local-tmp-ic-2-Data.db

        at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:112)

        at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:103)

        at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:140)

        at org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:400)

        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:242)

        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:449)

        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:492)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:606)

        at org.tanukisoftware.wrapper.WrapperSimpleApp.run(WrapperSimpleApp.java:244)

        at java.lang.Thread.run(Thread.java:745)

      Caused by: java.io.IOException: Failed to delete D:\rhq-server-4.11.0\rhq-storage\bin\..\..\..\rhq-data\data\system\local\system-local-tmp-ic-2-Data.db

        ... 13 more

      [...]

      ERROR [COMMIT-LOG-ALLOCATOR] 2015-09-07 13:58:42,019 CassandraDaemon.java (line 194) Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main]

      FSWriteError in ..\..\..\rhq-data\commit_log\CommitLog-2-1441623520488.log

        at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:132)

        at org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:196)

        at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:94)

        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

        at java.lang.Thread.run(Thread.java:745)

      Caused by: java.io.IOException: Rename from ..\..\..\rhq-data\commit_log\CommitLog-2-1441623502313.log to 1441623520488 failed

        at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:110)

        ... 4 more

      [...]

      ERROR [WrapperSimpleAppMain] 2015-09-07 13:58:42,630 CassandraDaemon.java (line 354) Fatal configuration error

      org.apache.cassandra.exceptions.ConfigurationException: S306/194.252.1.170:7100 is in use by another process.  Change listen_address:storage_port in cassandra.yaml to values that do not conflict with other services

        at org.apache.cassandra.net.MessagingService.getServerSocket(MessagingService.java:442)

        at org.apache.cassandra.net.MessagingService.listen(MessagingService.java:390)

        at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:589)

        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:554)

        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:451)

        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:350)

        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:449)

        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:492)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:606)

        at org.tanukisoftware.wrapper.WrapperSimpleApp.run(WrapperSimpleApp.java:244)

        at java.lang.Thread.run(Thread.java:745)

      ERROR [StorageServiceShutdownHook] 2015-09-07 13:58:42,636 CassandraDaemon.java (line 194) Exception in thread Thread[StorageServiceShutdownHook,5,main]

      java.lang.NullPointerException

        at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)

        at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)

        at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)

        at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:519)

        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

        at java.lang.Thread.run(Thread.java:745)

      [...]

      ERROR [NonPeriodicTasks:1] 2015-09-07 13:58:42,739 SSTableDeletingTask.java (line 72) Unable to delete ..\..\..\rhq-data\data\system\local\system-local-ic-2-Data.db (it will be removed on server restart; we'll also retry after GC)

      ERROR [NonPeriodicTasks:1] 2015-09-07 13:58:42,741 SSTableDeletingTask.java (line 72) Unable to delete ..\..\..\rhq-data\data\system\local\system-local-ic-3-Data.db (it will be removed on server restart; we'll also retry after GC)

      ERROR [NonPeriodicTasks:1] 2015-09-07 13:58:42,745 SSTableDeletingTask.java (line 72) Unable to delete ..\..\..\rhq-data\data\system\local\system-local-ic-1-Data.db (it will be removed on server restart; we'll also retry after GC)

       

      And generally just didn't seem to recover or be able to advance. I tried restarting the service, but it still complained about the same file with FSWriteError seen above. I shut the service down and tried to delete it manually - okay, there was some java.exe process which had it locked. I couldn't figure out that it was, since after shutting down the all the services (storage node, RHQ server and agent) it still remained. I then killed the process manually, and restarted the services.

       

      Now the node seemed to start syncing things (plenty of log lines with "repair ... Received merkle tree..." and CPU usage and network traffic. After a while that stopped, but the storage node is still maked DOWN in RHQ UI with Operation Mode : ADD_MAINTENANCE and Cluster Status : DOWN. There's also a alert "StorageNodeMaintenanceOperationsFailure" which says "Maintenance operation failed for an RHQ Storage Node. When fired please see documentation for the proper corrective action."

       

      The best corrective action I found from the docs was to manually again Deploy the new node from the UI, which seems to complete ok, but that doesn't change the node status or mode.

       

      I also wonder why the old node seems to have plenty more data than the new one (even when disregarding snapshots). The new node does have tens of megs of stuff in a couple of files in one-hour-metrics while the old one has hundreds of megs in a couple of files...

       

      As it seems that things started to go wrong already from the installer on (the stray java.exe blocking the tmp file, probably that was started by the installer while the installed Windows service was not running) I might as well try to start from the scratch, but I don't know how to purge the storage node installation so that I don't produce any bigger mess. Or, if someone thinks there's a way to fix the node with less drastic operation, I could try that out, too.

       


      Update (following morning):

       

      Last night both RHQ servers lost storage cluster which didn't seem to function too well with the half-done new node (which was missing e.g. schema_version data files). After finding the "Undeploying Storage Nodes" section of https://docs.jboss.org/author/display/RHQ/Deploying+Storage+Nodes (and escpecially the part on undeploying nodes which state is not NORMAL) I decided to try that out. I succeeded pulling the node out of the cluster and deleting it using RHQ operations so I believe that it was "clean enough". Next I tried installing the storage node again, and this time it seems to have gone ok (no issues seen in logs, got all data streamed in, log says "Node S306/194.252.1.170 state jump to normal" and "Startup completed! Now serving reads."). However, the new node is now stuck in RHQ UI with Operation Mode : BOOTSTRAP, Cluster Status : DOWN. RHQ UI displays an alert on "Deployment error: Deployment has been aborted due to failed resource operation [Prepare For Bootstrap]." but it is rather odd that the timestamp of the alert is in the middle of a streaming session which shows no errors in rhq-storage.log. Could that be due to a timeout? Now that the node seems to be otherwise fine,


      I'd like to know if the mode and status shown in RHQ UI can somehow be corrected. I tried to run addNodeMaintenance operation on both nodes (as the doc linked above suggests, and repair seems to run ok, but after a while there is this error in rhq-server.log (as well as an alert at the same time for StorageNodeMaintenanceOperationsFailure) and the node status stays the same:

      11:56:02,182 INFO  [org.rhq.enterprise.server.operation.OperationManagerBean] (RHQScheduler_Worker-4) Operation execution seems to have been orphaned - timing it out: ResourceOperationHistory: resource=[Resource[id=12732, uuid=dc8426f5-cff4-4dad-b908-3c5fde6ba587, type={RHQStorage}RHQ Storage Node, key=RHQ Storage Node(193.201.145.86), name=RHQ Storage Node(193.201.145.86), parent=koms70, version=4.11.0]], group-history=[null], id=[56813], job-name=[rhq-resource-12732-356401360-1441701582753], job-group=[rhq-resource-12732], status=[In Progress], subject-name=[rhqadmin], ctime=[Tue Sep 08 11:39:42 EEST 2015], mtime=[Tue Sep 08 11:39:43 EEST 2015], duration-millis=[979023], error-message=[null]

       

      I remember specifying 30 minutes as the operation timeout, and the error occurs around 15 minutes. I tried also with setting repair=false and update seeds=false, but with no good (actually by coincidence or not, when trying this, the old node killed Gossiper upon an FSWriteError in D:\rhq\storage\data\rhq\raw_metrics\rhq-raw_metrics-ic-4911-Index.db and thus the node was dropped from the cluster, and also rendered RHQ servers without storage).

        • 1. Re: Adding a new storage-node failed and will not seem to recover
          john.sanda

          Not sure why but it looks like my reply from earlier did not make it into the discussion. Reposting...

           

           

          You may have run into bugs with the operation mode and cluster status not getting properly updated. This should be fixed in more recent versions of RHQ. It sounds like you need to check your replication settings which unfortunately is not easy and requires some manual steps. Here are the steps,


           

          1) Get storage node username/password

           


          Open <RHQ_SERVER_HOME>/bin/rhq-server.properties and find the rhq.storage.username and rhq.storage.password properties, and make a note of the values.

           


          2) Decode storage node password

           


          You can use the RHQ CLI to decode the password. Log into the CLI and then call,

           

             

              org.rhq.core.util.obfuscation.Obfuscator.decode(storage_password)


           

          where storage_password is the value of the rhq.storage.password property

           


          3) Enable thrift server

           


          You will use a command line utility called cqlsh to query the replication settings. The storage node(s) need to be up and running from this point on. cqlsh sends requests over the storage node’s internal thrift server which is turned off by default. We need to turn it on using the nodetool utility.

           


              cd RHQ_SERVER_HOME>/rhq-storage/bin

           

              nodetool -p 7299 enablethrift

           


          4) Query replication settings


           

          Log into cqlsh,

           

           

              cqlsh -u <storage_username> -p <storage_password>

           


          where storage_username is the value of the rhq.storage.username property and storage_password is the value of the rhq.storage.password property. Then execute the following command,


           

              DESCRIBE KEYSPACE rhq;

           


          Scroll to the top of the output which should look something like,


           

              cqlsh> DESCRIBE KEYSPACE rhq;

           

              CREATE KEYSPACE rhq WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;


           

          If you have two storage nodes deployed, then replication_factor should be 2. If it is 1, that would explain the disparity around the data on the nodes.

           


          5) Fix replication_factor if necessary

           


          If the replication_factor is wrong, then execute the following in cqlsh,

           

              ALTER KEYSPACE rhq WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};


           

          6) Run repair

           


          Lastly, you need to run the repair operation on the storage nodes so that the data is made consistent between nodes. Data should get transferred to the new node during this process. You can do this from the RHQ CLI.

           

             

              StorageNodeManager.runClusterMaintenance()


           

          The method call in the CLI will return right away because it simply schedules the work. The actual repair operation, which is run on each node sequentially, can take a while to complete. It largely depends on the amount of data you have.

          • 2. Re: Adding a new storage-node failed and will not seem to recover
            esatikka

            Thanks for the response, but seems like replication_factor is already ok:

             

            cqlsh>  DESCRIBE KEYSPACE rhq;

             

            CREATE KEYSPACE rhq WITH replication = {

              'class': 'SimpleStrategy',

              'replication_factor': '2'

            };

             

            Had I foreseen the difficulty in node addition I might have first considered RHQ upgrade (mainly since 4.13 would allow proper management of storage snapshots etc), but before I could do that I'd need to make sure RTI is ok with RHQ 4.13...

            • 3. Re: Adding a new storage-node failed and will not seem to recover
              esatikka

              (reporting this with the correct account, sorry mods)

               

              It seems that the new node is now finally up and running. I needed to

              • add some resources to the new server (to avoid timeouts and all sorts of crashes due to resource shortage, as processing the 3.2 GB data back and forth seems to be pretty intensive operation compared to what the node requires in normal operations),
              • read very carefully Deploying Storage Nodes - RHQ - Project Documentation Editor and the parameters of the automatically created storage operations (by the deployment process) to be able to manually carry on where the deployment process crashed (the automatically started addNodeMaintenance still times out each time, and I wonder if the timeout can even be adjusted anywhere other than the source code),
              • remember to restart both storage nodes after each of the storage operation timeouts to make sure both nodes were running normally before retrying,
              • exercise a lot of patience when either one of the nodes encountered error deleting a file (with no apparent reason) which caused shutting down the gossiper or the server JVM getting restarted by wrapper watchdog ("JVM appears hung") causing RHQ to lose track of the process - thus requiring to start the same operation all over again, and
              • have enough luck so that some of the glitches actually made the timeout of addNodeMaintenance go without notice in the system since it ran for more than a day before now finally ending up with both storage node having both status and mode NORMAL.

               

              I know the above sounds too much of what you get in a bad support ticket but there really was little consistence in the way the process crashed (other than the timeouts excluding the one lucky glitch). I think the storage cluster was fully functional already before the last 1-2 attempts, since shutting down one node didn't seem to affect the availability of storage to RHQ server, but I wanted to get the node status also appear normal. I'm willing to attribute plenty of the fuss to the bugs John mentioned above, and I wonder if the JVM issue might be due to using Oracle JDK 1.7.0_80 instead of 1.7.0_55 which the old server has. I might try downgrading JVM if the issue keeps on bugging in normal operations. Anyway, thanks to John for pointing out (even implicitly) some things which made me find other resources in trying to figure out the situation.

               

              I will now shut down the old storage node and wait for some days to see if that causes any issues. If not, I'll be able to ditch that old server finally. I'll go for upgrading RHQ after next RTI release which should support RHQ 4.13.