Adding a new storage-node failed and will not seem to recover
esatikka Sep 8, 2015 5:56 AMI'm in the process of moving our RHQ (single node, only a couple of agents) to a new server, and I thought the best way would be to add a new node in a HA setup and when everything works also on the new server, shut down / undeploy the old one. Adding a new RHQ server went fine, but I'm having trouble with the storage node. RHQ is version 4.11 and it is installed on Win2008 (the new server, the older is...older) with Oracle JDK 1.7.80.
After getting the new RHQ server up, I did rhqctl.bat install --storage. No errors. After starting up the new node it however logged things like:
ERROR [WrapperSimpleAppMain] 2015-09-07 13:58:28,239 CassandraDaemon.java (line 466) Exception encountered during startup
FSWriteError in ..\..\..\rhq-data\data\system\local\system-local-tmp-ic-2-Data.db
at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:112)
at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:103)
at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:140)
at org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:400)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:242)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:449)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:492)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.tanukisoftware.wrapper.WrapperSimpleApp.run(WrapperSimpleApp.java:244)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to delete D:\rhq-server-4.11.0\rhq-storage\bin\..\..\..\rhq-data\data\system\local\system-local-tmp-ic-2-Data.db
... 13 more
[...]
ERROR [COMMIT-LOG-ALLOCATOR] 2015-09-07 13:58:42,019 CassandraDaemon.java (line 194) Exception in thread Thread[COMMIT-LOG-ALLOCATOR,5,main]
FSWriteError in ..\..\..\rhq-data\commit_log\CommitLog-2-1441623520488.log
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:132)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$3.run(CommitLogAllocator.java:196)
at org.apache.cassandra.db.commitlog.CommitLogAllocator$1.runMayThrow(CommitLogAllocator.java:94)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Rename from ..\..\..\rhq-data\commit_log\CommitLog-2-1441623502313.log to 1441623520488 failed
at org.apache.cassandra.db.commitlog.CommitLogSegment.<init>(CommitLogSegment.java:110)
... 4 more
[...]
ERROR [WrapperSimpleAppMain] 2015-09-07 13:58:42,630 CassandraDaemon.java (line 354) Fatal configuration error
org.apache.cassandra.exceptions.ConfigurationException: S306/194.252.1.170:7100 is in use by another process. Change listen_address:storage_port in cassandra.yaml to values that do not conflict with other services
at org.apache.cassandra.net.MessagingService.getServerSocket(MessagingService.java:442)
at org.apache.cassandra.net.MessagingService.listen(MessagingService.java:390)
at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:589)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:554)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:451)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:350)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:449)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:492)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.tanukisoftware.wrapper.WrapperSimpleApp.run(WrapperSimpleApp.java:244)
at java.lang.Thread.run(Thread.java:745)
ERROR [StorageServiceShutdownHook] 2015-09-07 13:58:42,636 CassandraDaemon.java (line 194) Exception in thread Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:370)
at org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:519)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.lang.Thread.run(Thread.java:745)
[...]
ERROR [NonPeriodicTasks:1] 2015-09-07 13:58:42,739 SSTableDeletingTask.java (line 72) Unable to delete ..\..\..\rhq-data\data\system\local\system-local-ic-2-Data.db (it will be removed on server restart; we'll also retry after GC)
ERROR [NonPeriodicTasks:1] 2015-09-07 13:58:42,741 SSTableDeletingTask.java (line 72) Unable to delete ..\..\..\rhq-data\data\system\local\system-local-ic-3-Data.db (it will be removed on server restart; we'll also retry after GC)
ERROR [NonPeriodicTasks:1] 2015-09-07 13:58:42,745 SSTableDeletingTask.java (line 72) Unable to delete ..\..\..\rhq-data\data\system\local\system-local-ic-1-Data.db (it will be removed on server restart; we'll also retry after GC)
And generally just didn't seem to recover or be able to advance. I tried restarting the service, but it still complained about the same file with FSWriteError seen above. I shut the service down and tried to delete it manually - okay, there was some java.exe process which had it locked. I couldn't figure out that it was, since after shutting down the all the services (storage node, RHQ server and agent) it still remained. I then killed the process manually, and restarted the services.
Now the node seemed to start syncing things (plenty of log lines with "repair ... Received merkle tree..." and CPU usage and network traffic. After a while that stopped, but the storage node is still maked DOWN in RHQ UI with Operation Mode : ADD_MAINTENANCE and Cluster Status : DOWN. There's also a alert "StorageNodeMaintenanceOperationsFailure" which says "Maintenance operation failed for an RHQ Storage Node. When fired please see documentation for the proper corrective action."
The best corrective action I found from the docs was to manually again Deploy the new node from the UI, which seems to complete ok, but that doesn't change the node status or mode.
I also wonder why the old node seems to have plenty more data than the new one (even when disregarding snapshots). The new node does have tens of megs of stuff in a couple of files in one-hour-metrics while the old one has hundreds of megs in a couple of files...
As it seems that things started to go wrong already from the installer on (the stray java.exe blocking the tmp file, probably that was started by the installer while the installed Windows service was not running) I might as well try to start from the scratch, but I don't know how to purge the storage node installation so that I don't produce any bigger mess. Or, if someone thinks there's a way to fix the node with less drastic operation, I could try that out, too.
Update (following morning):
Last night both RHQ servers lost storage cluster which didn't seem to function too well with the half-done new node (which was missing e.g. schema_version data files). After finding the "Undeploying Storage Nodes" section of https://docs.jboss.org/author/display/RHQ/Deploying+Storage+Nodes (and escpecially the part on undeploying nodes which state is not NORMAL) I decided to try that out. I succeeded pulling the node out of the cluster and deleting it using RHQ operations so I believe that it was "clean enough". Next I tried installing the storage node again, and this time it seems to have gone ok (no issues seen in logs, got all data streamed in, log says "Node S306/194.252.1.170 state jump to normal" and "Startup completed! Now serving reads."). However, the new node is now stuck in RHQ UI with Operation Mode : BOOTSTRAP, Cluster Status : DOWN. RHQ UI displays an alert on "Deployment error: Deployment has been aborted due to failed resource operation [Prepare For Bootstrap]." but it is rather odd that the timestamp of the alert is in the middle of a streaming session which shows no errors in rhq-storage.log. Could that be due to a timeout? Now that the node seems to be otherwise fine,
I'd like to know if the mode and status shown in RHQ UI can somehow be corrected. I tried to run addNodeMaintenance operation on both nodes (as the doc linked above suggests, and repair seems to run ok, but after a while there is this error in rhq-server.log (as well as an alert at the same time for StorageNodeMaintenanceOperationsFailure) and the node status stays the same:
11:56:02,182 INFO [org.rhq.enterprise.server.operation.OperationManagerBean] (RHQScheduler_Worker-4) Operation execution seems to have been orphaned - timing it out: ResourceOperationHistory: resource=[Resource[id=12732, uuid=dc8426f5-cff4-4dad-b908-3c5fde6ba587, type={RHQStorage}RHQ Storage Node, key=RHQ Storage Node(193.201.145.86), name=RHQ Storage Node(193.201.145.86), parent=koms70, version=4.11.0]], group-history=[null], id=[56813], job-name=[rhq-resource-12732-356401360-1441701582753], job-group=[rhq-resource-12732], status=[In Progress], subject-name=[rhqadmin], ctime=[Tue Sep 08 11:39:42 EEST 2015], mtime=[Tue Sep 08 11:39:43 EEST 2015], duration-millis=[979023], error-message=[null]
I remember specifying 30 minutes as the operation timeout, and the error occurs around 15 minutes. I tried also with setting repair=false and update seeds=false, but with no good (actually by coincidence or not, when trying this, the old node killed Gossiper upon an FSWriteError in D:\rhq\storage\data\rhq\raw_metrics\rhq-raw_metrics-ic-4911-Index.db and thus the node was dropped from the cluster, and also rendered RHQ servers without storage).