10 Replies Latest reply on Nov 9, 2013 12:45 PM by jayshaughnessy

RHQ Storage; Compaction failure

genman Nov 5, 2013 11:33 PM

Got this gem. I'm not sure what to do with it. (Didn't get much help on the snapshot issue on the mailing list either.)

2013-11-06 03:48:36,722 ERROR [ResourceContainer.invoker.nonDaemon-17] (StorageNodeComponent)- An error occurred while running cleanup on rhq keyspace
org.mc4j.ems.connection.EmsInvocationException: Exception on invocation of [forceTableCleanup]javax.management.MBeanException: java.util.concurrent.ExecutionException: java.lang.
RuntimeException: java.io.FileNotFoundException: /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db (No such file or directory)
        at org.mc4j.ems.impl.jmx.connection.bean.operation.DOperation.invoke(DOperation.java:126)
        at org.rhq.plugins.cassandra.util.KeyspaceService.cleanup(KeyspaceService.java:89)
        at org.rhq.plugins.storage.StorageNodeComponent.cleanupKeyspace(StorageNodeComponent.java:602)
        at org.rhq.plugins.storage.StorageNodeComponent.performTopologyChangeMaintenance(StorageNodeComponent.java:516)
        at org.rhq.plugins.storage.StorageNodeComponent.nodeAdded(StorageNodeComponent.java:475)
        at org.rhq.plugins.storage.StorageNodeComponent.invokeOperation(StorageNodeComponent.java:128)
...
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db (No such file or directory)
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at org.apache.cassandra.db.compaction.CompactionManager.performAllSSTableOperation(CompactionManager.java:274)
        at org.apache.cassandra.db.compaction.CompactionManager.performCleanup(CompactionManager.java:312)
        at org.apache.cassandra.db.ColumnFamilyStore.forceCleanup(ColumnFamilyStore.java:967)
        at org.apache.cassandra.service.StorageService.forceTableCleanup(StorageService.java:2148)

1. Re: RHQ Storage; Compaction failure

john.sanda Nov 6, 2013 1:43 AM (in response to genman)

Can you search rhq-storage.log for rhq-six_hour_metrics-ic-1-Data.db to see what happened with the file prior to the error?
Actions

2. Re: Re: RHQ Storage; Compaction failure

genman Nov 6, 2013 10:52 AM (in response to john.sanda)

Looks like the data was getting streamed. Maybe after that it was deleted? The cleanup was the job that failed.


INFO [Streaming to /17.172.21.186:371] 2013-11-06 02:39:57,730 StreamReplyVerbHandler.java (line 44) Successfully sent /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db to /17.172.21.186
INFO [AntiEntropyStage:1] 2013-11-06 02:40:43,352 StreamOut.java (line 184) Stream context metadata [/data06/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db sections=87 progress=0/138804 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-88-Data.db sections=282 progress=0/134589 - 0%, /data02/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-78-Data.db sections=282 progress=0/509733 - 0%, /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-6-Data.db sections=271 progress=0/1220679 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-7-Data.db sections=117 progress=0/2133486 - 0%, /data04/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-83-Data.db sections=281 progress=0/134118 - 0%, /data06/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-6-Data.db sections=271 progress=0/1220679 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-411-Data.db sections=2 progress=0/354 - 0%, /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db sections=87 progress=0/138804 - 0%], 13 sstables.
INFO [Streaming to /17.172.21.186:377] 2013-11-06 02:40:44,180 StreamReplyVerbHandler.java (line 44) Successfully sent /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db to /17.172.21.186
INFO [AntiEntropyStage:1] 2013-11-06 02:41:25,988 StreamOut.java (line 184) Stream context metadata [/data06/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db sections=506 progress=0/163692 - 0%, /data04/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-83-Data.db sections=515 progress=0/244773 - 0%, /data06/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-6-Data.db sections=479 progress=0/4496757 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-88-Data.db sections=519 progress=0/246657 - 0%, /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-6-Data.db sections=479 progress=0/4496757 - 0%, /data02/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-78-Data.db sections=520 progress=0/926121 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-7-Data.db sections=245 progress=0/4379049 - 0%, /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db sections=506 progress=0/163692 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-411-Data.db sections=6 progress=0/1062 - 0%], 12 sstables.
INFO [Streaming to /17.172.21.187:196] 2013-11-06 02:41:26,510 StreamReplyVerbHandler.java (line 44) Successfully sent /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db to /17.172.21.187
INFO [CompactionExecutor:3945] 2013-11-06 03:48:36,689 CompactionManager.java (line 587) Cleaning up SSTableReader(path='/data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db')
java.lang.RuntimeException: java.io.FileNotFoundException: /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db (No such file or directory)
Caused by: java.io.FileNotFoundException: /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db (No such file or directory)

3. Re: RHQ Storage; Compaction failure

genman Nov 6, 2013 1:44 PM (in response to genman)

May be a case of https://issues.apache.org/jira/browse/CASSANDRA-4857

The mailing list suggests a create-drop-create sequence may have happened. Is it possible that RHQ would have dropped my existing data, even if by mistake?

I have been seeing fairly quirky behavior where some queries are turning up no data, but after a few tries data comes back.

Is there a way to basically 'refresh' a node, meaning rebuild the data directory area from scratch?
Actions
4. Re: RHQ Storage; Compaction failure

john.sanda Nov 6, 2013 2:39 PM (in response to genman)

Elias Ross wrote:

Is it possible that RHQ would have dropped my existing data, even if by mistake?

If you mean dropped as in dropped the keyspace as described in CASSANDRA-4857, then no, that would not happen. It is possible however for a replica to miss data. If for example you have 3 replicas for a given key (i.e., schedule id) and on of the replicas goes down when data is written for that key, then that node will be inconsistent when it comes back up.

On the node where the cleanup error occurs, try running,

nodetool -p 7299 scrub rhq

That will rebuild the data files and should remove anything that is broken.
Actions
5. Re: RHQ Storage; Compaction failure

genman Nov 6, 2013 9:43 PM (in response to john.sanda)

$ ./nodetool -p 7299 scrub rhq

Exception in thread "main" java.lang.RuntimeException: Tried to create duplicate hard link to /data05/rhq/data/rhq/six_hour_metrics/snapshots/pre-scrub-1383787540489/rhq-six_hour_metrics-ic-6-Summary.db

No such luck. Is it possible to simply rm -rf it all and do the scrub?
Actions
6. Re: RHQ Storage; Compaction failure

john.sanda Nov 7, 2013 8:02 AM (in response to genman)
There is an offline scrub that you can try.
Shut down the node.
cd <rhq-server-home>/rhq-storage/bin
./sstablescrub rhq six_hour_metrics
restart storage node
./nodetool -p 7299 repair -pr rhq six_hour_metrics

If that does not work you can try an rm -rf approach. Here is how I would do it.
nodetool -p 7299 disablebinary
nodetool -p 7299 flush rhq six_hour_metrics
on each of the other nodes in the cluster run, nodetool -p 7299 repair rhq
Shut down the node
rm -rf <rhq-data-dir>/data/rhq/six_hour_metrics
restart the node
nodetool -p 7299 repair -pr rhq
Actions
7. Re: RHQ Storage; Compaction failure

mazz Nov 7, 2013 8:04 AM (in response to john.sanda)

John - that smells like a good FAQ entry - hint hint
Actions
8. Re: RHQ Storage; Compaction failure

genman Nov 7, 2013 5:21 PM (in response to john.sanda)

Thanks, it seems to be ignoring errors and powering through, which I like. I was thinking I might have to patch the server to keep grinding through.
Actions
9. Re: RHQ Storage; Compaction failure

genman Nov 8, 2013 11:11 PM (in response to genman)

The issue was I had somehow symlinked two of the data directories to the same physical drive. User error. Luckily only took me a week to figure out.
Actions
10. Re: RHQ Storage; Compaction failure

jayshaughnessy Nov 9, 2013 12:45 PM (in response to genman)

Elias, ugh. Thanks for following up, I'm sure John will appreciate it when he sees it. If nothing else he came up with some potential scrubbing FAQ entry.
Actions

Go to original post