10 Replies Latest reply on Nov 9, 2013 12:45 PM by jay shaughnessy

    RHQ Storage; Compaction failure

    Elias Ross Master

      Got this gem. I'm not sure what to do with it. (Didn't get much help on the snapshot issue on the mailing list either.)

       

      2013-11-06 03:48:36,722 ERROR [ResourceContainer.invoker.nonDaemon-17] (StorageNodeComponent)- An error occurred while running cleanup on rhq keyspace
      org.mc4j.ems.connection.EmsInvocationException: Exception on invocation of [forceTableCleanup]javax.management.MBeanException: java.util.concurrent.ExecutionException: java.lang.
      RuntimeException: java.io.FileNotFoundException: /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db (No such file or directory)
              at org.mc4j.ems.impl.jmx.connection.bean.operation.DOperation.invoke(DOperation.java:126)
              at org.rhq.plugins.cassandra.util.KeyspaceService.cleanup(KeyspaceService.java:89)
              at org.rhq.plugins.storage.StorageNodeComponent.cleanupKeyspace(StorageNodeComponent.java:602)
              at org.rhq.plugins.storage.StorageNodeComponent.performTopologyChangeMaintenance(StorageNodeComponent.java:516)
              at org.rhq.plugins.storage.StorageNodeComponent.nodeAdded(StorageNodeComponent.java:475)
              at org.rhq.plugins.storage.StorageNodeComponent.invokeOperation(StorageNodeComponent.java:128)
      ...
      Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.FileNotFoundException: /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db (No such file or directory)
              at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
              at java.util.concurrent.FutureTask.get(FutureTask.java:83)
              at org.apache.cassandra.db.compaction.CompactionManager.performAllSSTableOperation(CompactionManager.java:274)
              at org.apache.cassandra.db.compaction.CompactionManager.performCleanup(CompactionManager.java:312)
              at org.apache.cassandra.db.ColumnFamilyStore.forceCleanup(ColumnFamilyStore.java:967)
              at org.apache.cassandra.service.StorageService.forceTableCleanup(StorageService.java:2148)
      
        • 1. Re: RHQ Storage; Compaction failure
          John Sanda Apprentice

          Can you search rhq-storage.log for rhq-six_hour_metrics-ic-1-Data.db to see what happened with the file prior to the error?

          • 2. Re: Re: RHQ Storage; Compaction failure
            Elias Ross Master

            Looks like the data was getting streamed. Maybe after that it was deleted? The cleanup was the job that failed.

             

            
            INFO [Streaming to /17.172.21.186:371] 2013-11-06 02:39:57,730 StreamReplyVerbHandler.java (line 44) Successfully sent /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db to /17.172.21.186
            INFO [AntiEntropyStage:1] 2013-11-06 02:40:43,352 StreamOut.java (line 184) Stream context metadata [/data06/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db sections=87 progress=0/138804 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-88-Data.db sections=282 progress=0/134589 - 0%, /data02/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-78-Data.db sections=282 progress=0/509733 - 0%, /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-6-Data.db sections=271 progress=0/1220679 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-7-Data.db sections=117 progress=0/2133486 - 0%, /data04/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-83-Data.db sections=281 progress=0/134118 - 0%, /data06/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-6-Data.db sections=271 progress=0/1220679 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-411-Data.db sections=2 progress=0/354 - 0%, /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db sections=87 progress=0/138804 - 0%], 13 sstables.
            INFO [Streaming to /17.172.21.186:377] 2013-11-06 02:40:44,180 StreamReplyVerbHandler.java (line 44) Successfully sent /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db to /17.172.21.186
            INFO [AntiEntropyStage:1] 2013-11-06 02:41:25,988 StreamOut.java (line 184) Stream context metadata [/data06/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db sections=506 progress=0/163692 - 0%, /data04/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-83-Data.db sections=515 progress=0/244773 - 0%, /data06/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-6-Data.db sections=479 progress=0/4496757 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-88-Data.db sections=519 progress=0/246657 - 0%, /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-6-Data.db sections=479 progress=0/4496757 - 0%, /data02/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-78-Data.db sections=520 progress=0/926121 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-7-Data.db sections=245 progress=0/4379049 - 0%, /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db sections=506 progress=0/163692 - 0%, /data03/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-411-Data.db sections=6 progress=0/1062 - 0%], 12 sstables.
            INFO [Streaming to /17.172.21.187:196] 2013-11-06 02:41:26,510 StreamReplyVerbHandler.java (line 44) Successfully sent /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db to /17.172.21.187
            INFO [CompactionExecutor:3945] 2013-11-06 03:48:36,689 CompactionManager.java (line 587) Cleaning up SSTableReader(path='/data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db')
            java.lang.RuntimeException: java.io.FileNotFoundException: /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db (No such file or directory)
            Caused by: java.io.FileNotFoundException: /data05/rhq/data/rhq/six_hour_metrics/rhq-six_hour_metrics-ic-1-Data.db (No such file or directory)
            
            • 3. Re: RHQ Storage; Compaction failure
              Elias Ross Master

              May be a case of https://issues.apache.org/jira/browse/CASSANDRA-4857

               

              The mailing list suggests a create-drop-create sequence may have happened. Is it possible that RHQ would have dropped my existing data, even if by mistake?

               

              I have been seeing fairly quirky behavior where some queries are turning up no data, but after a few tries data comes back.

               

              Is there a way to basically 'refresh' a node, meaning rebuild the data directory area from scratch?

              • 4. Re: RHQ Storage; Compaction failure
                John Sanda Apprentice

                Elias Ross wrote:

                 

                Is it possible that RHQ would have dropped my existing data, even if by mistake?

                If you mean dropped as in dropped the keyspace as described in CASSANDRA-4857, then no, that would not happen. It is possible however for a replica to miss data. If for example you have 3 replicas for a given key (i.e., schedule id) and on of the replicas goes down when data is written for that key, then that node will be inconsistent when it comes back up.

                 

                On the node where the cleanup error occurs, try running,

                 

                nodetool -p 7299 scrub rhq

                 

                That will rebuild the data files and should remove anything that is broken.

                • 5. Re: RHQ Storage; Compaction failure
                  Elias Ross Master

                  $ ./nodetool -p 7299 scrub rhq

                  Exception in thread "main" java.lang.RuntimeException: Tried to create duplicate hard link to /data05/rhq/data/rhq/six_hour_metrics/snapshots/pre-scrub-1383787540489/rhq-six_hour_metrics-ic-6-Summary.db

                  No such luck. Is it possible to simply rm -rf it all and do the scrub?

                  • 6. Re: RHQ Storage; Compaction failure
                    John Sanda Apprentice

                    There is an offline scrub that you can try.

                    1. Shut down the node.
                    2. cd <rhq-server-home>/rhq-storage/bin
                    3. ./sstablescrub rhq six_hour_metrics
                    4. restart storage node
                    5. ./nodetool -p 7299 repair -pr rhq six_hour_metrics

                     

                    If that does not work you can try an rm -rf approach. Here is how I would do it.

                    1. nodetool -p 7299 disablebinary
                    2. nodetool -p 7299 flush rhq six_hour_metrics
                    3. on each of the other nodes in the cluster run, nodetool -p 7299 repair rhq
                    4. Shut down the node
                    5. rm -rf <rhq-data-dir>/data/rhq/six_hour_metrics
                    6. restart the node
                    7. nodetool -p 7299 repair -pr rhq
                    • 7. Re: RHQ Storage; Compaction failure
                      mazz Master

                      John - that smells like a good FAQ entry - hint hint

                      • 8. Re: RHQ Storage; Compaction failure
                        Elias Ross Master

                        Thanks, it seems to be ignoring errors and powering through, which I like. I was thinking I might have to patch the server to keep grinding through.

                        • 9. Re: RHQ Storage; Compaction failure
                          Elias Ross Master

                          The issue was I had somehow symlinked two of the data directories to the same physical drive. User error. Luckily only took me a week to figure out.

                          • 10. Re: RHQ Storage; Compaction failure
                            jay shaughnessy Expert

                            Elias, ugh.  Thanks for following up, I'm sure John will appreciate it when he sees it.  If nothing else he came up with some potential scrubbing FAQ entry.