A week ago we reported (ISPN 6183) an unexpected state transfer exception which had been logged when a new node tried to a join an existing cluster (with very light loading). Several attempts had been made to join, also with increasing the timeout. In the end original nodes had been restarted, which then enabled the new node to finally join them. There's more info in the ticket.
My question is whether one can possibly recover from such a situation without having to restart the entire original cluster?
In the search for the underlying problem, should we be looking at the original cluster nodes, some irregularities with the newly provisioned node, or elsewhere?
We would also like to be better prepared in case we get another incident like this. Since we don't have direct control and access to the software where it's deployed (we are a dev company), is there anything we could add to the config, or maybe some additional monitoring tool, to provide for better diagnosis and possibility to recover from such an exception?