Description of problem:
After experiencing some issues with hawkular-metrics pod and deleting it to deploy a new one, the new pod could not connect to the cassandra pod which appeared healthy.
Logs (to be attached) do not show any apparent errors or failures and rsh'ing into the pod and running `nodetool status` hangs. To correct the issue, the cassandra pod had to be deleted and a new one deployed.
Node the pod was deployed to ran out of Docker Storage if that might introduce an additional factor
@ejones when you say the pod ran out of storage, do you mean the Cassandra pod? Or the Hawkular Metrics pods?
Running out of storage is usually an error condition and applications are most likely going to malfunction without disk space.
@Matt sorry, the phrasing of that statement was poor. The pod was deployed to nodeA and nodeA's Docker Storage filled up (i.e. lvs reported >90% usage), therefore the pods on nodeA started to experience problems.
I am not sure how exactly this should have been handled, without disk space I don't think logs would be able to write any error messages to them. So not seeing error messages in the logs may be expected here.
Running out of disk space is expected to cause problems. What exactly would you have expected to have occurred here?
I suppose my expectation was that the Cassandra would fail and try to deploy a new pod? if everything about this is totally expected behavior and you don't see a way to improve this situation, then I suppose we can go ahead and mark this closed/notabug
We don't have a liveness probe on our Cassandra instance, so if we run into an error condition we don't restart it. I believe this was done by design as we don't necessarily want to reboot the Cassandra pod in these situations. Normally if Cassandra is not functioning then its going to require something more than just restarting the pod.
@jsanda: do you think it would be a wise decision to check the health of Casandra and potentially reboot it if it encounters an error condition? Or is it safer to not do that.
(In reply to Matt Wringe from comment #6)
> We don't have a liveness probe on our Cassandra instance, so if we run into
> an error condition we don't restart it. I believe this was done by design as
> we don't necessarily want to reboot the Cassandra pod in these situations.
> Normally if Cassandra is not functioning then its going to require something
> more than just restarting the pod.
> @jsanda: do you think it would be a wise decision to check the health of
> Casandra and potentially reboot it if it encounters an error condition? Or
> is it safer to not do that.
What kinds of errors are we talking about? And should replication be part of this conversation? If we multiple replicas, we potentially restart a node without any disruption of service in terms of writing metrics.
By default we are not using replication, and we might not know if replication has been applied or not. We should be using replication if available, but I thought we needed to wait until some Cassandra management was implemented in Hawkular Metrics itself.
At this point I am going to assume that leaving out liveness probes is probably going to be the safer option.
I am lowering the priority of this issue. Currently it should be the expected case that Cassandra pods will not be using a liveness probe and automatically restart.
I want to make sure I understand correctly. Cassandra will automatically be restarted if its Java process crashes. Is this correct? I was previously thinking that a restart might get triggered simply by an error getting logged in system.log.
If the Cassandra binary crashes or exits, it will automatically restart. That is the default behaviour.
We can add extra checks in place that will periodically run a script to check the health beyond just if the binary is running or not. Such as checking the state in the nodetool status, running cqlsh commands, or basically checking anything you want. These are the checks we do not have in place right now.
What is in place now sounds good.
From comments https://bugzilla.redhat.com/show_bug.cgi?id=1386406#c5 and https://bugzilla.redhat.com/show_bug.cgi?id=1386406#c11 it seems like this is the behaviour we want for now.
This is something we may refer back to in the future once we have more monitoring and management capabilities built into Cassandra.