1386406 – Cassandra Pod has no evident errors or failures in logs but hawkular-metrics cannot reach it

Bug 1386406 - Cassandra Pod has no evident errors or failures in logs but hawkular-metrics cannot reach it

Summary: Cassandra Pod has no evident errors or failures in logs but hawkular-metrics ...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Matt Wringe
QA Contact:	Peng Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-18 20:45 UTC by Eric Jones
Modified:	2020-03-11 15:18 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-09 20:22:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Eric Jones 2016-10-18 20:45:18 UTC

Description of problem:
After experiencing some issues with hawkular-metrics pod and deleting it to deploy a new one, the new pod could not connect to the cassandra pod which appeared healthy. 
Logs (to be attached) do not show any apparent errors or failures and rsh'ing into the pod and running `nodetool status` hangs. To correct the issue, the cassandra pod had to be deleted and a new one deployed.

Additional Information:
Node the pod was deployed to ran out of Docker Storage if that might introduce an additional factor

Comment 2 Matt Wringe 2016-10-27 14:21:15 UTC

@ejones when you say the pod ran out of storage, do you mean the Cassandra pod? Or the Hawkular Metrics pods?

Running out of storage is usually an error condition and applications are most likely going to malfunction without disk space.

Comment 3 Eric Jones 2016-10-27 14:26:20 UTC

@Matt sorry, the phrasing of that statement was poor. The pod was deployed to nodeA and nodeA's Docker Storage filled up (i.e. lvs reported >90% usage), therefore the pods on nodeA started to experience problems.

Comment 4 Matt Wringe 2016-10-27 14:56:16 UTC

I am not sure how exactly this should have been handled, without disk space I don't think logs would be able to write any error messages to them. So not seeing error messages in the logs may be expected here.

Running out of disk space is expected to cause problems. What exactly would you have expected to have occurred here?

Comment 5 Eric Jones 2016-10-27 14:59:22 UTC

I suppose my expectation was that the Cassandra would fail and try to deploy a new pod? if everything about this is totally expected behavior and you don't see a way to improve this situation, then I suppose we can go ahead and mark this closed/notabug

Comment 6 Matt Wringe 2016-10-27 16:10:45 UTC

We don't have a liveness probe on our Cassandra instance, so if we run into an error condition we don't restart it. I believe this was done by design as we don't necessarily want to reboot the Cassandra pod in these situations. Normally if Cassandra is not functioning then its going to require something more than just restarting the pod.

@jsanda: do you think it would be a wise decision to check the health of Casandra and potentially reboot it if it encounters an error condition? Or is it safer to not do that.

Comment 7 John Sanda 2016-10-27 17:52:48 UTC

(In reply to Matt Wringe from comment #6)
> We don't have a liveness probe on our Cassandra instance, so if we run into
> an error condition we don't restart it. I believe this was done by design as
> we don't necessarily want to reboot the Cassandra pod in these situations.
> Normally if Cassandra is not functioning then its going to require something
> more than just restarting the pod.
> 
> @jsanda: do you think it would be a wise decision to check the health of
> Casandra and potentially reboot it if it encounters an error condition? Or
> is it safer to not do that.

What kinds of errors are we talking about? And should replication be part of this conversation? If we multiple replicas, we potentially restart a node without any disruption of service in terms of writing metrics.

Comment 8 Matt Wringe 2016-10-31 16:01:33 UTC

By default we are not using replication, and we might not know if replication has been applied or not. We should be using replication if available, but I thought we needed to wait until some Cassandra management was implemented in Hawkular Metrics itself.

At this point I am going to assume that leaving out liveness probes is probably going to be the safer option.

I am lowering the priority of this issue. Currently it should be the expected case that Cassandra pods will not be using a liveness probe and automatically restart.

Comment 9 John Sanda 2016-11-01 13:37:51 UTC

I want to make sure I understand correctly. Cassandra will automatically be restarted if its Java process crashes. Is this correct? I was previously thinking that a restart might get triggered simply by an error getting logged in system.log.

Comment 10 Matt Wringe 2016-11-01 13:44:27 UTC

If the Cassandra binary crashes or exits, it will automatically restart. That is the default behaviour.

We can add extra checks in place that will periodically run a script to check the health beyond just if the binary is running or not. Such as checking the state in the nodetool status, running cqlsh commands, or basically checking anything you want. These are the checks we do not have in place right now.

Comment 11 John Sanda 2016-11-01 15:17:26 UTC

What is in place now sounds good.

Comment 12 Matt Wringe 2017-02-09 20:22:45 UTC

From comments https://bugzilla.redhat.com/show_bug.cgi?id=1386406#c5 and https://bugzilla.redhat.com/show_bug.cgi?id=1386406#c11 it seems like this is the behaviour we want for now.

This is something we may refer back to in the future once we have more monitoring and management capabilities built into Cassandra.

Note You need to log in before you can comment on or make changes to this bug.