1441390 – Missing node level memory/usage

Bug 1441390 - Missing node level memory/usage

Summary: Missing node level memory/usage

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Matt Wringe
QA Contact:	Paul Weil
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-11 20:34 UTC by Matt Wringe
Modified:	2017-08-16 19:51 UTC (History)
CC List:	12 users (show)
Fixed In Version:	3.6.0
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1469291 (view as bug list)
Environment:
Last Closed:	2017-08-16 19:44:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	HWKMETRICS-660	0	Major	Resolved	Failure to insert metric tags can result in inconsistent state	2020-12-04 07:57:21 UTC

Description Matt Wringe 2017-04-11 20:34:17 UTC

Description of problem:
We are running into an issue where a memory/usage metric is missing for a node in a cluster.

Its a 15 node cluster, and only 1 node seems to be affect by this.

The metric appears in the Heapster API properly, its just not being stored in Hawkular Metrics.

Comment 1 Matt Wringe 2017-04-11 20:38:41 UTC

In a previous Heapster log there was an error message about this node not being started yet. But after restarting the Heapster pod this error message is no longer in the logs.

The Hawkular Metrics logs don't appear to show any errors either (aside from a few "Connection reset by peer" errors and warning about "Multiple resource methods").

The Cassandra logs look to be good as well.

Comment 2 Matt Wringe 2017-04-11 21:26:36 UTC

It also seems to be affecting:

cpu/node_reservation
cpu/node_utilization
cpu/request
memory/node_reservation
memory/node_utilization


Heapster version: 1.2.0

Comment 3 Matt Wringe 2017-04-11 21:33:46 UTC

@micke: can you look through the Heapster sink code and see if anything in there would be obvious as to why the in-memory sink is able to display these metrics but the Hawkular sink is not able to get them?

Comment 4 Michael Burman 2017-04-12 07:12:34 UTC

I couldn't repeat this with a simple unit test, so it must be something more complex. Apart from filtering we don't really reject any metrics (and errors should be logged) so I will need to test this in a real environment to see if there's perhaps a data truncation / naming function / Heapster data enricher issue.

Comment 5 Matt Wringe 2017-04-12 15:37:07 UTC

Researching this more, the metric does actually exist. Its metric definition is there and its collecting metrics.

The problem is that for some reason its not showing up when doing a tag query.

Comment 10 Matt Wringe 2017-04-18 20:28:06 UTC

Upstream Hawkular Metrics issue: https://issues.jboss.org/browse/HWKMETRICS-660

Comment 17 Matt Wringe 2017-07-10 21:29:15 UTC

This has been fixed in out OCP 3.6 images. I have cloned the issue for the work that needs to be done for OCP 3.5

Comment 18 Junqi Zhao 2017-07-11 08:48:49 UTC

@Matt,

I would like to verify it by the following steps:
1. Deploy metrics on N(N>=15) nodes cluster.
2. Create pods on these nodes and ensure every node at least have one pod.
3. Check pod's memory/cpu/network usages on every node, make sure no memory/cpu/network metric is lost

Is that enough, do you have other comments?

Thanks

Comment 19 Matt Wringe 2017-07-11 20:00:01 UTC

(In reply to Junqi Zhao from comment #18)
> @Matt,
> 
> I would like to verify it by the following steps:
> 1. Deploy metrics on N(N>=15) nodes cluster.
> 2. Create pods on these nodes and ensure every node at least have one pod.
> 3. Check pod's memory/cpu/network usages on every node, make sure no
> memory/cpu/network metric is lost
> 
> Is that enough, do you have other comments?

I don't believe there is anyway to actually test and verify this issue.

Comment 20 Junqi Zhao 2017-07-12 23:35:51 UTC

(In reply to Matt Wringe from comment #19)
> I don't believe there is anyway to actually test and verify this issue.

So, how can we test this defect?

Comment 21 Matt Wringe 2017-07-13 12:49:57 UTC

(In reply to Junqi Zhao from comment #20)
> (In reply to Matt Wringe from comment #19)
> > I don't believe there is anyway to actually test and verify this issue.
> 
> So, how can we test this defect?

I don't know if you can, at least not easily.

Basically the problem is that you can send a single REST call to Hawkular Metrics where Hawkular Metrics then does two writes to Cassandra. If one of those writes fail, then you will get this issue.

You might be able to reproduce by getting Cassandra to the state where its under heavy load and a significant number of writes are failing.

@jsanda: any thoughts on this?

Comment 22 John Sanda 2017-07-13 14:31:57 UTC

(In reply to Matt Wringe from comment #21)
> (In reply to Junqi Zhao from comment #20)
> > (In reply to Matt Wringe from comment #19)
> > > I don't believe there is anyway to actually test and verify this issue.
> > 
> > So, how can we test this defect?
> 
> I don't know if you can, at least not easily.
> 
> Basically the problem is that you can send a single REST call to Hawkular
> Metrics where Hawkular Metrics then does two writes to Cassandra. If one of
> those writes fail, then you will get this issue.
> 
> You might be able to reproduce by getting Cassandra to the state where its
> under heavy load and a significant number of writes are failing.
> 
> @jsanda: any thoughts on this?

Unfortunately, this is probably the only way from a black box testing perspective.

Comment 23 Junqi Zhao 2017-07-14 10:46:34 UTC

@jsanda,

It seems it does not need to test on a large clusters,just make sure send a single REST call to Hawkular Metrics where Hawkular Metrics then does two writes to Cassandra, and make one write fail is ok.

So, I would like to know how to do it.

Thanks

Comment 24 Michael Burman 2017-07-14 15:15:41 UTC

That's not all that's required. You also need to somehow cause it to not return error code (or cause Heapster to fail to parse the error code), so that the information isn't repaired automatically like it should be. 

That is, with the original implementation. We couldn't find a way to trigger all these cases at the same time.

Comment 25 Junqi Zhao 2017-07-17 09:21:55 UTC

(In reply to Michael Burman from comment #24)
> That's not all that's required. You also need to somehow cause it to not
> return error code (or cause Heapster to fail to parse the error code), so
> that the information isn't repaired automatically like it should be. 
> 
> That is, with the original implementation. We couldn't find a way to trigger
> all these cases at the same time.

Michael,

I think your advice is not easy to do from black box testing perspective, do you have some easy and stable way to do that?

Comment 26 Michael Burman 2017-07-17 09:47:52 UTC

No, there's no easy way to trigger this scenario - we couldn't repeat it with tests.

I guess my approach would be to attach debugger to Cassandra, catch all the queries and then timeout processing of one of the queries. Or something like that. It might work - but I'm not sure in which part of the Cassandra one should catch it so that the driver doesn't detect it.

But there are no easy solution to testing this.

Comment 27 Junqi Zhao 2017-07-20 09:09:30 UTC

@Michael,

Tested, but I can not prepare the precondition to trigger this error, I think we could insert some codes to make it happen, and it will be easy to test from white box testing perspective.

Do you know how to do it that way?

Comment 28 Junqi Zhao 2017-07-24 08:33:30 UTC

@Jeff,

Since we are not able to repeat this error, how can we handle this defect now?

here is my thoughts:
I could let the metrics run for a few hours and then if there is no such error found, I will close this defect.

If we find this error again in the future, then re-open it.

Comment 29 John Sanda 2017-07-24 13:26:43 UTC

Other than patching Cassandra so it can predictably reproduce the error, comment 26 is likely the easiest way to reproduce. If you want to try this, let me know. I happy to assist.

Comment 30 Junqi Zhao 2017-07-25 00:25:23 UTC

@John,

OK, we can try to test based on Comment 26, but I don't know how to timeout processing of one of the queries

Comment 31 Junqi Zhao 2017-07-27 09:09:41 UTC

(In reply to John Sanda from comment #29)
> Other than patching Cassandra so it can predictably reproduce the error,
> comment 26 is likely the easiest way to reproduce. If you want to try this,
> let me know. I happy to assist.

OK, we can try to test based on Comment 26, but I don't know how to timeout processing of one of the queries

Comment 32 Matt Wringe 2017-07-27 12:00:47 UTC

(In reply to Junqi Zhao from comment #31)
> (In reply to John Sanda from comment #29)
> 
> OK, we can try to test based on Comment 26, but I don't know how to timeout
> processing of one of the queries

From comment 26 you basically need to attach a debugger to Cassandra, insert a break point at specific area and monitor requests coming in, decide to wait on some of those requests but not others, and also manipulate how Heapster is doing things since it will try and fix this problem in normal situations (the only case where Heapster should stop trying is if its restarted).

Its basically impossible to do this manually as described in comment 26.

I think the only real option here is to create a test Cassandra image that has been hacked up to fail every other request. And then use the Hawkular Metrics REST endpoint directly.

That should allow you to reproduce this issue. But its going to be really difficult to setup and verify

Comment 33 Junqi Zhao 2017-07-28 09:47:37 UTC

@jcantril, @pweil

For this defect, I am not able to verify it now, as you can see from Comment 32, it is not easy to verify.

Shall I only do our usual functional testing, and leave it ON_QA until I can verify it?

Comment 34 Paul Weil 2017-07-28 13:43:22 UTC

@junqi If you cannot reproduce this with functional testing I would verify the issue. If you'd like to keep the issue around for some soak time so you don't have to reopen it if it's seen again then that's fine too.

Comment 35 Junqi Zhao 2017-07-31 06:43:58 UTC

(In reply to Paul Weil from comment #34)
> @junqi If you cannot reproduce this with functional testing I would verify
> the issue. If you'd like to keep the issue around for some soak time so you
> don't have to reopen it if it's seen again then that's fine too.

Paul,

Thanks, please help me to verify it, I will change QA contact to you.

Note You need to log in before you can comment on or make changes to this bug.