Description of problem: We are running into an issue where a memory/usage metric is missing for a node in a cluster. Its a 15 node cluster, and only 1 node seems to be affect by this. The metric appears in the Heapster API properly, its just not being stored in Hawkular Metrics.
In a previous Heapster log there was an error message about this node not being started yet. But after restarting the Heapster pod this error message is no longer in the logs. The Hawkular Metrics logs don't appear to show any errors either (aside from a few "Connection reset by peer" errors and warning about "Multiple resource methods"). The Cassandra logs look to be good as well.
It also seems to be affecting: cpu/node_reservation cpu/node_utilization cpu/request memory/node_reservation memory/node_utilization Heapster version: 1.2.0
@micke: can you look through the Heapster sink code and see if anything in there would be obvious as to why the in-memory sink is able to display these metrics but the Hawkular sink is not able to get them?
I couldn't repeat this with a simple unit test, so it must be something more complex. Apart from filtering we don't really reject any metrics (and errors should be logged) so I will need to test this in a real environment to see if there's perhaps a data truncation / naming function / Heapster data enricher issue.
Researching this more, the metric does actually exist. Its metric definition is there and its collecting metrics. The problem is that for some reason its not showing up when doing a tag query.
Upstream Hawkular Metrics issue: https://issues.jboss.org/browse/HWKMETRICS-660
This has been fixed in out OCP 3.6 images. I have cloned the issue for the work that needs to be done for OCP 3.5
@Matt, I would like to verify it by the following steps: 1. Deploy metrics on N(N>=15) nodes cluster. 2. Create pods on these nodes and ensure every node at least have one pod. 3. Check pod's memory/cpu/network usages on every node, make sure no memory/cpu/network metric is lost Is that enough, do you have other comments? Thanks
(In reply to Junqi Zhao from comment #18) > @Matt, > > I would like to verify it by the following steps: > 1. Deploy metrics on N(N>=15) nodes cluster. > 2. Create pods on these nodes and ensure every node at least have one pod. > 3. Check pod's memory/cpu/network usages on every node, make sure no > memory/cpu/network metric is lost > > Is that enough, do you have other comments? I don't believe there is anyway to actually test and verify this issue.
(In reply to Matt Wringe from comment #19) > I don't believe there is anyway to actually test and verify this issue. So, how can we test this defect?
(In reply to Junqi Zhao from comment #20) > (In reply to Matt Wringe from comment #19) > > I don't believe there is anyway to actually test and verify this issue. > > So, how can we test this defect? I don't know if you can, at least not easily. Basically the problem is that you can send a single REST call to Hawkular Metrics where Hawkular Metrics then does two writes to Cassandra. If one of those writes fail, then you will get this issue. You might be able to reproduce by getting Cassandra to the state where its under heavy load and a significant number of writes are failing. @jsanda: any thoughts on this?
(In reply to Matt Wringe from comment #21) > (In reply to Junqi Zhao from comment #20) > > (In reply to Matt Wringe from comment #19) > > > I don't believe there is anyway to actually test and verify this issue. > > > > So, how can we test this defect? > > I don't know if you can, at least not easily. > > Basically the problem is that you can send a single REST call to Hawkular > Metrics where Hawkular Metrics then does two writes to Cassandra. If one of > those writes fail, then you will get this issue. > > You might be able to reproduce by getting Cassandra to the state where its > under heavy load and a significant number of writes are failing. > > @jsanda: any thoughts on this? Unfortunately, this is probably the only way from a black box testing perspective.
@jsanda, It seems it does not need to test on a large clusters,just make sure send a single REST call to Hawkular Metrics where Hawkular Metrics then does two writes to Cassandra, and make one write fail is ok. So, I would like to know how to do it. Thanks
That's not all that's required. You also need to somehow cause it to not return error code (or cause Heapster to fail to parse the error code), so that the information isn't repaired automatically like it should be. That is, with the original implementation. We couldn't find a way to trigger all these cases at the same time.
(In reply to Michael Burman from comment #24) > That's not all that's required. You also need to somehow cause it to not > return error code (or cause Heapster to fail to parse the error code), so > that the information isn't repaired automatically like it should be. > > That is, with the original implementation. We couldn't find a way to trigger > all these cases at the same time. Michael, I think your advice is not easy to do from black box testing perspective, do you have some easy and stable way to do that?
No, there's no easy way to trigger this scenario - we couldn't repeat it with tests. I guess my approach would be to attach debugger to Cassandra, catch all the queries and then timeout processing of one of the queries. Or something like that. It might work - but I'm not sure in which part of the Cassandra one should catch it so that the driver doesn't detect it. But there are no easy solution to testing this.
@Michael, Tested, but I can not prepare the precondition to trigger this error, I think we could insert some codes to make it happen, and it will be easy to test from white box testing perspective. Do you know how to do it that way?
@Jeff, Since we are not able to repeat this error, how can we handle this defect now? here is my thoughts: I could let the metrics run for a few hours and then if there is no such error found, I will close this defect. If we find this error again in the future, then re-open it.
Other than patching Cassandra so it can predictably reproduce the error, comment 26 is likely the easiest way to reproduce. If you want to try this, let me know. I happy to assist.
@John, OK, we can try to test based on Comment 26, but I don't know how to timeout processing of one of the queries
(In reply to John Sanda from comment #29) > Other than patching Cassandra so it can predictably reproduce the error, > comment 26 is likely the easiest way to reproduce. If you want to try this, > let me know. I happy to assist. OK, we can try to test based on Comment 26, but I don't know how to timeout processing of one of the queries
(In reply to Junqi Zhao from comment #31) > (In reply to John Sanda from comment #29) > > OK, we can try to test based on Comment 26, but I don't know how to timeout > processing of one of the queries From comment 26 you basically need to attach a debugger to Cassandra, insert a break point at specific area and monitor requests coming in, decide to wait on some of those requests but not others, and also manipulate how Heapster is doing things since it will try and fix this problem in normal situations (the only case where Heapster should stop trying is if its restarted). Its basically impossible to do this manually as described in comment 26. I think the only real option here is to create a test Cassandra image that has been hacked up to fail every other request. And then use the Hawkular Metrics REST endpoint directly. That should allow you to reproduce this issue. But its going to be really difficult to setup and verify
@jcantril, @pweil For this defect, I am not able to verify it now, as you can see from Comment 32, it is not easy to verify. Shall I only do our usual functional testing, and leave it ON_QA until I can verify it?
@junqi If you cannot reproduce this with functional testing I would verify the issue. If you'd like to keep the issue around for some soak time so you don't have to reopen it if it's seen again then that's fine too.
(In reply to Paul Weil from comment #34) > @junqi If you cannot reproduce this with functional testing I would verify > the issue. If you'd like to keep the issue around for some soak time so you > don't have to reopen it if it's seen again then that's fine too. Paul, Thanks, please help me to verify it, I will change QA contact to you.