1284700 – Heapster is using the deprecated externalID value to identify metrics

Bug 1284700 - Heapster is using the deprecated externalID value to identify metrics

Summary: Heapster is using the deprecated externalID value to identify metrics

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.2.1
Assignee:	Matt Wringe
QA Contact:	Qixuan Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-11-23 23:38 UTC by Federico Simoncelli
Modified:	2016-05-25 06:19 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-12 16:25:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2016:1064	0	normal	SHIPPED_LIVE	Important: Red Hat OpenShift Enterprise 3.2 security, bug fix, and enhancement update	2016-05-12 20:19:17 UTC

Description Federico Simoncelli 2015-11-23 23:38:24 UTC

Description of problem:
According to heapster code (and empirical tests), heapster is using the node ExternalID:

https://github.com/kubernetes/kubernetes/blob/release-1.0/pkg/api/v1/types.go#L1256

to identify nodes (LabelHostID).

As a side-effect hawkular uses the externalID to generate the resources identifiers for the metrics, e.g.:

  machine/<ExternalID>/memory/usage

the ExternalID is deprecated (see documentation in the link above) and non-deterministic (see bug #1284614 and bug #1284621).

For these reasons heapster should use the node name which is univocal and deterministic for the node resource identifier.

How reproducible:
100%

Steps to Reproduce:
1. Set the ExternalID of a node

Actual results:
The node metrics are stored with a resource id generated using the ExternalID.

Expected results:
The node metrics should be stored with a resource id generated using the node name.

Comment 1 Jeff Cantrill 2015-11-30 14:31:56 UTC

Opened upstream issue per request from Jimmi Dyson: https://github.com/kubernetes/heapster/issues/743

Comment 2 Jeff Cantrill 2015-11-30 14:34:03 UTC

Closed new upstream issue in favor of https://github.com/kubernetes/heapster/issues/731

Comment 3 Jeff Cantrill 2016-01-06 20:36:10 UTC

Lowering from 'must fix' as there is an open PR to fix https://github.com/kubernetes/heapster/pull/749 and per mwringe ' once openshfit fixes the ansible installer, then the value you get back should be exactly the same as the changes they are requesting'. Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1284621

Comment 4 Lutz Lange 2016-03-11 13:04:49 UTC

I'm running OSE on GCE. I did setup my metrics stack and can see pod and container metrics on the CloudForms side. I can't get any node info though.

My node spec looks like this.

spec:
  externalID: "9656520216661250444"
  providerID: gce://jens-walkthrough/europe-west1-c/osemaster

So the externalID that I have does not reflect the node name. 

This setup is important to us it is too be used for customer facing events with Google and we need to get the demo ready. Having no data in the Overview page of CF is not an option.

There will be 10+ events and we are expecting more than 100 delegates at each of these events.

Comment 5 Federico Simoncelli 2016-03-11 13:24:12 UTC

Indexing the node metrics on an obsolete key (externalID) which is not the id of the node (node name) other than being an issue that should be resolved, it is also preventing CloudForms from collecting the metrics.

This is high severity for CloudForms because it's preventing the dashboard to display the metrics of the cluster.

Comment 6 Federico Simoncelli 2016-03-11 14:26:02 UTC

(In reply to Jeff Cantrill from comment #3)
> Lowering from 'must fix' as there is an open PR to fix
> https://github.com/kubernetes/heapster/pull/749 and per mwringe ' once
> openshfit fixes the ansible installer, then the value you get back should be
> exactly the same as the changes they are requesting'. Ref:
> https://bugzilla.redhat.com/show_bug.cgi?id=1284621

The part in openshift-ansible was fixed but the bug still stands for those users that configured a cloudprovider (as per comment 4).

Comment 14 Lutz Lange 2016-03-16 05:21:43 UTC

The documentation on how to configure AWS persistent storage lists a config-config entry. This is not mentioned in the GCE part. I do image that I need that to make full use of the CreateVolume and DeleteVolume.

What would the workflow be like? Would just having the pvc be enough and OSE would the go off and create a matching volume for me?

Comment 15 Matt Wringe 2016-03-18 17:08:48 UTC

Our 3.2 images should now be using the nodename when generating the metric id.

Comment 16 Qixuan Wang 2016-03-21 09:41:59 UTC

I setup environment on OpenStack using OSE 3.1.1.6/3.2.0.4 images, even adding externalID=number, master could still get nodeName as externalID. But tested on GCE(openshift 3.2.0.3), the problem isn't fixed.


OpenStack

1. [root@openshift-129 ~]# oc get node
NAME                               STATUS                     AGE
openshift-129.lab.sjc.redhat.com   Ready,SchedulingDisabled   3d
openshift-137.lab.sjc.redhat.com   Ready                      3d

2. Add externalID : "9656520216661250444" into node-config.yaml on node(openshift-137.lab.sjc.redhat.com), like this, then restart node service:
nodeName: openshift-137.lab.sjc.redhat.com
externalID: "9656520216661250444"


3. Master could get externalID with the same value of nodeName 
[root@openshift-129 ~]# oc get node openshift-137.lab.sjc.redhat.com -o yaml | grep external
  externalID: openshift-137.lab.sjc.redhat.com



===================================================
GCE cloud-provider

1. [root@ose-32-dma-master ~]# oc get node
NAME                                               STATUS                     AGE
ose-32-dma-master.c.openshift-gce-devel.internal   Ready,SchedulingDisabled   5d
ose-32-dma-node-1.c.openshift-gce-devel.internal   Ready                      25m

2. [root@ose-32-dma-master ~]# oc get node ose-32-dma-node-1.c.openshift-gce-devel.internal -o yaml | grep external
  externalID: "13946820683979383815"

Comment 17 Matt Wringe 2016-03-21 14:18:59 UTC

This issue has nothing to do with what is being returned by 'oc get node'. 'oc get node' is going to return the externalId here, and this is something which is going to be the case until upstream Kubernetes removes this deprecated value.

What this issue is about is the metricId being stored in Hawkular Metrics using the 'externalId' instead of the 'node name'.

To verify:

1) setup your system so that when calling 'oc get node' the 'externalId' does not match the node name

2) deploy the 3.2 metrics images

3) fetch the system level metrics and check if the id's for those nodes are using the hostname and not the externalId.

eg:
curl --insecure -H "Authorization: Bearer `oc whoami -t`" -H "Hawkular-tenant: _system" -X GET https://${HAWKULAR_METRICS_HOSTNAME_OR_IP_ADDRESS}/hawkular/metrics/metrics | python -m json.tool | grep -i \"id\"

The id should be in the form: foo/${HOSTNAME}/bar/.. and not foo/${EXTERNAL_ID}/bar/...

Comment 18 Qixuan Wang 2016-03-23 10:57:02 UTC

Tested metrics 3.2 image on OpenShift-OpenStack. ExternalID is always overwrote by node name(comment 16) even if added a value mismatching node name, so I think metricId could always get right value. 

[root@dhcp-136-211 qwang]# curl --insecure -H "Authorization: Bearer `oc whoami -t`" -H "Hawkular-tenant: _system" -k -X GET https://hawkular-metrics.0318-02v.qe.rhcloud.com/hawkular/metrics/metrics | python -m json.tool | grep -i \"id\"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69289    0 69289    0     0  28828      0 --:--:--  0:00:02 --:--:-- 28834
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/cpu/usage",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/page_faults",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/uptime",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/cpu/usage",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/page_faults",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/uptime",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/cpu/usage",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/page_faults",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/uptime",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/cpu/usage",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/page_faults",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/uptime",
        "id": "machine/openshift-129.lab.sjc.redhat.com/cpu/usage",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/page_faults",
        "id": "machine/openshift-129.lab.sjc.redhat.com/network/rx",
        "id": "machine/openshift-129.lab.sjc.redhat.com/network/rx_errors",
        "id": "machine/openshift-129.lab.sjc.redhat.com/network/tx",
        "id": "machine/openshift-129.lab.sjc.redhat.com/network/tx_errors",
        "id": "machine/openshift-129.lab.sjc.redhat.com/uptime",
        "id": "machine/openshift-137.lab.sjc.redhat.com/cpu/usage",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/page_faults",
        "id": "machine/openshift-137.lab.sjc.redhat.com/network/rx",
        "id": "machine/openshift-137.lab.sjc.redhat.com/network/rx_errors",
        "id": "machine/openshift-137.lab.sjc.redhat.com/network/tx",
        "id": "machine/openshift-137.lab.sjc.redhat.com/network/tx_errors",
        "id": "machine/openshift-137.lab.sjc.redhat.com/uptime",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/cpu/limit",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/limit",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/usage",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/working_set",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/cpu/limit",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/limit",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/usage",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/working_set",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/cpu/limit",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/limit",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/usage",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/working_set",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/cpu/limit",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/limit",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/usage",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/working_set",
        "id": "machine/openshift-129.lab.sjc.redhat.com/cpu/limit",
        "id": "machine/openshift-129.lab.sjc.redhat.com/filesystem/limit",
        "id": "machine/openshift-129.lab.sjc.redhat.com/filesystem/usage",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/limit",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/usage",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/working_set",
        "id": "machine/openshift-137.lab.sjc.redhat.com/cpu/limit",
        "id": "machine/openshift-137.lab.sjc.redhat.com/filesystem/limit",
        "id": "machine/openshift-137.lab.sjc.redhat.com/filesystem/usage",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/limit",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/usage",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/working_set",


I'm running into some troubles on GCE, still need time to verify this bug.

Comment 19 Qixuan Wang 2016-03-24 10:48:47 UTC

I think I have a misunderstanding about comment 18. Please ignore the above.

Add openstack as cloud-provider, make sure its ExternalID does not match the node name, and get the following result. So verify the bug, thanks.

[root@openshift-129 ~]# oc get node
NAME                               STATUS                     AGE
openshift-129.lab.sjc.redhat.com   Ready,SchedulingDisabled   6d
openshift-137.lab.sjc.redhat.com   Ready                      2m

[root@openshift-129 ~]# oc describe node/openshift-137.lab.sjc.redhat.com|grep ExternalID
ExternalID:			57a3d2f5-ef9e-4fc0-bb0e-6568390f1831

[root@openshift-129 ~]# oc describe node/openshift-137.lab.sjc.redhat.com|grep ExternalID
ExternalID:			57a3d2f5-ef9e-4fc0-bb0e-6568390f1831

[root@dhcp-136-211 qwang]# curl --insecure -H "Authorization: Bearer `oc whoami -t`" -H "Hawkular-tenant: _system" -k -X GET https://hawkular-metrics.0318-02v.qe.rhcloud.com/hawkular/metrics/metrics | python -m json.tool | grep -i \"id\"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69297    0 69297    0     0  19947      0 --:--:--  0:00:03 --:--:-- 19953
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/cpu/limit",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/limit",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/usage",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/working_set",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/cpu/limit",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/limit",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/usage",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/working_set",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/cpu/limit",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/limit",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/usage",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/working_set",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/cpu/limit",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/limit",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/usage",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/working_set",
        "id": "machine/openshift-129.lab.sjc.redhat.com/cpu/limit",
        "id": "machine/openshift-129.lab.sjc.redhat.com/filesystem/limit",
        "id": "machine/openshift-129.lab.sjc.redhat.com/filesystem/usage",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/limit",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/usage",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/working_set",
        "id": "machine/openshift-137.lab.sjc.redhat.com/cpu/limit",
        "id": "machine/openshift-137.lab.sjc.redhat.com/filesystem/limit",
        "id": "machine/openshift-137.lab.sjc.redhat.com/filesystem/usage",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/limit",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/usage",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/working_set",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/cpu/usage",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/page_faults",
        "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/uptime",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/cpu/usage",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/page_faults",
        "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/uptime",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/cpu/usage",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/page_faults",
        "id": "kubelet/openshift-129.lab.sjc.redhat.com/uptime",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/cpu/usage",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/page_faults",
        "id": "kubelet/openshift-137.lab.sjc.redhat.com/uptime",
        "id": "machine/openshift-129.lab.sjc.redhat.com/cpu/usage",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "machine/openshift-129.lab.sjc.redhat.com/memory/page_faults",
        "id": "machine/openshift-129.lab.sjc.redhat.com/network/rx",
        "id": "machine/openshift-129.lab.sjc.redhat.com/network/rx_errors",
        "id": "machine/openshift-129.lab.sjc.redhat.com/network/tx",
        "id": "machine/openshift-129.lab.sjc.redhat.com/network/tx_errors",
        "id": "machine/openshift-129.lab.sjc.redhat.com/uptime",
        "id": "machine/openshift-137.lab.sjc.redhat.com/cpu/usage",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/major_page_faults",
        "id": "machine/openshift-137.lab.sjc.redhat.com/memory/page_faults",
        "id": "machine/openshift-137.lab.sjc.redhat.com/network/rx",
        "id": "machine/openshift-137.lab.sjc.redhat.com/network/rx_errors",
        "id": "machine/openshift-137.lab.sjc.redhat.com/network/tx",
        "id": "machine/openshift-137.lab.sjc.redhat.com/network/tx_errors",
        "id": "machine/openshift-137.lab.sjc.redhat.com/uptime",

Comment 21 errata-xmlrpc 2016-05-12 16:25:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064

Note You need to log in before you can comment on or make changes to this bug.