Description of problem: According to heapster code (and empirical tests), heapster is using the node ExternalID: https://github.com/kubernetes/kubernetes/blob/release-1.0/pkg/api/v1/types.go#L1256 to identify nodes (LabelHostID). As a side-effect hawkular uses the externalID to generate the resources identifiers for the metrics, e.g.: machine/<ExternalID>/memory/usage the ExternalID is deprecated (see documentation in the link above) and non-deterministic (see bug #1284614 and bug #1284621). For these reasons heapster should use the node name which is univocal and deterministic for the node resource identifier. How reproducible: 100% Steps to Reproduce: 1. Set the ExternalID of a node Actual results: The node metrics are stored with a resource id generated using the ExternalID. Expected results: The node metrics should be stored with a resource id generated using the node name.
Opened upstream issue per request from Jimmi Dyson: https://github.com/kubernetes/heapster/issues/743
Closed new upstream issue in favor of https://github.com/kubernetes/heapster/issues/731
Lowering from 'must fix' as there is an open PR to fix https://github.com/kubernetes/heapster/pull/749 and per mwringe ' once openshfit fixes the ansible installer, then the value you get back should be exactly the same as the changes they are requesting'. Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1284621
I'm running OSE on GCE. I did setup my metrics stack and can see pod and container metrics on the CloudForms side. I can't get any node info though. My node spec looks like this. spec: externalID: "9656520216661250444" providerID: gce://jens-walkthrough/europe-west1-c/osemaster So the externalID that I have does not reflect the node name. This setup is important to us it is too be used for customer facing events with Google and we need to get the demo ready. Having no data in the Overview page of CF is not an option. There will be 10+ events and we are expecting more than 100 delegates at each of these events.
Indexing the node metrics on an obsolete key (externalID) which is not the id of the node (node name) other than being an issue that should be resolved, it is also preventing CloudForms from collecting the metrics. This is high severity for CloudForms because it's preventing the dashboard to display the metrics of the cluster.
(In reply to Jeff Cantrill from comment #3) > Lowering from 'must fix' as there is an open PR to fix > https://github.com/kubernetes/heapster/pull/749 and per mwringe ' once > openshfit fixes the ansible installer, then the value you get back should be > exactly the same as the changes they are requesting'. Ref: > https://bugzilla.redhat.com/show_bug.cgi?id=1284621 The part in openshift-ansible was fixed but the bug still stands for those users that configured a cloudprovider (as per comment 4).
The documentation on how to configure AWS persistent storage lists a config-config entry. This is not mentioned in the GCE part. I do image that I need that to make full use of the CreateVolume and DeleteVolume. What would the workflow be like? Would just having the pvc be enough and OSE would the go off and create a matching volume for me?
Our 3.2 images should now be using the nodename when generating the metric id.
I setup environment on OpenStack using OSE 3.1.1.6/3.2.0.4 images, even adding externalID=number, master could still get nodeName as externalID. But tested on GCE(openshift 3.2.0.3), the problem isn't fixed. OpenStack 1. [root@openshift-129 ~]# oc get node NAME STATUS AGE openshift-129.lab.sjc.redhat.com Ready,SchedulingDisabled 3d openshift-137.lab.sjc.redhat.com Ready 3d 2. Add externalID : "9656520216661250444" into node-config.yaml on node(openshift-137.lab.sjc.redhat.com), like this, then restart node service: nodeName: openshift-137.lab.sjc.redhat.com externalID: "9656520216661250444" 3. Master could get externalID with the same value of nodeName [root@openshift-129 ~]# oc get node openshift-137.lab.sjc.redhat.com -o yaml | grep external externalID: openshift-137.lab.sjc.redhat.com =================================================== GCE cloud-provider 1. [root@ose-32-dma-master ~]# oc get node NAME STATUS AGE ose-32-dma-master.c.openshift-gce-devel.internal Ready,SchedulingDisabled 5d ose-32-dma-node-1.c.openshift-gce-devel.internal Ready 25m 2. [root@ose-32-dma-master ~]# oc get node ose-32-dma-node-1.c.openshift-gce-devel.internal -o yaml | grep external externalID: "13946820683979383815"
This issue has nothing to do with what is being returned by 'oc get node'. 'oc get node' is going to return the externalId here, and this is something which is going to be the case until upstream Kubernetes removes this deprecated value. What this issue is about is the metricId being stored in Hawkular Metrics using the 'externalId' instead of the 'node name'. To verify: 1) setup your system so that when calling 'oc get node' the 'externalId' does not match the node name 2) deploy the 3.2 metrics images 3) fetch the system level metrics and check if the id's for those nodes are using the hostname and not the externalId. eg: curl --insecure -H "Authorization: Bearer `oc whoami -t`" -H "Hawkular-tenant: _system" -X GET https://${HAWKULAR_METRICS_HOSTNAME_OR_IP_ADDRESS}/hawkular/metrics/metrics | python -m json.tool | grep -i \"id\" The id should be in the form: foo/${HOSTNAME}/bar/.. and not foo/${EXTERNAL_ID}/bar/...
Tested metrics 3.2 image on OpenShift-OpenStack. ExternalID is always overwrote by node name(comment 16) even if added a value mismatching node name, so I think metricId could always get right value. [root@dhcp-136-211 qwang]# curl --insecure -H "Authorization: Bearer `oc whoami -t`" -H "Hawkular-tenant: _system" -k -X GET https://hawkular-metrics.0318-02v.qe.rhcloud.com/hawkular/metrics/metrics | python -m json.tool | grep -i \"id\" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 69289 0 69289 0 0 28828 0 --:--:-- 0:00:02 --:--:-- 28834 "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/cpu/usage", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/major_page_faults", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/page_faults", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/uptime", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/cpu/usage", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/major_page_faults", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/page_faults", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/uptime", "id": "kubelet/openshift-129.lab.sjc.redhat.com/cpu/usage", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/major_page_faults", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/page_faults", "id": "kubelet/openshift-129.lab.sjc.redhat.com/uptime", "id": "kubelet/openshift-137.lab.sjc.redhat.com/cpu/usage", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/major_page_faults", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/page_faults", "id": "kubelet/openshift-137.lab.sjc.redhat.com/uptime", "id": "machine/openshift-129.lab.sjc.redhat.com/cpu/usage", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/major_page_faults", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/page_faults", "id": "machine/openshift-129.lab.sjc.redhat.com/network/rx", "id": "machine/openshift-129.lab.sjc.redhat.com/network/rx_errors", "id": "machine/openshift-129.lab.sjc.redhat.com/network/tx", "id": "machine/openshift-129.lab.sjc.redhat.com/network/tx_errors", "id": "machine/openshift-129.lab.sjc.redhat.com/uptime", "id": "machine/openshift-137.lab.sjc.redhat.com/cpu/usage", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/major_page_faults", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/page_faults", "id": "machine/openshift-137.lab.sjc.redhat.com/network/rx", "id": "machine/openshift-137.lab.sjc.redhat.com/network/rx_errors", "id": "machine/openshift-137.lab.sjc.redhat.com/network/tx", "id": "machine/openshift-137.lab.sjc.redhat.com/network/tx_errors", "id": "machine/openshift-137.lab.sjc.redhat.com/uptime", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/cpu/limit", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/limit", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/usage", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/working_set", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/cpu/limit", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/limit", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/usage", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/working_set", "id": "kubelet/openshift-129.lab.sjc.redhat.com/cpu/limit", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/limit", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/usage", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/working_set", "id": "kubelet/openshift-137.lab.sjc.redhat.com/cpu/limit", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/limit", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/usage", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/working_set", "id": "machine/openshift-129.lab.sjc.redhat.com/cpu/limit", "id": "machine/openshift-129.lab.sjc.redhat.com/filesystem/limit", "id": "machine/openshift-129.lab.sjc.redhat.com/filesystem/usage", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/limit", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/usage", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/working_set", "id": "machine/openshift-137.lab.sjc.redhat.com/cpu/limit", "id": "machine/openshift-137.lab.sjc.redhat.com/filesystem/limit", "id": "machine/openshift-137.lab.sjc.redhat.com/filesystem/usage", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/limit", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/usage", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/working_set", I'm running into some troubles on GCE, still need time to verify this bug.
I think I have a misunderstanding about comment 18. Please ignore the above. Add openstack as cloud-provider, make sure its ExternalID does not match the node name, and get the following result. So verify the bug, thanks. [root@openshift-129 ~]# oc get node NAME STATUS AGE openshift-129.lab.sjc.redhat.com Ready,SchedulingDisabled 6d openshift-137.lab.sjc.redhat.com Ready 2m [root@openshift-129 ~]# oc describe node/openshift-137.lab.sjc.redhat.com|grep ExternalID ExternalID: 57a3d2f5-ef9e-4fc0-bb0e-6568390f1831 [root@openshift-129 ~]# oc describe node/openshift-137.lab.sjc.redhat.com|grep ExternalID ExternalID: 57a3d2f5-ef9e-4fc0-bb0e-6568390f1831 [root@dhcp-136-211 qwang]# curl --insecure -H "Authorization: Bearer `oc whoami -t`" -H "Hawkular-tenant: _system" -k -X GET https://hawkular-metrics.0318-02v.qe.rhcloud.com/hawkular/metrics/metrics | python -m json.tool | grep -i \"id\" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 69297 0 69297 0 0 19947 0 --:--:-- 0:00:03 --:--:-- 19953 "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/cpu/limit", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/limit", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/usage", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/working_set", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/cpu/limit", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/limit", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/usage", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/working_set", "id": "kubelet/openshift-129.lab.sjc.redhat.com/cpu/limit", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/limit", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/usage", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/working_set", "id": "kubelet/openshift-137.lab.sjc.redhat.com/cpu/limit", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/limit", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/usage", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/working_set", "id": "machine/openshift-129.lab.sjc.redhat.com/cpu/limit", "id": "machine/openshift-129.lab.sjc.redhat.com/filesystem/limit", "id": "machine/openshift-129.lab.sjc.redhat.com/filesystem/usage", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/limit", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/usage", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/working_set", "id": "machine/openshift-137.lab.sjc.redhat.com/cpu/limit", "id": "machine/openshift-137.lab.sjc.redhat.com/filesystem/limit", "id": "machine/openshift-137.lab.sjc.redhat.com/filesystem/usage", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/limit", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/usage", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/working_set", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/cpu/usage", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/major_page_faults", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/memory/page_faults", "id": "docker-daemon/openshift-129.lab.sjc.redhat.com/uptime", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/cpu/usage", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/major_page_faults", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/memory/page_faults", "id": "docker-daemon/openshift-137.lab.sjc.redhat.com/uptime", "id": "kubelet/openshift-129.lab.sjc.redhat.com/cpu/usage", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/major_page_faults", "id": "kubelet/openshift-129.lab.sjc.redhat.com/memory/page_faults", "id": "kubelet/openshift-129.lab.sjc.redhat.com/uptime", "id": "kubelet/openshift-137.lab.sjc.redhat.com/cpu/usage", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/major_page_faults", "id": "kubelet/openshift-137.lab.sjc.redhat.com/memory/page_faults", "id": "kubelet/openshift-137.lab.sjc.redhat.com/uptime", "id": "machine/openshift-129.lab.sjc.redhat.com/cpu/usage", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/major_page_faults", "id": "machine/openshift-129.lab.sjc.redhat.com/memory/page_faults", "id": "machine/openshift-129.lab.sjc.redhat.com/network/rx", "id": "machine/openshift-129.lab.sjc.redhat.com/network/rx_errors", "id": "machine/openshift-129.lab.sjc.redhat.com/network/tx", "id": "machine/openshift-129.lab.sjc.redhat.com/network/tx_errors", "id": "machine/openshift-129.lab.sjc.redhat.com/uptime", "id": "machine/openshift-137.lab.sjc.redhat.com/cpu/usage", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/major_page_faults", "id": "machine/openshift-137.lab.sjc.redhat.com/memory/page_faults", "id": "machine/openshift-137.lab.sjc.redhat.com/network/rx", "id": "machine/openshift-137.lab.sjc.redhat.com/network/rx_errors", "id": "machine/openshift-137.lab.sjc.redhat.com/network/tx", "id": "machine/openshift-137.lab.sjc.redhat.com/network/tx_errors", "id": "machine/openshift-137.lab.sjc.redhat.com/uptime",
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064