Created attachment 1485102 [details] data is empty in web UI Description of problem: Deploy metrics v3.11.11 on HA OCP v3.11.11. metrics data is empty in web UI. There is not such issue on non HA env take memory usage for example, inspect network data for request: https://hawkular-metrics.apps.0920-umo.qe.rhcloud.com/hawkular/metrics/gauges/router%2Fe98e76bd-bc97-11e8-b0d7-fa163ec29712%2Fmemory%2Fusage/data?bucketDuration=120000ms&start=-60mn Response is the following, data is empty Object start 1537437161289 end 1537437281289 empty true env info ************************************************************* atomic-openshift version: v3.11.11 Operation System: Red Hat Enterprise Linux Server release 7.5 (Maipo) Cluster Install Method: rpm Docker Version: docker-1.13.1-74.git6e3bb8e.el7.x86_64 Docker Storage Driver: overlay2 OpenvSwitch Version: openvswitch-2.9.0-55.el7fdp.x86_64 etcd Version: etcd-3.2.22-1.el7.x86_64 Network Plugin: redhat/openshift-ovs-networkpolicy Auth Method: LDAP_IPA Registry Deployment Method: deploymentconfig Secure Registry: True Registry Backend Storage: cinder Load Balancer: Haproxy Kubelet Container Runtime: crio CRI-O rpm version: cri-o-1.11.4-2.rhaos3.11.gite0c89d8.el7_5.x86_64 Firewall Service: firewalld ************************************************************* Version-Release number of selected component (if applicable): metrics components version: v3.11.11-1 openshift v3.11.11 openshift-ansible-3.11.11-1 How reproducible: Always Steps to Reproduce: 1.Deploy metrics v3.11.11 on HA OCP v3.11.11 2. 3. Actual results: Metrics data is empty in web UI. Expected results: Metrics data should be shown in web UI. Additional info:
Blocks metrics testing on HA env
Created attachment 1485103 [details] metrics pods log
After some initial debugging, it appears that heapster is not sending metrics for any project. The only metric data stored in Cassandra is for the _system tenant which does come from heapster. This explains why the graphs are empty. I increased logging for heapster. I am not seeing any errors. We need to figure out whether or not heapster is actually collecting pod level metrics from different projects. Based on the logs I suspect it is not collecting pod metrics.
Tested with openshift_crio_var_sock: "/var/run/crio/crio.sock" configmap qe-master in openshift-node is still have unix prefix ***************************************************** container-runtime-endpoint: - unix:///var/run/crio/crio.sock image-service-endpoint: - unix:///var/run/crio/crio.sock ***************************************************** and /etc/origin/node/node-config.yaml in all nodes have unix prefix ***************************************************** container-runtime-endpoint: - unix:///var/run/crio/crio.sock image-service-endpoint: - unix:///var/run/crio/crio.sock ***************************************************** after removing the unix prefix and restarting all atomic-openshift-node.service we could see the data from web UI maybe there are other playbooks needed to changed
for different nodes, we should also modify the confimaps for different nodes # cat /etc/sysconfig/atomic-openshift-node | grep BOOTSTRAP_CONFIG_NAME
To change an existing cluster, change the config maps for the nodes i.e. BOOTSTRAP_CONFIG_NAME and restart the atomic-openshift-node service on each node. If you don't want to restart the services, the sync DS pod should restart the kubelet when it eventually notices (3m max) that the config map has changed.
Tested again, the issue is fixed actually, see the attached picture, metrics data can be shown in web UI. The reason I thought it is not fixed and need workaround as mentioned in Comment 9 is because there is one error in our OCP environment build template. Although the issue is fixed, we can see the warning info, it is recommend us to use format "unix:///var/run/crio/crio.sock". # master-logs etcd etcd W0925 07:16:36.726552 19184 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock". # openshift version openshift v3.11.14 openshift-ansible-3.11.14-1 Please change it to ON_QA, then we can close it
Created attachment 1486724 [details] metrics data is shown in web UI
add crio version cri-o://1.11.5
There are 3 namespaces(kube-system, openshift-node, openshift-sdn) could show network metrics graph, but it is empty for other namespace. See the attached pictures. Workaround is restart atomic-openshift-node on every nodes, then network metrics graph will be shown. maybe there should be fix in openshift-ansible
Created attachment 1498395 [details] network metrics graph could be shown - kube-system
Created attachment 1498396 [details] There is notnetwork metrics graph - openshift-infra
Created attachment 1498397 [details] network metrics graph could be shown after restarting atomic-openshift-node.service - openshift-infra
per c#15
Related fix: https://github.com/openshift/origin/pull/21398 This bug is tracking the openshift-ansible fix so this is just fyi.
(In reply to Seth Jennings from comment #26) > Related fix: > https://github.com/openshift/origin/pull/21398 Is this PR fixed the empty network graph issue?
My understanding of the matter is that the installer, in order to stop emitting deprecation warnings updated the default crio socket to be unix:///var/run/crio/crio.sock and that broke metrics gathering. Seth's PRs to openshift-ansible reverted that change so now configs are generated with a format that works for metrics gathering. If they had previously generated their configmaps they will either need to patch them or re-create them in order for this change to be distributed via the sync pod and services restarted automatically. Further, the PR referenced in comment 26 updates the kubelet to work with the new format. It has not merged, I have no idea when it will, so we have to rely on the installer fix for now to address this issue.
Will verify this bug after the PR referenced in comment 26 is merged.
I think this bug should be tested as is. We have resolved the problem for new installs and upgrades. The pull request referenced in comment 26 is supplementary but not required.
Issue in Comment 20 is reported in Bug 1646886, close this defect