1631300 – metrics data is empty in CRI-O env

Bug 1631300 - metrics data is empty in CRI-O env

Summary: metrics data is empty in CRI-O env

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Russell Teague
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-20 11:09 UTC by Junqi Zhao
Modified:	2021-12-10 17:35 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-05 14:39:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
data is empty in web UI (187.85 KB, image/png) 2018-09-20 11:09 UTC, Junqi Zhao	no flags	Details
metrics pods log (33.28 KB, application/x-gzip) 2018-09-20 11:13 UTC, Junqi Zhao	no flags	Details
metrics data is shown in web UI (96.60 KB, image/png) 2018-09-25 11:38 UTC, Junqi Zhao	no flags	Details
network metrics graph could be shown - kube-system (177.37 KB, image/png) 2018-10-29 02:16 UTC, Junqi Zhao	no flags	Details
There is notnetwork metrics graph - openshift-infra (200.19 KB, image/png) 2018-10-29 02:17 UTC, Junqi Zhao	no flags	Details
network metrics graph could be shown after restarting atomic-openshift-node.service - openshift-infra (195.43 KB, image/png) 2018-10-29 02:33 UTC, Junqi Zhao	no flags	Details
View All

Description Junqi Zhao 2018-09-20 11:09:58 UTC

Created attachment 1485102 [details]
data is empty in web UI

Description of problem:
Deploy metrics v3.11.11 on HA OCP v3.11.11. metrics data is empty in web UI.
There is not such issue on non HA env

take memory usage for example, inspect network data
for request:
https://hawkular-metrics.apps.0920-umo.qe.rhcloud.com/hawkular/metrics/gauges/router%2Fe98e76bd-bc97-11e8-b0d7-fa163ec29712%2Fmemory%2Fusage/data?bucketDuration=120000ms&start=-60mn

Response is the following, data is empty
Object
start	1537437161289
end	1537437281289
empty	true


env info
*************************************************************
atomic-openshift version: v3.11.11
Operation System: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Cluster Install Method: rpm
Docker Version: docker-1.13.1-74.git6e3bb8e.el7.x86_64
Docker Storage Driver:  overlay2
OpenvSwitch Version: openvswitch-2.9.0-55.el7fdp.x86_64
etcd Version: etcd-3.2.22-1.el7.x86_64
Network Plugin: redhat/openshift-ovs-networkpolicy
Auth Method: LDAP_IPA
Registry Deployment Method: deploymentconfig
Secure Registry: True
Registry Backend Storage: cinder
Load Balancer: Haproxy
Kubelet Container Runtime: crio
CRI-O rpm version: cri-o-1.11.4-2.rhaos3.11.gite0c89d8.el7_5.x86_64
Firewall Service: firewalld
*************************************************************

Version-Release number of selected component (if applicable):
metrics components version: v3.11.11-1
openshift v3.11.11
openshift-ansible-3.11.11-1

How reproducible:
Always

Steps to Reproduce:
1.Deploy metrics v3.11.11 on HA OCP v3.11.11
2.
3.

Actual results:
Metrics data is empty in web UI.

Expected results:
Metrics data should be shown in web UI.

Additional info:

Comment 1 Junqi Zhao 2018-09-20 11:12:30 UTC

Blocks metrics testing on HA env

Comment 2 Junqi Zhao 2018-09-20 11:13:35 UTC

Created attachment 1485103 [details]
metrics pods log

Comment 4 John Sanda 2018-09-20 16:21:03 UTC

After some initial debugging, it appears that heapster is not sending metrics for any project. The only metric data stored in Cassandra is for the _system tenant which does come from heapster. This explains why the graphs are empty. I increased logging for heapster. I am not seeing any errors. We need to figure out whether or not heapster is actually collecting pod level metrics from different projects. Based on the logs I suspect it is not collecting pod metrics.

Comment 9 Junqi Zhao 2018-09-21 05:57:04 UTC

Tested with 
openshift_crio_var_sock: "/var/run/crio/crio.sock"

configmap qe-master in openshift-node is still have unix prefix
*****************************************************
      container-runtime-endpoint:
      - unix:///var/run/crio/crio.sock

      image-service-endpoint:
      - unix:///var/run/crio/crio.sock
*****************************************************

and /etc/origin/node/node-config.yaml in all nodes have unix prefix
*****************************************************
  container-runtime-endpoint:
  - unix:///var/run/crio/crio.sock

  image-service-endpoint:
  - unix:///var/run/crio/crio.sock
*****************************************************

after removing the unix prefix and restarting all atomic-openshift-node.service
we could see the data from web UI

maybe there are other playbooks needed to changed

Comment 10 Junqi Zhao 2018-09-21 06:20:24 UTC

for different nodes, we should also modify the confimaps for different nodes

# cat /etc/sysconfig/atomic-openshift-node  | grep BOOTSTRAP_CONFIG_NAME

Comment 11 Seth Jennings 2018-09-21 15:45:47 UTC

To change an existing cluster, change the config maps for the nodes i.e. BOOTSTRAP_CONFIG_NAME and restart the atomic-openshift-node service on each node.  If you don't want to restart the services, the sync DS pod should restart the kubelet when it eventually notices (3m max) that the config map has changed.

Comment 15 Junqi Zhao 2018-09-25 11:37:50 UTC

Tested again, the issue is fixed actually, see the attached picture, metrics data  can be shown in web UI. The reason I thought it is not fixed and need workaround as mentioned in Comment 9 is because there is one error in our OCP environment build template.

Although the issue is fixed, we can see the warning info, it is recommend us to use format "unix:///var/run/crio/crio.sock".
# master-logs etcd etcd
W0925 07:16:36.726552   19184 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".

# openshift version
openshift v3.11.14

openshift-ansible-3.11.14-1

Please change it to ON_QA, then we can close it

Comment 16 Junqi Zhao 2018-09-25 11:38:42 UTC

Created attachment 1486724 [details]
metrics data is shown in web UI

Comment 17 Junqi Zhao 2018-09-25 11:40:49 UTC

add crio version
cri-o://1.11.5

Comment 20 Junqi Zhao 2018-10-29 02:14:29 UTC

There are 3 namespaces(kube-system, openshift-node, openshift-sdn) could show network metrics graph, but it is empty for other namespace.
See the attached pictures.

Workaround is restart atomic-openshift-node on every nodes, then network metrics graph will be shown. maybe there should be fix in openshift-ansible

Comment 21 Junqi Zhao 2018-10-29 02:16:25 UTC

Created attachment 1498395 [details]
network metrics graph could be shown - kube-system

Comment 22 Junqi Zhao 2018-10-29 02:17:11 UTC

Created attachment 1498396 [details]
There is notnetwork metrics graph  - openshift-infra

Comment 24 Junqi Zhao 2018-10-29 02:33:35 UTC

Created attachment 1498397 [details]
network metrics graph could be shown after restarting atomic-openshift-node.service - openshift-infra

Comment 25 Seth Jennings 2018-10-31 16:54:41 UTC

per c#15

Comment 26 Seth Jennings 2018-10-31 16:59:53 UTC

Related fix:
https://github.com/openshift/origin/pull/21398

This bug is tracking the openshift-ansible fix so this is just fyi.

Comment 28 Junqi Zhao 2018-11-01 00:43:43 UTC

(In reply to Seth Jennings from comment #26)
> Related fix:
> https://github.com/openshift/origin/pull/21398

Is this PR fixed the empty network graph issue?

Comment 29 Scott Dodson 2018-11-01 12:47:10 UTC

My understanding of the matter is that the installer, in order to stop emitting deprecation warnings updated the default crio socket to be unix:///var/run/crio/crio.sock and that broke metrics gathering. Seth's PRs to openshift-ansible reverted that change so now configs are generated with a format that works for metrics gathering.

If they had previously generated their configmaps they will either need to patch them or re-create them in order for this change to be distributed via the sync pod and services restarted automatically.

Further, the PR referenced in comment 26 updates the kubelet to work with the new format. It has not merged, I have no idea when it will, so we have to rely on the installer fix for now to address this issue.

Comment 30 Junqi Zhao 2018-11-02 00:36:31 UTC

Will verify this bug after the PR referenced in comment 26 is merged.

Comment 31 Scott Dodson 2018-11-02 14:25:52 UTC

I think this bug should be tested as is. We have resolved the problem for new installs and upgrades. The pull request referenced in comment 26 is supplementary but not required.

Comment 32 Junqi Zhao 2018-11-06 08:50:52 UTC

Issue in Comment 20 is reported in Bug 1646886, close this defect

Note You need to log in before you can comment on or make changes to this bug.