Bug 1631300

Summary:

metrics data is empty in CRI-O env

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Installer

Assignee:

Russell Teague <rteague>

Installer sub component:

openshift-ansible

QA Contact:

Johnny Liu <jialiu>

Status:

CLOSED ERRATA

Docs Contact:

Severity:

high

Priority:

high

CC:

aos-bugs, dapark, jokerman, mmccomas, sdodson, ssadhale, xtian

Version:

3.11.0

Keywords:

Regression, Reopened

Target Milestone:

---

Target Release:

3.11.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-03-05 14:39:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
data is empty in web UI	none
metrics pods log	none
metrics data is shown in web UI	none
network metrics graph could be shown - kube-system	none
There is notnetwork metrics graph - openshift-infra	none
network metrics graph could be shown after restarting atomic-openshift-node.service - openshift-infra	none

Description Junqi Zhao 2018-09-20 11:09:58 UTC

Created attachment 1485102 [details]
data is empty in web UI

Description of problem:
Deploy metrics v3.11.11 on HA OCP v3.11.11. metrics data is empty in web UI.
There is not such issue on non HA env

take memory usage for example, inspect network data
for request:
https://hawkular-metrics.apps.0920-umo.qe.rhcloud.com/hawkular/metrics/gauges/router%2Fe98e76bd-bc97-11e8-b0d7-fa163ec29712%2Fmemory%2Fusage/data?bucketDuration=120000ms&start=-60mn

Response is the following, data is empty
Object
start	1537437161289
end	1537437281289
empty	true


env info
*************************************************************
atomic-openshift version: v3.11.11
Operation System: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Cluster Install Method: rpm
Docker Version: docker-1.13.1-74.git6e3bb8e.el7.x86_64
Docker Storage Driver:  overlay2
OpenvSwitch Version: openvswitch-2.9.0-55.el7fdp.x86_64
etcd Version: etcd-3.2.22-1.el7.x86_64
Network Plugin: redhat/openshift-ovs-networkpolicy
Auth Method: LDAP_IPA
Registry Deployment Method: deploymentconfig
Secure Registry: True
Registry Backend Storage: cinder
Load Balancer: Haproxy
Kubelet Container Runtime: crio
CRI-O rpm version: cri-o-1.11.4-2.rhaos3.11.gite0c89d8.el7_5.x86_64
Firewall Service: firewalld
*************************************************************

Version-Release number of selected component (if applicable):
metrics components version: v3.11.11-1
openshift v3.11.11
openshift-ansible-3.11.11-1

How reproducible:
Always

Steps to Reproduce:
1.Deploy metrics v3.11.11 on HA OCP v3.11.11
2.
3.

Actual results:
Metrics data is empty in web UI.

Expected results:
Metrics data should be shown in web UI.

Additional info:

Comment 1 Junqi Zhao 2018-09-20 11:12:30 UTC

Blocks metrics testing on HA env

Comment 2 Junqi Zhao 2018-09-20 11:13:35 UTC

Created attachment 1485103 [details]
metrics pods log

Comment 4 John Sanda 2018-09-20 16:21:03 UTC

After some initial debugging, it appears that heapster is not sending metrics for any project. The only metric data stored in Cassandra is for the _system tenant which does come from heapster. This explains why the graphs are empty. I increased logging for heapster. I am not seeing any errors. We need to figure out whether or not heapster is actually collecting pod level metrics from different projects. Based on the logs I suspect it is not collecting pod metrics.

Comment 9 Junqi Zhao 2018-09-21 05:57:04 UTC

Tested with 
openshift_crio_var_sock: "/var/run/crio/crio.sock"

configmap qe-master in openshift-node is still have unix prefix
*****************************************************
      container-runtime-endpoint:
      - unix:///var/run/crio/crio.sock

      image-service-endpoint:
      - unix:///var/run/crio/crio.sock
*****************************************************

and /etc/origin/node/node-config.yaml in all nodes have unix prefix
*****************************************************
  container-runtime-endpoint:
  - unix:///var/run/crio/crio.sock

  image-service-endpoint:
  - unix:///var/run/crio/crio.sock
*****************************************************

after removing the unix prefix and restarting all atomic-openshift-node.service
we could see the data from web UI

maybe there are other playbooks needed to changed

Comment 10 Junqi Zhao 2018-09-21 06:20:24 UTC

for different nodes, we should also modify the confimaps for different nodes

# cat /etc/sysconfig/atomic-openshift-node  | grep BOOTSTRAP_CONFIG_NAME

Comment 11 Seth Jennings 2018-09-21 15:45:47 UTC

To change an existing cluster, change the config maps for the nodes i.e. BOOTSTRAP_CONFIG_NAME and restart the atomic-openshift-node service on each node.  If you don't want to restart the services, the sync DS pod should restart the kubelet when it eventually notices (3m max) that the config map has changed.

Comment 15 Junqi Zhao 2018-09-25 11:37:50 UTC

Tested again, the issue is fixed actually, see the attached picture, metrics data  can be shown in web UI. The reason I thought it is not fixed and need workaround as mentioned in Comment 9 is because there is one error in our OCP environment build template.

Although the issue is fixed, we can see the warning info, it is recommend us to use format "unix:///var/run/crio/crio.sock".
# master-logs etcd etcd
W0925 07:16:36.726552   19184 util_unix.go:75] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".

# openshift version
openshift v3.11.14

openshift-ansible-3.11.14-1

Please change it to ON_QA, then we can close it

Comment 16 Junqi Zhao 2018-09-25 11:38:42 UTC

Created attachment 1486724 [details]
metrics data is shown in web UI

Comment 17 Junqi Zhao 2018-09-25 11:40:49 UTC

add crio version
cri-o://1.11.5

Comment 20 Junqi Zhao 2018-10-29 02:14:29 UTC

There are 3 namespaces(kube-system, openshift-node, openshift-sdn) could show network metrics graph, but it is empty for other namespace.
See the attached pictures.

Workaround is restart atomic-openshift-node on every nodes, then network metrics graph will be shown. maybe there should be fix in openshift-ansible

Comment 21 Junqi Zhao 2018-10-29 02:16:25 UTC

Created attachment 1498395 [details]
network metrics graph could be shown - kube-system

Comment 22 Junqi Zhao 2018-10-29 02:17:11 UTC

Created attachment 1498396 [details]
There is notnetwork metrics graph  - openshift-infra

Comment 24 Junqi Zhao 2018-10-29 02:33:35 UTC

Created attachment 1498397 [details]
network metrics graph could be shown after restarting atomic-openshift-node.service - openshift-infra

Comment 25 Seth Jennings 2018-10-31 16:54:41 UTC

per c#15

Comment 26 Seth Jennings 2018-10-31 16:59:53 UTC

Related fix:
https://github.com/openshift/origin/pull/21398

This bug is tracking the openshift-ansible fix so this is just fyi.

Comment 28 Junqi Zhao 2018-11-01 00:43:43 UTC

(In reply to Seth Jennings from comment #26)
> Related fix:
> https://github.com/openshift/origin/pull/21398

Is this PR fixed the empty network graph issue?

Comment 29 Scott Dodson 2018-11-01 12:47:10 UTC

My understanding of the matter is that the installer, in order to stop emitting deprecation warnings updated the default crio socket to be unix:///var/run/crio/crio.sock and that broke metrics gathering. Seth's PRs to openshift-ansible reverted that change so now configs are generated with a format that works for metrics gathering.

If they had previously generated their configmaps they will either need to patch them or re-create them in order for this change to be distributed via the sync pod and services restarted automatically.

Further, the PR referenced in comment 26 updates the kubelet to work with the new format. It has not merged, I have no idea when it will, so we have to rely on the installer fix for now to address this issue.

Comment 30 Junqi Zhao 2018-11-02 00:36:31 UTC

Will verify this bug after the PR referenced in comment 26 is merged.

Comment 31 Scott Dodson 2018-11-02 14:25:52 UTC

I think this bug should be tested as is. We have resolved the problem for new installs and upgrades. The pull request referenced in comment 26 is supplementary but not required.

Comment 32 Junqi Zhao 2018-11-06 08:50:52 UTC

Issue in Comment 20 is reported in Bug 1646886, close this defect