Bug 1502924

Summary: [free-int] Metrics pods are not in running status - Corrupt layer on aws-reg
Product: OpenShift Online Reporter: Junqi Zhao <juzhao>
Component: UnknownAssignee: Abhishek Gupta <abhgupta>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.xCC: aos-bugs, jokerman, jupierce, mmccomas, twiest
Target Milestone: ---Keywords: OnlineStarter, TestBlocker
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-09 18:47:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Junqi Zhao 2017-10-17 04:12:07 UTC
Description of problem:
free-int cluster, metrics pods are not in running status
Command ***** oc get pod -n openshift-infra ***** result as below:

NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-24qck   0/1       CrashLoopBackOff   63         11h
hawkular-cassandra-2-xt6lg   0/1       CrashLoopBackOff   63         11h
hawkular-metrics-k48g8       0/1       CrashLoopBackOff   128        11h
heapster-v0421               0/1       ImagePullBackOff   0          11h



Version-Release number of selected component (if applicable):
OpenShift Master:v3.7.0-0.147.0 (online version 3.6.0.38)
Kubernetes Master:v1.7.6+a08f5eeb62 

How reproducible:
Always

Steps to Reproduce:
1. Check metrics pods' status by: oc get pod -n openshift-infra
2.
3.

Actual results:
metrics pods are not in running status

Expected results:
metrics pods should be healthy

Additional info:

Comment 1 Seth Jennings 2017-10-17 05:04:30 UTC
heapster-v0421 is failing with:

Failed to pull image "registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0": rpc error: code = 2 desc = unknown blob

The rest are failing with:

invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:327: setting cgroup config for procHooks process caused \\\"failed to write 372500 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2cd335f8_b26d_11e7_bbc3_0ac586c2eb16.slice/docker-2f970e8ebd2c125fe5b438a0237d8f28d3cfb422dd1cc321856c9702fb4bbc5a.scope/cpu.cfs_quota_us: invalid argument\\\"\"\n"

All the pods are stacked up on one infra node
$ oc get pod -o wide
NAME                         READY     STATUS             RESTARTS   AGE       IP             NODE
hawkular-cassandra-1-24qck   0/1       CrashLoopBackOff   125        16h       10.129.0.208   ip-172-31-61-50.ec2.internal
hawkular-cassandra-2-xt6lg   0/1       CrashLoopBackOff   125        16h       10.129.0.207   ip-172-31-61-50.ec2.internal
hawkular-metrics-k48g8       0/1       CrashLoopBackOff   189        16h       10.129.0.205   ip-172-31-61-50.ec2.internal
heapster-v0421               0/1       ImagePullBackOff   0          16h       10.129.0.211   ip-172-31-61-50.ec2.internal

I don't have privilege to delete pods in this namespace to see if they come up if their pod sandboxes are recreated.

Comment 5 Stefanie Forrester 2017-10-18 16:57:23 UTC
Luckily I had a good copy of this image on online-int. I pushed that image out and now I can pull it from any host I try. That seems to have fixed the issue. 

[root@free-stg-master-03fb6 ~]# docker pull registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0
Trying to pull repository registry.reg-aws.openshift.com:443/openshift3/metrics-heapster ... 
sha256:0ffa9b57b60f93357d1de1f3d19d20f130db5f5df35ed9af2c45a74be571389f: Pulling from registry.reg-aws.openshift.com:443/openshift3/metrics-heapster

4b8ec2c40f02: Pull complete 
9a825c117595: Pull complete 
5321a89381ba: Pull complete 
Digest: sha256:0ffa9b57b60f93357d1de1f3d19d20f130db5f5df35ed9af2c45a74be571389f
Status: Downloaded newer image for registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0

Comment 6 Junqi Zhao 2017-10-20 07:40:35 UTC
hawkular-cassandra-1 and heapster are CrashLoopBackOff
oc get pod -n openshift-infra 
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-mp9h1   0/1       CrashLoopBackOff   186        2d
hawkular-cassandra-2-shlq1   1/1       Running            1          2d
hawkular-metrics-7bm1j       1/1       Running            0          2d
heapster-5126g               0/1       CrashLoopBackOff   456        2d

env:
OpenShift Master:v3.7.0-0.147.0 (online version 3.6.0.38)
Kubernetes Master:v1.7.6+a08f5eeb62

Comment 7 Stefanie Forrester 2017-10-20 19:28:50 UTC
I'm not seeing an image-pull issue anymore. I think that part has been resolved.

I was able to ssh to the node where hawkular-cassandra-1-mp9h1 was experiencing the CrashLoopBackOff, and I pulled the image successfully.

[root@free-int-node-infra-70a2b ~]# docker pull registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0
Trying to pull repository registry.reg-aws.openshift.com:443/openshift3/metrics-heapster ...
sha256:0ffa9b57b60f93357d1de1f3d19d20f130db5f5df35ed9af2c45a74be571389f: Pulling from registry.reg-aws.openshift.com:443/openshift3/metrics-heapsterDigest: sha256:0ffa9b57b60f93357d1de1f3d19d20f130db5f5df35ed9af2c45a74be571389f
Status: Image is up to date for registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0

The pod is crashing because of this error:

[root@free-int-node-infra-70a2b ~]# docker logs 270a2ed773d6

container_linux.go:247: starting container process caused "process_linux.go:327: setting cgroup config for procHooks process caused \"failed to write 698400 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6c5a4841_b33f_11e7_bbc3_0ac586c2eb16.slice/docker-270a2ed773d67a1c5eae08c685ced636b740023887ef0d5adac336b415d8263b.scope/cpu.cfs_quota_us: invalid argument\""

I deleted the pod hawkular-cassandra-1-mp9h1, and a replacement pod came up successfully.

So the remaining issue is cgroups, which appears to be a system configuration issue, since the pod is able to run on some nodes but not others.

[root@free-int-master-3c664 ~]# oc get pods -n openshift-infra
NAME                         READY     STATUS             RESTARTS   AGE
analytics-1-3fx3c            1/1       Running            0          1d
analytics-1-build            0/1       Completed          0          1d
hawkular-cassandra-1-c27cx   1/1       Running            0          6m
hawkular-cassandra-2-shlq1   1/1       Running            2          3d
hawkular-metrics-7bm1j       1/1       Running            0          3d
heapster-5126g               0/1       CrashLoopBackOff   595        3d
hibernation-1-2g725          1/1       Running            0          1d
hibernation-1-build          0/1       Completed          0          1d

The state of heapster points to another potential system issue:

[root@free-int-master-3c664 ~]# oc logs heapster-5126g -n openshift-infra
failed to open log file "/var/log/pods/6c5a4841-b33f-11e7-bbc3-0ac586c2eb16/heapster_595.log": open /var/log/pods/6c5a4841-b33f-11e7-bbc3-0ac586c2eb16/heapster_595.log: no such file or directory
[root@free-int-master-3c664 ~]#

I'm going to be on PTO next week, so I'm handing this back to Abhishek for delegation.

Comment 8 Justin Pierce 2017-10-23 18:48:24 UTC
Moving back to ON_QA since the image corruption can be verified and this specific issue closed. 

For the current issue affecting metrics, I think we are seeing: https://bugzilla.redhat.com/show_bug.cgi?id=1501550

Comment 9 Junqi Zhao 2017-10-24 00:48:15 UTC
metrics pods are running and metrics diagram could be shown on web console.

Command ***** oc get pod -n openshift-infra ***** result as below:

NAME                         READY     STATUS    RESTARTS   AGE
hawkular-cassandra-1-pkm7d   1/1       Running   1          3h
hawkular-cassandra-2-hxhsj   1/1       Running   4          3h
hawkular-metrics-5x4b2       1/1       Running   0          3h
heapster-f7tzq               1/1       Running   0          3h

env:
OpenShift Master: v3.7.0-0.176.0 (online version 3.6.0.45)
Kubernetes Master: v1.7.6+a08f5eeb62