Bug 1502924
| Summary: | [free-int] Metrics pods are not in running status - Corrupt layer on aws-reg | ||
|---|---|---|---|
| Product: | OpenShift Online | Reporter: | Junqi Zhao <juzhao> |
| Component: | Unknown | Assignee: | Abhishek Gupta <abhgupta> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.x | CC: | aos-bugs, jokerman, jupierce, mmccomas, twiest |
| Target Milestone: | --- | Keywords: | OnlineStarter, TestBlocker |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-11-09 18:47:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Junqi Zhao
2017-10-17 04:12:07 UTC
heapster-v0421 is failing with: Failed to pull image "registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0": rpc error: code = 2 desc = unknown blob The rest are failing with: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:327: setting cgroup config for procHooks process caused \\\"failed to write 372500 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod2cd335f8_b26d_11e7_bbc3_0ac586c2eb16.slice/docker-2f970e8ebd2c125fe5b438a0237d8f28d3cfb422dd1cc321856c9702fb4bbc5a.scope/cpu.cfs_quota_us: invalid argument\\\"\"\n" All the pods are stacked up on one infra node $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE hawkular-cassandra-1-24qck 0/1 CrashLoopBackOff 125 16h 10.129.0.208 ip-172-31-61-50.ec2.internal hawkular-cassandra-2-xt6lg 0/1 CrashLoopBackOff 125 16h 10.129.0.207 ip-172-31-61-50.ec2.internal hawkular-metrics-k48g8 0/1 CrashLoopBackOff 189 16h 10.129.0.205 ip-172-31-61-50.ec2.internal heapster-v0421 0/1 ImagePullBackOff 0 16h 10.129.0.211 ip-172-31-61-50.ec2.internal I don't have privilege to delete pods in this namespace to see if they come up if their pod sandboxes are recreated. Luckily I had a good copy of this image on online-int. I pushed that image out and now I can pull it from any host I try. That seems to have fixed the issue. [root@free-stg-master-03fb6 ~]# docker pull registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0 Trying to pull repository registry.reg-aws.openshift.com:443/openshift3/metrics-heapster ... sha256:0ffa9b57b60f93357d1de1f3d19d20f130db5f5df35ed9af2c45a74be571389f: Pulling from registry.reg-aws.openshift.com:443/openshift3/metrics-heapster 4b8ec2c40f02: Pull complete 9a825c117595: Pull complete 5321a89381ba: Pull complete Digest: sha256:0ffa9b57b60f93357d1de1f3d19d20f130db5f5df35ed9af2c45a74be571389f Status: Downloaded newer image for registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0 hawkular-cassandra-1 and heapster are CrashLoopBackOff oc get pod -n openshift-infra NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-mp9h1 0/1 CrashLoopBackOff 186 2d hawkular-cassandra-2-shlq1 1/1 Running 1 2d hawkular-metrics-7bm1j 1/1 Running 0 2d heapster-5126g 0/1 CrashLoopBackOff 456 2d env: OpenShift Master:v3.7.0-0.147.0 (online version 3.6.0.38) Kubernetes Master:v1.7.6+a08f5eeb62 I'm not seeing an image-pull issue anymore. I think that part has been resolved. I was able to ssh to the node where hawkular-cassandra-1-mp9h1 was experiencing the CrashLoopBackOff, and I pulled the image successfully. [root@free-int-node-infra-70a2b ~]# docker pull registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0 Trying to pull repository registry.reg-aws.openshift.com:443/openshift3/metrics-heapster ... sha256:0ffa9b57b60f93357d1de1f3d19d20f130db5f5df35ed9af2c45a74be571389f: Pulling from registry.reg-aws.openshift.com:443/openshift3/metrics-heapsterDigest: sha256:0ffa9b57b60f93357d1de1f3d19d20f130db5f5df35ed9af2c45a74be571389f Status: Image is up to date for registry.reg-aws.openshift.com:443/openshift3/metrics-heapster:v3.7.0-0.147.0 The pod is crashing because of this error: [root@free-int-node-infra-70a2b ~]# docker logs 270a2ed773d6 container_linux.go:247: starting container process caused "process_linux.go:327: setting cgroup config for procHooks process caused \"failed to write 698400 to cpu.cfs_quota_us: write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6c5a4841_b33f_11e7_bbc3_0ac586c2eb16.slice/docker-270a2ed773d67a1c5eae08c685ced636b740023887ef0d5adac336b415d8263b.scope/cpu.cfs_quota_us: invalid argument\"" I deleted the pod hawkular-cassandra-1-mp9h1, and a replacement pod came up successfully. So the remaining issue is cgroups, which appears to be a system configuration issue, since the pod is able to run on some nodes but not others. [root@free-int-master-3c664 ~]# oc get pods -n openshift-infra NAME READY STATUS RESTARTS AGE analytics-1-3fx3c 1/1 Running 0 1d analytics-1-build 0/1 Completed 0 1d hawkular-cassandra-1-c27cx 1/1 Running 0 6m hawkular-cassandra-2-shlq1 1/1 Running 2 3d hawkular-metrics-7bm1j 1/1 Running 0 3d heapster-5126g 0/1 CrashLoopBackOff 595 3d hibernation-1-2g725 1/1 Running 0 1d hibernation-1-build 0/1 Completed 0 1d The state of heapster points to another potential system issue: [root@free-int-master-3c664 ~]# oc logs heapster-5126g -n openshift-infra failed to open log file "/var/log/pods/6c5a4841-b33f-11e7-bbc3-0ac586c2eb16/heapster_595.log": open /var/log/pods/6c5a4841-b33f-11e7-bbc3-0ac586c2eb16/heapster_595.log: no such file or directory [root@free-int-master-3c664 ~]# I'm going to be on PTO next week, so I'm handing this back to Abhishek for delegation. Moving back to ON_QA since the image corruption can be verified and this specific issue closed. For the current issue affecting metrics, I think we are seeing: https://bugzilla.redhat.com/show_bug.cgi?id=1501550 metrics pods are running and metrics diagram could be shown on web console. Command ***** oc get pod -n openshift-infra ***** result as below: NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-pkm7d 1/1 Running 1 3h hawkular-cassandra-2-hxhsj 1/1 Running 4 3h hawkular-metrics-5x4b2 1/1 Running 0 3h heapster-f7tzq 1/1 Running 0 3h env: OpenShift Master: v3.7.0-0.176.0 (online version 3.6.0.45) Kubernetes Master: v1.7.6+a08f5eeb62 |