Bug 1994454
| Summary: | upgrade from 4.6 to 4.7 to 4.8 with mcp worker "paused=true", crio report "panic: close of closed channel" which lead to a master Node go into Restart loop | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | MinLi <minmli> |
| Component: | Node | Assignee: | Peter Hunt <pehunt> |
| Node sub component: | CRI-O | QA Contact: | MinLi <minmli> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aos-bugs, pehunt, rmanes, saniyer |
| Version: | 4.7 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:46:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1994728 | ||
|
Description
MinLi
2021-08-17 10:21:27 UTC
crio log:
Aug 17 09:28:50 ip-10-0-139-77 crio[1298394]: time="2021-08-17 09:28:50.783221573Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:cecb7839b484569a1c2826af500ff1107f811725296c58d13e0f5df6a10e882b" id=863cd639-eff9-496e-9200-b90cae4a3db1 name=/runtime.v1alpha2.ImageService/ImageStatus
Aug 17 09:28:50 ip-10-0-139-77 crio[1298394]: time="2021-08-17 09:28:50.785152581Z" level=info msg="Image status: &{0xc0001e4a10 map[]}" id=863cd639-eff9-496e-9200-b90cae4a3db1 name=/runtime.v1alpha2.ImageService/ImageStatus
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: panic: close of closed channel
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: goroutine 1323 [running]:
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: panic(0x5573ffea0e20, 0x55740013b840)
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: /usr/lib/golang/src/runtime/panic.go:1065 +0x565 fp=0xc00063d5b0 sp=0xc00063d4e8 pc=0x5573fe02b4a5
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: runtime.closechan(0xc00185aba0)
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: /usr/lib/golang/src/runtime/chan.go:363 +0x3f5 fp=0xc00063d5f0 sp=0xc00063d5b0 pc=0x5573fdff97b5
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: github.com/cri-o/cri-o/internal/oci.(*runtimeOCI).ReopenContainerLog(0xc0005cbe30, 0x5574001b0240, 0xc000d1fad0, 0xc000dee000, 0x55740015e5f8, 0xc002024d50)
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: /builddir/build/BUILD/cri-o-3b87110e2208fd889e35829d92f4400742568fa6/_output/src/github.com/cri-o/cri-o/internal/oci/runtime_oci.go:1050 +0x7c6 fp=0xc00063d758 sp=0xc00063d5f0 pc=0x5573ff65e186
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: github.com/cri-o/cri-o/internal/oci.(*Runtime).ReopenContainerLog(0xc0006aff50, 0x5574001b0240, 0xc000d1fad0, 0xc000dee000, 0x0, 0x0)
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: /builddir/build/BUILD/cri-o-3b87110e2208fd889e35829d92f4400742568fa6/_output/src/github.com/cri-o/cri-o/internal/oci/oci.go:433 +0x93 fp=0xc00063d798 sp=0xc00063d758 pc=0x5573ff654113
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: github.com/cri-o/cri-o/server.(*Server).ReopenContainerLog(0xc0001b6000, 0x5574001b0240, 0xc000d1fad0, 0xc00063d848, 0x55740000e300, 0x5574000c9c80)
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: /builddir/build/BUILD/cri-o-3b87110e2208fd889e35829d92f4400742568fa6/_output/src/github.com/cri-o/cri-o/server/container_reopen_log.go:28 +0x14f fp=0xc00063d818 sp=0xc00063d798 pc=0x5573ff70b88f
Aug 17 09:28:51 ip-10-0-139-77 crio[1298394]: github.com/cri-o/cri-o/server/cri/v1alpha2.(*service).ReopenContainerLog(0xc0000134e0, 0x5574001b0240, 0xc000d1fad0, 0xc002024ca8, 0xc0000134e0, 0x1, 0x1)
Aug 17 09:39:44 ip-10-0-139-77 crio[1417705]: github.com/cri-o/cri-o/vendor/github.com/fsnotify/fsnotify.(*fdPoller).wait(0xc001073660, 0x0, 0x0, 0x0)
Aug 17 09:39:44 ip-10-0-139-77 crio[1417705]: /builddir/build/BUILD/cri-o-3b87110e2208fd889e35829d92f4400742568fa6/_output/src/github.com/cri-o/cri-o/vendor/github.com/fsnotify/fsnotify/inotify_poller.go:86 +0x93 fp=0xc0011cdc98 sp=0xc0011cdbd0 pc=0x559fb5efa893
Aug 17 09:39:44 ip-10-0-139-77 crio[1417705]: github.com/cri-o/cri-o/vendor/github.com/fsnotify/fsnotify.(*Watcher).readEvents(0xc0006b4780)
Aug 17 09:39:44 ip-10-0-139-77 crio[1417705]: /builddir/build/BUILD/cri-o-3b87110e2208fd889e35829d92f4400742568fa6/_output/src/github.com/cri-o/cri-o/vendor/github.com/fsnotify/fsnotify/inotify.go:192 +0x206 fp=0xc0011ddfd8 sp=0xc0011cdc98 pc=0x559fb5ef9aa6
Aug 17 09:39:44 ip-10-0-139-77 crio[1417705]: runtime.goexit()
Aug 17 09:39:44 ip-10-0-139-77 crio[1417705]: /usr/lib/golang/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc0011ddfe0 sp=0xc0011ddfd8 pc=0x559fb5709901
Aug 17 09:39:44 ip-10-0-139-77 crio[1417705]: created by github.com/cri-o/cri-o/vendor/github.com/fsnotify/fsnotify.NewWatcher
Aug 17 09:39:44 ip-10-0-139-77 crio[1417705]: /builddir/build/BUILD/cri-o-3b87110e2208fd889e35829d92f4400742568fa6/_output/src/github.com/cri-o/cri-o/vendor/github.com/fsnotify/fsnotify/inotify.go:59 +0x1ab
Aug 17 09:39:45 ip-10-0-139-77 systemd[1]: crio.service: Main process exited, code=killed, status=6/ABRT
Aug 17 09:39:45 ip-10-0-139-77 systemd[1]: crio.service: Failed with result 'signal'.
Aug 17 09:39:45 ip-10-0-139-77 systemd[1]: crio.service: Consumed 4.162s CPU time
Aug 17 09:39:45 ip-10-0-139-77 systemd[1]: crio.service: Service RestartSec=100ms expired, scheduling restart.
Aug 17 09:39:45 ip-10-0-139-77 systemd[1]: crio.service: Scheduled restart job, restart counter is at 742.
Aug 17 09:39:45 ip-10-0-139-77 systemd[1]: Stopped Open Container Initiative Daemon.
Aug 17 09:39:45 ip-10-0-139-77 systemd[1]: crio.service: Consumed 4.162s CPU time
Aug 17 09:39:45 ip-10-0-139-77 systemd[1]: Starting Open Container Initiative Daemon...
Aug 17 09:39:45 ip-10-0-139-77 crio[1419537]: time="2021-08-17 09:39:45.465278484Z" level=info msg="Starting CRI-O, version: 1.21.2-9.rhaos4.8.git3b87110.el8, git: ()"
Aug 17 09:39:45 ip-10-0-139-77 crio[1419537]: time="2021-08-17 09:39:45.465418754Z" level=info msg="Node configuration value for hugetlb cgroup is true"
Aug 17 09:39:45 ip-10-0-139-77 crio[1419537]: time="2021-08-17 09:39:45.465429178Z" level=info msg="Node configuration value for pid cgroup is true"
Aug 17 09:39:45 ip-10-0-139-77 crio[1419537]: time="2021-08-17 09:39:45.465449326Z" level=info msg="Node configuration value for memoryswap cgroup is true"
Aug 17 09:39:45 ip-10-0-139-77 crio[1419537]: time="2021-08-17 09:39:45.473891730Z" level=info msg="Node configuration value for systemd CollectMode is true"
Aug 17 09:39:45 ip-10-0-139-77 crio[1419537]: time="2021-08-17 09:39:45.481865984Z" level=info msg="Node configuration value for systemd AllowedCPUs is true"
Aug 17 09:39:45 ip-10-0-139-77 crio[1419537]: time="2021-08-17 09:39:45.492177440Z" level=info msg="Using default capabilities: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FSETID, CAP_FOWNER, CAP_SETGID, CAP_SETUID, CAP_SETPCAP, CAP_NET_BIND_SERVICE, CAP_KILL"
kubelet log:
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.229468 1307272 plugins.go:639] Loaded volume plugin "kubernetes.io/csi"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.229599 1307272 server.go:1190] "Started kubelet"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: E0817 09:29:32.230147 1307272 kubelet.go:1311] "Image garbage collection failed once. Stats initialization
may not have completed yet" err="failed to get imageFs info: unable to find data in memory cache"
Aug 17 09:29:32 ip-10-0-139-77 systemd[1]: Started Kubernetes Kubelet.
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.232665 1307272 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.234984 1307272 server.go:405] "Adding debug handlers to kubelet server"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.241854 1307272 certificate_manager.go:282] Certificate rotation is enabled.
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.241889 1307272 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.241897 1307272 certificate_manager.go:556] Certificate expiration is 2021-09-16 02:38:43 +00
00 UTC, rotation deadline is 2021-09-08 05:22:04.727863975 +0000 UTC
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.342568 1307272 kubelet_node_status.go:453] "Adding node label from cloud provider" labelKey="failure-domain.beta.kubernetes.io/region" labelValue="us-east-2"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.342579 1307272 kubelet_node_status.go:455] "Adding node label from cloud provider" labelKey="topology.kubernetes.io/region" labelValue="us-east-2"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: E0817 09:29:32.372827 1307272 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod548969ba_a0c1_4a2a_b164_66a148ceea54.slice/crio-2d3b24a34ee60d22699478bbe8cc3c89075fb7642f89b0f71d5a59180b7f81c5.scope: Error finding container 2d3b24a34ee60d22699478bbe8cc3c89075fb7642f89b0f71d5a59180b7f81c5: Status 404 returned error &{%!s(*http.body=&{0xc000c64048 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77baa0) %!s(func() error=0x77ba20)}
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: E0817 09:29:32.389743 1307272 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod3c996063_2362_45f6_a84f_e71dd5df4f9d.slice/crio-9ca73cbb686e2c9fd01ac8a779d8493d6ac8efeae2a818a616efc61fabca537f.scope: Error finding container 9ca73cbb686e2c9fd01ac8a779d8493d6ac8efeae2a818a616efc61fabca537f: Status 404 returned error &{%!s(*http.body=&{0xc000536dc8 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77baa0) %!s(func() error=0x77ba20)}
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: E0817 09:29:32.392662 1307272 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod376e132a_5317_453f_b851_05b1fa45e49b.slice/crio-092318ed8473a8596c035c2dbdc9ea7fec00437cc89b7c5024f183fa18eab39a.scope: Error finding container 092318ed8473a8596c035c2dbdc9ea7fec00437cc89b7c5024f183fa18eab39a: Status 404 returned error &{%!s(*http.body=&{0xc000eb45d0 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77baa0) %!s(func() error=0x77ba20)}
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: E0817 09:29:32.399610 1307272 manager.go:1127] Failed to create existing container: /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode34dd56c_05bb_41a5_9304_910162ffb342.slice/crio-3c7e04fdca0944613e666ba7d0c961b7b9f2ee889a8d2349230cdd3d10d53a00.scope: Error finding container 3c7e04fdca0944613e666ba7d0c961b7b9f2ee889a8d2349230cdd3d10d53a00: Status 404 returned error &{%!s(*http.body=&{0xc000c64558 <nil> <nil> false false {0 0} false false false <nil>}) {%!s(int32=0) %!s(uint32=0)} %!s(bool=false) <nil> %!s(func(error) error=0x77baa0) %!s(func() error=0x77ba20)}
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: E0817 09:29:32.413027 1307272 fsHandler.go:114] failed to collect filesystem stats - rootDiskErr: <nil>, extraDiskErr: could not stat "/var/log/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-139-77.us-east-2.compute.internal_376e132a-5317-453f-b851-05b1fa45e49b/kube-apiserver-insecure-readyz/1.log" to get inode usage: stat /var/log/pods/openshift-kube-apiserver_kube-apiserver-ip-10-0-139-77.us-east-2.compute.internal_376e132a-5317-453f-b851-05b1fa45e49b/kube-apiserver-insecure-readyz/1.log: no such file or directory
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.415080 1307272 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv6
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.415105 1307272 status_manager.go:157] "Starting to sync pod status with apiserver"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.415124 1307272 kubelet.go:1858] "Starting kubelet main sync loop"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: E0817 09:29:32.415161 1307272 kubelet.go:1882] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Aug 17 09:29:32 ip-10-0-139-77 hyperkube[1307272]: I0817 09:29:32.415497 1307272 reflector.go:219] Starting reflector *v1.RuntimeClass (0s) from k8s.io/client-go/informers/factory.go:134
please access the must-gather via: http://file.nay.redhat.com/~minmli/eus-upgrade-must-gather.tar.gz fixed in attached PR, will backport to 4.9-4.6 PR merged upgrade from 4.8.8 to 4.9.0-0.nightly-2021-08-29-010334 with mcp worker "paused=true", don't reproduce. And don't find error log in 4.9 job from https://search.ci.openshift.org/?search=close+of+closed+channel&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job $ oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-08-29-010334 --force=true --allow-explicit-upgrade=true $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.8 True True 44m Working towards 4.9.0-0.nightly-2021-08-29-010334: 584 of 732 done (79% complete), waiting on network $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-08-29-010334 True False 56m Cluster version is 4.9.0-0.nightly-2021-08-29-010334 *** Bug 2013487 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |