Bug 1822872
| Summary: | cgroup sandbox has not been cleaned up | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lukasz Szaszkiewicz <lszaszki> |
| Component: | Node | Assignee: | Ted Yu <zyu> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Qin Ping <piqin> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.4 | CC: | aojeagar, aos-bugs, bbennett, jokerman, jsafrane, kir, rphillips, zyu |
| Target Milestone: | --- | ||
| Target Release: | 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-05-18 07:58:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Lukasz Szaszkiewicz
2020-04-10 08:43:54 UTC
I'm not sure if this might be related but I noticed weird kubelet behavior during the same test run - https://bugzilla.redhat.com/show_bug.cgi?id=1822865 (In reply to Lukasz Szaszkiewicz from comment #1) > I'm not sure if this might be related but I noticed weird kubelet behavior > during the same test run - > https://bugzilla.redhat.com/show_bug.cgi?id=1822865 Might be related. I noticed the NFS provisioner cleanup failed at Mar 25 03:13:49.888, which is suspiciously close to the other bug. Might be also related to this kernel error: Mar 25 03:09:38.212093 ip-10-0-143-25 kernel: ------------[ cut here ]------------ Mar 25 03:09:38.212152 ip-10-0-143-25 kernel: WARNING: CPU: 3 PID: 36351 at fs/dcache.c:1597 umount_check.cold+0x2e/0x3a Mar 25 03:09:38.212205 ip-10-0-143-25 kernel: Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache nfsd auth_rpcgss nfs_acl lockd grace sunrpc ipt_REJECT nf_reject_ipv4 veth xt_nat geneve ip6_udp_tunnel udp_tunnel openv switch nsh nf_conncount iptable_mangle xt_comment xt_mark xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkil l overlay vfat fat intel_rapl_msr intel_rapl_common sb_edac snd_pcsp snd_pcm cirrus snd_timer drm_kms_helper snd intel_rapl_perf soundcore drm i2c_piix4 ip_tables xfs crct10dif_pclmul crc32_pclmul xen_blkfront ghash_clmulni_intel serio_ra w ixgbevf ata_generic pata_acpi btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c crc32c_intel zstd_decompress fuse Mar 25 03:09:38.212265 ip-10-0-143-25 kernel: CPU: 3 PID: 36351 Comm: umount Tainted: G W 5.5.9-200.fc31.x86_64 #1 Mar 25 03:09:38.212317 ip-10-0-143-25 kernel: Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 Mar 25 03:09:38.212366 ip-10-0-143-25 kernel: RIP: 0010:umount_check.cold+0x2e/0x3a Mar 25 03:09:38.212415 ip-10-0-143-25 kernel: Code: 48 85 c0 74 2d 48 8b 50 40 48 8b 46 68 48 c7 c7 28 b1 39 9c 48 8b 48 28 48 05 e8 03 00 00 50 4c 8b 09 48 89 f1 e8 a5 87 e1 ff <0f> 0b 58 e9 67 c9 ff ff 31 d2 eb d3 49 8b 44 24 28 48 8b 7 5 28 48 Mar 25 03:09:38.212465 ip-10-0-143-25 kernel: RSP: 0018:ffff9c6080f3bdb8 EFLAGS: 00010246 Mar 25 03:09:38.212541 ip-10-0-143-25 kernel: RAX: 000000000000004f RBX: ffff908c5a2bd840 RCX: 0000000000000000 Mar 25 03:09:38.212599 ip-10-0-143-25 kernel: RDX: 0000000000000000 RSI: ffff908d7fad9cc8 RDI: ffff908d7fad9cc8 Mar 25 03:09:38.212644 ip-10-0-143-25 kernel: RBP: ffff908c5a149718 R08: 0000000000000632 R09: 0000000000000000 Mar 25 03:09:38.212703 ip-10-0-143-25 kernel: R10: 0000000000000002 R11: ffffffff9c0eaed0 R12: ffff908c5a1496c0 Mar 25 03:09:38.212757 ip-10-0-143-25 kernel: R13: ffff908d72474150 R14: ffff908c5a149750 R15: ffff908c55097180 Mar 25 03:09:38.212809 ip-10-0-143-25 kernel: FS: 00007f4fbe0a8880(0000) GS:ffff908d7fac0000(0000) knlGS:0000000000000000 Mar 25 03:09:38.212853 ip-10-0-143-25 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 25 03:09:38.212903 ip-10-0-143-25 kernel: CR2: 0000560f2759c700 CR3: 00000002d4e22004 CR4: 00000000001606e0 Mar 25 03:09:38.212946 ip-10-0-143-25 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Mar 25 03:09:38.212987 ip-10-0-143-25 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Mar 25 03:09:38.213029 ip-10-0-143-25 kernel: Call Trace: Mar 25 03:09:38.213073 ip-10-0-143-25 kernel: d_walk+0xd7/0x2a0 Mar 25 03:09:38.213134 ip-10-0-143-25 kernel: ? select_collect+0x80/0x80 Mar 25 03:09:38.213180 ip-10-0-143-25 kernel: do_one_tree+0x20/0x40 Mar 25 03:09:38.213223 ip-10-0-143-25 kernel: shrink_dcache_for_umount+0x28/0x80 Mar 25 03:09:38.213268 ip-10-0-143-25 kernel: generic_shutdown_super+0x1a/0x100 Mar 25 03:09:38.213317 ip-10-0-143-25 kernel: kill_anon_super+0x14/0x30 Mar 25 03:09:38.213367 ip-10-0-143-25 kernel: nfsd_umount+0x12/0x40 [nfsd] Mar 25 03:09:38.213419 ip-10-0-143-25 kernel: deactivate_locked_super+0x36/0x70 Mar 25 03:09:38.213463 ip-10-0-143-25 kernel: cleanup_mnt+0x104/0x160 Mar 25 03:09:38.213529 ip-10-0-143-25 kernel: task_work_run+0x8a/0xb0 Mar 25 03:09:38.213586 ip-10-0-143-25 kernel: exit_to_usermode_loop+0x102/0x130 Mar 25 03:09:38.213631 ip-10-0-143-25 kernel: do_syscall_64+0x1a6/0x1c0 Mar 25 03:09:38.213679 ip-10-0-143-25 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Mar 25 03:09:38.213723 ip-10-0-143-25 kernel: RIP: 0033:0x7f4fbd316c47 Mar 25 03:09:38.213775 ip-10-0-143-25 kernel: Code: 72 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 72 2c 00 f7 d8 64 8 9 01 48 Mar 25 03:09:38.213825 ip-10-0-143-25 kernel: RSP: 002b:00007fff5f754918 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 Mar 25 03:09:38.213868 ip-10-0-143-25 kernel: RAX: 0000000000000000 RBX: 0000563b7894e060 RCX: 00007f4fbd316c47 Mar 25 03:09:38.213917 ip-10-0-143-25 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000563b78953040 Mar 25 03:09:38.213966 ip-10-0-143-25 kernel: RBP: 0000563b78953040 R08: 0000563b78952fb0 R09: 0000000000000000 Mar 25 03:09:38.214010 ip-10-0-143-25 kernel: R10: 00007fff5f7543a0 R11: 0000000000000246 R12: 00007f4fbde91d58 Mar 25 03:09:38.214059 ip-10-0-143-25 kernel: R13: 0000000000000000 R14: 0000563b7894e160 R15: 0000563b7894e060 Mar 25 03:09:38.214103 ip-10-0-143-25 kernel: ---[ end trace 78afcac9a5dcf0d0 ]--- Mar 25 03:09:38.214157 ip-10-0-143-25 kernel: BUG: Dentry 0000000084b00a22{i=27,n=6} still in use (1) [unmount of nfsd nfsd] Mar 25 03:09:38.214208 ip-10-0-143-25 kernel: ------------[ cut here ]------------ It started at 03:09:34.731093, failing every now and then until 03:13:49.588848 - again, very close to the weird kubelet behavior. For a started container, the log went like this:
```
Mar 25 03:55:54.716171 ip-10-0-146-242 hyperkube[1344]: + Name: "webserver",
Mar 25 03:55:54.716171 ip-10-0-146-242 hyperkube[1344]: + State: v1.ContainerState{Running: s"&ContainerStateRunning{StartedAt:2020-03-25 03:41:47 +0000 UTC,}"},
Mar 25 03:55:54.716171 ip-10-0-146-242 hyperkube[1344]: + Ready: true,
Mar 25 03:55:54.716171 ip-10-0-146-242 hyperkube[1344]: + Image: "docker.io/library/httpd:2.4.39-alpine",
Mar 25 03:55:54.716171 ip-10-0-146-242 hyperkube[1344]: + ImageID: "docker.io/library/httpd@sha256:addd70e4ee83f3bc9a4c1c7c41e37927ba47faf639312fc936df3afad7926f5a",
Mar 25 03:55:54.716171 ip-10-0-146-242 hyperkube[1344]: + ContainerID: "cri-o://265c02e29e5ca5457ef21ac7ef0635cd86ecdb7a01cd09482fd1593ceeaadb12",
Mar 25 03:55:54.716171 ip-10-0-146-242 hyperkube[1344]: + Started: &true,
Mar 25 03:55:54.725124 ip-10-0-146-242 hyperkube[1344]: I0325 03:55:54.724491 1344 status_manager.go:569] Patch status for pod "ss2-0_e2e-statefulset-6797(2e26ea01-d5cc-4401-b92e-4bbf9b4c33e8)" with "{\"metadata\":{\"uid\":\"2e26ea01-d5cc-4401-b92e-4bbf9b4c33e8\"},\"status\":{\"containerStatuses\":[{\"containerID\":\"cri-o://265c02e29e5ca5457ef21ac7ef0635cd86ecdb7a01cd09482fd1593ceeaadb12\",\"image\":\"docker.io/library/httpd:2.4.39-alpine\",\"imageID\":\"docker.io/library/httpd@sha256:addd70e4ee83f3bc9a4c1c7c41e37927ba47faf639312fc936df3afad7926f5a\",\"lastState\":{},\"name\":\"webserver\",\"ready\":true,\"restartCount\":0,\"started\":true,\"state\":{\"running\":{\"startedAt\":\"2020-03-25T03:41:47Z\"}}}]}}"
Mar 25 03:55:54.725124 ip-10-0-146-242 hyperkube[1344]: I0325 03:55:54.725006 1344 status_manager.go:577] Status for pod "ss2-0_e2e-statefulset-6797(2e26ea01-d5cc-4401-b92e-4bbf9b4c33e8)" updated successfully: (4, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-03-25 03:41:44 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-03-25 03:41:48 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-03-25 03:41:48 +0000 UTC Reason: Message:} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-03-25 03:41:44 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.146.242 PodIP:10.131.0.19 PodIPs:[{IP:10.131.0.19}] StartTime:2020-03-25 03:41:44 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:webserver State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2020-03-25 03:41:47 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:docker.io/library/httpd:2.4.39-alpine ImageID:docker.io/library/httpd@sha256:addd70e4ee83f3bc9a4c1c7c41e37927ba47faf639312fc936df3afad7926f5a ContainerID:cri-o://265c02e29e5ca5457ef21ac7ef0635cd86ecdb7a01cd09482fd1593ceeaadb12 Started:0xc0006a5cd0}] QOSClass:BestEffort EphemeralContainerStatuses:[]})
```
Note 'updated successfully'
It seems around Apr 10, the crio log level wasn't bumped yet.
We would need more detailed log to determine why some containers got stuck in ContainerStateWaiting
I looked at it for 2+ hours, was not able to find anything :-\ My main theory ATM is systemd was just slow to remove the pod scope. Last container from the pod was removed at 02:55:33.600532, the (failed) check for pod cgroup being gone was performed at 02:55:34.464622 (i.e. a second later). One second might not be enough if a lot is going on in the system. And it looks like it was: $ for f in $(seq 0 9); do d="02:55:3$f"; echo -n $d ' '; grep -c $d masters-journal; done 02:55:30 106 02:55:31 132 02:55:32 722 02:55:33 846 <---- 02:55:34 188 02:55:35 100 02:55:36 231 02:55:37 179 02:55:38 184 02:55:39 535 Also, nothing really bad happened, the pod was removed right after (at 02:55:34.856335). To clarify, in the logs I am looking at, there are 5 different pods that have messages `Pod XXX is terminated, but pod cgroup sandbox has not been cleaned up`, the above comment is about the first such pod (d123d1c7-6142-4431-a13d-1d286320256a). Concerning the pod 23d2bdfd-34f8-45e5-9b58-73cc1b89f73c which the bug report is about, I can't find the related logs :( If we are unable to find a root cause of the issue I'm fine with closing this BZ. Found in kubernetes upstream CI the same stack trace, but is using containerd https://github.com/kubernetes/kubernetes/issues/91236#issuecomment-713619858 |