On 4.2.0-0.ci-2019-05-21-145642 I ran 'oc delete pod' on a etcd-member. When I deleted it, over the next 10-20 seconds a new pod was continually deleted and recreated - after about 25s that stopped happening. This looks like a skew problem just because I've seen other pods do this. It's fairly concerning, and needs debugging. If we don't debug this before we fix the kubelet, we need to set up a skew test for 4.2 that verifies that 4.2 control plane works with 4.1 nodes. This is a 4.2 exit blocker
I'll continue to look at it as well.
Roughly 3 iterations of the loop in the kubelet logs 18:53:55.607060 1001 kubelet_pods.go:1329] Generating status for "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:55.607501 1001 kubelet.go:1654] Deleting mirror pod "etcd-member-master-0_openshift-etcd(29feae11-7bf8-11e9-a6ee-fa163edbbb22)" because it is outdated 18:53:55.607512 1001 mirror_client.go:85] Deleting a mirror pod "etcd-member-master-0_openshift-etcd" 18:53:55.737681 1001 kubelet.go:1931] SyncLoop (DELETE, "api"): "etcd-member-master-0_openshift-etcd(29feae11-7bf8-11e9-a6ee-fa163edbbb22)" 18:53:55.779707 1001 kubelet.go:1925] SyncLoop (REMOVE, "api"): "etcd-member-master-0_openshift-etcd(29feae11-7bf8-11e9-a6ee-fa163edbbb22)" 18:53:55.888796 1001 volume_manager.go:350] Waiting for volumes to attach and mount for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:55.888951 1001 volume_manager.go:383] All volumes are attached and mounted for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:55.889111 1001 kuberuntime_manager.go:585] computePodActions got {KillPod:false CreateSandbox:false SandboxID:c292a3666e77cbd57e50591cd9ab2a4652c1394bf68bd4cec8424e1e4c8177d0 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[]} for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:55.897317 1001 kubelet.go:1915] SyncLoop (ADD, "api"): "etcd-member-master-0_openshift-etcd(c8d92884-7bf9-11e9-8987-fa163edbbb22)" 18:53:56.448576 1001 kubelet_pods.go:1329] Generating status for "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:56.449115 1001 kubelet.go:1654] Deleting mirror pod "etcd-member-master-0_openshift-etcd(29feae11-7bf8-11e9-a6ee-fa163edbbb22)" because it is outdated 18:53:56.449131 1001 mirror_client.go:85] Deleting a mirror pod "etcd-member-master-0_openshift-etcd" 18:53:56.476909 1001 kubelet.go:1931] SyncLoop (DELETE, "api"): "etcd-member-master-0_openshift-etcd(c8d92884-7bf9-11e9-8987-fa163edbbb22)" 18:53:56.498377 1001 kubelet.go:1925] SyncLoop (REMOVE, "api"): "etcd-member-master-0_openshift-etcd(c8d92884-7bf9-11e9-8987-fa163edbbb22)" 18:53:56.512463 1001 volume_manager.go:350] Waiting for volumes to attach and mount for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:56.512577 1001 volume_manager.go:383] All volumes are attached and mounted for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:56.512713 1001 kuberuntime_manager.go:585] computePodActions got {KillPod:false CreateSandbox:false SandboxID:c292a3666e77cbd57e50591cd9ab2a4652c1394bf68bd4cec8424e1e4c8177d0 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[]} for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:56.518594 1001 kubelet.go:1915] SyncLoop (ADD, "api"): "etcd-member-master-0_openshift-etcd(c940152c-7bf9-11e9-8987-fa163edbbb22)" 18:53:57.453726 1001 kubelet_pods.go:1329] Generating status for "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:57.454498 1001 kubelet.go:1654] Deleting mirror pod "etcd-member-master-0_openshift-etcd(c8d92884-7bf9-11e9-8987-fa163edbbb22)" because it is outdated 18:53:57.454514 1001 mirror_client.go:85] Deleting a mirror pod "etcd-member-master-0_openshift-etcd" 18:53:57.485778 1001 kubelet.go:1931] SyncLoop (DELETE, "api"): "etcd-member-master-0_openshift-etcd(c940152c-7bf9-11e9-8987-fa163edbbb22)" 18:53:57.490318 1001 kubelet.go:1925] SyncLoop (REMOVE, "api"): "etcd-member-master-0_openshift-etcd(c940152c-7bf9-11e9-8987-fa163edbbb22)" 18:53:57.504153 1001 kubelet.go:1915] SyncLoop (ADD, "api"): "etcd-member-master-0_openshift-etcd(c9d6e86d-7bf9-11e9-8987-fa163edbbb22)" 18:53:57.504692 1001 volume_manager.go:350] Waiting for volumes to attach and mount for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:57.504777 1001 volume_manager.go:383] All volumes are attached and mounted for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" 18:53:57.504890 1001 kuberuntime_manager.go:585] computePodActions got {KillPod:false CreateSandbox:false SandboxID:c292a3666e77cbd57e50591cd9ab2a4652c1394bf68bd4cec8424e1e4c8177d0 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[]} for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)" Happens at 1s intervals for a variable amount of time. I've see the loop go for 2s to 39s. Static pod uuid does not change. Mirror pod uuid does change.
Reproduced with current master.
Broadly speaking, oc get pods shows the pod in question (or rather, the mirror pod reflecting it) to be in Pending state while this is going on. When or shortly after this stops, it shows it transitioned to Running state. This is using a 4.2 installation, so no skew. Clayton, are you seeing evidence of something more happening, or is it just the log messages that Seth noted?
Confirmed, using build specified above (4.2.0-0.ci-2019-05-21-145642). In this case it took about 80 seconds to stabilize: pod "etcd-member-ip-10-0-133-116.ec2.internal" deleted 1: Mon 17 Jun 2019 10:27:52 AM EDT 0/2 Pending 0 1s 2: Mon 17 Jun 2019 10:27:53 AM EDT 0/2 Pending 0 0s 3: Mon 17 Jun 2019 10:27:54 AM EDT 0/2 Pending 0 0s 4: Mon 17 Jun 2019 10:27:55 AM EDT 0/2 Pending 0 1s ... 68: Mon 17 Jun 2019 10:29:19 AM EDT 0/2 Pending 0 1s 69: Mon 17 Jun 2019 10:29:21 AM EDT 0/2 Pending 0 0s 70: Mon 17 Jun 2019 10:29:22 AM EDT 0/2 Pending 0 0s 71: Mon 17 Jun 2019 10:29:23 AM EDT 0/2 Pending 0 2s 72: Mon 17 Jun 2019 10:29:25 AM EDT 0/2 Pending 0 3s 73: Mon 17 Jun 2019 10:29:26 AM EDT 2/2 Running 0 4s ... 80: Mon 17 Jun 2019 10:29:35 AM EDT 2/2 Running 0 13s 81: Mon 17 Jun 2019 10:29:37 AM EDT 2/2 Running 0 15s Also reproduced with a simple static pod of my own. At first glance, the node logs look very much like a pure 4.2 cluster, and there's no evidence that the underlying static pod is affected. However, this condition continues for rather longer than on a pure cluster, where it's mostly bound by the sync interval.
Changing severity to low as this only affects mirror pods of static pods and does not result in any loss of service of either the actual static pod or the system as a whole. Furthermore, 'oc delete' of a static pod is not a particularly standard operation to perform.
xref https://github.com/kubernetes/kubernetes/issues/79147
xref https://github.com/kubernetes/kubernetes/pull/79148
xref https://github.com/openshift/origin/pull/23262
Tested with 4.2.0-0.nightly-2019-07-01-102521. etcd pods gets recreated within 5 - 10 seconds after deletion. $ oc get pods NAME READY STATUS RESTARTS AGE etcd-member-ip-10-0-142-240.ap-south-1.compute.internal 2/2 Running 0 84s etcd-member-ip-10-0-144-156.ap-south-1.compute.internal 2/2 Running 0 5s etcd-member-ip-10-0-162-128.ap-south-1.compute.internal 2/2 Running 0 39m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922