Bug 1712528 - oc delete of a static pod results in multiple attempts to re-create the mirror pod before succeeding
Summary: oc delete of a static pod results in multiple attempts to re-create the mirro...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.2.0
Assignee: Robert Krawitz
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1723553
TreeView+ depends on / blocked
 
Reported: 2019-05-21 17:21 UTC by Clayton Coleman
Modified: 2019-10-16 06:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1723553 (view as bug list)
Environment:
Last Closed: 2019-10-16 06:29:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:29:33 UTC

Description Clayton Coleman 2019-05-21 17:21:59 UTC
On 4.2.0-0.ci-2019-05-21-145642 I ran 'oc delete pod' on a etcd-member.  When I deleted it, over the next 10-20 seconds a new pod was continually deleted and recreated - after about 25s that stopped happening.

This looks like a skew problem just because I've seen other pods do this.  It's fairly concerning, and needs debugging.

If we don't debug this before we fix the kubelet, we need to set up a skew test for 4.2 that verifies that 4.2 control plane works with 4.1 nodes.

This is a 4.2 exit blocker

Comment 1 Seth Jennings 2019-05-21 18:23:29 UTC
I'll continue to look at it as well.

Comment 2 Seth Jennings 2019-05-21 19:14:45 UTC
Roughly 3 iterations of the loop in the kubelet logs

18:53:55.607060    1001 kubelet_pods.go:1329] Generating status for "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:55.607501    1001 kubelet.go:1654] Deleting mirror pod "etcd-member-master-0_openshift-etcd(29feae11-7bf8-11e9-a6ee-fa163edbbb22)" because it is outdated
18:53:55.607512    1001 mirror_client.go:85] Deleting a mirror pod "etcd-member-master-0_openshift-etcd"
18:53:55.737681    1001 kubelet.go:1931] SyncLoop (DELETE, "api"): "etcd-member-master-0_openshift-etcd(29feae11-7bf8-11e9-a6ee-fa163edbbb22)"
18:53:55.779707    1001 kubelet.go:1925] SyncLoop (REMOVE, "api"): "etcd-member-master-0_openshift-etcd(29feae11-7bf8-11e9-a6ee-fa163edbbb22)"
18:53:55.888796    1001 volume_manager.go:350] Waiting for volumes to attach and mount for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:55.888951    1001 volume_manager.go:383] All volumes are attached and mounted for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:55.889111    1001 kuberuntime_manager.go:585] computePodActions got {KillPod:false CreateSandbox:false SandboxID:c292a3666e77cbd57e50591cd9ab2a4652c1394bf68bd4cec8424e1e4c8177d0 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[]} for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:55.897317    1001 kubelet.go:1915] SyncLoop (ADD, "api"): "etcd-member-master-0_openshift-etcd(c8d92884-7bf9-11e9-8987-fa163edbbb22)"
18:53:56.448576    1001 kubelet_pods.go:1329] Generating status for "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:56.449115    1001 kubelet.go:1654] Deleting mirror pod "etcd-member-master-0_openshift-etcd(29feae11-7bf8-11e9-a6ee-fa163edbbb22)" because it is outdated
18:53:56.449131    1001 mirror_client.go:85] Deleting a mirror pod "etcd-member-master-0_openshift-etcd"
18:53:56.476909    1001 kubelet.go:1931] SyncLoop (DELETE, "api"): "etcd-member-master-0_openshift-etcd(c8d92884-7bf9-11e9-8987-fa163edbbb22)"
18:53:56.498377    1001 kubelet.go:1925] SyncLoop (REMOVE, "api"): "etcd-member-master-0_openshift-etcd(c8d92884-7bf9-11e9-8987-fa163edbbb22)"
18:53:56.512463    1001 volume_manager.go:350] Waiting for volumes to attach and mount for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:56.512577    1001 volume_manager.go:383] All volumes are attached and mounted for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:56.512713    1001 kuberuntime_manager.go:585] computePodActions got {KillPod:false CreateSandbox:false SandboxID:c292a3666e77cbd57e50591cd9ab2a4652c1394bf68bd4cec8424e1e4c8177d0 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[]} for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:56.518594    1001 kubelet.go:1915] SyncLoop (ADD, "api"): "etcd-member-master-0_openshift-etcd(c940152c-7bf9-11e9-8987-fa163edbbb22)"
18:53:57.453726    1001 kubelet_pods.go:1329] Generating status for "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:57.454498    1001 kubelet.go:1654] Deleting mirror pod "etcd-member-master-0_openshift-etcd(c8d92884-7bf9-11e9-8987-fa163edbbb22)" because it is outdated
18:53:57.454514    1001 mirror_client.go:85] Deleting a mirror pod "etcd-member-master-0_openshift-etcd"
18:53:57.485778    1001 kubelet.go:1931] SyncLoop (DELETE, "api"): "etcd-member-master-0_openshift-etcd(c940152c-7bf9-11e9-8987-fa163edbbb22)"
18:53:57.490318    1001 kubelet.go:1925] SyncLoop (REMOVE, "api"): "etcd-member-master-0_openshift-etcd(c940152c-7bf9-11e9-8987-fa163edbbb22)"
18:53:57.504153    1001 kubelet.go:1915] SyncLoop (ADD, "api"): "etcd-member-master-0_openshift-etcd(c9d6e86d-7bf9-11e9-8987-fa163edbbb22)"
18:53:57.504692    1001 volume_manager.go:350] Waiting for volumes to attach and mount for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:57.504777    1001 volume_manager.go:383] All volumes are attached and mounted for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"
18:53:57.504890    1001 kuberuntime_manager.go:585] computePodActions got {KillPod:false CreateSandbox:false SandboxID:c292a3666e77cbd57e50591cd9ab2a4652c1394bf68bd4cec8424e1e4c8177d0 Attempt:0 NextInitContainerToStart:nil ContainersToStart:[] ContainersToKill:map[]} for pod "etcd-member-master-0_openshift-etcd(8299824b99346356080f9f99d749160c)"

Happens at 1s intervals for a variable amount of time.  I've see the loop go for 2s to 39s.

Static pod uuid does not change.  Mirror pod uuid does change.

Comment 3 Robert Krawitz 2019-06-11 16:07:18 UTC
Reproduced with current master.

Comment 4 Robert Krawitz 2019-06-12 21:04:55 UTC
Broadly speaking, oc get pods shows the pod in question (or rather, the mirror pod reflecting it) to be in Pending state while this is going on.  When or shortly after this stops, it shows it transitioned to Running state.  This is using a 4.2 installation, so no skew.

Clayton, are you seeing evidence of something more happening, or is it just the log messages that Seth noted?

Comment 5 Robert Krawitz 2019-06-17 15:04:00 UTC
Confirmed, using build specified above (4.2.0-0.ci-2019-05-21-145642).  In this case it took about 80 seconds to stabilize:

pod "etcd-member-ip-10-0-133-116.ec2.internal" deleted
 1: Mon 17 Jun 2019 10:27:52 AM EDT    0/2   Pending   0     1s
 2: Mon 17 Jun 2019 10:27:53 AM EDT    0/2   Pending   0     0s
 3: Mon 17 Jun 2019 10:27:54 AM EDT    0/2   Pending   0     0s
 4: Mon 17 Jun 2019 10:27:55 AM EDT    0/2   Pending   0     1s
...
68: Mon 17 Jun 2019 10:29:19 AM EDT    0/2   Pending   0     1s
69: Mon 17 Jun 2019 10:29:21 AM EDT    0/2   Pending   0     0s
70: Mon 17 Jun 2019 10:29:22 AM EDT    0/2   Pending   0     0s
71: Mon 17 Jun 2019 10:29:23 AM EDT    0/2   Pending   0     2s
72: Mon 17 Jun 2019 10:29:25 AM EDT    0/2   Pending   0     3s
73: Mon 17 Jun 2019 10:29:26 AM EDT    2/2   Running   0     4s
...
80: Mon 17 Jun 2019 10:29:35 AM EDT    2/2   Running   0     13s
81: Mon 17 Jun 2019 10:29:37 AM EDT    2/2   Running   0     15s

Also reproduced with a simple static pod of my own.  At first glance, the node logs look very much like a pure 4.2 cluster, and there's no evidence that the underlying static pod is affected.  However, this condition continues for rather longer than on a pure cluster, where it's mostly bound by the sync interval.

Comment 10 Robert Krawitz 2019-06-18 14:48:04 UTC
Changing severity to low as this only affects mirror pods of static pods and does not result in any loss of service of either the actual static pod or the system as a whole.  Furthermore, 'oc delete' of a static pod is not a particularly standard operation to perform.

Comment 11 Robert Krawitz 2019-06-18 22:16:15 UTC
xref https://github.com/kubernetes/kubernetes/issues/79147

Comment 12 Robert Krawitz 2019-06-18 22:25:04 UTC
xref https://github.com/kubernetes/kubernetes/pull/79148

Comment 13 Robert Krawitz 2019-06-24 18:58:34 UTC
xref https://github.com/openshift/origin/pull/23262

Comment 18 Sunil Choudhary 2019-07-08 08:07:14 UTC
Tested with 4.2.0-0.nightly-2019-07-01-102521. etcd pods gets recreated within 5 - 10 seconds after deletion.

$ oc get pods
NAME                                                      READY   STATUS    RESTARTS   AGE
etcd-member-ip-10-0-142-240.ap-south-1.compute.internal   2/2     Running   0          84s
etcd-member-ip-10-0-144-156.ap-south-1.compute.internal   2/2     Running   0          5s
etcd-member-ip-10-0-162-128.ap-south-1.compute.internal   2/2     Running   0          39m

Comment 20 errata-xmlrpc 2019-10-16 06:29:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.