e2e failure : https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22819/pull-ci-openshift-origin-master-e2e-aws/8711/ The timeline events from my debugging so far: # detach from old node(and old pod) is started I0514 19:30:17.938746 1 reconciler.go:230] attacherDetacher.DetachVolume started for volume "pvc-a007686c-767e-11e9-9339-129c3af80d5c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-04f136c2462a930a1") on node "ip-10-0-144-99.ec2.internal" # Kubelet sees a new pod that uses same volume May 14 19:30:18 ip-10-0-144-99 hyperkube[998]: I0514 19:30:18.215594 998 status_manager.go:527] Status for pod "volume-client_e2e-tests-volumes-ffhmv(b49b3cca-767e-11e9-9339-129c3af80d5c)" updated successfully: (1, {Phase:Pending Conditions:[{Type:Initialized Status:True LastProb # detach from old node(and old pod) succeeds 0514 19:30:25.001324 1 aws.go:2244] waitForAttachmentStatus returned non-nil attachment with state=detached: { AttachTime: 2019-05-14 19:29:50 +0000 UTC, DeleteOnTermination: false, Device: "/dev/xvdbl", InstanceId: "i-018e05c99167cc3f9", State: "detaching", VolumeId: "vol-04f136c2462a930a1" } I0514 19:30:25.326374 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-a007686c-767e-11e9-9339-129c3af80d5c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-04f136c2462a930a1") on node "ip-10-0-144-99.ec2.internal" # controller attempt to start new attach for new pod (but on same node) I0514 19:30:25.364349 1 reconciler.go:289] attacherDetacher.AttachVolume started for volume "pvc-a007686c-767e-11e9-9339-129c3af80d5c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-04f136c2462a930a1") from node "ip-10-0-144-99.ec2.internal" # kubelet verifycontrollerattached volume succeeds with old path May 14 19:30:34 ip-10-0-144-99 hyperkube[998]: I0514 19:30:34.388088 998 operation_generator.go:1199] Controller attach succeeded for volume "pvc-a007686c-767e-11e9-9339-129c3af80d5c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-04f136c2462a930a1") pod "volume-client" (UID: "b49b3cca-767e-11e9-9339-129c3af80d5c") device path: "/dev/xvdbl" # in controller attach actual succeeds for second pod I0514 19:30:59.804475 1 operation_generator.go:339] AttachVolume.Attach succeeded for volume "pvc-a007686c-767e-11e9-9339-129c3af80d5c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-04f136c2462a930a1") from node "ip-10-0-144-99.ec2.internal" second pod is now stuck waiting for device to get attached on /dev/xvdbl whereas new attach call succeeded on different path. There are couple of other "problematic" things going on: 1. The usual eventual consistency of "DescribeVolume" API calls strikes again: See detach succeeds at 14.30.25: I0514 19:30:25.326374 1 operation_generator.go:423] DetachVolume.Detach succeeded for volume "pvc-a007686c-767e-11e9-9339-129c3af80d5c" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1b/vol-04f136c2462a930a1") on node "ip-10-0-144-99.ec2.internal" and yet: I0514 19:30:25.510323 1 aws.go:1726] volume ID: vol-04f136c2462a930a1: device mappings from EC2 Instance: map[a:vol-00a246b72606dab0b ca:vol-049251631ad7e7638] I0514 19:30:25.510335 1 aws.go:1738] volume ID: vol-04f136c2462a930a1: Full device mappings: map[a:vol-00a246b72606dab0b ca:vol-049251631ad7e7638] I0514 19:30:25.510356 1 aws.go:1779] Assigned mount device bv -> volume vol-04f136c2462a930a1 E0514 19:30:25.732642 1 aws.go:2133] "Error attaching EBS volume \"vol-04f136c2462a930a1\"" to instance "i-018e05c99167cc3f9" since volume is currently attached to "i-018e05c99167cc3f9" so second DescribeVolume API call is actually lying. 2. Traditionally we have disabled the reconciler resync mechanism in Openshift - https://github.com/openshift/openshift-ansible/pull/5387 . It throws one more variable in the mix: I0514 19:30:56.098777 1 operation_generator.go:268] VerifyVolumesAreAttached.BulkVerifyVolumes failed for node "ip-10-0-144-99.ec2.internal" and volume "pvc-a007686c-767e-11e9-9339-129c3af80d5c" At 19.30.56, that verification mechansim in controller has failed verification of volume attachment to the node and hence it will remove the volume from ASOW. I will open a PR to controller-manager to disable this again. The volume verification mechanism is known to race with other operations in reconciler.
This is not a blocker. We will fix this in 4.1.z
okay, I think I know why Kubelet saw old device Path. The problem is - when attach for second pod failed with "Error attaching EBS volume \"vol-04f136c2462a930a1\"" to instance "i-018e05c99167cc3f9" since volume is currently attached to "i-018e05c99167cc3f9, this caused an dangling volume error from AWS. The dangling volume error is supposed to auto-correct any volumes which are attached to a node that it shouldn't be. But in this case - the dangling volume error was triggered by the fact that, volume was not actually attached to the node but because `describeVolume` call for 2nd attachment returned stale data (because of AWS eventual consistency). So, dangling volume exception caused volume to be added to actual state of world (so as it can be detached later), but this caused volume to be reported to node's Status field. And that caused second pod to assume volume as attached to invalid device Path and hence 2nd pod was stuck for 10 minutes waiting for volume to attach on a device path (/dev/xvdbl) which does not exist.
Logged https://github.com/kubernetes/kubernetes/issues/77949 in upstream to track it and discuss a fix.
Opened a PR for fixing this - https://github.com/kubernetes/kubernetes/pull/78595
Fix for openshift. Already merged - https://github.com/openshift/origin/pull/23013
@Hemant Could you direct QE how to reproduce and/or verify this bug ? Thanks.
Checked the builds in https://openshift-gce-devel.appspot.com/builds/origin-ci-test/pr-logs/pull/22819/pull-ci-openshift-origin-master-e2e-aws/ All the builds after 8711 (the failed one which is reported in #comment 0), 8716, 8720, 8722, 8725 are passed. Move bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2417