Bug 1523142

Summary: timeout expired waiting for volumes to attach/mount for pod
Product: OpenShift Container Platform Reporter: Vladislav Walek <vwalek>
Component: StorageAssignee: Tomas Smetana <tsmetana>
Status: CLOSED ERRATA QA Contact: Qin Ping <piqin>
Severity: high Docs Contact:
Priority: high    
Version: 3.7.0CC: aos-bugs, aos-storage-staff, bchilds, cstark, dzhukous, hekumar, nnosenzo, tsmetana
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When a node with OpenStack Cinder type of persistent volume attached was shut down or crashed, the attached volume has never been attached. Consequence: The pods could not be successfully migrated from the failed node due to unavailable persistent volumes and the volumes could not be accessed from any other node or pod. Fix: The problem was fixed in the OpenShift code. Result: When the node fails all its OpenStack Cinder attached volumes are being correctly detached after a time-out.
Story Points: ---
Clone Of:
: 1590243 (view as bug list) Environment:
Last Closed: 2018-03-28 14:14:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1590243    

Description Vladislav Walek 2017-12-07 10:02:56 UTC
Description of problem:

Cinder volumes taking to much time to be reloaded.
Related PR in github for k8s:
https://github.com/kubernetes/kubernetes/pull/56846

Possibly related to 
https://bugzilla.redhat.com/show_bug.cgi?id=1481729

Version-Release number of selected component (if applicable):
OpenShift Container Platform 3.7

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 9 Hemant Kumar 2018-01-10 21:54:08 UTC
https://github.com/kubernetes/kubernetes/pull/56846 PR is ready for merge. We are just waiting for someone with approver access to approve it (I have already lgtmed it).

Comment 22 Hemant Kumar 2018-01-17 12:49:56 UTC
yeah I was about to post -he aformentioned patch isn't suppposed to fix Multi-Attach error. It fixes two cases:

1. On Cinder, we were never detaching volumes from shutdown nodes. So if a node was running a DC and you brought it down - then the pod on new node will fail to start. Can we verify if that is fixed?
2. if volume information is lost from A/D controller's ActualStateOfWorld - the patch uses same dangling volume mechanism in AWS to correct the error.

Comment 23 Tomas Smetana 2018-01-17 13:06:32 UTC
What I did:

1. Started up a cluster with 1 master and 2 nodes
2. Created a cinder PVC/PV
3. Created a pod using the PVC
4. Shut down the node the pod was running on and waited for the pod to disappear from the API server
5. Started the same pod (using the same, already attached PV) again

I verified the pod came up again. This looks to be the case #1. I guess I need one more test (restarting the controller after the pod disappears).

Comment 25 Tomas Smetana 2018-01-17 17:01:28 UTC
https://github.com/openshift/origin/pull/18140

Comment 27 Qin Ping 2018-02-05 07:21:03 UTC
In OCP version: v3.9.0-0.36.0, after 8 minutes, Pod's status becomes to running.
In OCP version: v3.7.27, after 22 minutes, Pod's status is ContainerCreating.

So, changed bug to verified.

Comment 30 errata-xmlrpc 2018-03-28 14:14:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489