Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1542781 - Pod with Azure Persistent Volume stuck in "Container creating" after node shutdown [NEEDINFO]
Pod with Azure Persistent Volume stuck in "Container creating" after node shu...
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage (Show other bugs)
3.6.0
Unspecified Unspecified
medium Severity medium
: ---
: 3.9.0
Assigned To: hchen
Wenqi He
: Reopened, UpcomingRelease
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-02-06 20:01 EST by Greg Rodriguez II
Modified: 2018-03-28 10:26 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-03-28 10:26:32 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
hchen: needinfo? (rhowe)


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0489 None None None 2018-03-28 10:26 EDT

  None (edit)
Description Greg Rodriguez II 2018-02-06 20:01:57 EST
Description of problem:
We have an OCP 3.6 cluster in Azure. The nodes, master API and master controller are configured for Azure, and we use dynamically provisioned Azure disks as persistent volumes.

A PostgreSQL pod with an Azure disk as a persistent volume was running on node01. As a test, we shut down node01 in the Azure portal. OpenShift tried to start a new pod on node03, but is unable to attach the PV to node03. In the log file of node03, we see this repeatedly:

Jan 18 17:16:53 node03 atomic-openshift-node[1506]: E0118 17:16:53.340451    1506 kubelet.go:1556] Unable to mount volumes for pod "postgresql-1-flt8m_testprj2(0533a2bc-fbf6-11e7-b2a9-000d3a3612dc)": timeout expired waiting for volumes to attach/mount for pod "testprj2"/"postgresql-1-flt8m". list of unattached/unmounted volumes=[postgresql-data]; skipping pod

Jan 18 17:16:53 node03 atomic-openshift-node[1506]: E0118 17:16:53.340495    1506 pod_workers.go:182] Error syncing pod 0533a2bc-fbf6-11e7-b2a9-000d3a3612dc ("postgresql-1-flt8m_testprj2(0533a2bc-fbf6-11e7-b2a9-000d3a3612dc)"), skipping: timeout expired waiting for volumes to attach/mount for pod "testprj2"/"postgresql-1-flt8m". list of unattached/unmounted volumes=[postgresql-data]

On the active master controller, we see these log entries:

Jan 18 16:16:54 master1 atomic-openshift-master-controllers[39459]: I0118 16:16:54.091419   39459 actual_state_of_world.go:310] Volume "kubernetes.io/azure-disk/kubernetes-dynamic-pvc-ed84ece7-fbf4-11e7-b20f-000d3a371192.vhd" is already added to attachedVolume list to node "node01", update device path "2"

Jan 18 16:16:54 master1 atomic-openshift-master-controllers[39459]: W0118 16:16:54.092262   39459 reconciler.go:269] (Volume : "kubernetes.io/azure-disk/kubernetes-dynamic-pvc-ed84ece7-fbf4-11e7-b20f-000d3a371192.vhd") from node "node03" failed to attach - volume is already exclusively attached to another node

Jan 18 16:16:54 master1 atomic-openshift-master-controllers[39459]: I0118 16:16:54.092553   39459 event.go:217] Event(v1.ObjectReference{Kind:"Pod", Namespace:"testprj2", Name:"postgresql-1-flt8m", UID:"0533a2bc-fbf6-11e7-b2a9-000d3a3612dc", APIVersion:"v1", ResourceVersion:"8340", FieldPath:""}): type: 'Warning' reason: 'FailedAttachVolume' (Volume : "kubernetes.io/azure-disk/kubernetes-dynamic-pvc-ed84ece7-fbf4-11e7-b20f-000d3a371192.vhd") from node "node03" failed to attach - volume is already exclusively attached to another node

Jan 18 16:16:54 master1 atomic-openshift-master-controllers[39459]: I0118 16:16:54.107198   39459 node_status_updater.go:136] Updating status for node "node01" succeeded. patchBytes: "{}" VolumesAttached: [{kubernetes.io/azure-disk/kubernetes-dynamic-pvc-ed84ece7-fbf4-11e7-b20f-000d3a371192.vhd 2}]

The annotation volumes.kubernetes.io/controller-managed-attach-detach=true is set for all nodes, so the master controller should be able to detach the volume from node01.

Version-Release number of selected component (if applicable):
openshift v3.6.173.0.83
kubernetes v1.6.1+5115d708d7
etcd 3.2.1

How reproducible:
Customer verified repeatable

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
Comment 2 hchen 2018-02-08 11:22:56 EST
This is similar to a known issue in Cinder volume [1]. 

1. https://github.com/kubernetes/kubernetes/issues/57497
Comment 13 errata-xmlrpc 2018-03-28 10:26:32 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Note You need to log in before you can comment on or make changes to this bug.