1540080 – OCP pods fail to mount gluster block volumes after cluster being rebooted

Bug 1540080 - OCP pods fail to mount gluster block volumes after cluster being rebooted

Summary: OCP pods fail to mount gluster block volumes after cluster being rebooted

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.9.0
Assignee:	Prasanna Kumar Kalever
QA Contact:	Wenkai Shi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1570976
TreeView+	depends on / blocked

Reported:	2018-01-30 08:58 UTC by Ture Karlsson
Modified:	2019-05-09 11:13 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1570976 (view as bug list)
Environment:
Last Closed:	2018-03-28 14:23:55 UTC
Target Upstream Version:
Embargoed:
Flags:	ansverma: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0489	0	None	None	None	2018-03-28 14:24:35 UTC

Description Ture Karlsson 2018-01-30 08:58:09 UTC

Description of problem:

I had a working installation of CNS 3.6 on OCP 3.6 and had successfully migrated registry storage to glusterfs storage and metrics and logging to glusterblock storage. Due to maintenance on the underlying infrastructure, all OCP nodes needed to be shut down for a couple of hours. When powering to cluster back on, the glusterblock volumes fails to mount in the cassandra and elasticsearch pods while the registry pod can mount its glusterfs volume again without problem. 

# oc get pods --all-namespaces | egrep "es-data|cassandra"
logging                   logging-es-data-master-2yyubnfn-1-z90sp   0/1       ContainerCreating   0          3d
logging                   logging-es-data-master-80izip4w-1-bbqhr   0/1       ContainerCreating   0          3d
logging                   logging-es-data-master-is16luj3-1-th2gb   0/1       ContainerCreating   0          3d
openshift-infra           hawkular-cassandra-1-5ctm1                0/1       ContainerCreating   0          3d

Similar error messages are seen in each of these pods:

# oc describe pod logging-es-data-master-is16luj3-1-th2gb -n logging
...
Events:
  FirstSeen	LastSeen	Count	From				SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----				-------------	--------	------		-------
  3d		23m		2281	kubelet, cns01.example.com			Warning		FailedMount	MountVolume.SetUp failed for volume "kubernetes.io/iscsi/9dd4d291-0286-11e8-9b78-001a4a160453-pvc-002a3213-01b2-11e8-92a8-001a4a160352" (spec.Name: "pvc-002a3213-01b2-11e8-92a8-001a4a160352") pod "9dd4d291-0286-11e8-9b78-001a4a160453" (UID: "9dd4d291-0286-11e8-9b78-001a4a160453") with: failed to get any path for iscsi disk, last err seen:
Could not attach disk: Timeout after 10s
  3d	11m	2529	kubelet, cns01.example.com	Warning	FailedSync	Error syncing pod
  3d	2m	2531	kubelet, cns01.example.com	Warning	FailedMount	Unable to mount volumes for pod "logging-es-data-master-is16luj3-1-th2gb_logging(9dd4d291-0286-11e8-9b78-001a4a160453)": timeout expired waiting for volumes to attach/mount for pod "logging"/"logging-es-data-master-is16luj3-1-th2gb". list of unattached/unmounted volumes=[elasticsearch-storage]

Version-Release number of selected component (if applicable):

Openshift Container Platform 3.6
Container Native Storage 3.6

How reproducible:

1 out of 1 try for me

Steps to Reproduce:

1. Install OCP 3.6 cluster according to documentation: 
https://docs.openshift.com/container-platform/3.6/install_config/install/advanced_install.html

2. Deploy CNS 3.6 according to documentation:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html/container-native_storage_for_openshift_container_platform/chap-documentation-install_upgrade_matrix_red_hat_gluster_storage_container_native_with_openshift_platform-introduction_containerized_rhgs#idm140179699791088

3. Configure block storage according to documentation:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/container-native_storage_for_openshift_container_platform/#Block_Storage

4. Migrate Registry to CNS backed glusterfs volume according to documentation:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/container-native_storage_for_openshift_container_platform/#chap-Documentation-Red_Hat_Gluster_Storage_Container_Native_with_OpenShift_Platform-Updating_Registry

5. Migrate Metrics and Logging to CNS backed glusterblock volumes according to documentation:
https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/container-native_storage_for_openshift_container_platform/#Logging_Metrics

6. Reboot each node in the cluster.

Actual results:

The registry pod successfully mounts the glusterfs volume again.

The cassandra and elasticsearch pods fail to mount the glusterblock volumes with the following error:

# oc describe pod hawkular-cassandra-1-5ctm1 -n openshift-infra
...
  3d        4m        1895    kubelet, cns01.example.com     Warning        FailedMount    MountVolume.SetUp failed for volume "kubernetes.io/iscsi/19bb1a09-029c-11e8-aecf-001a4a160352-pvc-3684bb5a-01bc-11e8-a7a9-001a4a160352" (spec.Name: "pvc-3684bb5a-01bc-11e8-a7a9-001a4a160352") pod "19bb1a09-029c-11e8-aecf-001a4a160352" (UID: "19bb1a09-029c-11e8-aecf-001a4a160352") with: failed to get any path for iscsi disk, last err seen:
Could not attach disk: Timeout after 10s

In /var/log/messages on the CNS node, the following errors appear:

Jan 29 19:52:38 cns01.example.com journal: E0129 19:52:38.145119   14821 iscsi_util.go:272] iscsi: failed to get any path for iscsi disk, last err seen:
Jan 29 19:52:38 cns01.example.com atomic-openshift-node: E0129 19:52:38.145618   14821 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/iscsi/9dd4d291-0286-11e8-9b78-001a4a160453-pvc-002a3213-01b2-11e8-92a8-001a4a160352\" (\"9dd4d291-0286-11e8-9b78-001a4a160453\")" failed. No retries permitted until 2018-01-29 19:54:38.145189855 +0100 CET (durationBeforeRetry 2m0s). Error: MountVolume.SetUp failed for volume "kubernetes.io/iscsi/9dd4d291-0286-11e8-9b78-001a4a160453-pvc-002a3213-01b2-11e8-92a8-001a4a160352" (spec.Name: "pvc-002a3213-01b2-11e8-92a8-001a4a160352") pod "9dd4d291-0286-11e8-9b78-001a4a160453" (UID: "9dd4d291-0286-11e8-9b78-001a4a160453") with: failed to get any path for iscsi disk, last err seen:
Jan 29 19:52:38 cns01.example.com atomic-openshift-node: Could not attach disk: Timeout after 10s
Jan 29 19:52:38 cns01.example.com journal: Could not attach disk: Timeout after 10s
Jan 29 19:52:38 cns01.example.com journal: E0129 19:52:38.145129   14821 disk_manager.go:50] failed to attach disk
Jan 29 19:52:38 cns01.example.com journal: E0129 19:52:38.145132   14821 iscsi.go:247] iscsi: failed to setup
Jan 29 19:52:38 cns01.example.com journal: E0129 19:52:38.145618   14821 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/iscsi/9dd4d291-0286-11e8-9b78-001a4a160453-pvc-002a3213-01b2-11e8-92a8-001a4a160352\" (\"9dd4d291-0286-11e8-9b78-001a4a160453\")" failed. No retries permitted until 2018-01-29 19:54:38.145189855 +0100 CET (durationBeforeRetry 2m0s). Error: MountVolume.SetUp failed for volume "kubernetes.io/iscsi/9dd4d291-0286-11e8-9b78-001a4a160453-pvc-002a3213-01b2-11e8-92a8-001a4a160352" (spec.Name: "pvc-002a3213-01b2-11e8-92a8-001a4a160352") pod "9dd4d291-0286-11e8-9b78-001a4a160453" (UID: "9dd4d291-0286-11e8-9b78-001a4a160453") with: failed to get any path for iscsi disk, last err seen:
Jan 29 19:52:38 cns01.example.com journal: Could not attach disk: Timeout after 10s

Expected results:

The glusterblock volumes should successfully mount in the cassandra and elasticsearch pods.

Comment 3 Ture Karlsson 2018-01-30 09:36:47 UTC

I can add that the environment above is running on RHHI v1.1.

Comment 33 Jose A. Rivera 2018-02-19 14:23:53 UTC

PR is upstream: https://github.com/openshift/openshift-ansible/pull/7198

Comment 34 Jose A. Rivera 2018-02-19 14:33:24 UTC

Moving this BZ to OCP as this is a bug with the openshift-ansible installer.

Comment 35 Scott Dodson 2018-02-20 21:51:37 UTC

Fix is in openshift-ansible-3.9.0-0.46.0

Comment 37 Wenkai Shi 2018-02-24 02:54:35 UTC

Will verify this once BZ #1547229 fix.

Comment 38 Wenkai Shi 2018-02-28 08:18:07 UTC

Verified with version openshift-ansible-3.9.1-1.git.0.9862628.el7, code merged. Once installation done, target mount has been added for gluster block.

# oc export ds glusterfs-registry
...
        - mountPath: /etc/target
          name: glusterfs-target
...
      - hostPath:
          path: /etc/target
          type: ""
        name: glusterfs-target
...

Comment 47 errata-xmlrpc 2018-03-28 14:23:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 48 Oonkwee Lim 2018-04-13 02:04:42 UTC

Hello Prasanna,

What can we do for this custimer who re-open the case?

I'm reopening this because I see that the BZ is now closed: https://bugzilla.redhat.com/show_bug.cgi?id=1540080
With errata:  https://access.redhat.com/errata/RHBA-2018:0489

The errata is only for OCP 3.9, but we need this for 3.6. We can't upgrade our cluster because RHMAP that we are running on top of OCP is only supported on 3.6.


Thanks and Regards

Oonkwee Lim
Enterprise Cloud Support

Comment 49 Mangirdas Judeikis 2018-04-13 08:50:03 UTC

https://github.com/openshift/openshift-ansible/pull/7767

This went in. So now after new release done this should work for 3.6

Note You need to log in before you can comment on or make changes to this bug.