Bug 1858195

Summary:	[GSS] registry pod stuck in ContainerCreating due to pvc from cephfs storage class fail to mount
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Prashant Dhange <pdhange>
Component:	csi-driver	Assignee:	Mudit Agarwal <muagarwa>
Status:	CLOSED ERRATA	QA Contact:	akarsha <akrai>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	akrai, edonnell, hchiramm, jdillama, jwang, khiremat, madam, mrajanna, muagarwa, nberry, oarribas, ocs-bugs, pdonnell, ratamir, r.martinez, srangana, suchaudh, tdesala, tnielsen, wwurzbac
Target Milestone:	---	Keywords:	AutomationBackLog
Target Release:	OCS 4.6.0	Flags:	khiremat: needinfo- khiremat: needinfo-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	.`ceph-mgr` no longer causes errors during requests Previously, certain `ceph-mgr` modules (`fs`) always connected to the MONs that were passed in as part of the initial `ceph-mgr` pod creation. Therefore, when the MON endpoints were changed, these modules failed to connect to the Red Hat Ceph Storage cluster for various requests, such as provisioning and staging CephFS volumes, causing errors. With this update, `ceph-mgr` has been fixed to keep its MON endpoints updated as they change, and to not rely only on the initial MON addresses that are passed during pod creation, and `ceph-mgr` operations continue and work as expected.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-12-17 06:23:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1858891
Bug Blocks:	1882359

Comment 6 Mudit Agarwal 2020-07-17 14:03:04 UTC

Log location (other than supportshell): https://drive.google.com/file/d/1Y7mnk7Pihq6c5zdE-eVOrfftepuponxO/view?usp=sharing

cephfs plugin logs say that NodeStageVolume for the volume in question was stuck since 07/13, that was the first time when cephfs-plugin invoked a command to ceph and it never cam back:

-------------------------
utils.go:157] ID: 77 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC call: /csi.v1.Node/NodeStageVolume
utils.go:158] ID: 77 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-68cc31f8-80eb-41b3-a17c-7c22475b9900/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":5}},"volume_context":{"clusterID":"openshift-storage","fsName":"ocs-storagecluster-cephfilesystem","storage.kubernetes.io/csiProvisionerIdentity":"1592512686709-8081-openshift-storage.cephfs.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07"}
util.go:48] ID: 77 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 cephfs: EXEC ceph [-m 172.30.55.194:6789,172.30.81.162:6789 --id csi-cephfs-node --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs dump --format=json]
util.go:48] ID: 77 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 cephfs: EXEC ceph [-m 172.30.55.194:6789,172.30.81.162:6789 --id csi-cephfs-node --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs ls --format=json]
utils.go:157] ID: 81 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC call: /csi.v1.Node/NodeStageVolume
utils.go:158] ID: 81 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-68cc31f8-80eb-41b3-a17c-7c22475b9900/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":5}},"volume_context":{"clusterID":"openshift-storage","fsName":"ocs-storagecluster-cephfilesystem","storage.kubernetes.io/csiProvisionerIdentity":"1592512686709-8081-openshift-storage.cephfs.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07"}
nodeserver.go:83] ID: 81 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 already exists
utils.go:161] ID: 81 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 already exists
-------------------------------------


For the same volume I can see a lot of "fs subvolume getpath" commands in ceph-mgr logs(rook-ceph-mgr-a-c4d7f55c-nhlch/mgr/mgr/logs/current.log):

-------------------------------- 
log_channel(audit) log [DBG] : from='client.1450493 -' entity='client.csi-cephfs-node' cmd=[{"prefix": "fs subvolume getpath", "vol_name": "ocs-storagecluster-cephfilesystem", "sub_name": "csi-vol-bff09dcd-b1c2-11ea-aec4-0a580afe0d07", "group_name": "csi", "target": ["mgr", ""]}]: dispatch
log_channel(audit) log [DBG] : from='client.1450493 -' entity='client.csi-cephfs-node' cmd=[{"prefix": "fs subvolume getpath", "vol_name": "ocs-storagecluster-cephfilesystem", "sub_name": "csi-vol-bff09dcd-b1c2-11ea-aec4-0a580afe0d07", "group_name": "csi", "target": ["mgr", ""]}]: dispatch

2020-07-14T17:54:53.039570434Z debug 2020-07-14 17:54:53.038 7f9817b8c700 -1 mgr.server reply reply (22) Invalid argument Traceback (most recent call last):
2020-07-14T17:54:53.039570434Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 158, in get_fs_handle
2020-07-14T17:54:53.039570434Z     conn.connect()
2020-07-14T17:54:53.039570434Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 83, in connect
2020-07-14T17:54:53.039570434Z     self.fs.init()
2020-07-14T17:54:53.039570434Z   File "cephfs.pyx", line 663, in cephfs.LibCephFS.init
2020-07-14T17:54:53.039570434Z cephfs.Error: error calling ceph_init: Connection timed out [Errno 110]
2020-07-14T17:54:53.039570434Z 
2020-07-14T17:54:53.039570434Z During handling of the above exception, another exception occurred:
2020-07-14T17:54:53.039570434Z 
2020-07-14T17:54:53.039570434Z Traceback (most recent call last):
2020-07-14T17:54:53.039570434Z   File "/usr/share/ceph/mgr/mgr_module.py", line 974, in _handle_command
2020-07-14T17:54:53.039570434Z     return self.handle_command(inbuf, cmd)
2020-07-14T17:54:53.039570434Z   File "/usr/share/ceph/mgr/volumes/module.py", line 249, in handle_command
2020-07-14T17:54:53.039570434Z     return handler(inbuf, cmd)
2020-07-14T17:54:53.039570434Z   File "/usr/share/ceph/mgr/volumes/module.py", line 316, in _cmd_fs_subvolume_getpath
2020-07-14T17:54:53.039570434Z     group_name=cmd.get('group_name', None))
2020-07-14T17:54:53.039570434Z   File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 205, in subvolume_getpath
2020-07-14T17:54:53.039570434Z     with open_volume(self, volname) as fs_handle:
2020-07-14T17:54:53.039570434Z   File "/lib64/python3.6/contextlib.py", line 81, in __enter__
2020-07-14T17:54:53.039570434Z     return next(self.gen)
2020-07-14T17:54:53.039570434Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 283, in open_volume
2020-07-14T17:54:53.039570434Z     fs_handle = vc.connection_pool.get_fs_handle(volname)
2020-07-14T17:54:53.039570434Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 164, in get_fs_handle
2020-07-14T17:54:53.039570434Z     raise VolumeException(-e.args[0], e.args[1])
2020-07-14T17:54:53.039570434Z TypeError: bad operand type for unary -: 'str'
2020-07-14T17:54:53.039570434Z 
2020-07-14T17:54:53.043327448Z debug 2020-07-14 17:54:53.042 7f981838d700  0 log_channel(audit) log [DBG] : from='client.1450493 -' entity='client.csi-cephfs-node' cmd=[{"prefix": "fs subvolume getpath", "vol_name": "ocs-storagecluster-cephfilesystem", "sub_name": "csi-vol-bff09dcd-b1c2-11ea-aec4-0a580afe0d07", "group_name": "csi", "target": ["mgr", ""]}]: dispatch
------------------------------

@Jason, can you please check once.


@Prasanth, here also rook-ceph logs are from 07/16 only and we have no info on ceph health. Please find that along with the things Madhu had asked.

Comment 13 Travis Nielsen 2020-07-17 20:30:04 UTC

There is something fishy going on with this cluster. The mon endpoints are never expected to change, yet they appear to have changed.

The clue is that the mon services were created after the creation of the cluster. The OSD logs show that they were started on July 7th with the original a,b,c endpoints. However, the services.yaml in the must-gather shows that the mon services were created at these timestamps:
- mon-a 
   creationTimestamp: "2020-07-11T04:44:38Z"
   clusterIP: 172.30.55.194
- mon-b 
   creationTimestamp: "2020-07-09T23:45:55Z"
   clusterIP: 172.30.81.162
- mon-h 
   creationTimestamp: "2020-07-15T23:33:30Z"
   clusterIP: 172.30.10.122

The services for a,b clearly should have been created on July 7th or earlier when the cluster was created. The only explanation is that the services were deleted or otherwise tampered with and created again by Rook when found missing. But creating a new service will result in a new mon endpoint, which results in a cluster that stops working.

The operator logs are only from today and do not show any history back to the date when the services would have been corrupted.

A workaround may be to delete these mon services and create them with the original clusterIP, update them in the configmap "rook-ceph-mon", and then restart the operator and the daemons.

Comment 16 Travis Nielsen 2020-07-19 05:13:52 UTC

After looking again, agreed with Patrick that instead of reverting to the original mon endpoints we should keep the updated endpoints and update the clients to also pick them up. The core ceph cluster surprisingly is looking healthy despite the endpoints being changed. I would have expected that changing a mon endpoint would result in that mon falling out of quorum. But the first priority is to keep core ceph healthy, which it is.

Restarting the Ceph daemon pods that are having issues (is it just the mgr?), should resolve their issues to pick up the new mons.

For the clients, the CSI driver picks up the endpoints from the configmap "rook-ceph-csi-config", which all the csi provisioner and plugin pods pick up to find those endpoints when mounting the clients. The must-gather does show that it correctly specifies the latest mon endpoints:

    csi-cluster-config-json: '[{"clusterID":"openshift-storage","monitors":["172.30.10.122:6789","172.30.55.194:6789","172.30.81.162:6789"]}]'
 
The CSI driver was designed to pick up the new mon endpoints dynamically, IIRC whenever the volume is mounted. 
@Shyam do we just need to restart the app pods to get them to pick up the new endpoints?

Comment 17 Shyamsundar 2020-07-19 12:13:58 UTC

(In reply to Travis Nielsen from comment #16)
> For the clients, the CSI driver picks up the endpoints from the configmap
> "rook-ceph-csi-config", which all the csi provisioner and plugin pods pick
> up to find those endpoints when mounting the clients. The must-gather does
> show that it correctly specifies the latest mon endpoints:
> 
>     csi-cluster-config-json:
> '[{"clusterID":"openshift-storage","monitors":["172.30.10.122:6789","172.30.
> 55.194:6789","172.30.81.162:6789"]}]'
>  
> The CSI driver was designed to pick up the new mon endpoints dynamically,
> IIRC whenever the volume is mounted. 
> @Shyam do we just need to restart the app pods to get them to pick up the
> new endpoints?

The app pods as in the "image-registry" in this case?

I would state that is not required, as CSI will re-read MON end points from config for each request, and as you note it is already up to date on this. So the staging and mount will proceed once the stale request from mgr returns (which it should as soon as the mgr is restarted). We will have to wait for the current exponential timeout to expire, which may take up to 5 minutes (from the logs in this case it looks like NodeStage should be retried every 2 minutes).

The above should keep the "pods to restart" for the customer lower.

Comment 22 Mudit Agarwal 2020-07-20 04:29:35 UTC

Thanks for the update Prasanth.

@Shyam, what are the next steps here. Do we need to keep this open for any code changes.

In any case, this doesn't look like a candidate for 4.5. Moving this to 4.6

Comment 25 Shyamsundar 2020-07-20 14:02:16 UTC

(In reply to Mudit Agarwal from comment #22)
> Thanks for the update Prasanth.
> 
> @Shyam, what are the next steps here. Do we need to keep this open for any
> code changes.

The mgr/volumes tracker to ensure this command does not get stuck due to incorrect exception handling needs to be fixed. Patrick has added the tracker for the same: https://tracker.ceph.com/issues/46607

@travis should we consider alternatives to '--mon-host' for the MGR pod(s), such that MON updates are picked up automatically? (I am not sure about this, but asking anyway)

Comment 28 Travis Nielsen 2020-07-20 17:26:00 UTC

(In reply to Shyamsundar from comment #25)
> (In reply to Mudit Agarwal from comment #22)
> > Thanks for the update Prasanth.
> > 
> > @Shyam, what are the next steps here. Do we need to keep this open for any
> > code changes.
> 
> The mgr/volumes tracker to ensure this command does not get stuck due to
> incorrect exception handling needs to be fixed. Patrick has added the
> tracker for the same: https://tracker.ceph.com/issues/46607
> 
> @travis should we consider alternatives to '--mon-host' for the MGR pod(s),
> such that MON updates are picked up automatically? (I am not sure about
> this, but asking anyway)

I think Jason already answered it, agreed on his points. I don't see a change needed in Rook.

Comment 29 Mudit Agarwal 2020-07-21 04:48:51 UTC

Prasanth, this can be closed. Nothing is pending from our side. See comments https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28

Comment 30 Patrick Donnelly 2020-07-21 19:51:41 UTC

(In reply to Mudit Agarwal from comment #29)
> Prasanth, this can be closed. Nothing is pending from our side. See comments
> https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and
> https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28

I think we'll want to validate that this is fixed for OCS. The issue should be reproduced internally and the fix verified. Ceph (RHCS) will have its own validation but that won't be sufficient for OCS, IMO.

Comment 31 Mudit Agarwal 2020-07-22 01:57:31 UTC

(In reply to Patrick Donnelly from comment #30)
> (In reply to Mudit Agarwal from comment #29)
> > Prasanth, this can be closed. Nothing is pending from our side. See comments
> > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and
> > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28
> 
> I think we'll want to validate that this is fixed for OCS. The issue should
> be reproduced internally and the fix verified. Ceph (RHCS) will have its own
> validation but that won't be sufficient for OCS, IMO.

Sure, lets keep it open then.

Comment 33 Raz Tamir 2020-07-29 13:58:06 UTC

Reducing severity and priority based on comment #22

Comment 38 akarsha 2020-10-20 06:42:23 UTC

Please let me know the steps and validations needed to verify the bz. Thank you in advance

Comment 39 Mudit Agarwal 2020-10-20 07:32:48 UTC

This is a ceph BZ, Kotresh might help here.

Comment 40 Shyamsundar 2020-10-20 18:15:24 UTC

(In reply to akarsha from comment #38)
> Please let me know the steps and validations needed to verify the bz. Thank
> you in advance

The problem arises due to MON addresses changing completely, without any restart of the ceph-mgr pod. To reproduce and validate this, the test case would need to create new MONs and retire(?) older MONs, such that the ceph cluster is running with newer MONs than when it was boot strapped. In this state the ceph-mgr and more specifically cephfs set of commands should still be responsive, for example creating a PVC backed by CephFS should work and not error out.

Comment 42 Mudit Agarwal 2020-10-29 09:15:07 UTC

Shyam, Please fill doc_text for this BZ.

Comment 43 Shyamsundar 2020-10-30 13:36:28 UTC

(In reply to Mudit Agarwal from comment #42)
> Shyam, Please fill doc_text for this BZ.

Updated doc text.

Comment 45 akarsha 2020-11-12 10:27:17 UTC

There are no well known steps to reproduce the issue.
 
These are the few steps tried to reproduce the issue as suggested:

a. Scale down mon deployments
No mon endpoints changed once the mon came up and running.

b. Prolong shutdown of worker nodes
Not every time, observed one of the mon endpoint changes(1/3) and was able to create cephfs pvc.

c. Add extra nodes(total 6 ocs nodes) then prolong shutdown and drain nodes
No mon endpoints changed once the mon came up and running, and successfully was able to create cephfs pvc.

d. Etcd stability(restarting pods, scaling down, shutdown of master node)
No mon endpoints changed once the mon came up and running, and successfully was able to create cephfs pvc.

e. certificates expire for masters (as well as for OCS nodes)
As suggested kept the OCS nodes shutdown for a day and didn't hit the issue. 

One more test performed as suggested by Travis.
Manually deleting mon services successfully changes the end-points but cluster goes to bad state and raised a separate #bz1897029 for it as discussed.

All the above steps are performed on OCS 4.6, 4.5 and 4.4 observed same behavior.

Versions:

OCP: 4.4.29
OCS: ocs-operator.v4.4.2

OCP: 4.5.16
OCS: ocs-operator.v4.5.1

OCP: 4.6.0-0.nightly-2020-11-07-035509
OCS: ocs-operator.v4.6.0-156.ci

Since we don't have the exact steps to reproduce the issue and ceph bz has been verified. Expected all the tier1, tier4 tests are passed on same build, based the sanity tests if there are no objections can we move the bz to verified state.

If anyone knows other ways to reproduce the issue please let me know.

Comment 46 Yaniv Kaul 2020-11-18 15:59:05 UTC

Here's how I'd try:
1. Disconnect the MGR node from the rest of the cluster (specifically, from nodes running the MONs)
2. Restart one of the nodes running one of the MONs - my assumption is that the MON will re-launch in a different node.
2. Repeast for the other nodes running MONs - so now they all run elsewhere.
3. Reconnect the MGR node - it should now have outdated list of MON addresses.

Comment 47 akarsha 2020-11-23 10:34:18 UTC

(In reply to Yaniv Kaul from comment #46)

Thanks Yaniv for the suggestion, we tried this scenario of restart of the nodes where mon hosted and didn't change the mon endpoints. Neither prolong shutdown of the node

Here what I tried,

OCP: 4.6.0-0.nightly-2020-11-18-154058
OCS: 4.6.0-160.ci

> Here's how I'd try:
> 1. Disconnect the MGR node from the rest of the cluster (specifically, from
> nodes running the MONs)
mgr pod is running on different node say, compute-3(mons are not hosted on same node).


> 2. Restart one of the nodes running one of the MONs - my assumption is that
> the MON will re-launch in a different node.
> 2. Repeast for the other nodes running MONs - so now they all run elsewhere.
Noted the mon endpoints before restart of the nodes and restarted one by one node where mons are running. 
Once all mons are up checked the endpoints and remains same. 

Observed similar behavior with prolong shutdown of the nodes where mon hosted, mon endpoints remained same.

> 3. Reconnect the MGR node - it should now have outdated list of MON
> addresses.

Comment 52 akarsha 2020-12-02 15:55:28 UTC

Again tried the second method given in comment#48 and same observation, when MONs had new addresses and mgr still using old MON addresses, the cephfs pvc creation was success and also pvc mount requests succeeded, no errors seen on mgr logs.
After the restart of mgr pod, picked the new mon endpionts.

Based on the observation moving bz to verified. 


Logs here:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1858195/bz/

Comment 54 errata-xmlrpc 2020-12-17 06:23:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605