Log location (other than supportshell): https://drive.google.com/file/d/1Y7mnk7Pihq6c5zdE-eVOrfftepuponxO/view?usp=sharing cephfs plugin logs say that NodeStageVolume for the volume in question was stuck since 07/13, that was the first time when cephfs-plugin invoked a command to ceph and it never cam back: ------------------------- utils.go:157] ID: 77 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC call: /csi.v1.Node/NodeStageVolume utils.go:158] ID: 77 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-68cc31f8-80eb-41b3-a17c-7c22475b9900/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":5}},"volume_context":{"clusterID":"openshift-storage","fsName":"ocs-storagecluster-cephfilesystem","storage.kubernetes.io/csiProvisionerIdentity":"1592512686709-8081-openshift-storage.cephfs.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07"} util.go:48] ID: 77 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 cephfs: EXEC ceph [-m 172.30.55.194:6789,172.30.81.162:6789 --id csi-cephfs-node --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs dump --format=json] util.go:48] ID: 77 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 cephfs: EXEC ceph [-m 172.30.55.194:6789,172.30.81.162:6789 --id csi-cephfs-node --keyfile=***stripped*** -c /etc/ceph/ceph.conf fs ls --format=json] utils.go:157] ID: 81 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC call: /csi.v1.Node/NodeStageVolume utils.go:158] ID: 81 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-68cc31f8-80eb-41b3-a17c-7c22475b9900/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":5}},"volume_context":{"clusterID":"openshift-storage","fsName":"ocs-storagecluster-cephfilesystem","storage.kubernetes.io/csiProvisionerIdentity":"1592512686709-8081-openshift-storage.cephfs.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07"} nodeserver.go:83] ID: 81 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 already exists utils.go:161] ID: 81 Req-ID: 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-bff09dcd-b1c2-11ea-aec4-0a580afe0d07 already exists ------------------------------------- For the same volume I can see a lot of "fs subvolume getpath" commands in ceph-mgr logs(rook-ceph-mgr-a-c4d7f55c-nhlch/mgr/mgr/logs/current.log): -------------------------------- log_channel(audit) log [DBG] : from='client.1450493 -' entity='client.csi-cephfs-node' cmd=[{"prefix": "fs subvolume getpath", "vol_name": "ocs-storagecluster-cephfilesystem", "sub_name": "csi-vol-bff09dcd-b1c2-11ea-aec4-0a580afe0d07", "group_name": "csi", "target": ["mgr", ""]}]: dispatch log_channel(audit) log [DBG] : from='client.1450493 -' entity='client.csi-cephfs-node' cmd=[{"prefix": "fs subvolume getpath", "vol_name": "ocs-storagecluster-cephfilesystem", "sub_name": "csi-vol-bff09dcd-b1c2-11ea-aec4-0a580afe0d07", "group_name": "csi", "target": ["mgr", ""]}]: dispatch 2020-07-14T17:54:53.039570434Z debug 2020-07-14 17:54:53.038 7f9817b8c700 -1 mgr.server reply reply (22) Invalid argument Traceback (most recent call last): 2020-07-14T17:54:53.039570434Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 158, in get_fs_handle 2020-07-14T17:54:53.039570434Z conn.connect() 2020-07-14T17:54:53.039570434Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 83, in connect 2020-07-14T17:54:53.039570434Z self.fs.init() 2020-07-14T17:54:53.039570434Z File "cephfs.pyx", line 663, in cephfs.LibCephFS.init 2020-07-14T17:54:53.039570434Z cephfs.Error: error calling ceph_init: Connection timed out [Errno 110] 2020-07-14T17:54:53.039570434Z 2020-07-14T17:54:53.039570434Z During handling of the above exception, another exception occurred: 2020-07-14T17:54:53.039570434Z 2020-07-14T17:54:53.039570434Z Traceback (most recent call last): 2020-07-14T17:54:53.039570434Z File "/usr/share/ceph/mgr/mgr_module.py", line 974, in _handle_command 2020-07-14T17:54:53.039570434Z return self.handle_command(inbuf, cmd) 2020-07-14T17:54:53.039570434Z File "/usr/share/ceph/mgr/volumes/module.py", line 249, in handle_command 2020-07-14T17:54:53.039570434Z return handler(inbuf, cmd) 2020-07-14T17:54:53.039570434Z File "/usr/share/ceph/mgr/volumes/module.py", line 316, in _cmd_fs_subvolume_getpath 2020-07-14T17:54:53.039570434Z group_name=cmd.get('group_name', None)) 2020-07-14T17:54:53.039570434Z File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 205, in subvolume_getpath 2020-07-14T17:54:53.039570434Z with open_volume(self, volname) as fs_handle: 2020-07-14T17:54:53.039570434Z File "/lib64/python3.6/contextlib.py", line 81, in __enter__ 2020-07-14T17:54:53.039570434Z return next(self.gen) 2020-07-14T17:54:53.039570434Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 283, in open_volume 2020-07-14T17:54:53.039570434Z fs_handle = vc.connection_pool.get_fs_handle(volname) 2020-07-14T17:54:53.039570434Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 164, in get_fs_handle 2020-07-14T17:54:53.039570434Z raise VolumeException(-e.args[0], e.args[1]) 2020-07-14T17:54:53.039570434Z TypeError: bad operand type for unary -: 'str' 2020-07-14T17:54:53.039570434Z 2020-07-14T17:54:53.043327448Z debug 2020-07-14 17:54:53.042 7f981838d700 0 log_channel(audit) log [DBG] : from='client.1450493 -' entity='client.csi-cephfs-node' cmd=[{"prefix": "fs subvolume getpath", "vol_name": "ocs-storagecluster-cephfilesystem", "sub_name": "csi-vol-bff09dcd-b1c2-11ea-aec4-0a580afe0d07", "group_name": "csi", "target": ["mgr", ""]}]: dispatch ------------------------------ @Jason, can you please check once. @Prasanth, here also rook-ceph logs are from 07/16 only and we have no info on ceph health. Please find that along with the things Madhu had asked.
There is something fishy going on with this cluster. The mon endpoints are never expected to change, yet they appear to have changed. The clue is that the mon services were created after the creation of the cluster. The OSD logs show that they were started on July 7th with the original a,b,c endpoints. However, the services.yaml in the must-gather shows that the mon services were created at these timestamps: - mon-a creationTimestamp: "2020-07-11T04:44:38Z" clusterIP: 172.30.55.194 - mon-b creationTimestamp: "2020-07-09T23:45:55Z" clusterIP: 172.30.81.162 - mon-h creationTimestamp: "2020-07-15T23:33:30Z" clusterIP: 172.30.10.122 The services for a,b clearly should have been created on July 7th or earlier when the cluster was created. The only explanation is that the services were deleted or otherwise tampered with and created again by Rook when found missing. But creating a new service will result in a new mon endpoint, which results in a cluster that stops working. The operator logs are only from today and do not show any history back to the date when the services would have been corrupted. A workaround may be to delete these mon services and create them with the original clusterIP, update them in the configmap "rook-ceph-mon", and then restart the operator and the daemons.
After looking again, agreed with Patrick that instead of reverting to the original mon endpoints we should keep the updated endpoints and update the clients to also pick them up. The core ceph cluster surprisingly is looking healthy despite the endpoints being changed. I would have expected that changing a mon endpoint would result in that mon falling out of quorum. But the first priority is to keep core ceph healthy, which it is. Restarting the Ceph daemon pods that are having issues (is it just the mgr?), should resolve their issues to pick up the new mons. For the clients, the CSI driver picks up the endpoints from the configmap "rook-ceph-csi-config", which all the csi provisioner and plugin pods pick up to find those endpoints when mounting the clients. The must-gather does show that it correctly specifies the latest mon endpoints: csi-cluster-config-json: '[{"clusterID":"openshift-storage","monitors":["172.30.10.122:6789","172.30.55.194:6789","172.30.81.162:6789"]}]' The CSI driver was designed to pick up the new mon endpoints dynamically, IIRC whenever the volume is mounted. @Shyam do we just need to restart the app pods to get them to pick up the new endpoints?
(In reply to Travis Nielsen from comment #16) > For the clients, the CSI driver picks up the endpoints from the configmap > "rook-ceph-csi-config", which all the csi provisioner and plugin pods pick > up to find those endpoints when mounting the clients. The must-gather does > show that it correctly specifies the latest mon endpoints: > > csi-cluster-config-json: > '[{"clusterID":"openshift-storage","monitors":["172.30.10.122:6789","172.30. > 55.194:6789","172.30.81.162:6789"]}]' > > The CSI driver was designed to pick up the new mon endpoints dynamically, > IIRC whenever the volume is mounted. > @Shyam do we just need to restart the app pods to get them to pick up the > new endpoints? The app pods as in the "image-registry" in this case? I would state that is not required, as CSI will re-read MON end points from config for each request, and as you note it is already up to date on this. So the staging and mount will proceed once the stale request from mgr returns (which it should as soon as the mgr is restarted). We will have to wait for the current exponential timeout to expire, which may take up to 5 minutes (from the logs in this case it looks like NodeStage should be retried every 2 minutes). The above should keep the "pods to restart" for the customer lower.
Thanks for the update Prasanth. @Shyam, what are the next steps here. Do we need to keep this open for any code changes. In any case, this doesn't look like a candidate for 4.5. Moving this to 4.6
(In reply to Mudit Agarwal from comment #22) > Thanks for the update Prasanth. > > @Shyam, what are the next steps here. Do we need to keep this open for any > code changes. The mgr/volumes tracker to ensure this command does not get stuck due to incorrect exception handling needs to be fixed. Patrick has added the tracker for the same: https://tracker.ceph.com/issues/46607 @travis should we consider alternatives to '--mon-host' for the MGR pod(s), such that MON updates are picked up automatically? (I am not sure about this, but asking anyway)
(In reply to Shyamsundar from comment #25) > (In reply to Mudit Agarwal from comment #22) > > Thanks for the update Prasanth. > > > > @Shyam, what are the next steps here. Do we need to keep this open for any > > code changes. > > The mgr/volumes tracker to ensure this command does not get stuck due to > incorrect exception handling needs to be fixed. Patrick has added the > tracker for the same: https://tracker.ceph.com/issues/46607 > > @travis should we consider alternatives to '--mon-host' for the MGR pod(s), > such that MON updates are picked up automatically? (I am not sure about > this, but asking anyway) I think Jason already answered it, agreed on his points. I don't see a change needed in Rook.
Prasanth, this can be closed. Nothing is pending from our side. See comments https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28
(In reply to Mudit Agarwal from comment #29) > Prasanth, this can be closed. Nothing is pending from our side. See comments > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28 I think we'll want to validate that this is fixed for OCS. The issue should be reproduced internally and the fix verified. Ceph (RHCS) will have its own validation but that won't be sufficient for OCS, IMO.
(In reply to Patrick Donnelly from comment #30) > (In reply to Mudit Agarwal from comment #29) > > Prasanth, this can be closed. Nothing is pending from our side. See comments > > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and > > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28 > > I think we'll want to validate that this is fixed for OCS. The issue should > be reproduced internally and the fix verified. Ceph (RHCS) will have its own > validation but that won't be sufficient for OCS, IMO. Sure, lets keep it open then.
Reducing severity and priority based on comment #22
Please let me know the steps and validations needed to verify the bz. Thank you in advance
This is a ceph BZ, Kotresh might help here.
(In reply to akarsha from comment #38) > Please let me know the steps and validations needed to verify the bz. Thank > you in advance The problem arises due to MON addresses changing completely, without any restart of the ceph-mgr pod. To reproduce and validate this, the test case would need to create new MONs and retire(?) older MONs, such that the ceph cluster is running with newer MONs than when it was boot strapped. In this state the ceph-mgr and more specifically cephfs set of commands should still be responsive, for example creating a PVC backed by CephFS should work and not error out.
Shyam, Please fill doc_text for this BZ.
(In reply to Mudit Agarwal from comment #42) > Shyam, Please fill doc_text for this BZ. Updated doc text.
There are no well known steps to reproduce the issue. These are the few steps tried to reproduce the issue as suggested: a. Scale down mon deployments No mon endpoints changed once the mon came up and running. b. Prolong shutdown of worker nodes Not every time, observed one of the mon endpoint changes(1/3) and was able to create cephfs pvc. c. Add extra nodes(total 6 ocs nodes) then prolong shutdown and drain nodes No mon endpoints changed once the mon came up and running, and successfully was able to create cephfs pvc. d. Etcd stability(restarting pods, scaling down, shutdown of master node) No mon endpoints changed once the mon came up and running, and successfully was able to create cephfs pvc. e. certificates expire for masters (as well as for OCS nodes) As suggested kept the OCS nodes shutdown for a day and didn't hit the issue. One more test performed as suggested by Travis. Manually deleting mon services successfully changes the end-points but cluster goes to bad state and raised a separate #bz1897029 for it as discussed. All the above steps are performed on OCS 4.6, 4.5 and 4.4 observed same behavior. Versions: OCP: 4.4.29 OCS: ocs-operator.v4.4.2 OCP: 4.5.16 OCS: ocs-operator.v4.5.1 OCP: 4.6.0-0.nightly-2020-11-07-035509 OCS: ocs-operator.v4.6.0-156.ci Since we don't have the exact steps to reproduce the issue and ceph bz has been verified. Expected all the tier1, tier4 tests are passed on same build, based the sanity tests if there are no objections can we move the bz to verified state. If anyone knows other ways to reproduce the issue please let me know.
Here's how I'd try: 1. Disconnect the MGR node from the rest of the cluster (specifically, from nodes running the MONs) 2. Restart one of the nodes running one of the MONs - my assumption is that the MON will re-launch in a different node. 2. Repeast for the other nodes running MONs - so now they all run elsewhere. 3. Reconnect the MGR node - it should now have outdated list of MON addresses.
(In reply to Yaniv Kaul from comment #46) Thanks Yaniv for the suggestion, we tried this scenario of restart of the nodes where mon hosted and didn't change the mon endpoints. Neither prolong shutdown of the node Here what I tried, OCP: 4.6.0-0.nightly-2020-11-18-154058 OCS: 4.6.0-160.ci > Here's how I'd try: > 1. Disconnect the MGR node from the rest of the cluster (specifically, from > nodes running the MONs) mgr pod is running on different node say, compute-3(mons are not hosted on same node). > 2. Restart one of the nodes running one of the MONs - my assumption is that > the MON will re-launch in a different node. > 2. Repeast for the other nodes running MONs - so now they all run elsewhere. Noted the mon endpoints before restart of the nodes and restarted one by one node where mons are running. Once all mons are up checked the endpoints and remains same. Observed similar behavior with prolong shutdown of the nodes where mon hosted, mon endpoints remained same. > 3. Reconnect the MGR node - it should now have outdated list of MON > addresses.
Again tried the second method given in comment#48 and same observation, when MONs had new addresses and mgr still using old MON addresses, the cephfs pvc creation was success and also pvc mount requests succeeded, no errors seen on mgr logs. After the restart of mgr pod, picked the new mon endpionts. Based on the observation moving bz to verified. Logs here: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1858195/bz/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605