Bug 1858195
| Summary: | [GSS] registry pod stuck in ContainerCreating due to pvc from cephfs storage class fail to mount | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Prashant Dhange <pdhange> |
| Component: | csi-driver | Assignee: | Mudit Agarwal <muagarwa> |
| Status: | CLOSED ERRATA | QA Contact: | akarsha <akrai> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.4 | CC: | akrai, edonnell, hchiramm, jdillama, jwang, khiremat, madam, mrajanna, muagarwa, nberry, oarribas, ocs-bugs, pdonnell, ratamir, r.martinez, srangana, suchaudh, tdesala, tnielsen, wwurzbac |
| Target Milestone: | --- | Keywords: | AutomationBackLog |
| Target Release: | OCS 4.6.0 | Flags: | khiremat:
needinfo-
khiremat: needinfo- |
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
.`ceph-mgr` no longer causes errors during requests
Previously, certain `ceph-mgr` modules (`fs`) always connected to the MONs that were passed in as part of the initial `ceph-mgr` pod creation. Therefore, when the MON endpoints were changed, these modules failed to connect to the Red Hat Ceph Storage cluster for various requests, such as provisioning and staging CephFS volumes, causing errors. With this update, `ceph-mgr` has been fixed to keep its MON endpoints updated as they change, and to not rely only on the initial MON addresses that are passed during pod creation, and `ceph-mgr` operations continue and work as expected.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-12-17 06:23:00 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1858891 | ||
| Bug Blocks: | 1882359 | ||
|
Comment 6
Mudit Agarwal
2020-07-17 14:03:04 UTC
There is something fishy going on with this cluster. The mon endpoints are never expected to change, yet they appear to have changed. The clue is that the mon services were created after the creation of the cluster. The OSD logs show that they were started on July 7th with the original a,b,c endpoints. However, the services.yaml in the must-gather shows that the mon services were created at these timestamps: - mon-a creationTimestamp: "2020-07-11T04:44:38Z" clusterIP: 172.30.55.194 - mon-b creationTimestamp: "2020-07-09T23:45:55Z" clusterIP: 172.30.81.162 - mon-h creationTimestamp: "2020-07-15T23:33:30Z" clusterIP: 172.30.10.122 The services for a,b clearly should have been created on July 7th or earlier when the cluster was created. The only explanation is that the services were deleted or otherwise tampered with and created again by Rook when found missing. But creating a new service will result in a new mon endpoint, which results in a cluster that stops working. The operator logs are only from today and do not show any history back to the date when the services would have been corrupted. A workaround may be to delete these mon services and create them with the original clusterIP, update them in the configmap "rook-ceph-mon", and then restart the operator and the daemons. After looking again, agreed with Patrick that instead of reverting to the original mon endpoints we should keep the updated endpoints and update the clients to also pick them up. The core ceph cluster surprisingly is looking healthy despite the endpoints being changed. I would have expected that changing a mon endpoint would result in that mon falling out of quorum. But the first priority is to keep core ceph healthy, which it is.
Restarting the Ceph daemon pods that are having issues (is it just the mgr?), should resolve their issues to pick up the new mons.
For the clients, the CSI driver picks up the endpoints from the configmap "rook-ceph-csi-config", which all the csi provisioner and plugin pods pick up to find those endpoints when mounting the clients. The must-gather does show that it correctly specifies the latest mon endpoints:
csi-cluster-config-json: '[{"clusterID":"openshift-storage","monitors":["172.30.10.122:6789","172.30.55.194:6789","172.30.81.162:6789"]}]'
The CSI driver was designed to pick up the new mon endpoints dynamically, IIRC whenever the volume is mounted.
@Shyam do we just need to restart the app pods to get them to pick up the new endpoints?
(In reply to Travis Nielsen from comment #16) > For the clients, the CSI driver picks up the endpoints from the configmap > "rook-ceph-csi-config", which all the csi provisioner and plugin pods pick > up to find those endpoints when mounting the clients. The must-gather does > show that it correctly specifies the latest mon endpoints: > > csi-cluster-config-json: > '[{"clusterID":"openshift-storage","monitors":["172.30.10.122:6789","172.30. > 55.194:6789","172.30.81.162:6789"]}]' > > The CSI driver was designed to pick up the new mon endpoints dynamically, > IIRC whenever the volume is mounted. > @Shyam do we just need to restart the app pods to get them to pick up the > new endpoints? The app pods as in the "image-registry" in this case? I would state that is not required, as CSI will re-read MON end points from config for each request, and as you note it is already up to date on this. So the staging and mount will proceed once the stale request from mgr returns (which it should as soon as the mgr is restarted). We will have to wait for the current exponential timeout to expire, which may take up to 5 minutes (from the logs in this case it looks like NodeStage should be retried every 2 minutes). The above should keep the "pods to restart" for the customer lower. Thanks for the update Prasanth. @Shyam, what are the next steps here. Do we need to keep this open for any code changes. In any case, this doesn't look like a candidate for 4.5. Moving this to 4.6 (In reply to Mudit Agarwal from comment #22) > Thanks for the update Prasanth. > > @Shyam, what are the next steps here. Do we need to keep this open for any > code changes. The mgr/volumes tracker to ensure this command does not get stuck due to incorrect exception handling needs to be fixed. Patrick has added the tracker for the same: https://tracker.ceph.com/issues/46607 @travis should we consider alternatives to '--mon-host' for the MGR pod(s), such that MON updates are picked up automatically? (I am not sure about this, but asking anyway) (In reply to Shyamsundar from comment #25) > (In reply to Mudit Agarwal from comment #22) > > Thanks for the update Prasanth. > > > > @Shyam, what are the next steps here. Do we need to keep this open for any > > code changes. > > The mgr/volumes tracker to ensure this command does not get stuck due to > incorrect exception handling needs to be fixed. Patrick has added the > tracker for the same: https://tracker.ceph.com/issues/46607 > > @travis should we consider alternatives to '--mon-host' for the MGR pod(s), > such that MON updates are picked up automatically? (I am not sure about > this, but asking anyway) I think Jason already answered it, agreed on his points. I don't see a change needed in Rook. Prasanth, this can be closed. Nothing is pending from our side. See comments https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28 (In reply to Mudit Agarwal from comment #29) > Prasanth, this can be closed. Nothing is pending from our side. See comments > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28 I think we'll want to validate that this is fixed for OCS. The issue should be reproduced internally and the fix verified. Ceph (RHCS) will have its own validation but that won't be sufficient for OCS, IMO. (In reply to Patrick Donnelly from comment #30) > (In reply to Mudit Agarwal from comment #29) > > Prasanth, this can be closed. Nothing is pending from our side. See comments > > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c27 and > > https://bugzilla.redhat.com/show_bug.cgi?id=1858195#c28 > > I think we'll want to validate that this is fixed for OCS. The issue should > be reproduced internally and the fix verified. Ceph (RHCS) will have its own > validation but that won't be sufficient for OCS, IMO. Sure, lets keep it open then. Reducing severity and priority based on comment #22 Please let me know the steps and validations needed to verify the bz. Thank you in advance This is a ceph BZ, Kotresh might help here. (In reply to akarsha from comment #38) > Please let me know the steps and validations needed to verify the bz. Thank > you in advance The problem arises due to MON addresses changing completely, without any restart of the ceph-mgr pod. To reproduce and validate this, the test case would need to create new MONs and retire(?) older MONs, such that the ceph cluster is running with newer MONs than when it was boot strapped. In this state the ceph-mgr and more specifically cephfs set of commands should still be responsive, for example creating a PVC backed by CephFS should work and not error out. Shyam, Please fill doc_text for this BZ. (In reply to Mudit Agarwal from comment #42) > Shyam, Please fill doc_text for this BZ. Updated doc text. There are no well known steps to reproduce the issue. These are the few steps tried to reproduce the issue as suggested: a. Scale down mon deployments No mon endpoints changed once the mon came up and running. b. Prolong shutdown of worker nodes Not every time, observed one of the mon endpoint changes(1/3) and was able to create cephfs pvc. c. Add extra nodes(total 6 ocs nodes) then prolong shutdown and drain nodes No mon endpoints changed once the mon came up and running, and successfully was able to create cephfs pvc. d. Etcd stability(restarting pods, scaling down, shutdown of master node) No mon endpoints changed once the mon came up and running, and successfully was able to create cephfs pvc. e. certificates expire for masters (as well as for OCS nodes) As suggested kept the OCS nodes shutdown for a day and didn't hit the issue. One more test performed as suggested by Travis. Manually deleting mon services successfully changes the end-points but cluster goes to bad state and raised a separate #bz1897029 for it as discussed. All the above steps are performed on OCS 4.6, 4.5 and 4.4 observed same behavior. Versions: OCP: 4.4.29 OCS: ocs-operator.v4.4.2 OCP: 4.5.16 OCS: ocs-operator.v4.5.1 OCP: 4.6.0-0.nightly-2020-11-07-035509 OCS: ocs-operator.v4.6.0-156.ci Since we don't have the exact steps to reproduce the issue and ceph bz has been verified. Expected all the tier1, tier4 tests are passed on same build, based the sanity tests if there are no objections can we move the bz to verified state. If anyone knows other ways to reproduce the issue please let me know. Here's how I'd try: 1. Disconnect the MGR node from the rest of the cluster (specifically, from nodes running the MONs) 2. Restart one of the nodes running one of the MONs - my assumption is that the MON will re-launch in a different node. 2. Repeast for the other nodes running MONs - so now they all run elsewhere. 3. Reconnect the MGR node - it should now have outdated list of MON addresses. (In reply to Yaniv Kaul from comment #46) Thanks Yaniv for the suggestion, we tried this scenario of restart of the nodes where mon hosted and didn't change the mon endpoints. Neither prolong shutdown of the node Here what I tried, OCP: 4.6.0-0.nightly-2020-11-18-154058 OCS: 4.6.0-160.ci > Here's how I'd try: > 1. Disconnect the MGR node from the rest of the cluster (specifically, from > nodes running the MONs) mgr pod is running on different node say, compute-3(mons are not hosted on same node). > 2. Restart one of the nodes running one of the MONs - my assumption is that > the MON will re-launch in a different node. > 2. Repeast for the other nodes running MONs - so now they all run elsewhere. Noted the mon endpoints before restart of the nodes and restarted one by one node where mons are running. Once all mons are up checked the endpoints and remains same. Observed similar behavior with prolong shutdown of the nodes where mon hosted, mon endpoints remained same. > 3. Reconnect the MGR node - it should now have outdated list of MON > addresses. Again tried the second method given in comment#48 and same observation, when MONs had new addresses and mgr still using old MON addresses, the cephfs pvc creation was success and also pvc mount requests succeeded, no errors seen on mgr logs. After the restart of mgr pod, picked the new mon endpionts. Based on the observation moving bz to verified. Logs here: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-1858195/bz/ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 |