Bug 2168892
| Summary: | OSD and mon scheduling errors during drain of node in one zone, 1 OSD out of 5 and MON stayed in pending state | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Neha Berry <nberry> | |
| Component: | rook | Assignee: | Parth Arora <paarora> | |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.10 | CC: | jijoy, lgangava, ocs-bugs, odf-bz-bot, paarora, sapillai, tnielsen | |
| Target Milestone: | --- | Flags: | nberry:
needinfo?
(sapillai) tnielsen: needinfo? (nberry) |
|
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2169267 (view as bug list) | Environment: | ||
| Last Closed: | 2023-03-14 15:10:43 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2169267 | |||
Parth, - The only ask is, if an OSD is in not Running state will that effect mon failover? - Scenario I observed consistently is, when an OSD is in Pending state, mon failover took more than usual 20min. Thanks, Leela. (In reply to Leela Venkaiah Gangavarapu from comment #6) > Parth, > > - The only ask is, if an OSD is in not Running state will that effect mon > failover? No. Rook achieves mon quorum first before starting the OSDs. So OSD being down should not affect mon failover. > - Scenario I observed consistently is, when an OSD is in Pending state, mon > failover took more than usual 20min. The after-drain-logs attached in comment 4 are not sufficient. Maybe they are not the latest logs and they provide much data on why mon took more time. The message around mon failver was `op-mon: mon "a" not found in quorum, waiting for timeout (504 seconds left) before failover` and nothing after that. So we might need more updated logs to debug further. > Thanks, > Leela. Hi Leela. We discussed today about verifying the ceph versions in the cephCluster CR to ensure that Rook has performed the upgrade successfully. After daemons have upgraded and we have ceph health OK, then we can start the drain. This will avoid getting stuck into the `ok-to-stop` loop. Let me know if my above observation from our conversation is correct. - I'm up for a call whenever possible What's the conclusion from the discussion? Is there a fix needed? Please reopen if more discussion is needed. |
Description of problem: ============================== On a managed service cluster with v2.0.10, 5 OSDs and 1 MON were present in each zone. The instance type was m5.4xlarge. HostNetwork= true For migrating the cluster on stage from v2.0.10 to v2.0.11(with chnages for odfms-55), following steps were performed a) Merging the (ocs-{provider,consumer}) addon manifest MR --- While the MR was being merged, we created a new machineconfigpool with m5.2xlarge i)) SRE-P member created of new pool (temp-pool) 6 x m5.2xl (refer to existing default pool) ii)Update the endpoint at all consumers attached to the provider b) Cordon all old nodes on the provider, drain first node, wait for the cluster to be healthy and repeat for other nodes Node drained : ip-10-0-13-78.us-east-2.compute.internal Approximate time = 2023-02-10 10:30:24.222893 2023-02-10 10:30:13.691113 I | op-mon: checking for basic quorum with existing mons 2023-02-10 10:30:13.691119 I | op-mon: setting mon endpoints for hostnetwork mode 2023-02-10 10:30:13.691126 I | op-mon: setting mon endpoints for hostnetwork mode 2023-02-10 10:30:13.691129 I | op-mon: setting mon endpoints for hostnetwork mode Observation: ================= >> Out of 5 OSDs of us-east-2a, 1 OSD stayed down for hours(even now it is down) and $ oc get pods -o wide|grep Pending rook-ceph-mon-a-6d6994c977-xd4wm 0/2 Pending 0 14m <none> <none> <none> <none> rook-ceph-osd-4-65c6f857c-64k45 0/2 Pending 0 13m <none> <none> <none> <none> $ oc get pods -o wide -n openshift-storage|grep mon rook-ceph-mon-a-6d6994c977-xd4wm 0/2 Pending 0 8m23s <none> <none> <none> <none> rook-ceph-mon-b-8559bfdb6f-kbg9t 2/2 Running 0 18m 10.0.16.217 ip-10-0-16-217.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-7947f768fb-757m6 2/2 Running 0 18m 10.0.23.83 ip-10-0-23-83.us-east-2.compute.internal <none> <none> ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 60.00000 root default -5 60.00000 region us-east-2 -12 20.00000 zone us-east-2a -21 4.00000 host default-0-data-1rf469 3 ssd 4.00000 osd.3 up 1.00000 1.00000 -17 4.00000 host default-0-data-24v9kv 5 ssd 4.00000 osd.5 up 1.00000 1.00000 -11 4.00000 host default-1-data-17tsq4 4 ssd 4.00000 osd.4 down 1.00000 1.00000 -19 4.00000 host default-1-data-45xk9k 6 ssd 4.00000 osd.6 up 1.00000 1.00000 -23 4.00000 host default-2-data-2vltnr 7 ssd 4.00000 osd.7 up 1.00000 1.00000 $oc describe pod rook-ceph-osd-4-65c6f857c-64k45 -n openshift-storage Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 35m default-scheduler 0/15 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 35m default-scheduler 0/15 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 31m default-scheduler 0/15 nodes are available: 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. $oc describe pod rook-ceph-osd-4-65c6f857c-64k45 -n openshift-storage $ Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 24m default-scheduler 0/15 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 24m default-scheduler 0/15 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector. Warning FailedScheduling 19m default-scheduler 0/15 nodes are available: 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector. >> Mon-a since tied to a node( hostnetwork=true), mon-a also stayed in pending state for more than 70+ mins after which mon-d was created. Version-Release number of selected component (if applicable): OCP = 4.10.50 Before deployer upgrade = 2.0.10 After upgrade = 2.0.11-7 ODF version before upgrade = 4.10.5 ODF after deployer upgrade = 4.10.9 How reproducible: ====================== We have seen following issues already with 2.0.11-4 (pre-RC build) 3. Bug 2167045 - All OSDs were not created in MS provider cluster of size 20. 4. Bug 2167347 - OSDs marked as down, not equally distributed in ceph osd tree output of a size 20 cluster. Steps to Reproduce: ========================= 1. Merging the (ocs-{provider,consumer}) addon manifest MR 2. SRE-P need to check whether consumer is updated or not 3. Reach out to SRE-P for creation of new pool (temp-pool) 6 x m5.2xl (refer to existing default pool) 4. Update the endpoint at all consumers attached to the provider 5, Cordon all nodes on the provider, drain first node, wait for the cluster to be healthy and repeat for other nodes 6, Verify cluster is healthy and delete the default pool (didnt reach this step) Actual results: ======================= OSD4 stayed down on us-east-2a out of the 5 OSDs that were there in us=east-2a mon-a also stayed down for 70+ mins before mon-d came up Expected results: ======================= After drain of node, the 5 OSD and 1 MON pods should have come up on the 2 new m5.2xlarge worker nodes in the same zone But only 4 OSDs recovered and mon couldnt come up for more than 60+ min (as portable is false)