Bug 2168892

Summary: OSD and mon scheduling errors during drain of node in one zone, 1 OSD out of 5 and MON stayed in pending state
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Neha Berry <nberry>
Component: rookAssignee: Parth Arora <paarora>
Status: CLOSED NOTABUG QA Contact: Neha Berry <nberry>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.10CC: jijoy, lgangava, ocs-bugs, odf-bz-bot, paarora, sapillai, tnielsen
Target Milestone: ---Flags: nberry: needinfo? (sapillai)
tnielsen: needinfo? (nberry)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2169267 (view as bug list) Environment:
Last Closed: 2023-03-14 15:10:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2169267    

Description Neha Berry 2023-02-10 12:34:40 UTC
Description of problem:
==============================

On a managed service cluster with v2.0.10, 5 OSDs and 1 MON were present in each zone. The instance type was m5.4xlarge. HostNetwork= true

For migrating the cluster on stage from v2.0.10 to v2.0.11(with chnages for odfms-55), following steps were performed

a) Merging the (ocs-{provider,consumer}) addon manifest MR
		--- While the MR was being merged, we created a new machineconfigpool with m5.2xlarge
		i)) SRE-P member created of new pool (temp-pool) 6 x m5.2xl (refer to existing default pool)
                ii)Update the endpoint at all consumers attached to the provider
b) Cordon all old nodes on the provider, drain first node, wait for the cluster to be healthy and repeat for other nodes


Node drained : ip-10-0-13-78.us-east-2.compute.internal

Approximate time  = 2023-02-10 10:30:24.222893

2023-02-10 10:30:13.691113 I | op-mon: checking for basic quorum with existing mons
2023-02-10 10:30:13.691119 I | op-mon: setting mon endpoints for hostnetwork mode
2023-02-10 10:30:13.691126 I | op-mon: setting mon endpoints for hostnetwork mode
2023-02-10 10:30:13.691129 I | op-mon: setting mon endpoints for hostnetwork mode


Observation:
=================
>> Out of 5 OSDs of us-east-2a, 1 OSD stayed down for hours(even now it is down) and 

$ oc get pods -o wide|grep Pending
rook-ceph-mon-a-6d6994c977-xd4wm                                  0/2     Pending     0          14m   <none>        <none>                                      <none>           <none>

rook-ceph-osd-4-65c6f857c-64k45                                   0/2     Pending     0          13m   <none>        <none>                                      <none>           <none>



$ oc get pods -o wide -n openshift-storage|grep mon
rook-ceph-mon-a-6d6994c977-xd4wm                                  0/2     Pending     0          8m23s   <none>        <none>                                      <none>           <none>
rook-ceph-mon-b-8559bfdb6f-kbg9t                                  2/2     Running     0          18m     10.0.16.217   ip-10-0-16-217.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-7947f768fb-757m6                                  2/2     Running     0          18m     10.0.23.83    ip-10-0-23-83.us-east-2.compute.internal    <none>           <none>


ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME                               STATUS  REWEIGHT  PRI-AFF
 -1         60.00000  root default                                                     
 -5         60.00000      region us-east-2                                             
-12         20.00000          zone us-east-2a                                          
-21          4.00000              host default-0-data-1rf469                           
  3    ssd   4.00000                  osd.3                       up   1.00000  1.00000
-17          4.00000              host default-0-data-24v9kv                           
  5    ssd   4.00000                  osd.5                       up   1.00000  1.00000
-11          4.00000              host default-1-data-17tsq4                           
  4    ssd   4.00000                  osd.4                     down   1.00000  1.00000
-19          4.00000              host default-1-data-45xk9k                           
  6    ssd   4.00000                  osd.6                       up   1.00000  1.00000
-23          4.00000              host default-2-data-2vltnr                           
  7    ssd   4.00000                  osd.7                       up   1.00000  1.00000


$oc describe pod rook-ceph-osd-4-65c6f857c-64k45 -n openshift-storage
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  35m   default-scheduler  0/15 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  35m   default-scheduler  0/15 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  31m   default-scheduler  0/15 nodes are available: 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector.

$oc describe pod rook-ceph-osd-4-65c6f857c-64k45 -n openshift-storage
$ Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  24m   default-scheduler  0/15 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  24m   default-scheduler  0/15 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 6 node(s) didn't match Pod's node affinity/selector.
  Warning  FailedScheduling  19m   default-scheduler  0/15 nodes are available: 2 node(s) didn't match pod topology spread constraints, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) were unschedulable, 4 node(s) didn't match Pod's node affinity/selector.


>> Mon-a since tied to a node( hostnetwork=true), mon-a also stayed in pending state for more than 70+ mins after which mon-d was created.



Version-Release number of selected component (if applicable):
OCP = 4.10.50
Before deployer upgrade = 2.0.10
After upgrade = 2.0.11-7
ODF version before upgrade  = 4.10.5
ODF after deployer upgrade  = 4.10.9

How reproducible:
======================
We have seen following issues already with 2.0.11-4 (pre-RC build)

3. Bug 2167045 - All OSDs were not created in MS provider cluster of size 20.
4. Bug 2167347 - OSDs marked as down, not equally distributed in ceph osd tree output of a size 20 cluster.

Steps to Reproduce:
=========================
1. Merging the (ocs-{provider,consumer}) addon manifest MR
2. SRE-P need to check whether consumer is updated or not
3. Reach out to SRE-P for creation of new pool (temp-pool) 6 x m5.2xl (refer to existing default pool)
4. Update the endpoint at all consumers attached to the provider
5, Cordon all nodes on the provider, drain first node, wait for the cluster to be healthy and repeat for other nodes
6, Verify cluster is healthy and delete the default pool (didnt reach this step)

Actual results:
=======================
OSD4 stayed down on us-east-2a out of the 5 OSDs that were there in us=east-2a
mon-a also stayed down for 70+ mins before mon-d came up


Expected results:
=======================
After drain of node, the 5 OSD and 1 MON pods should have come up on the 2 new m5.2xlarge worker nodes in the same zone

But only 4 OSDs recovered and mon couldnt come up for more than 60+ min (as portable is false)

Comment 6 Leela Venkaiah Gangavarapu 2023-02-13 07:11:36 UTC
Parth,

- The only ask is, if an OSD is in not Running state will that effect mon failover?
- Scenario I observed consistently is, when an OSD is in Pending state, mon failover took more than usual 20min.

Thanks,
Leela.

Comment 8 Santosh Pillai 2023-02-13 10:07:59 UTC
(In reply to Leela Venkaiah Gangavarapu from comment #6)
> Parth,
> 
> - The only ask is, if an OSD is in not Running state will that effect mon
> failover?

No. Rook achieves mon quorum first before starting the OSDs. So OSD being down should not affect mon failover. 

> - Scenario I observed consistently is, when an OSD is in Pending state, mon
> failover took more than usual 20min.


The after-drain-logs attached in comment 4 are not sufficient. Maybe they are not the latest logs and they provide much data on why mon took more time.
The message around mon failver was `op-mon: mon "a" not found in quorum, waiting for timeout (504 seconds left) before failover` and nothing after that. So we might need more updated logs to debug further. 

> Thanks,
> Leela.

Comment 13 Santosh Pillai 2023-02-14 15:22:35 UTC
Hi Leela. We discussed today about verifying the ceph versions in the cephCluster CR to ensure that Rook has performed the upgrade successfully. After daemons have upgraded and we have ceph health OK, then we can start the drain. This will avoid getting stuck into the `ok-to-stop` loop.  

Let me know if my above observation from our conversation is correct.

Comment 16 Leela Venkaiah Gangavarapu 2023-02-17 07:53:21 UTC
- I'm up for a call whenever possible

Comment 18 Travis Nielsen 2023-03-07 15:13:47 UTC
What's the conclusion from the discussion? Is there a fix needed?

Comment 19 Travis Nielsen 2023-03-14 15:10:43 UTC
Please reopen if more discussion is needed.