Description of problem (please be detailed as possible and provide log snippests): - The logic to increase the PDB "max unavailable" as OSDs are added to a node is taken care of by the ocs/odf operator. - We should not expect manual changes to it to allow nodes to be drained when upgrading OCP nodes for example. Version of all relevant components (if applicable): - OCP 4.10.23 - ODF 4.10.5 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? - customer needs to manually increase the number of unavailable OSDs in PDB object definition or manually delete the OSD POD running on the node which is being drained. Is there any workaround available to the best of your knowledge? - manually increase the amount of unavailable OSDs if they run on the same node which is being drained, or delete the POD manually Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? - Unknown Can this issue reproduce from the UI? - Unknown If this is a regression, please provide more details to justify this: Steps to Reproduce in customer environment: 1. Have OCP with ODF installed 2. Have more than 1 OSDs per storage node. 3. Upgrade/drain 1 storage node. Actual results: - node won't be drained since OSD cannot be evicted from node due to the amount of unavailable OSDs allowed in PDB object which defaults to 1 ~~~ error when evicting pods/"rook-ceph-osd-5-86455f59c4-t4lbb" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. ~~~ Expected results: - If customer adds OSDs to ODF such that there are two OSDs per node, PDB should increase the number of `max unavailable` accordingly to allow node to be drained in an OCP node upgrade scenario for example.
This is by design. Once a node starts being drained, one of the OSDs is expected to go down, then the rook operator will adjust the PDBs dynamically to allow all OSDs on that node (or in that failure domain) to go down as well. If the rook operator was on the same drained node, the adjustment may take a bit longer while the operator starts on another node. See the design here: https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md Do you not see the PDBs adjusted automatically when an OSD pod goes down?
Santosh PTAL, thanks
I'll check and get back to you as soon as I can.
Got occupied with some other task. I'll check and get back to you this week. Is https://bugzilla.redhat.com/show_bug.cgi?id=2116358#c19 latest update from the customer? or have there been new instances where customer is not seeing correct behavior with PDBs?
Correct, latest update from customer is https://bugzilla.redhat.com/show_bug.cgi?id=2116358#c19 At least customer did not report any new scenario about PDB
Hi Javier Sorry for the delay. I'm looking at the must gather logs `inspect.local.359676569811226791` logs for comment 19. Can you please confirm if these logs are related to comment 19. Looking at the the rook-operator logs at `inspect.local.359676569811226791/namespaces/openshift-storage/pods/rook-ceph-operator-6985f85bcb-ncrms/rook-ceph-operator/rook-ceph-operator/logs/current.log`. Few things I observed: 1. One of the mon pod is down and it never came back: ------- 2022-11-07T09:15:03.483731749Z 2022-11-07 09:15:03.483630 E | op-mon: failed to schedule mon "g". failed to schedule canary pod(s) 2022-11-07T09:15:03.489879925Z 2022-11-07 09:15:03.489782 I | op-mon: cleaning up canary monitor deployment "rook-ceph-mon-g-canary" 2022-11-07T09:15:03.511533755Z 2022-11-07 09:15:03.511507 I | op-mon: scaling the mon "c" deployment to replica 1 2022-11-07T09:15:03.532785605Z 2022-11-07 09:15:03.532714 E | op-mon: failed to failover mon "c". failed to place new mon on a node: failed to schedule mons 2022-11-07T09:15:03.532785605Z 2022-11-07 09:15:03.532726 I | op-mon: allow voluntary mon drain after failover ..... ... .. op-mon: mon "c" not found in quorum, waiting for timeout (554 seconds left) before failover ------- 2. The events suggest that ``` 07:59:40 openshift-storage rook-ceph-mon-g-canary-5c7c748cfb-c2tn8 FailedScheduling 0/5 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules. ``` So my assumption is that the last drained node never came back up due to which we are seeing the unexpected behavior.
Hi Santosh, Happy new year! Thanks for looking at this. Latest time frame when this scenario was seen in customer environment was on Nov 4th around 11 AM when worker-12 node was brought down ~~~ I1104 11:15:54.830513 20381 drain.go:44] Initiating cordon on node (currently schedulable: true) I1104 11:15:54.867419 20381 drain.go:66] cordon succeeded on node (currently schedulable: false) I1104 11:15:54.867442 20381 update.go:1956] Node has been successfully cordoned I1104 11:15:54.869633 20381 update.go:1956] Update prepared; beginning drain E1104 11:15:58.555501 20381 daemon.go:335] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-7s5ck, openshift-cnv/bridge-marker-wpvj7, openshift-cnv/kube-cni-linux-bridge-plugin-j4xj5, openshift-cnv/nmstate-handler-72lpw, openshift-cnv/virt-handler-d5hcf, openshift-controller-manager/controller-manager-xnsz2, openshift-dns/dns-default-fxvnk, openshift-dns/node-resolver-d27mx, openshift-image-registry/node-ca-75xs8, openshift-ingress-canary/ingress-canary-b6sw6, openshift-local-storage/diskmaker-discovery-zw5hp, openshift-local-storage/diskmaker-manager-v44f8, openshift-machine-api/metal3-image-cache-xh5vz, openshift-machine-config-operator/machine-config-daemon-m99n2, openshift-machine-config-operator/machine-config-server-h5sn8, openshift-monitoring/node-exporter-tqhsk, openshift-multus/multus-additional-cni-plugins-fbcqf, openshift-multus/multus-admission-controller-nbxwz, openshift-multus/multus-z9lt7, openshift-multus/network-metrics-daemon-6pbmh, openshift-network-diagnostics/network-check-target-82slr, openshift-sdn/sdn-48f4f, openshift-sdn/sdn-controller-z7drv, openshift-sriov-network-operator/network-resources-injector-rnkg9, openshift-sriov-network-operator/sriov-device-plugin-pxgw6, openshift-sriov-network-operator/sriov-network-config-daemon-8qbxj, openshift-storage/csi-cephfsplugin-f88ln, openshift-storage/csi-rbdplugin-b2hsk; deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: openshift-kube-apiserver/kube-apiserver-guard-worker-12, openshift-kube-controller-manager/kube-controller-manager-guard-worker-12, openshift-kube-scheduler/openshift-kube-scheduler-guard-worker-12, openshift-marketplace/certified-operators-m5cz9, openshift-marketplace/redhat-operators-pmcgc I1104 11:15:58.556703 20381 daemon.go:335] evicting pod openshift-storage/rook-ceph-osd-5-86455f59c4-d4tpj ... I1104 11:15:58.558112 20381 daemon.go:335] evicting pod openshift-storage/rook-ceph-osd-1-785c747d65-hw9lx E1104 11:15:58.575057 20381 daemon.go:335] error when evicting pods/"rook-ceph-osd-5-86455f59c4-d4tpj" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. ... E1104 11:16:04.794348 20381 daemon.go:335] error when evicting pods/"rook-ceph-osd-1-785c747d65-hw9lx" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. ... I1104 11:17:16.672382 20381 daemon.go:320] Evicted pod openshift-storage/rook-ceph-osd-1-785c747d65-hw9lx I1104 11:17:20.970957 20381 daemon.go:335] evicting pod openshift-storage/rook-ceph-osd-5-86455f59c4-d4tpj E1104 11:17:21.735658 20381 daemon.go:335] error when evicting pods/"rook-ceph-osd-5-86455f59c4-d4tpj" -n "openshift-storage" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. ~~~ This happened on a Friday, so customer leave it during the weekend to give enough time for the drain process to complete. This process didn't complete. The mon Pod that didn't come back was the mon-c Pod which was drained from worker-12 node on Friday 11/4 ~~~ 2022-11-07T09:15:03.483731749Z 2022-11-07 09:15:03.483630 E | op-mon: failed to schedule mon "g". failed to schedule canary pod(s) 2022-11-07T09:15:03.489879925Z 2022-11-07 09:15:03.489782 I | op-mon: cleaning up canary monitor deployment "rook-ceph-mon-g-canary" 2022-11-07T09:15:03.511533755Z 2022-11-07 09:15:03.511507 I | op-mon: scaling the mon "c" deployment to replica 1 2022-11-07T09:15:03.532785605Z 2022-11-07 09:15:03.532714 E | op-mon: failed to failover mon "c". failed to place new mon on a node: failed to schedule mons 2022-11-07T09:15:03.532785605Z 2022-11-07 09:15:03.532726 I | op-mon: allow voluntary mon drain after failover ~~~ The one event you shared is the one showing worker-12 node still in unschedulable status because was hung trying to drain the second OSD from the node, thus, the node didn't reboot and the unschedulable condition was not removed until the second OSD pod was forcefully killed to allow the node to be drained and rebooted. ~~~ 07:59:40 openshift-storage rook-ceph-mon-g-canary-5c7c748cfb-c2tn8 FailedScheduling 0/5 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match Pod's node affinity/selector, 2 node(s) didn't match pod anti-affinity rules. ~~~ Hope it clarifies the scenario, else, please let me know
(In reply to Javier Coscia from comment #34) > Hi Santosh, Happy new year! > > Thanks for looking at this. > > Latest time frame when this scenario was seen in customer environment was on > Nov 4th around 11 AM when worker-12 node was brought down > > > Hope it clarifies the scenario, else, please let me know Thanks for the clarification. I'll try to setup a similar cluster (multiple OSDs on a node) and try to reproduce this behavior (comment 19) again. I tried to reproduce this last time but couldn't. I'll give it another shot. I'll get back to you about the results by tomorrow.
Tried testing this out locally on minikube. The failure domain was Node and each node had 2 osds. oc get pods -o wide -n rook-ceph NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-b5b8g 2/2 Running 0 23m 192.168.50.115 minikube-m03 <none> <none> csi-cephfsplugin-provisioner-569f96898b-bcc55 5/5 Running 0 23m 10.244.3.4 minikube-m04 <none> <none> csi-cephfsplugin-provisioner-569f96898b-lq2rs 5/5 Running 0 23m 10.244.2.4 minikube-m03 <none> <none> csi-cephfsplugin-qh6r9 2/2 Running 0 23m 192.168.50.36 minikube-m02 <none> <none> csi-cephfsplugin-w4qgt 2/2 Running 0 23m 192.168.50.70 minikube-m04 <none> <none> csi-rbdplugin-25g7w 2/2 Running 0 23m 192.168.50.70 minikube-m04 <none> <none> csi-rbdplugin-d2jtn 2/2 Running 0 23m 192.168.50.115 minikube-m03 <none> <none> csi-rbdplugin-provisioner-5d4578b479-2wx4v 5/5 Running 0 23m 10.244.2.3 minikube-m03 <none> <none> csi-rbdplugin-provisioner-5d4578b479-9dlph 5/5 Running 0 18m 10.244.3.14 minikube-m04 <none> <none> csi-rbdplugin-x7t5v 2/2 Running 0 23m 192.168.50.36 minikube-m02 <none> <none> rook-ceph-crashcollector-minikube-m02-56d95c749f-jmt89 1/1 Running 0 2m 10.244.1.25 minikube-m02 <none> <none> rook-ceph-crashcollector-minikube-m03-58db9f774-2qhsl 1/1 Running 0 22m 10.244.2.7 minikube-m03 <none> <none> rook-ceph-crashcollector-minikube-m04-58fc88874f-h48jh 1/1 Running 0 21m 10.244.3.12 minikube-m04 <none> <none> rook-ceph-mgr-a-b8d58d8f9-g7wch 3/3 Running 0 22m 10.244.3.7 minikube-m04 <none> <none> rook-ceph-mgr-b-b767d5f96-b5jw6 3/3 Running 0 18m 10.244.2.12 minikube-m03 <none> <none> rook-ceph-mon-a-58f64dbb87-5s7gq 2/2 Running 0 23m 10.244.3.6 minikube-m04 <none> <none> rook-ceph-mon-b-644b5ddf94-bcjhv 2/2 Running 0 3m55s 10.244.1.24 minikube-m02 <none> <none> rook-ceph-mon-c-5cb6444c94-vfqv9 2/2 Running 0 22m 10.244.2.6 minikube-m03 <none> <none> rook-ceph-operator-66d89f9c7c-lbvl4 1/1 Running 0 3m55s 10.244.2.14 minikube-m03 <none> <none> rook-ceph-osd-0-7dc8d5dd97-txm84 2/2 Running 0 3m29s 10.244.1.22 minikube-m02 <none> <none> rook-ceph-osd-1-7f84fdcdb6-4f94p 2/2 Running 0 21m 10.244.3.10 minikube-m04 <none> <none> rook-ceph-osd-2-f6979b7b-n2jrt 2/2 Running 0 21m 10.244.2.9 minikube-m03 <none> <none> rook-ceph-osd-3-6b76ff8696-2mk8g 2/2 Running 0 3m55s 10.244.1.23 minikube-m02 <none> <none> rook-ceph-osd-4-7545d757d8-6m977 2/2 Running 0 21m 10.244.3.11 minikube-m04 <none> <none> rook-ceph-osd-5-59d69d86f9-q8qg5 2/2 Running 0 21m 10.244.2.10 minikube-m03 <none> <none> rook-ceph-osd-prepare-minikube-m03--1-j4m97 0/1 Completed 0 3m26s 10.244.2.16 minikube-m03 <none> <none> rook-ceph-osd-prepare-minikube-m04--1-d8bgv 0/1 Completed 0 3m22s 10.244.3.18 minikube-m04 <none> <none> rook-ceph-tools-598f4566db-989v8 1/1 Running 0 18m 10.244.3.15 minikube-m04 <none> <none> ---------------------------------- Observe node `minikube-m03`. It has osd-2 and osd-5 and the rook operator is also running on this node. oc get pods -o wide -n rook-ceph | grep minikube-m03 csi-cephfsplugin-b5b8g 2/2 Running 0 24m 192.168.50.115 minikube-m03 <none> <none> csi-cephfsplugin-provisioner-569f96898b-lq2rs 5/5 Running 0 24m 10.244.2.4 minikube-m03 <none> <none> csi-rbdplugin-d2jtn 2/2 Running 0 24m 192.168.50.115 minikube-m03 <none> <none> csi-rbdplugin-provisioner-5d4578b479-2wx4v 5/5 Running 0 24m 10.244.2.3 minikube-m03 <none> <none> rook-ceph-crashcollector-minikube-m03-58db9f774-2qhsl 1/1 Running 0 23m 10.244.2.7 minikube-m03 <none> <none> rook-ceph-mgr-b-b767d5f96-b5jw6 3/3 Running 0 20m 10.244.2.12 minikube-m03 <none> <none> rook-ceph-mon-c-5cb6444c94-vfqv9 2/2 Running 0 24m 10.244.2.6 minikube-m03 <none> <none> rook-ceph-operator-66d89f9c7c-lbvl4 1/1 Running 0 5m12s 10.244.2.14 minikube-m03 <none> <none> rook-ceph-osd-2-f6979b7b-n2jrt 2/2 Running 0 23m 10.244.2.9 minikube-m03 <none> <none> rook-ceph-osd-5-59d69d86f9-q8qg5 2/2 Running 0 23m 10.244.2.10 minikube-m03 <none> <none> rook-ceph-osd-prepare-minikube-m03--1-j4m97 0/1 Completed 0 4m43s 10.244.2.16 minikube-m03 <none> <none> --------------------- PDBs before the tests: Every 2.0s: oc get pdb -n rook-ceph localhost.localdomain: Wed Jan 4 10:38:21 2023 NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mgr-pdb N/A 1 1 24m rook-ceph-mon-pdb N/A 1 1 25m rook-ceph-osd N/A 1 1 3m45s ---------------------------------------------------- Tests: - Drained minikube-m03. Every 2.0s: oc get pdb -n rook-ceph localhost.localdomain: Wed Jan 4 10:39:29 2023 NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mgr-pdb N/A 1 0 25m rook-ceph-mon-pdb N/A 1 0 26m rook-ceph-osd N/A 1 0 4m53s Initially no blocking PDBs were created because the rook operator was also removed. Once the operator got deployed on another node, it created the blocking PDBs on other nodes. Every 2.0s: oc get pdb -n rook-ceph localhost.localdomain: Wed Jan 4 10:40:35 2023 NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mgr-pdb N/A 1 1 26m rook-ceph-mon-pdb N/A 1 0 27m rook-ceph-osd-host-minikube-m02 N/A 0 0 48s rook-ceph-osd-host-minikube-m04 N/A 0 0 48s And the node minikube-03 was drained successfully. $ kubectl drain minikube-m03 --ignore-daemonsets --delete-local-data --force Flag --delete-local-data has been deprecated, This option is deprecated and will be deleted. Use --delete-emptydir-data. node/minikube-m03 cordoned Warning: ignoring DaemonSet-managed Pods: kube-system/kindnet-l5vdb, kube-system/kube-proxy-hpjkn, rook-ceph/csi-cephfsplugin-b5b8g, rook-ceph/csi-rbdplugin-d2jtn evicting pod rook-ceph/rook-ceph-osd-prepare-minikube-m03--1-j4m97 evicting pod rook-ceph/rook-ceph-mgr-b-b767d5f96-b5jw6 evicting pod rook-ceph/csi-cephfsplugin-provisioner-569f96898b-lq2rs evicting pod rook-ceph/csi-rbdplugin-provisioner-5d4578b479-2wx4v evicting pod rook-ceph/rook-ceph-crashcollector-minikube-m03-58db9f774-2qhsl evicting pod rook-ceph/rook-ceph-operator-66d89f9c7c-lbvl4 evicting pod rook-ceph/rook-ceph-mon-c-5cb6444c94-vfqv9 evicting pod rook-ceph/rook-ceph-osd-2-f6979b7b-n2jrt evicting pod rook-ceph/rook-ceph-osd-5-59d69d86f9-q8qg5 error when evicting pods/"rook-ceph-osd-5-59d69d86f9-q8qg5" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. pod/rook-ceph-osd-prepare-minikube-m03--1-j4m97 evicted evicting pod rook-ceph/rook-ceph-osd-5-59d69d86f9-q8qg5 error when evicting pods/"rook-ceph-osd-5-59d69d86f9-q8qg5" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. pod/rook-ceph-mgr-b-b767d5f96-b5jw6 evicted pod/csi-cephfsplugin-provisioner-569f96898b-lq2rs evicted pod/csi-rbdplugin-provisioner-5d4578b479-2wx4v evicted pod/rook-ceph-crashcollector-minikube-m03-58db9f774-2qhsl evicted I0104 10:39:20.152678 150443 request.go:682] Waited for 1.0774003s due to client-side throttling, not priority and fairness, request: GET:https://192.168.50.186:8443/api/v1/namespaces/rook-ceph/pods/rook-ceph-operator-66d89f9c7c-lbvl4 pod/rook-ceph-operator-66d89f9c7c-lbvl4 evicted pod/rook-ceph-mon-c-5cb6444c94-vfqv9 evicted pod/rook-ceph-osd-2-f6979b7b-n2jrt evicted evicting pod rook-ceph/rook-ceph-osd-5-59d69d86f9-q8qg5 error when evicting pods/"rook-ceph-osd-5-59d69d86f9-q8qg5" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod rook-ceph/rook-ceph-osd-5-59d69d86f9-q8qg5 error when evicting pods/"rook-ceph-osd-5-59d69d86f9-q8qg5" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod rook-ceph/rook-ceph-osd-5-59d69d86f9-q8qg5 error when evicting pods/"rook-ceph-osd-5-59d69d86f9-q8qg5" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod rook-ceph/rook-ceph-osd-5-59d69d86f9-q8qg5 error when evicting pods/"rook-ceph-osd-5-59d69d86f9-q8qg5" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod rook-ceph/rook-ceph-osd-5-59d69d86f9-q8qg5 error when evicting pods/"rook-ceph-osd-5-59d69d86f9-q8qg5" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. evicting pod rook-ceph/rook-ceph-osd-5-59d69d86f9-q8qg5 pod/rook-ceph-osd-5-59d69d86f9-q8qg5 evicted node/minikube-m03 drained -------------- Note that it took some seconds for evicting the `rook-ceph-osd-5-59d69d86f9-q8qg5` pod because the rook operator was also down. Once the operator got deployed on another node, it created the blocking pdbs correctly. And now `rook-ceph-osd-5-59d69d86f9-q8qg5` got evicted as well. I tried draining multiple nodes one at a time. Same result was observed. This was local testing with rook. Next I'll try to test this with ODF with same configuration like the customer. I'll update the results soon.