1861104 – OCS podDisruptionBudget prevents successful OCP upgrades

Bug 1861104 - OCS podDisruptionBudget prevents successful OCP upgrades

Summary: OCS podDisruptionBudget prevents successful OCP upgrades

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Santosh Pillai
QA Contact:	Aviad Polak
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1889518 (view as bug list)
Depends On:
Blocks:	1899743 1915851 1916585 1938134
TreeView+	depends on / blocked

Reported:	2020-07-27 20:19 UTC by Dan Yocum
Modified:	2024-03-25 16:13 UTC (History)
CC List:	33 users (show)
Fixed In Version:	4.7.0-191.ci
Doc Type:	Enhancement
Doc Text:	.OSD PDB redesign Previously, the OCS Pod Disruption Budgets (PDBs) by default have a `minUnavailable=0` and only allow rebooting of OSDs on a single node at a time. This caused the OCP console to constantly show a warning about nodes not being able to restart. With this update, OSD PDBs have the following redesigns: * There is one OSD PDB in the beginning. This allows only one OSD to go down at any time. * Once an OSD goes down, its failure domain is determined and any blocking OSD PDBs are created for other failure domains. * The original OSD created is deleted. As a result, all the OSDs can go down in the failure domain. With the new design, users can drain multiple nodes in the same failure domain.
Clone Of:
Clones:	1899738 1899743 1915851 (view as bug list)
Environment:
Last Closed:	2021-05-19 09:14:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-ci pull 3888	None	None	None	2021-09-02 06:02:42 UTC
Github	rook rook pull 6497	None	closed	ceph: OSD PDB reconciler changes	2021-02-18 21:26:45 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:16:07 UTC

Internal Links: 1788126

Description Dan Yocum 2020-07-27 20:19:04 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Setting a podDisruptionBudget of 1 with a replica of 1 prevents eviction of a storage pod, which causes OCP upgrades to stop during the upgrade process.

apiVersion: v1
items:
- apiVersion: policy/v1beta1
  kind: PodDisruptionBudget
  metadata:
    creationTimestamp: 2020-07-27T15:45:18Z
    generation: 1
    name: rook-ceph-mds-ocs-storagecluster-cephfilesystem
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: false
      kind: CephFilesystem
      name: ocs-storagecluster-cephfilesystem
      uid: 25f0ef12-f543-44b8-993b-6990426ca5b4
    resourceVersion: "124052"
    selfLink: /apis/policy/v1beta1/namespaces/openshift-storage/poddisruptionbudgets/rook-ceph-mds-ocs-storagecluster-cephfilesystem
    uid: 407e6ffc-7e7e-4ef0-9f78-a97594254b3c
  spec:
    minAvailable: 1
    selector:
      matchLabels:
        rook_file_system: ocs-storagecluster-cephfilesystem
  status:
    currentHealthy: 1
    desiredHealthy: 1
    disruptionsAllowed: 0
    expectedPods: 1
    observedGeneration: 1
- apiVersion: policy/v1beta1
  kind: PodDisruptionBudget
  metadata:
    creationTimestamp: 2020-07-27T15:45:18Z
    generation: 1
    name: rook-ceph-mon-pdb
    namespace: openshift-storage
    resourceVersion: "144553"
    selfLink: /apis/policy/v1beta1/namespaces/openshift-storage/poddisruptionbudgets/rook-ceph-mon-pdb
    uid: b1e2278e-3b50-41eb-b346-f4d9d54600e0
  spec:
    minAvailable: 2
    selector:
      matchLabels:
        app: rook-ceph-mon
  status:
    currentHealthy: 2
    desiredHealthy: 2
    disruptionsAllowed: 0
    expectedPods: 4
    observedGeneration: 1
- apiVersion: policy/v1beta1
  kind: PodDisruptionBudget
  metadata:
    creationTimestamp: 2020-07-27T15:45:19Z
    generation: 1
    labels:
      app: rook-ceph-osd-pdb
      ceph-osd-id: "1"
    name: rook-ceph-osd-1
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: f62f5027-cf93-4a92-8093-6a1f69c2e160
    resourceVersion: "124057"
    selfLink: /apis/policy/v1beta1/namespaces/openshift-storage/poddisruptionbudgets/rook-ceph-osd-1
    uid: 1af20bfa-d5b6-4b12-b376-5b244603834d
  spec:
    maxUnavailable: 0
    selector:
      matchLabels:
        app: rook-ceph-osd
        ceph-osd-id: "1"
        rook_cluster: openshift-storage
  status:
    currentHealthy: 1
    desiredHealthy: 1
    disruptionsAllowed: 0
    expectedPods: 1
    observedGeneration: 1
- apiVersion: policy/v1beta1
  kind: PodDisruptionBudget
  metadata:
    creationTimestamp: 2020-07-27T15:45:19Z
    generation: 1
    labels:
      app: rook-ceph-osd-pdb
      ceph-osd-id: "2"
    name: rook-ceph-osd-2
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: f62f5027-cf93-4a92-8093-6a1f69c2e160
    resourceVersion: "124059"
    selfLink: /apis/policy/v1beta1/namespaces/openshift-storage/poddisruptionbudgets/rook-ceph-osd-2
    uid: cb1ba853-f8e6-4ae0-98f0-0eb776dbaf71
  spec:
    maxUnavailable: 0
    selector:
      matchLabels:
        app: rook-ceph-osd
        ceph-osd-id: "2"
        rook_cluster: openshift-storage
  status:
    currentHealthy: 1
    desiredHealthy: 1
    disruptionsAllowed: 0
    expectedPods: 1
    observedGeneration: 1
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""


That is the PDBs for the storage project you see a few maxUnavailable of 0 and minAvailable of 1 both of which cause errors like the following in the machine-config-daemon pod on the nodes when they are trying to be evicted
:

I0727 16:30:27.799957    2371 update.go:92] error when evicting pod "rook-ceph-osd-2-66f6984746-b75xq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. 



Version of all relevant components (if applicable):

4.4

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

No

Is there any workaround available to the best of your knowledge?

The only way to successfully upgrade the OCP cluster is to manually delete the osd pod.


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCS
2. Attempt z-stream upgrade of OCP
3. Fail


Actual results:

I0727 16:30:27.799957    2371 update.go:92] error when evicting pod "rook-ceph-osd-2-66f6984746-b75xq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Expected results:

No errors, OCP is upgraded successfully w/o manual intervention.


Additional info:

Comment 4 Rohan CJ 2020-07-29 07:19:27 UTC

The upgrade should throw those errors a few times and then the eviction attempts are detected,
and if the OCS cluster's data (Placement Groups) are in a clean state, the PodDisruptionBudget will automatically be deleted.

> I0727 16:30:27.799957    2371 update.go:92] error when evicting pod "rook-ceph-osd-2-66f6984746-b75xq" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

The log message above is expected. The eviction will eventually succeed.

We will suppress the alert in an upcoming release: https://bugzilla.redhat.com/show_bug.cgi?id=1856764

This will cause the upgrade to take longer as each eviction will be blocked between node reboots to allow the data replication to occur.


If the ceph cluster is unable to become healthy for some reason, you will have to either fix the cluster or manually evict the pods.

If the ceph cluster is able to become healthy, but the upgrade is still failing (and not simply taking a longer time), that is a bug.

In either case, please provide the must-gather.

Comment 5 Rohan CJ 2020-07-29 07:22:51 UTC

(apparently, we need to have more discussion about fixing the alert, the alert hiding bug I linked has been close)

Comment 6 Neha Berry 2020-07-29 11:12:25 UTC

BTW, some past issues reproted for similar behavior:


>> In OCP upgrade, during MCO upgrade, we still wait for PGs to be active clean indefinitely, and this affects OCP upgrade

 Bug 1785151 - With OCS 4.2 configured, One node left behind in "Ready,SchedulingDisabled" state during OCP upgrade 4.2.10->4.2.12

>> RFE for better recovery time for PGs as Upgrade is prolonged and the issue is aggravated with ongoing IO 
-----------------------------------------
Bug 1809064 - [RFE]Speed up OCS recovery/drain during OCP upgrade: (closed as Duplicate of bug 1856254)

Comment 9 Rohan CJ 2020-08-11 08:20:10 UTC

Hey, sorry for the delay! Getting access to dropbox.redhat.com

Comment 10 Rohan CJ 2020-08-11 08:40:17 UTC

Seems like you can't access dropbox.redhat.com directly through VPN @Bipin, could you make this available on the support shell?

Comment 12 Rohan CJ 2020-08-12 12:30:27 UTC

Hey, this is the OCP must-gather. We need the OCS must-gather

Comment 13 Mudit Agarwal 2020-08-17 03:03:18 UTC

Removing blocker flag based on an offline discussion.

Comment 14 Rohan CJ 2020-08-17 05:50:22 UTC

Nothing to indicate a bug so far, most likely a misunderstanding cause by the incorrect alert.
Moving out of 4.5.
Keeping open to RC the issue when ocs must-gather is provided.

Comment 20 Rohan CJ 2020-09-21 07:44:34 UTC

The problem with allowing only one OSD to be down at a time is that this could lead to data unavailability. 

Between OSD downtimes, we need to wait for the data to rebalance. This can't be done using PDBs.
The only way to do this was to block all evictions, until the attempts are detected and deemed safe by rook.
We will be looking into solutions for this (possibly by contributing to upstream k8s) in the future.

We worked around to allow one failure domain to drain at a time in response to detected drains (using a canary pod on the node).

The design for this is here: https://github.com/rook/rook/blob/master/design/ceph/ceph-managed-disruptionbudgets.md

You can look at the rook operator logs and the configmap which tracks which failure domain has been allowed to drain (and when). 
The configmap is called something like pdbstatemap.

Comment 21 daniel 2020-09-21 15:25:00 UTC

thanks Rohan , quite helpful!

Comment 22 Travis Nielsen 2020-09-30 20:08:32 UTC

@Rohan As I recall, the main reason we have multiple PDBs for OSDs is since there could be multiple OSDs per node, which means we need one PDB per node. If there was only ever one OSD per node, we could have gone with the easy solution like mons and etcd where there is only a single PDB.

@Dan @daniel @Annette Are you still seeing this issue? I'm understanding that the upgrade is able to proceed, but there are just some error and warning messages along the way that make it look like it's not working. Can you confirm if there is still a critical upgrade issue here for OCS 4.6?

Comment 23 Dan Yocum 2020-09-30 20:47:33 UTC

Per the original report from Experian: 

"The only way to successfully upgrade the OCP cluster is to manually delete the osd pod."

Comment 24 Travis Nielsen 2020-09-30 21:34:28 UTC

@Dan Is it a bare metal cluster or AWS? 

@Rohan What about the following design:
1. Create a single PDB that cover all OSDs, similar to mons 
2. Set the maxUnavailable to the max number of OSDs that have been scheduled on a node. 
3. Watch for OSD pods going down
4. If PGs become unhealthy for any reason, change the maxUnavailable to 0 to block any further OSDs from going down until the operator decides it is safe
5. When the PGs are healthy again, return to step 2 to restore the maxUnavailable.

As long as we assume even spread of OSDs (which will be the case for OSDs after topology spread constraints are applied), only one node would be allowed to go down at a time.

Since the operator itself might be evicted at the same time as OSDs, the operator is expected to respond soon enough to modify the PDB after it is restarted (when it is rescheduled to a new node after being evicted). Since the operator can run on any node, it will be scheduled to a new node while the drained node is still being restarted.

This also would get rid of the warning in the OCP console from always having a max unavailable of 0.

Let's discuss...

Comment 25 Dan Yocum 2020-09-30 23:00:11 UTC

@Travis 

It's AWS.

Comment 26 Travis Nielsen 2020-10-05 15:59:55 UTC

There does not appear to be an easy fix with dev freeze tomorrow for 4.6. We will need to address the redesign in 4.7.

Comment 27 daniel 2020-10-06 06:11:02 UTC

> @Dan @daniel @Annette Are you still seeing this issue? I'm understanding
> that the upgrade is able to proceed, but there are just some error and
> warning messages along the way that make it look like it's not working. Can
> you confirm if there is still a critical upgrade issue here for OCS 4.6?

@Trevis, in my last tests it worked w/o any issues apart from a rhv issue (1880860) but OCP upgrades went well, see c#18 and yesterday I upgraded from 4.5.8 through 4.5.11 to 4.5.13 which went smooth as well
If it helps I can retest on AWS as well

Comment 28 Rohan CJ 2020-10-14 07:28:31 UTC

@Travis, PDB specs cannot be modified. Only created or deleted

Comment 29 Travis Nielsen 2020-10-15 18:18:34 UTC

Per discussion today with Rohan and Santosh, there are a couple approaches to explore.

First, keep the current design, but suppress the OCP warning. After all, from the PDB docs, using a maxUnavailable=0 is a documented option.
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
"Set PDB with maxUnavailable=0. Have an understanding (outside of Kubernetes) that the cluster operator needs to consult you before termination. When the cluster operator contacts you, prepare for downtime, and then delete the PDB to indicate readiness for disruption. Recreate afterwards."

Second, here is a possible design to explore (ignore my earlier design in comment 24):

Example scenario:
Zone x
- Node a
- osd.0
- osd.1
Zone y
- Node b
- osd.2
- osd.3
Zone z
- Node c
- osd.4
- osd.5

1. Rook creates a single PDB that covers all OSDs with maxUnavailable=1. This would avoid the alert, and would also allow one osd to be down at any given time, which is fully supported by OCS already.
2. When Rook sees an OSD go down (for example, osd.0 goes down):
a. Create a PDB for each failure domain (zones y and z) with maxUnavailable=0 where the OSD did *not* go down.
b. Delete the original PDB that covers all OSDs
c. Now all remaining OSDs in zone x would be allowed to be drained
3. When Rook sees the OSDs are back up an all PGs are clean
a. Restore the PDB that covers all OSDs with maxUnavailable=1
b. Delete the PDBs (in zone y and z) where maxUnavailable=0

Now the OSDs on another node could proceed to be drained.

The main question for this design is whether there is a concern with the PDBs that will overlap for a split second, which documentation says should be avoided.
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
"when there are multiple PDBs in a namespace, you must be careful not to create PDBs whose selectors overlap."

Comment 30 W. Trevor King 2020-10-15 19:11:21 UTC

> The main question for this design is whether there is a concern with the PDBs that will overlap for a split second, which documentation says should be avoided.

That sounds like an upstream bug; at least a docs bug.  Because overlapping PDBs during transition seems much safer to me than "drop all PDB coverage and hope to get your new PDB landed before anything moves on the opening".

Comment 31 Michael Gugino 2020-10-15 19:28:25 UTC

Having multiple PDBs covering the same pod will result in the inability to evict said pod.  The documentation upstream is correct, you should avoid overlapping PDBs if you still want to evict a pod.  If you want to ensure a pod doesn't get evicted, multiple PDBs would do it.  Note, this is somewhat of a side-effect, there is some desire to try to handle multiple PDBs, I think it should continue behaving as today.

The steps to create an additional PDB where maxUnavailable = 0 seems redundant in this particular case.  If the all pods are covered with maxUnavailable=1, then no more pods will be allowed to be disrupted anyway.

Step 'C' "Now all remaining OSDs in zone x would be allowed to drain" seems dangerous.  If we only care about draining and destroying the node where the OSD failed, then this patch will fix that issue: https://github.com/kubernetes/kubernetes/pull/94381

What you can do today:

Delete the broken OSD pod via some other controller under the right conditions.  This will cause the OSD pod to terminate, which will unblock drain if using the machine-api.

Comment 32 Travis Nielsen 2020-10-15 20:28:35 UTC

(In reply to W. Trevor King from comment #30)
> > The main question for this design is whether there is a concern with the PDBs that will overlap for a split second, which documentation says should be avoided.
> 
> That sounds like an upstream bug; at least a docs bug.  Because overlapping
> PDBs during transition seems much safer to me than "drop all PDB coverage
> and hope to get your new PDB landed before anything moves on the opening".

If the only side effect of having multiple pdbs cover the same pod is that the pod won't be evicted if either pdb is satisfied, this is fine. We will delete one of the pdbs immediately anyway. We just wanted to confirm if there are any other side effects since it wasn't very clear.


(In reply to Michael Gugino from comment #31)
> Having multiple PDBs covering the same pod will result in the inability to
> evict said pod.  The documentation upstream is correct, you should avoid
> overlapping PDBs if you still want to evict a pod.  If you want to ensure a
> pod doesn't get evicted, multiple PDBs would do it.  Note, this is somewhat
> of a side-effect, there is some desire to try to handle multiple PDBs, I
> think it should continue behaving as today.
> 
> The steps to create an additional PDB where maxUnavailable = 0 seems
> redundant in this particular case.  If the all pods are covered with
> maxUnavailable=1, then no more pods will be allowed to be disrupted anyway.

To clarify the example, after step 2 is complete, we would end up with two PDBs:
- zone b with maxUnavailable=0
- zone c with maxUnavailable=0

Zone 'x' would not have any PDBs, so osd.1 could now be stopped to finish draining node 'a'. 

> Step 'C' "Now all remaining OSDs in zone x would be allowed to drain" seems
> dangerous.  If we only care about draining and destroying the node where the
> OSD failed, then this patch will fix that issue:
> https://github.com/kubernetes/kubernetes/pull/94381

After osd.0 was stopped, we need osd.1 to be stopped to allow the node to complete the drain. If the OSDs were perfectly healthy before the drain started, not sure I see how that PR will help us.


> What you can do today:
> 
> Delete the broken OSD pod via some other controller under the right
> conditions.  This will cause the OSD pod to terminate, which will unblock
> drain if using the machine-api.

Let's assume there are no broken OSD pods, the OSDs just need to be removed when the node is drained. This should just fit into the expected drain scenario rather than try to manage the pods with our own controller.

Comment 33 Michael Gugino 2020-10-15 20:29:28 UTC

Here's a scenario that I think will work for all cases.

First, determine how many OSDs you can lose simultaneously before failure.  That's your PDB count. (Feel free to choose any lower number as long as it's 1 or more).

Let's say that number (N) is '2'.

So, all OSDs covered by PDB with maxUnavailable = 2.

Next, in addition to the OSDs being covered by the PDB, also deploy N canary pods (In our example case, 2) to be covered under the same PDBs.

The canary pods listen to rook or whatever to determine 'PGs are clean' state.  If 'PGs are clean' == false, set pod readiness to False.  This will count as disruptions against the PDB and all draining will be blocked until the PGs are clean.

Once the aforementioned patch merges, the nodes with the broken OSDs will drain properly after the canary pods report ready.

One thing to keep in mind, if the kubelet goes unreachable, if your pods do not tolerate NoExecute and NoSchedule, they will be marked deleted and after 5 minutes, the machine-api will ignore any pods on an unreachable kubelet with a deletion timestamp.  In this case, drain would not be blocked (by the machine-api, or manually if using the right options).  This might not be desireable for you, you might wish to have the OSDs tolerate those taints.  The net effect of this would be those OSDs would block drain indefinitely until the PDB is no longer in violation (since those OSDs will not be marked deleted).  See the etcd-quorum-guard PDB for an example.

Comment 34 Travis Nielsen 2020-10-15 20:50:54 UTC

(In reply to Michael Gugino from comment #33)
> Here's a scenario that I think will work for all cases.
> 
> First, determine how many OSDs you can lose simultaneously before failure. 
> That's your PDB count. (Feel free to choose any lower number as long as it's
> 1 or more).
> 
> Let's say that number (N) is '2'.
> 
> So, all OSDs covered by PDB with maxUnavailable = 2.

What causes downtime is if there are any OSDs that are down in different zones. A single PDB just can't represent that behavior across zones since it only counts the total number of pods that can be down. Any number of OSDs can be down in the same zone, but not across zones. 

Therefore, the idea is that when any OSD goes down in a single zone, we effectively remove the PDB from that zone while using PDBs to block OSDs from going down in other zones.

Comment 35 W. Trevor King 2020-10-15 23:19:45 UTC

> If the only side effect of having multiple pdbs cover the same pod is that the pod won't be evicted if either pdb is satisfied, this is fine...

I'm attempting to soften the upstream docs wording in [1].

[1]: https://github.com/kubernetes/website/pull/24589

Comment 36 Santosh Pillai 2020-10-19 13:20:00 UTC

Looking into it.

Comment 37 Lars Kellogg-Stedman 2020-10-21 12:18:54 UTC

*** Bug 1889518 has been marked as a duplicate of this bug. ***

Comment 38 Travis Nielsen 2020-10-27 19:57:36 UTC

Note that the design proposal in https://bugzilla.redhat.com/show_bug.cgi?id=1861104#c29 only applies to clusters that have at least three failure domains above the host level (usually a zone or rack). If there is no higher level failure domain (or what portable: false on the storageClassDeviceSets), we need a different approach for a host failure domain. 

For host failure domain, let's consider the following design to only allow a single node to go down at a time:
- An osd canary pod is deployed on each node where OSDs are deployed. The canary pod is needed since there could be multiple OSDs on the same node, though not necessarily the same number of OSDs on every node.
- A PDB is created for the osd canary selector that will set maxUnavailable=1. The PDB is static and never needs to be updated.

This does not strictly enforce the PGs to be in a healthy state, but I expect it to be a sufficient approximation of having OSDs up during the rolling node updates. We must strike a balance between allowing nodes to be drained and accepting the occasional intermittent short time window of down PGs if the PGs need time to settle after a node restart.

Comment 42 Santosh Pillai 2020-10-28 11:55:58 UTC

(In reply to Travis Nielsen from comment #29)
> Per discussion today with Rohan and Santosh, there are a couple approaches
> to explore.
> 

> 
> Example scenario:
> Zone x
> - Node a
>   - osd.0
>   - osd.1
> Zone y
> - Node b
>   - osd.2
>   - osd.3
> Zone z
> - Node c
>   - osd.4
>   - osd.5
> 
> 1. Rook creates a single PDB that covers all OSDs with maxUnavailable=1.
> This would avoid the alert, and would also allow one osd to be down at any
> given time, which is fully supported by OCS already.
> 2. When Rook sees an OSD go down (for example, osd.0 goes down):
>    a. Create a PDB for each failure domain (zones y and z) with
> maxUnavailable=0 where the OSD did *not* go down. 
>    b. Delete the original PDB that covers all OSDs
>    c. Now all remaining OSDs in zone x would be allowed to be drained
> 3. When Rook sees the OSDs are back up an all PGs are clean

@Travis Between 2(c) and 3, if the PGs take time to become active+clean and another node (say Node y) is drained, then this drain will be blocked because of 2(a) and we will still see the alert. What do you think?

>    a. Restore the PDB that covers all OSDs with maxUnavailable=1
>    b. Delete the PDBs (in zone y and z) where maxUnavailable=0
>

Comment 43 Travis Nielsen 2020-10-28 17:31:27 UTC

After further discussion with Rohan and Santosh, we agree that the design in https://bugzilla.redhat.com/show_bug.cgi?id=1861104#c29 will also cover the case of the host failure domain. Each host will be considered its own failure domain in that case (instead of zone), so a PDB would be created with maxUnavailable=0 for all hosts where the drain is not in progress. Therefore, we don't need the separate design using canary pods as described in Comment #38. There was also a limitation with the design in #38 where we couldn't watch for PG health before continuing to the next node. With the design in #29 we will be able to monitor the PG health. 

> @Travis Between 2(c) and 3, if the PGs take time to become active+clean and another node (say Node y) is drained, then this drain will be blocked because of 2(a) and we will still see the alert. What do you think?

@Santosh Per our discussion, we believe it is ok to see the alerts during node drain since this is a temporary condition. The alerts were mainly a problem when they always existed even when nodes were not being drained. 

The final question is how to deal with PG health during the node drains. Different clusters may have different sensitivity to any storage downtime, while some admins may prefer to proceed with drains even if there is some level of downtime. I don't believe one solution is possible to satisfy all clusters, therefore I recommend the following approach...

The OCS UI (and the StorageCluster/CephCluster CRs) need to expose a setting for how to handle node drains, in order from most to least restrictive:
1. Most restrictive: Always wait for PGs to be clean in a failure domain before allowing the next failure domain to continue with node drains.
2. Restrictive with a timeout: Wait for PGs to be clean between each failure domain restart, but continue after a timeout (e.g. 5 minutes). This would allow clusters to proceed and drain by default, but use a best-effort approach to keep the data available.
3. Restricted only by PDBs: Use PDBs as proposed, but don't ever wait for PGs.
4. Least restrictive: Don't even create PDBs for OSDs. Allow storage downtime during node drain, OCP upgrades, etc.

As the default, I would recommend approach 2 to avoid nodes being blocked indefinitely.

@Eran Thoughts on making this configurable in the OCS UI as described?

A separate improvement discussed with Annette is to deploy the Mons on the master nodes so that they will be treated directly as part of the control plane. That is already possible today through the placement settings of the StorageCluster CR, but is not widely used.

Comment 44 Ben England 2020-10-29 00:17:01 UTC

about cmt 43:

option 2 needs a longer timeout.  For a baremetal cluster, you can easily spend 5 min rebooting, most of it in the BIOS.

option 1 should be the default - we should *always* default to data safety.   The other options create some risk of data unavailability, and the user should be warned and forced to choose that.   For example, what if the system is already recovering from a OSD/node failure at the time of the upgrade?   The larger the cluster, the more likely this scenario is.  As explained in comment 39, this can lead to data being temporarily unwritable during the upgrade process (PGs with only 1 active OSD), enough to take an application offline.

For option 2, depending on how much data is in the Ceph storage pools it may take much longer to bring the PGs to a clean state than it does to proceed to the next node, so you may have cascading PG failures, massive movement of data, etc. that will slow down and disrupt the cluster and even lead to temporary data unavailability (PGs with 1 or even 0 active OSDs).  

Sorry but distributed storage is a very different thing than migrating applications - if  an application has to be taken down, you just restart it, or even live-migrate it, to another node, but data is very expensive and slow to relocate if there is a lot of it.

Comment 45 Travis Nielsen 2020-10-29 16:16:38 UTC

(In reply to Ben England from comment #44)
> about cmt 43:
> 
> option 2 needs a longer timeout.  For a baremetal cluster, you can easily
> spend 5 min rebooting, most of it in the BIOS.

Agreed, perhaps 10 min as a default, but we can make it configurable. 

> option 1 should be the default - we should *always* default to data safety. 
> The other options create some risk of data unavailability, and the user
> should be warned and forced to choose that.   For example, what if the
> system is already recovering from a OSD/node failure at the time of the
> upgrade?   The larger the cluster, the more likely this scenario is.  As
> explained in comment 39, this can lead to data being temporarily unwritable
> during the upgrade process (PGs with only 1 active OSD), enough to take an
> application offline.

We aren't talking about data safety, just data availability, right?

> For option 2, depending on how much data is in the Ceph storage pools it may
> take much longer to bring the PGs to a clean state than it does to proceed
> to the next node, so you may have cascading PG failures, massive movement of
> data, etc. that will slow down and disrupt the cluster and even lead to
> temporary data unavailability (PGs with 1 or even 0 active OSDs).  
> 
> Sorry but distributed storage is a very different thing than migrating
> applications - if  an application has to be taken down, you just restart it,
> or even live-migrate it, to another node, but data is very expensive and
> slow to relocate if there is a lot of it.

Certainly we need to avoid moving data around during the node restarts since it is so expensive, and to avoid cascading failures. But if we can be smart about it in the operator and make option 2 smarter, I would still vote for that as the default in OCS where we don't have a dedicated storage admin. For RCHS I would see option 1 as the default since there is a dedicated admin. I'll defer to PM/field input on what the default should be.

Comment 48 Yaniv Kaul 2020-11-07 15:57:12 UTC

How does the design work with multiple non-overlapping (distinct) pools?
Say I have pool A and pool B, and they are on 3 different zones.
I should be able to restart multiple OSDs (a single node?) per pool - and I do not need to wait for the whole cluster to be in a good shape - I need per pool health.

Comment 49 Travis Nielsen 2020-11-09 19:18:43 UTC

(In reply to Yaniv Kaul from comment #48)
> How does the design work with multiple non-overlapping (distinct) pools?
> Say I have pool A and pool B, and they are on 3 different zones.
> I should be able to restart multiple OSDs (a single node?) per pool - and I
> do not need to wait for the whole cluster to be in a good shape - I need per
> pool health.

Yaniv The design is based on the CRUSH hierarchy, of which the pools are consumers and all pools in the cluster likely overlap, based on CRUSH rules. All we need to know about the pools is what their failure domain is, and we find the lowest common denominator. If all pools have failureDomain=zone, then we can all nodes in the same CRUSH zone to be down. If a pool has failureDomain=rack, that means we can only allow all nodes in a given CRUSH rack to be down.

Comment 50 Yaniv Kaul 2020-11-09 19:51:58 UTC

(In reply to Travis Nielsen from comment #49)
> (In reply to Yaniv Kaul from comment #48)
> > How does the design work with multiple non-overlapping (distinct) pools?
> > Say I have pool A and pool B, and they are on 3 different zones.
> > I should be able to restart multiple OSDs (a single node?) per pool - and I
> > do not need to wait for the whole cluster to be in a good shape - I need per
> > pool health.
> 
> Yaniv The design is based on the CRUSH hierarchy, of which the pools are
> consumers and all pools in the cluster likely overlap, based on CRUSH rules.
> All we need to know about the pools is what their failure domain is, and we
> find the lowest common denominator. If all pools have failureDomain=zone,
> then we can all nodes in the same CRUSH zone to be down. If a pool has
> failureDomain=rack, that means we can only allow all nodes in a given CRUSH
> rack to be down.

I was under the impression that in the future (4.7? 4.8?) we will support distinct locations, so pools will NOT be shared. I do hope the design takes care of this.

Comment 51 Travis Nielsen 2020-11-09 21:13:38 UTC

(In reply to Yaniv Kaul from comment #50)
> (In reply to Travis Nielsen from comment #49)
> > (In reply to Yaniv Kaul from comment #48)
> > > How does the design work with multiple non-overlapping (distinct) pools?
> > > Say I have pool A and pool B, and they are on 3 different zones.
> > > I should be able to restart multiple OSDs (a single node?) per pool - and I
> > > do not need to wait for the whole cluster to be in a good shape - I need per
> > > pool health.
> > 
> > Yaniv The design is based on the CRUSH hierarchy, of which the pools are
> > consumers and all pools in the cluster likely overlap, based on CRUSH rules.
> > All we need to know about the pools is what their failure domain is, and we
> > find the lowest common denominator. If all pools have failureDomain=zone,
> > then we can all nodes in the same CRUSH zone to be down. If a pool has
> > failureDomain=rack, that means we can only allow all nodes in a given CRUSH
> > rack to be down.
> 
> I was under the impression that in the future (4.7? 4.8?) we will support
> distinct locations, so pools will NOT be shared. I do hope the design takes
> care of this.

Are you referring to pools on different media? (hdd/ssd/nvme)
Those pools won't share OSDs, but the OSDs will be included in the same CRUSH map, so this design would also apply.

Comment 52 Chris Blum 2020-11-10 11:55:06 UTC

If different disk types (hdd/ssd/nvme) are part of the same CRUSH map, but do not share the same data (for example because they support different Ceph pools and are selected by different CRUSH rules) then you can take these OSDs down independently of each other --> HDDs and SSDs can be down at the same time without data insecurity.

Comment 53 Travis Nielsen 2020-11-13 21:43:33 UTC

The implementation is in final review, testing looking good. https://github.com/rook/rook/pull/6497

Comment 55 Yaniv Kaul 2020-11-15 15:49:45 UTC

(In reply to Travis Nielsen from comment #51)
> (In reply to Yaniv Kaul from comment #50)
> > (In reply to Travis Nielsen from comment #49)
> > > (In reply to Yaniv Kaul from comment #48)
> > > > How does the design work with multiple non-overlapping (distinct) pools?
> > > > Say I have pool A and pool B, and they are on 3 different zones.
> > > > I should be able to restart multiple OSDs (a single node?) per pool - and I
> > > > do not need to wait for the whole cluster to be in a good shape - I need per
> > > > pool health.
> > > 
> > > Yaniv The design is based on the CRUSH hierarchy, of which the pools are
> > > consumers and all pools in the cluster likely overlap, based on CRUSH rules.
> > > All we need to know about the pools is what their failure domain is, and we
> > > find the lowest common denominator. If all pools have failureDomain=zone,
> > > then we can all nodes in the same CRUSH zone to be down. If a pool has
> > > failureDomain=rack, that means we can only allow all nodes in a given CRUSH
> > > rack to be down.
> > 
> > I was under the impression that in the future (4.7? 4.8?) we will support
> > distinct locations, so pools will NOT be shared. I do hope the design takes
> > care of this.
> 
> Are you referring to pools on different media? (hdd/ssd/nvme)
> Those pools won't share OSDs, but the OSDs will be included in the same
> CRUSH map, so this design would also apply.

Same media.

Comment 56 Travis Nielsen 2020-11-19 20:47:52 UTC

The fix has been pushed to the downstream OCS release-4.7 branch. Now to clone the BZ for 4.6.z...

Comment 57 Travis Nielsen 2020-11-19 21:43:22 UTC

Cloned to 4.6.z with https://bugzilla.redhat.com/show_bug.cgi?id=1899743

Comment 61 Travis Nielsen 2021-01-04 18:29:54 UTC

The PDB redesign discussed in Comment 29 and later really addresses the following BZs:
- PDB alert continuously firing: https://bugzilla.redhat.com/show_bug.cgi?id=1788126
- Backport of the redesign to 4.6.z (clone of this BZ): https://bugzilla.redhat.com/show_bug.cgi?id=1899743

The original report of this BZ is only partially addressed by the redesign. The fundamental issue with this BZ is that when the PGs are not clean, OCP will be prevented from upgraded. This will still be true. OCS still uses the PDBs to prevent more OSDs from being evicted than what the cluster can tolerate. If the PGs are unclean, the PDBs may still block the OCP upgrade from happening because OCS is trying to prevent downtime. The case where the OCP upgrade will not be blocked is if the upgrade takes down node(s) in the same zone where the OSD(s) are already down and causing the PGs to be unclean. The OCP upgrade in that zone will be allowed to continue. After that zone is completed, if the OSDs all come up again and the PGs become clean, then the OCP upgrade could continue in the other zones.

Having unclean PGs is not expected regularly and when it does happen, the situation should be resolved to get the PGs to a clean state again. If the upgrade really does need to be forced even while PGs are unclean, the admin needs to be aware of the potential downtime and make this decision explicitly. Currently the only way to override the behavior is to manually delete the PDBs or force delete the OSD pods. IMO this is sufficient control for the admin to override the behavior if downtime really is acceptable.

@Dan The scenario is improved, though still requires manual override if the PGs are not clean. How about if we close this BZ with the verification of the redesign? If you see the need for a UI option to override the PDBs, I'd suggest opening a new BZ for consideration.

Comment 67 Aviad Polak 2021-01-10 14:34:23 UTC

tested ocp upgrade with ocs build 4.7.0-222.ci

before upgrade:
oc exec rook-ceph-tools-8575486ffd-wh4fm -- ceph status
  cluster:
    id:     8b1345d7-5a43-45e8-8fea-c82e719ae68e
    health: HEALTH_WARN
            Degraded data redundancy: 58286/290997 objects degraded (20.030%), 88 pgs degraded
 
  services:
    mon: 3 daemons, quorum a,b,c (age 108m)
    mgr: a(active, since 108m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 2s), 3 in (since 107m)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   3 pools, 192 pgs
    objects: 97.00k objects, 377 GiB
    usage:   1.1 TiB used, 405 GiB / 1.5 TiB avail
    pgs:     35.938% pgs not active
             58286/290997 objects degraded (20.030%)
             88 active+undersized+degraded
             69 peering
             35 active+undersized
 
  io:
    client:   921 B/s rd, 371 KiB/s wr, 1 op/s rd, 2 op/s wr

after upgrade:
oc exec rook-ceph-tools-8575486ffd-2pwx2 -- ceph status
  cluster:
    id:     8b1345d7-5a43-45e8-8fea-c82e719ae68e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 18m)
    mgr: a(active, since 18m)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 17m), 3 in (since 2h)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   3 pools, 192 pgs
    objects: 97.12k objects, 378 GiB
    usage:   1.1 TiB used, 404 GiB / 1.5 TiB avail
    pgs:     192 active+clean
 
  io:
    client:   853 B/s rd, 9.0 KiB/s wr, 1 op/s rd, 1 op/s wr


PBD before upgrade:
Sun 10 Jan 12:55:05 UTC 2021
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     74m
rook-ceph-mon-pdb                                 2               N/A               1                     74m
rook-ceph-osd                                     N/A             1                 1                     70m


PDB after upgrade:
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     162m
rook-ceph-mon-pdb                                 2               N/A               1                     162m
rook-ceph-osd                                     N/A             1                 1                     15m


upgrade job: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/16145/

its looks good to me

Comment 78 errata-xmlrpc 2021-05-19 09:14:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.

aclewett
alchan
apolak
assingh
bengland
bkunal
cblum
ChetRHosey
dahernan
dkochuka
dmoessne
edonnell
etamir
jelopez
jhopper
lars
madam
mgugino
muagarwa
mwasher
nberry
nelluri
ocs-bugs
owasserm
ratamir
r.martinez
rojoseph
rperiyas
shan
tdesala
tnielsen
wking
ykaul