Bug 2026179

Summary:	Machineset modification requires scale to 2 which deletes attached local storage
Product:	OpenShift Container Platform	Reporter:	Peter Larsen <plarsen>
Component:	Cloud Compute	Assignee:	OCP on RHV Team <ocprhvteam>
Cloud Compute sub component:	oVirt Provider	QA Contact:	Michael Burman <mburman>
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	jpasztor
Version:	4.9
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-31 11:47:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peter Larsen 2021-11-24 01:40:22 UTC

Description of problem:
According to https://docs.openshift.com/container-platform/4.9/machine_management/modifying-machineset.html the proper procedure changing a machineset requires scaling the machineset to 0 to allow the nodes to reconfigure according to the changes (more memory, cpu etc). However, this process deletes all localstorage volumes - most likely as a cascading delete of storage used by the node.

Deleting the volumes means what-ever was running looses state data. For ODF in particular, it means this option is not valid or available.

Errors like this in the event log:
0s Warning FailedMount pod/rook-ceph-mon-b-7c5d8c8567-f55z8 Unable to attach or mount volumes: unmounted volumes=[ceph-daemon-data], unattached volumes=[rook-ceph-log rook-ceph-crash ceph-daemon-data kube-api-access-dwlcz rook-config-override rook-ceph-mons-keyring]: timed out waiting for the condition
0s Warning FailedAttachVolume pod/rook-ceph-mon-b-7c5d8c8567-f55z8 AttachVolume.Attach failed for volume "pvc-e7abf995-779e-4db3-bf9d-457eb53e71c0" : rpc error: code = Unknown desc = Fault reason is "Operation Failed". Fault detail is "[Cannot attach Virtual Disk. The specified disk does not exist.]". HTTP response code is "400". HTTP response message is "400 Bad Request".

Version-Release number of selected component (if applicable):

4.9.8

How reproducible:
Yes :D

Steps to Reproduce:
1. Setup ODF - using 3 nodes, make the nodes infra nodes etc.
2. Once ODF is provisioned, create a sample database that uses cephfs or similar storage classes.
3. Edit the machineset for the storage/infra nodes. Add a new node label or something simple.
4. Scale machineset to 0 - wait for all nodes to be removed.
5. Scale machineset to 3 - the nodes get created using the new machineset - HOWEVER ODF will not initialize as existing PV metadata is no longer available on RHV.

Actual results:
ODF will not come backup as the PVs being used are no longer present on the cluster.

Expected results:
Cluster comes up and the existing PVs are reattached to ODF allowing the storage platform to continue with the changed machineset.

Additional info:
I would expect a similar issue can easily exist on other platforms (not just oVirt). I have not tested it there.

Suspect changing the process of modifying machinesets so it doesn't require scaling down, or only delete the "osdisk" of a VM when the machine-operator removes a VM. Do not delete all attached volumes - detach everything and just delete the osDisk.

Comment 1 Gal Zaidman 2021-11-24 07:12:32 UTC

So this BZ will be solved as a side effect of https://bugzilla.redhat.com/show_bug.cgi?id=2024328.
But I'm still interested in understanding what is going on here because that might point to a different issue.
If I understand correctly you have a 3 master 3 workers openshift cluster on oVirt, the 3 workers run some workload which requires persistent storage like a database, I assume that the pods which run the workload use PVC/PV to get the volume, and in the oVirt engine you see 3 VMs each with 2 Disks, one for the VM OS and one as the PVC.
Then you scale the workers down to 0.
What should happen, is that the nodes will drain and the pods will:
1. move to some Pending state
OR
2. move to different nodes that are available if there are any (on your setup I guess there aren't any additional schedulable workers).
In either case, when the pods are drained from the node then the Disk should be detached by the CSI driver and the PV should remain.

So the question that comes to mind is how did the PVC disk remain attached to the VM on deletion? did you see that the draining process of the workers was successful? or did you somehow force the deletion of the nodes?
Can you please reproduce this and provide a must gather output so we can try and take a look?

Comment 2 Peter Larsen 2021-11-24 20:36:30 UTC

(In reply to Gal Zaidman from comment #1)
> So this BZ will be solved as a side effect of
> https://bugzilla.redhat.com/show_bug.cgi?id=2024328.
> But I'm still interested in understanding what is going on here because that
> might point to a different issue.
> If I understand correctly you have a 3 master 3 workers openshift cluster on
> oVirt, the 3 workers run some workload which requires persistent storage
> like a database, I assume that the pods which run the workload use PVC/PV to
> get the volume, and in the oVirt engine you see 3 VMs each with 2 Disks, one
> for the VM OS and one as the PVC.
> Then you scale the workers down to 0.
> What should happen, is that the nodes will drain and the pods will:
> 1. move to some Pending state
> OR
> 2. move to different nodes that are available if there are any (on your
> setup I guess there aren't any additional schedulable workers).
> In either case, when the pods are drained from the node then the Disk should
> be detached by the CSI driver and the PV should remain.
> 
> So the question that comes to mind is how did the PVC disk remain attached
> to the VM on deletion? did you see that the draining process of the workers
> was successful? or did you somehow force the deletion of the nodes?
> Can you please reproduce this and provide a must gather output so we can try
> and take a look?

Thanks for the BZ - indeed that looks like the root cause and will resolve my BZ. 
To answer you directly, I am not sure it matters if a node is successfully drained or not. Existing PV definitions should _not_ be removed by the MachineController when a machine/VM is deleted. I cannot think of any use-case where existing K8S metadata would be impacted by adding/removing compute resources, particular when it comes to persistent data. I had data across 3 nodes - and none of the PVs (7 I think in total) were existing after the scale operation. In other words, this wasn't a exception handling failure - but it seems every node was processed the same way, even when drained successfully.  Why the PV would remain on the node/VM I don't know. I did not look at the RHV metadata before the VMs were removed - but even if they were present on the VM, they should _not_ have been removed (detached, yes). 

Even if the drain failed I would not expect all attached PVs to be deleted, just as BZ #2024328 mentions. The only disk I expect to be deleted is the overlay COW file system that holds the osDisk data, since a new provisioned machine would recreate all of this.

My exact use-case when I found this issues was that a cluster-update failed to update the "infra" nodes, that I used for Red Hat's OpenShift Data Foundation (formerly known as OpenShift Container Storage). This is a CEPH based storage system where an oVirt CSI StorageClass is used to provision a series of PVs for the data management. ODF requires a special set of compute nodes (using a label). It was these 3 nodes that failed to update during an cluster upgrade.  They also lacked the required affinityGroups. Both changes were done to the machineset, but this does not modify the nodes/machines, which requires scaling the machineset to 0.  This results in a lot of pods going to pending state - a few will go in loopback failure as they weren't bound to the nodelabel but requires resources only made available by the pods running on these nodes. After a long wait the VMs/machines where all removed from RHV, and when I scaled the machineset back up to 3, everything had the required new labels and the nodes had the right version. But all the disk-volumes used by the PVs were gone, resulting in errors like those I indicated in this BZ.  This means the disk-volumes were deleted as part of the process of deleting the VMs. To repeat myself, this is not the expected behavior.

The only time I expect all existing PVs/disk-volumes of a cluster to be removed is during an "openshift-install destroy cluster" operation. Not even if I delete a machineset would that be expected behavior.  As long as there is metadata in etcd pointing to valid RHV volumes they should remain 'as is'.  Once the pod(s) that need the storage is recreated on the cluster, the storage would follow.

Comment 3 Gal Zaidman 2021-11-25 07:10:49 UTC

(In reply to Peter Larsen from comment #2)
> (In reply to Gal Zaidman from comment #1)
> > So this BZ will be solved as a side effect of
> > https://bugzilla.redhat.com/show_bug.cgi?id=2024328.
> > But I'm still interested in understanding what is going on here because that
> > might point to a different issue.
> > If I understand correctly you have a 3 master 3 workers openshift cluster on
> > oVirt, the 3 workers run some workload which requires persistent storage
> > like a database, I assume that the pods which run the workload use PVC/PV to
> > get the volume, and in the oVirt engine you see 3 VMs each with 2 Disks, one
> > for the VM OS and one as the PVC.
> > Then you scale the workers down to 0.
> > What should happen, is that the nodes will drain and the pods will:
> > 1. move to some Pending state
> > OR
> > 2. move to different nodes that are available if there are any (on your
> > setup I guess there aren't any additional schedulable workers).
> > In either case, when the pods are drained from the node then the Disk should
> > be detached by the CSI driver and the PV should remain.
> > 
> > So the question that comes to mind is how did the PVC disk remain attached
> > to the VM on deletion? did you see that the draining process of the workers
> > was successful? or did you somehow force the deletion of the nodes?
> > Can you please reproduce this and provide a must gather output so we can try
> > and take a look?
> 
> Thanks for the BZ - indeed that looks like the root cause and will resolve
> my BZ. 
> To answer you directly, I am not sure it matters if a node is successfully
> drained or not. Existing PV definitions should _not_ be removed by the
> MachineController when a machine/VM is deleted.

You are correct it should never be removed, but it does matter if the drain occurred or not, since the machine/node should never be deleted (meaning the delete operation in the cloud provider should never have been called) until:
1. The machine drain was successful.
2. The machine is not responsive to the cluster, like can't be reached due to network or the VM is shut down
If the drain occurred then the PV should have been detached because the pod is not on that machine any more.
I wanted to get the logs to debug it, since I belive that there might be a bug behind this case in one of the flows.
** unless you know the machines were not reachable by the cluster(the 2 case).

> I cannot think of any use-case where existing K8S metadata would be impacted by adding/removing
> compute resources, particular when it comes to persistent data. I had data
> across 3 nodes - and none of the PVs (7 I think in total) were existing
> after the scale operation. In other words, this wasn't a exception handling
> failure - but it seems every node was processed the same way, even when
> drained successfully.  Why the PV would remain on the node/VM I don't know.
> I did not look at the RHV metadata before the VMs were removed - but even if
> they were present on the VM, they should _not_ have been removed (detached,
> yes). 
> 
> Even if the drain failed I would not expect all attached PVs to be deleted,
> just as BZ #2024328 mentions. The only disk I expect to be deleted is the
> overlay COW file system that holds the osDisk data, since a new provisioned
> machine would recreate all of this.
 
If the drain failed I wouldn't expect the machine to be removed

> My exact use-case when I found this issues was that a cluster-update failed
> to update the "infra" nodes, that I used for Red Hat's OpenShift Data
> Foundation (formerly known as OpenShift Container Storage). This is a CEPH
> based storage system where an oVirt CSI StorageClass is used to provision a
> series of PVs for the data management. ODF requires a special set of compute
> nodes (using a label). It was these 3 nodes that failed to update during an
> cluster upgrade.  They also lacked the required affinityGroups. Both changes
> were done to the machineset, but this does not modify the nodes/machines,
> which requires scaling the machineset to 0.  This results in a lot of pods
> going to pending state - a few will go in loopback failure as they weren't
> bound to the nodelabel but requires resources only made available by the
> pods running on these nodes. After a long wait the VMs/machines where all
> removed from RHV,

Wonder what happened here, how much time did it take? did you force the removal?

> and when I scaled the machineset back up to 3, everything
> had the required new labels and the nodes had the right version. But all the
> disk-volumes used by the PVs were gone, resulting in errors like those I
> indicated in this BZ.  This means the disk-volumes were deleted as part of
> the process of deleting the VMs. To repeat myself, this is not the expected
> behavior.
> 
> The only time I expect all existing PVs/disk-volumes of a cluster to be
> removed is during an "openshift-install destroy cluster" operation. Not even
> if I delete a machineset would that be expected behavior.  As long as there
> is metadata in etcd pointing to valid RHV volumes they should remain 'as
> is'.  Once the pod(s) that need the storage is recreated on the cluster, the
> storage would follow.

Comment 4 Peter Larsen 2021-11-25 16:33:59 UTC

(In reply to Gal Zaidman from comment #3)
> (In reply to Peter Larsen from comment #2)
> > 
> > Thanks for the BZ - indeed that looks like the root cause and will resolve
> > my BZ. 
> > To answer you directly, I am not sure it matters if a node is successfully
> > drained or not. Existing PV definitions should _not_ be removed by the
> > MachineController when a machine/VM is deleted.
> 
> You are correct it should never be removed, but it does matter if the drain
> occurred or not, since the machine/node should never be deleted (meaning the
> delete operation in the cloud provider should never have been called) until:
> 1. The machine drain was successful.
> 2. The machine is not responsive to the cluster, like can't be reached due
> to network or the VM is shut down
> If the drain occurred then the PV should have been detached because the pod
> is not on that machine any more.
> I wanted to get the logs to debug it, since I belive that there might be a
> bug behind this case in one of the flows.
> ** unless you know the machines were not reachable by the cluster(the 2
> case).

Unfortunately that cluster no longer exists (for good reasons - no storage means everything started failing). But I can try to reproduce it. I'm still not sure I follow your exact point. I understand that if everything goes well there shouldn't be any outstanding volumes, however you in #2 explain that cases where a node is unresponsive it's removed too, and in THIS case I think my argument is still valid: It should not remove any volume-disks that have associated PVs.  So getting an answer to if the drain succeeded or not doesn't seem to matter - since there is a valid scenario where we cannot just remove everything associated with a VM.

I'll see if I get some cycles this week to reproduce it, but I still don't understand why that matters given what I just wrote?
> > Even if the drain failed I would not expect all attached PVs to be deleted,
> > just as BZ #2024328 mentions. The only disk I expect to be deleted is the
> > overlay COW file system that holds the osDisk data, since a new provisioned
> > machine would recreate all of this.
>  
> If the drain failed I wouldn't expect the machine to be removed

In my case it took probably an hour or so before the scale-down was complete - there was definitely something "not right" preventing some state to be achieved - not sure if that was the drain part or something else. but it eventually did decide to finish up. However, I would definitely not be surprised if an admin in a production environment would "hurry it along" by stopping a VM manually to let the machine-controller remove it. I think that's the scenario you're describing in #1 above as some state/pods may be left on the host and hence when a cascade-delete is done on the VM, the storage is killed.  Even if we're talking about an unexpected error state, that's still the wrong action to take?

> Wonder what happened here, how much time did it take? did you force the
> removal?

A looong time (about 1 hour or so). One machine was removed within 10 minutes or so, but the other 2 stayed around for a long time. Even though as far as I could tell all the pods were in pending state.  Still, for arguments sake, let's assume the machines where stopped not fully drained. The disks should still not have been deleted - detached yes, but not deleted?

Comment 6 Janos Bonic 2022-03-31 11:47:43 UTC

We do not recommend using the local storage operator for production use. Please re-file this as an RFE to support dynamic VM size updates with a customer case attached if there is one.

As a workaround, you can update the VM size from the RHV Manager if needed.

Comment 7 Peter Larsen 2022-03-31 14:42:58 UTC

(In reply to Janos Bonic from comment #6)
> We do not recommend using the local storage operator for production use.
> Please re-file this as an RFE to support dynamic VM size updates with a
> customer case attached if there is one.
> 
> As a workaround, you can update the VM size from the RHV Manager if needed.

Janos - while I agree to close this BZ, I don't agree with your comment. This was resolved via https://bugzilla.redhat.com/show_bug.cgi?id=2024328 and hence not related to the argument presented here.  Support or not - wiping out allocated storage used by CSI on OCP is _not_ the correct procedure, and the BZ I mentioned resolved that.