1782241 – HCO-managed configmaps are dropped and recreated, dropping all user-provided configuration

Bug 1782241 - HCO-managed configmaps are dropped and recreated, dropping all user-provided configuration

Summary: HCO-managed configmaps are dropped and recreated, dropping all user-provided ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Installation
Sub Component:
Version:	2.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.3.0
Assignee:	Simone Tiraboschi
QA Contact:	Irina Gulina
Docs Contact:
URL:
Whiteboard:
Depends On:	1781336
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-11 13:45 UTC by Igor Braginsky
Modified:	2020-05-04 19:10 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 19:10:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2020:2011	0	None	None	None	2020-05-04 19:10:48 UTC

Description Igor Braginsky 2019-12-11 13:45:36 UTC

Description of problem:
There are several issues happening:
1. VMs are totally removed from the cluster
2. Customizations of configmap is lost, no VDDK-INIT image is mentioned there


Version-Release number of selected component (if applicable): CNV 2.2.0-10


How reproducible: Occasionally, from time to time


Steps to Reproduce:
1. Configure cluster for migration
2. Migrate one or more VMs
3. Wait some time - up to several days

Actual results:


Expected results:


Additional info:

Comment 1 Nelly Credi 2019-12-11 14:21:58 UTC

@Igor, do you only see this with imported VMs (vs newly created CNV regular VMs)?

Comment 2 Nelly Credi 2019-12-11 14:26:12 UTC

and can you please add must-gather logs?

Comment 3 Igor Braginsky 2019-12-12 09:47:25 UTC

@Nelly:
must-gather is returning multiple errors, can't make it work.
oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.2.0-5 --dest-dir=~/cnv_logs
....
[must-gather-drssm] POD error: the server doesn't have a resource type "inspect"
[must-gather-drssm] POD WARNING: openshift-must-gather has been DEPRECATED. Use `oc adm inspect` instead.
[must-gather-drssm] POD error: the server doesn't have a resource type "inspect"
[must-gather-drssm] OUT gather logs unavailable: unexpected EOF
[must-gather-drssm] OUT waiting for gather to complete
[must-gather-drssm] OUT gather never finished: timed out waiting for the condition
UnauthorizedUnauthorizederror: gather never finished for pod must-gather-drssm: timed out waiting for the condition

Can you or someone else connect to the env and check?
ssh -lcnv-qe-jenkins -i ~/.ssh/cnv-qe-jenkins.key 10.0.98.50

Comment 4 Marek Libra 2019-12-16 12:14:12 UTC

I believe we should focus on searching the reason why the configmap is reset to its default.
Is kubevirt being reinstalled (removed or updated??) Is the VM CRD stable?

Comment 5 Radim Hrazdil 2019-12-18 15:21:58 UTC

I believe I have just witnessed the same issue as Igor has described.

During a test run of kubevirt console integration tests I noticed at one point that page 'Virtual Machines' started to display 404: Not Found. I checked the pods and noticed that bunch of pods are being terminated and recreated, HCO Readiness said 'ContainersNotReady'.

See this pastebin with openshift-cnv pods http://pastebin.test.redhat.com/822851 . The pods with age 66m are the ones that were recreated.

Since we often install CNV from brew (I'm not 100% sure that Igor's environment has CNV_SOURCE set to brew as well, but it's likely that it does), when new container image is available in brew, HCO pulls new image and updates it.
During the update, some of the operators must have recreated the VirtualMachine CRD, which explains the 404:Not Found displayed in the UI on Virtual Machines page.
There may be other situations where UI displays this error and @Tomas Jelinek would surely be able to elaborate on that, but missing CRD is definitely one of them.

Now the important thing. When you create some VMs and delete VirtualMachine CRD, all the virtual machines are deleted as well. Apparently, this is how VMs may disappear after some time.

Additionally, when all the pods came back up I noticed that the kubevirt-storage-class-defaults configMap was overwritten to it's default state.
The whole process is nicely captured in this jenkins run, where none of the test cases that require a DataVolume work after the update, because hostpath-provisioner requires accessMode: ReadWriteMany and volumeMode: Filesystem, which is not the default setting of this configMap.

This doesn't seem to be related to V2V or User Interface, moving to Installation.

Comment 6 Maayan Hadasi 2020-01-01 09:33:37 UTC

This issue wasn't reproduced since I worked on the CNV 2.2.0-11 environment on Monday - 2 days ago
I was the 1st one who used this CNV version after it successfully deployed. Till now (~ 48 hours) this bug didn't occur

Comment 8 Maayan Hadasi 2020-01-12 09:28:02 UTC

(In reply to Maayan Guetta from comment #6)
> This issue wasn't reproduced since I worked on the CNV 2.2.0-11 environment
> on Monday - 2 days ago
> I was the 1st one who used this CNV version after it successfully deployed.
> Till now (~ 48 hours) this bug didn't occur


Update regarding bug reproduction:
The bug reproduced after ~ 13 days since the environment was deployed
(The issue wasn't seen between dates: Dec 28 - Jan 10)

Comment 9 Maya Rashish 2020-01-12 10:27:16 UTC

Duplicate bz1786475?

Comment 11 Radim Hrazdil 2020-01-12 13:53:37 UTC

Nelly, currently when we install CNV the storage storage configMap contains the following by default:
data:
  accessMode: ReadWriteMany
  volumeMode: Block
  local-sc.accessMode: ReadWriteOnce
  local-sc.volumeMode: Filesystem

Which means this datavolumes created with rook-ceph-block are able to bind, but DVs with hostpath-provisioner SC cannot.

hostpath-provisioner.accessMode: ReadWriteOnce
hostpath-provisioner.volumeMode: Filesystem 
needs to be added in order to use hostpath storage class.

I've been told that configuration of the configmap is admin responsibility, so I don't consider it neccessary to open a bug for having
each SC used by QE included by default - users would most likely need to add their own configuration anyway.

Comment 12 Maya Rashish 2020-01-12 13:57:26 UTC

(In reply to Radim Hrazdil from comment #11)
> Nelly, currently when we install CNV the storage storage configMap contains
> the following by default:
> data:
>   accessMode: ReadWriteMany
>   volumeMode: Block
>   local-sc.accessMode: ReadWriteOnce
>   local-sc.volumeMode: Filesystem
> 
> Which means this datavolumes created with rook-ceph-block are able to bind,
> but DVs with hostpath-provisioner SC cannot.
> 
> hostpath-provisioner.accessMode: ReadWriteOnce
> hostpath-provisioner.volumeMode: Filesystem 
> needs to be added in order to use hostpath storage class.
> 
> I've been told that configuration of the configmap is admin responsibility,
> so I don't consider it neccessary to open a bug for having
> each SC used by QE included by default - users would most likely need to add
> their own configuration anyway.

this is caused by https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/controller/hyperconverged/hyperconverged_controller.go#L1045-L1074

Comment 13 Nelly Credi 2020-01-12 14:22:25 UTC

It is OK if this is admin responsibility, my question is if at some point this configuration will get overridden?
e.g what will happen in upgrade?
Maya, do you know what will happen in the above situation?

Comment 14 Brett Thurber 2020-01-13 03:02:40 UTC

(In reply to Nelly Credi from comment #13)
> It is OK if this is admin responsibility, my question is if at some point
> this configuration will get overridden?
> e.g what will happen in upgrade?
> Maya, do you know what will happen in the above situation?

Our code checks to see if the config map exists and if so it doesn't overwrite it.

Comment 15 Dan Kenigsberg 2020-01-13 08:02:32 UTC

(In reply to Radim Hrazdil from comment #11)

> 
> Which means this datavolumes created with rook-ceph-block are able to bind,
> but DVs with hostpath-provisioner SC cannot.
> 
> hostpath-provisioner.accessMode: ReadWriteOnce
> hostpath-provisioner.volumeMode: Filesystem 
> needs to be added in order to use hostpath storage class.
> 
> I've been told that configuration of the configmap is admin responsibility,
> so I don't consider it neccessary to open a bug for having
> each SC used by QE included by default - users would most likely need to add
> their own configuration anyway.

Radim, on which CR are you configure this? I believe you should have a new StorageClass for HPP, as documented in https://github.com/openshift/openshift-docs/pull/18445/files#diff-de904d837fcd269743acf30b44b4127cR6 . You should not edit the default StorageClass generated by HCO.

Comment 16 Radim Hrazdil 2020-01-13 09:09:10 UTC

Hello Dan, this is configured in kubevirt-storage-class-defaults configMap which is in openshift-cnv namespace.
The console UI resolves the accessMode and volumeMode for DVs from this configMap.

Comment 17 Nelly Credi 2020-01-13 09:24:40 UTC

If the UI requires editing this, i think we need to figure out how to do it as part of the deployment (maybe HPP SC deployment), so it will be persistent 
@Tomas, how should we handle this?

Comment 18 Igor Braginsky 2020-01-13 09:30:43 UTC

Nelly, we can't do that.
As part of the flow, we need to compile a docker image which is using VMWare proprietary library.
Each user needs to download it himself, licence doesn't allow RedHat to share it with someone else.

Comment 19 Tomas Jelinek 2020-01-13 09:34:06 UTC

@Nelly, Igor is right for the v2v config map. 

And you can not do it at deployment for the kubevirt-storage-class-defaults config map neither. The kubevirt-storage-class-defaults config map contains the defaults per storage class and if at some point the admin decides to create a new storage class, he should have the ability to configure his preferred access / volume mode for it.

Comment 20 Nelly Credi 2020-01-13 09:49:50 UTC

Ack, so my last question is if we *think* it is persistent - wont get overridden after upgrade.
hopefully Simone can answer this one

Comment 22 Maya Rashish 2020-01-13 10:37:26 UTC

We spent some time looking at the UI code.

It looks like the code will only use the configmap volumeMode if the storageclass doesn't have its own volumeMode.
Perhaps if we define volumeMode to Filesystem in the HPP storageclass it will do the right thing.

Comment 23 Nelly Credi 2020-01-13 10:57:03 UTC

sounds good for HPP, thanks Maya!
what should we do about v2v configmap? or what do we think will happen to this configmap after upgrade?

Comment 24 Tomas Jelinek 2020-01-13 12:45:59 UTC

(In reply to Maya Rashish from comment #22)
> We spent some time looking at the UI code.
> 
> It looks like the code will only use the configmap volumeMode if the
> storageclass doesn't have its own volumeMode.

where do you see this logic? AFAIK the UI:
1: looks at the config map to check the value for the given storage class
2: if the given storage class is not in the config map, the UI looks into the same config map and checks the default mode in it

> Perhaps if we define volumeMode to Filesystem in the HPP storageclass it
> will do the right thing.

Comment 25 Maya Rashish 2020-01-13 13:03:00 UTC

The code I am looking at is:
https://github.com/kubevirt/web-ui-components/blob/master/src/selectors/configmap/sc-defaults.js#L10-L13

It looks like:
1. If the storageClass has a default volumeMode, use that
2. Otherwise, use some defaults from the config map (bad values for us)

But perhaps this is totally the wrong code, I don't have a UI setup around to try changes out.

Comment 26 Tomas Jelinek 2020-01-13 13:11:55 UTC

(In reply to Maya Rashish from comment #25)
> The code I am looking at is:
> https://github.com/kubevirt/web-ui-components/blob/master/src/selectors/
> configmap/sc-defaults.js#L10-L13

this is the correct code

> 
> It looks like:
> 1. If the storageClass has a default volumeMode, use that
> 2. Otherwise, use some defaults from the config map (bad values for us)
> 

nope:

first line:
const defaultVolumeMode = get(storageClassConfigMap, ['data', 'volumeMode'], PVC_VOLUMEMODE_DEFAULT);
this looks at the "kubevirt-storage-class-defaults" config map and reads the default volume mode. If it is not there, it falls back to the ultimate default hardcoded in the code (which is filesystem)

second line:
return get(storageClassConfigMap, ['data', `${scName}.volumeMode`], defaultVolumeMode);
this looks into the same "kubevirt-storage-class-defaults" config map and returns the volume mode configured for the given storage class. If it is not there, the default value calculated on the first line is returned.

long story short, the "kubevirt-storage-class-defaults" config map is the only place we can add any configuration the UI reads.

> But perhaps this is totally the wrong code, I don't have a UI setup around
> to try changes out.

Comment 27 Brett Thurber 2020-01-13 13:58:52 UTC

(In reply to Nelly Credi from comment #23)
> sounds good for HPP, thanks Maya!
> what should we do about v2v configmap? or what do we think will happen to
> this configmap after upgrade?

My understanding is that if the config map exists it shouldn't be overwritten.  Asking Jason for input as he put this together.

Comment 28 Ric Garcia 2020-01-13 15:38:17 UTC

Not a blocker for CNV 2.2. Moving to CNV 2.3.

Comment 29 Dan Kenigsberg 2020-01-13 16:15:10 UTC

I assume we would not be seeing the configmap reset now that we make sure to deploy one HCO (never both upstream and downstream), and having solved the cross-namespace ownership bugs. However, I believe we have a UI/storage bug here

> long story short, the "kubevirt-storage-class-defaults" config map is the only place we can add any configuration the UI reads.

In my opinion this is a mistake. We may have two storage classes: one for NFS, the other for iSCSI. The first needs volumeMode=file and the second - volumeMode=block. We cannot use the same value from both. IMHO, UI should read from the relevant storage class. If volumeMode is missing there it can resort to a global default.

Specifically searching for `site:docs.openshift.com "kubevirt-storage-class-default"` in a famous search engine leads to a single release note; I could not find any documentation for its content, so we should not expect customers to modify it. If my former paragraph is wrong, we should at least document the configmap.

Comment 30 Nelly Credi 2020-01-13 16:35:28 UTC

We will test to verify that this is indeed the case, 
but if it is not, we will insist on moving it back to 2.2.0 as a release blocker.

Im resetting the priority & severity with the current assumption mentioned above.

Comment 32 Adam Litke 2020-01-13 21:36:39 UTC

(In reply to Dan Kenigsberg from comment #29)
> I assume we would not be seeing the configmap reset now that we make sure to
> deploy one HCO (never both upstream and downstream), and having solved the
> cross-namespace ownership bugs. However, I believe we have a UI/storage bug
> here
> 
> > long story short, the "kubevirt-storage-class-defaults" config map is the only place we can add any configuration the UI reads.
> 
> In my opinion this is a mistake. We may have two storage classes: one for
> NFS, the other for iSCSI. The first needs volumeMode=file and the second -
> volumeMode=block. We cannot use the same value from both. IMHO, UI should
> read from the relevant storage class. If volumeMode is missing there it can
> resort to a global default.

This is exactly how it works.  That configMap has per-storageclass configuration and a global default.  Keep in mind that the original default was set with OCS in mind (with ReadWriteMany/Block to ensure that live migration works out of the box).  Is this something we want to change?

> 
> Specifically searching for `site:docs.openshift.com
> "kubevirt-storage-class-default"` in a famous search engine leads to a
> single release note; I could not find any documentation for its content, so
> we should not expect customers to modify it. If my former paragraph is
> wrong, we should at least document the configmap.

Sounds like a good idea.

Comment 33 Adam Litke 2020-01-13 21:38:32 UTC

Created https://issues.redhat.com/browse/CNV-3781 for the documentation.  I put it against a 2.3 epic.  Let me know if we want to press the documentation team to have it ready for 2.2 instead.

Comment 34 Jason Montleon 2020-01-14 14:13:40 UTC

(In reply to Brett Thurber from comment #27)
> (In reply to Nelly Credi from comment #23)
> > sounds good for HPP, thanks Maya!
> > what should we do about v2v configmap? or what do we think will happen to
> > this configmap after upgrade?
> 
> My understanding is that if the config map exists it shouldn't be
> overwritten.  Asking Jason for input as he put this together.

Yes, this should be true of the v2v-vmware configmap as well.
https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/controller/hyperconverged/hyperconverged_controller.go#L988-L998

Unless there is a subtle bug here that's causing intermittent issues the configmap shouldn't be getting overwritten. From logs on affected systems we've seen several passes of the configmap being found and then it seems to be gone for some reason and gets recreated. We haven't been able to get to the bottom of why the configmap is gone or why HCO believes it is.

Comment 35 Tomas Jelinek 2020-01-14 15:12:48 UTC

@Adam: I remember there was a discussion if we want to read the volume/access mode from a storage class or from a config map. Do you remember why did the config map win?

Comment 36 Adam Litke 2020-01-14 21:17:16 UTC

Yes.  Storage Classes are cluster scoped resources and most users do not have permission to examine them.  Read access to a configmap can be granted without exposing potentially sensitive storage class information to unprivileged users.

Comment 37 Tomas Jelinek 2020-01-15 07:27:13 UTC

Thank you Adam, that makes a lot of sense! In the light of this answer, resetting the title of the BZ.

Comment 38 Dan Kenigsberg 2020-01-15 07:45:23 UTC

> This is exactly how it works.  That configMap has per-storageclass configuration and a global default. 

I did not understand that the configMap has per-storageclass configuration - I stand corrected.

It still smells wrong to place this in a kubevirt-specific entity. I think that a nicer option would have been to annotate the StorageClass with this information (and possibly have a kubevirt controller copy it to the configmap).

> Unless there is a subtle bug here that's causing intermittent issues the configmap shouldn't be getting overwritten. From logs on affected systems we've seen several passes of the 
configmap being found and then it seems to be gone for some reason and gets recreated. We haven't been able to get to the bottom of why the configmap is gone or why HCO believes it is.

I would guess that the configmap was rewritten because it was first removed together with the rest of HCO.

Comment 39 Nelly Credi 2020-01-15 18:39:46 UTC

I agree with Dan that this configmap doesnt sound like the proper place to put user (even if it's an admin) configuration.
can we open RFEs (One for storage & one for migration) to change this in the next release (or current, if we think this will break upgrade flow)

second, should this bug move to ON QA to verify that now that owner refs are fixed, the configmap wont doesnt get overridden anymore? or are we expecting another fix for this? 

third, should we have an RFE for UI to not read from the CM?

Comment 40 Tomas Jelinek 2020-01-16 08:48:35 UTC

> third, should we have an RFE for UI to not read from the CM?

I would wait if and how the backend will be changed. If it will be changed as Dan suggests, no UI change is needed.

Comment 43 Nelly Credi 2020-01-22 13:46:42 UTC

we need to verify that after upgrade the configmap stays as is

Comment 45 Irina Gulina 2020-01-28 14:42:10 UTC

Verification logs attached in the previous comment. 

Mind it was tested in a scope of the general upgrade test via CLI:

OCP 4.2 cluster with CNV 2.1 + 2 vms on different nodes + edited storage cm has been successfully upgraded to 4.3 and CNV 2.2, vms are still there and running, then VMs have been stopped to complete virt-launcher update, and started again. Edited cm has not been reset to default.

Comment 46 Adam Litke 2020-01-28 18:41:04 UTC

As I explained in comment 36, we considered storage classes when initially designing this feature and determined that they are not the correct place to annotate this information due to their privileged nature in kubernetes.  As such I do not agree with the request to create an RFE to change the current design to use Storage Classes.  I would be happy to discuss your other ideas for improved usability outside the context of this bug.

Comment 49 errata-xmlrpc 2020-05-04 19:10:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:2011

Note You need to log in before you can comment on or make changes to this bug.