Description of problem: There are several issues happening: 1. VMs are totally removed from the cluster 2. Customizations of configmap is lost, no VDDK-INIT image is mentioned there Version-Release number of selected component (if applicable): CNV 2.2.0-10 How reproducible: Occasionally, from time to time Steps to Reproduce: 1. Configure cluster for migration 2. Migrate one or more VMs 3. Wait some time - up to several days Actual results: Expected results: Additional info:
@Igor, do you only see this with imported VMs (vs newly created CNV regular VMs)?
and can you please add must-gather logs?
@Nelly: must-gather is returning multiple errors, can't make it work. oc adm must-gather --image=registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-cnv-must-gather-rhel8:v2.2.0-5 --dest-dir=~/cnv_logs .... [must-gather-drssm] POD error: the server doesn't have a resource type "inspect" [must-gather-drssm] POD WARNING: openshift-must-gather has been DEPRECATED. Use `oc adm inspect` instead. [must-gather-drssm] POD error: the server doesn't have a resource type "inspect" [must-gather-drssm] OUT gather logs unavailable: unexpected EOF [must-gather-drssm] OUT waiting for gather to complete [must-gather-drssm] OUT gather never finished: timed out waiting for the condition UnauthorizedUnauthorizederror: gather never finished for pod must-gather-drssm: timed out waiting for the condition Can you or someone else connect to the env and check? ssh -lcnv-qe-jenkins -i ~/.ssh/cnv-qe-jenkins.key 10.0.98.50
I believe we should focus on searching the reason why the configmap is reset to its default. Is kubevirt being reinstalled (removed or updated??) Is the VM CRD stable?
I believe I have just witnessed the same issue as Igor has described. During a test run of kubevirt console integration tests I noticed at one point that page 'Virtual Machines' started to display 404: Not Found. I checked the pods and noticed that bunch of pods are being terminated and recreated, HCO Readiness said 'ContainersNotReady'. See this pastebin with openshift-cnv pods http://pastebin.test.redhat.com/822851 . The pods with age 66m are the ones that were recreated. Since we often install CNV from brew (I'm not 100% sure that Igor's environment has CNV_SOURCE set to brew as well, but it's likely that it does), when new container image is available in brew, HCO pulls new image and updates it. During the update, some of the operators must have recreated the VirtualMachine CRD, which explains the 404:Not Found displayed in the UI on Virtual Machines page. There may be other situations where UI displays this error and @Tomas Jelinek would surely be able to elaborate on that, but missing CRD is definitely one of them. Now the important thing. When you create some VMs and delete VirtualMachine CRD, all the virtual machines are deleted as well. Apparently, this is how VMs may disappear after some time. Additionally, when all the pods came back up I noticed that the kubevirt-storage-class-defaults configMap was overwritten to it's default state. The whole process is nicely captured in this jenkins run, where none of the test cases that require a DataVolume work after the update, because hostpath-provisioner requires accessMode: ReadWriteMany and volumeMode: Filesystem, which is not the default setting of this configMap. This doesn't seem to be related to V2V or User Interface, moving to Installation.
This issue wasn't reproduced since I worked on the CNV 2.2.0-11 environment on Monday - 2 days ago I was the 1st one who used this CNV version after it successfully deployed. Till now (~ 48 hours) this bug didn't occur
(In reply to Maayan Guetta from comment #6) > This issue wasn't reproduced since I worked on the CNV 2.2.0-11 environment > on Monday - 2 days ago > I was the 1st one who used this CNV version after it successfully deployed. > Till now (~ 48 hours) this bug didn't occur Update regarding bug reproduction: The bug reproduced after ~ 13 days since the environment was deployed (The issue wasn't seen between dates: Dec 28 - Jan 10)
Duplicate bz1786475?
Nelly, currently when we install CNV the storage storage configMap contains the following by default: data: accessMode: ReadWriteMany volumeMode: Block local-sc.accessMode: ReadWriteOnce local-sc.volumeMode: Filesystem Which means this datavolumes created with rook-ceph-block are able to bind, but DVs with hostpath-provisioner SC cannot. hostpath-provisioner.accessMode: ReadWriteOnce hostpath-provisioner.volumeMode: Filesystem needs to be added in order to use hostpath storage class. I've been told that configuration of the configmap is admin responsibility, so I don't consider it neccessary to open a bug for having each SC used by QE included by default - users would most likely need to add their own configuration anyway.
(In reply to Radim Hrazdil from comment #11) > Nelly, currently when we install CNV the storage storage configMap contains > the following by default: > data: > accessMode: ReadWriteMany > volumeMode: Block > local-sc.accessMode: ReadWriteOnce > local-sc.volumeMode: Filesystem > > Which means this datavolumes created with rook-ceph-block are able to bind, > but DVs with hostpath-provisioner SC cannot. > > hostpath-provisioner.accessMode: ReadWriteOnce > hostpath-provisioner.volumeMode: Filesystem > needs to be added in order to use hostpath storage class. > > I've been told that configuration of the configmap is admin responsibility, > so I don't consider it neccessary to open a bug for having > each SC used by QE included by default - users would most likely need to add > their own configuration anyway. this is caused by https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/controller/hyperconverged/hyperconverged_controller.go#L1045-L1074
It is OK if this is admin responsibility, my question is if at some point this configuration will get overridden? e.g what will happen in upgrade? Maya, do you know what will happen in the above situation?
(In reply to Nelly Credi from comment #13) > It is OK if this is admin responsibility, my question is if at some point > this configuration will get overridden? > e.g what will happen in upgrade? > Maya, do you know what will happen in the above situation? Our code checks to see if the config map exists and if so it doesn't overwrite it.
(In reply to Radim Hrazdil from comment #11) > > Which means this datavolumes created with rook-ceph-block are able to bind, > but DVs with hostpath-provisioner SC cannot. > > hostpath-provisioner.accessMode: ReadWriteOnce > hostpath-provisioner.volumeMode: Filesystem > needs to be added in order to use hostpath storage class. > > I've been told that configuration of the configmap is admin responsibility, > so I don't consider it neccessary to open a bug for having > each SC used by QE included by default - users would most likely need to add > their own configuration anyway. Radim, on which CR are you configure this? I believe you should have a new StorageClass for HPP, as documented in https://github.com/openshift/openshift-docs/pull/18445/files#diff-de904d837fcd269743acf30b44b4127cR6 . You should not edit the default StorageClass generated by HCO.
Hello Dan, this is configured in kubevirt-storage-class-defaults configMap which is in openshift-cnv namespace. The console UI resolves the accessMode and volumeMode for DVs from this configMap.
If the UI requires editing this, i think we need to figure out how to do it as part of the deployment (maybe HPP SC deployment), so it will be persistent @Tomas, how should we handle this?
Nelly, we can't do that. As part of the flow, we need to compile a docker image which is using VMWare proprietary library. Each user needs to download it himself, licence doesn't allow RedHat to share it with someone else.
@Nelly, Igor is right for the v2v config map. And you can not do it at deployment for the kubevirt-storage-class-defaults config map neither. The kubevirt-storage-class-defaults config map contains the defaults per storage class and if at some point the admin decides to create a new storage class, he should have the ability to configure his preferred access / volume mode for it.
Ack, so my last question is if we *think* it is persistent - wont get overridden after upgrade. hopefully Simone can answer this one
We spent some time looking at the UI code. It looks like the code will only use the configmap volumeMode if the storageclass doesn't have its own volumeMode. Perhaps if we define volumeMode to Filesystem in the HPP storageclass it will do the right thing.
sounds good for HPP, thanks Maya! what should we do about v2v configmap? or what do we think will happen to this configmap after upgrade?
(In reply to Maya Rashish from comment #22) > We spent some time looking at the UI code. > > It looks like the code will only use the configmap volumeMode if the > storageclass doesn't have its own volumeMode. where do you see this logic? AFAIK the UI: 1: looks at the config map to check the value for the given storage class 2: if the given storage class is not in the config map, the UI looks into the same config map and checks the default mode in it > Perhaps if we define volumeMode to Filesystem in the HPP storageclass it > will do the right thing.
The code I am looking at is: https://github.com/kubevirt/web-ui-components/blob/master/src/selectors/configmap/sc-defaults.js#L10-L13 It looks like: 1. If the storageClass has a default volumeMode, use that 2. Otherwise, use some defaults from the config map (bad values for us) But perhaps this is totally the wrong code, I don't have a UI setup around to try changes out.
(In reply to Maya Rashish from comment #25) > The code I am looking at is: > https://github.com/kubevirt/web-ui-components/blob/master/src/selectors/ > configmap/sc-defaults.js#L10-L13 this is the correct code > > It looks like: > 1. If the storageClass has a default volumeMode, use that > 2. Otherwise, use some defaults from the config map (bad values for us) > nope: first line: const defaultVolumeMode = get(storageClassConfigMap, ['data', 'volumeMode'], PVC_VOLUMEMODE_DEFAULT); this looks at the "kubevirt-storage-class-defaults" config map and reads the default volume mode. If it is not there, it falls back to the ultimate default hardcoded in the code (which is filesystem) second line: return get(storageClassConfigMap, ['data', `${scName}.volumeMode`], defaultVolumeMode); this looks into the same "kubevirt-storage-class-defaults" config map and returns the volume mode configured for the given storage class. If it is not there, the default value calculated on the first line is returned. long story short, the "kubevirt-storage-class-defaults" config map is the only place we can add any configuration the UI reads. > But perhaps this is totally the wrong code, I don't have a UI setup around > to try changes out.
(In reply to Nelly Credi from comment #23) > sounds good for HPP, thanks Maya! > what should we do about v2v configmap? or what do we think will happen to > this configmap after upgrade? My understanding is that if the config map exists it shouldn't be overwritten. Asking Jason for input as he put this together.
Not a blocker for CNV 2.2. Moving to CNV 2.3.
I assume we would not be seeing the configmap reset now that we make sure to deploy one HCO (never both upstream and downstream), and having solved the cross-namespace ownership bugs. However, I believe we have a UI/storage bug here > long story short, the "kubevirt-storage-class-defaults" config map is the only place we can add any configuration the UI reads. In my opinion this is a mistake. We may have two storage classes: one for NFS, the other for iSCSI. The first needs volumeMode=file and the second - volumeMode=block. We cannot use the same value from both. IMHO, UI should read from the relevant storage class. If volumeMode is missing there it can resort to a global default. Specifically searching for `site:docs.openshift.com "kubevirt-storage-class-default"` in a famous search engine leads to a single release note; I could not find any documentation for its content, so we should not expect customers to modify it. If my former paragraph is wrong, we should at least document the configmap.
We will test to verify that this is indeed the case, but if it is not, we will insist on moving it back to 2.2.0 as a release blocker. Im resetting the priority & severity with the current assumption mentioned above.
(In reply to Dan Kenigsberg from comment #29) > I assume we would not be seeing the configmap reset now that we make sure to > deploy one HCO (never both upstream and downstream), and having solved the > cross-namespace ownership bugs. However, I believe we have a UI/storage bug > here > > > long story short, the "kubevirt-storage-class-defaults" config map is the only place we can add any configuration the UI reads. > > In my opinion this is a mistake. We may have two storage classes: one for > NFS, the other for iSCSI. The first needs volumeMode=file and the second - > volumeMode=block. We cannot use the same value from both. IMHO, UI should > read from the relevant storage class. If volumeMode is missing there it can > resort to a global default. This is exactly how it works. That configMap has per-storageclass configuration and a global default. Keep in mind that the original default was set with OCS in mind (with ReadWriteMany/Block to ensure that live migration works out of the box). Is this something we want to change? > > Specifically searching for `site:docs.openshift.com > "kubevirt-storage-class-default"` in a famous search engine leads to a > single release note; I could not find any documentation for its content, so > we should not expect customers to modify it. If my former paragraph is > wrong, we should at least document the configmap. Sounds like a good idea.
Created https://issues.redhat.com/browse/CNV-3781 for the documentation. I put it against a 2.3 epic. Let me know if we want to press the documentation team to have it ready for 2.2 instead.
(In reply to Brett Thurber from comment #27) > (In reply to Nelly Credi from comment #23) > > sounds good for HPP, thanks Maya! > > what should we do about v2v configmap? or what do we think will happen to > > this configmap after upgrade? > > My understanding is that if the config map exists it shouldn't be > overwritten. Asking Jason for input as he put this together. Yes, this should be true of the v2v-vmware configmap as well. https://github.com/kubevirt/hyperconverged-cluster-operator/blob/master/pkg/controller/hyperconverged/hyperconverged_controller.go#L988-L998 Unless there is a subtle bug here that's causing intermittent issues the configmap shouldn't be getting overwritten. From logs on affected systems we've seen several passes of the configmap being found and then it seems to be gone for some reason and gets recreated. We haven't been able to get to the bottom of why the configmap is gone or why HCO believes it is.
@Adam: I remember there was a discussion if we want to read the volume/access mode from a storage class or from a config map. Do you remember why did the config map win?
Yes. Storage Classes are cluster scoped resources and most users do not have permission to examine them. Read access to a configmap can be granted without exposing potentially sensitive storage class information to unprivileged users.
Thank you Adam, that makes a lot of sense! In the light of this answer, resetting the title of the BZ.
> This is exactly how it works. That configMap has per-storageclass configuration and a global default. I did not understand that the configMap has per-storageclass configuration - I stand corrected. It still smells wrong to place this in a kubevirt-specific entity. I think that a nicer option would have been to annotate the StorageClass with this information (and possibly have a kubevirt controller copy it to the configmap). > Unless there is a subtle bug here that's causing intermittent issues the configmap shouldn't be getting overwritten. From logs on affected systems we've seen several passes of the configmap being found and then it seems to be gone for some reason and gets recreated. We haven't been able to get to the bottom of why the configmap is gone or why HCO believes it is. I would guess that the configmap was rewritten because it was first removed together with the rest of HCO.
I agree with Dan that this configmap doesnt sound like the proper place to put user (even if it's an admin) configuration. can we open RFEs (One for storage & one for migration) to change this in the next release (or current, if we think this will break upgrade flow) second, should this bug move to ON QA to verify that now that owner refs are fixed, the configmap wont doesnt get overridden anymore? or are we expecting another fix for this? third, should we have an RFE for UI to not read from the CM?
> third, should we have an RFE for UI to not read from the CM? I would wait if and how the backend will be changed. If it will be changed as Dan suggests, no UI change is needed.
we need to verify that after upgrade the configmap stays as is
Verification logs attached in the previous comment. Mind it was tested in a scope of the general upgrade test via CLI: OCP 4.2 cluster with CNV 2.1 + 2 vms on different nodes + edited storage cm has been successfully upgraded to 4.3 and CNV 2.2, vms are still there and running, then VMs have been stopped to complete virt-launcher update, and started again. Edited cm has not been reset to default.
As I explained in comment 36, we considered storage classes when initially designing this feature and determined that they are not the correct place to annotate this information due to their privileged nature in kubernetes. As such I do not agree with the request to create an RFE to change the current design to use Storage Classes. I would be happy to discuss your other ideas for improved usability outside the context of this bug.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:2011