Description of problem: A running vm shuts down after running for several minutes with a reported SIGTERM being sent to all processes Version-Release number of selected component (if applicable): oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.4.14.0-1876 OpenShift Virtualization 4.14.0-1876 kubevirt-hyperconverged-operator.4.14.0-1867 Succeeded odr-cluster-operator.v4.14.0-123.stable Openshift DR Cluster Operator 4.14.0-123.stable odr-cluster-operator.v4.14.0-117.stable Succeeded openshift-pipelines-operator-rh.v1.11.1 Red Hat OpenShift Pipelines 1.11.1 Succeeded volsync-product.v0.7.4 VolSync 0.7.4 volsync-product.v0.7.3 Succeeded Client Version: 4.14.0-ec.3 Kustomize Version: v5.0.1 Server Version: 4.14.0-0.nightly-2023-08-11-055332 Kubernetes Version: v1.27.4+deb2c60 How reproducible: 100% Steps to Reproduce: 1. Deployed vm to openshift virtualization cluster from RHACM hub - vm is successfully deployed 2. Start the vm with 'virtctl start vm' - vm is running 3. Access the vm console - 'virtctl console vm', login and write data files 4. After about 10 minutes the VM shuts down with the message below: 5. Restart and access the vm, same happens. Reproduced this multiple times The system is going down NOW! Sent SIGTERM to all processes Sent SIGKILL to all processes Requesting system poweroff [ 687.879014] sd 1:0:0:0: [sda] Synchronizing SCSI cache [ 687.880156] sd 1:0:0:0: [sda] Stopping disk [ 687.973945] reboot: Power down You were disconnected from the console. This has one of the following reasons: - another user connected to the console of the target vm - network issues websocket: close 1006 (abnormal closure): unexpected EOF Actual results: VM shuts down unexpectedly with SIGTERM sent to all processes Expected results: VM should remain up and running Additional info: oc get pvc -n kevin-dr NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE sample-vm-pvc Bound pvc-c8112912-8ac9-4537-adaf-c9fd6089dee7 2Gi RWX ocs-external-storagecluster-ceph-rbd 42h tmp-pvc Bound pvc-b08f240f-e828-49bb-9cf4-44ed8e8d9174 954Mi RWO ocs-external-storagecluster-ceph-rbd 7d22h oc get vm -n kevin-dr NAME AGE STATUS READY sample-vm 42h Stopped False [kgoldbla@localhost Metro_DR]$ virtctl start sample-vm -n kevin-dr VM sample-vm was scheduled to start [kgoldbla@localhost Metro_DR]$ virtctl console sample-vm -n kevin-dr Successfully connected to sample-vm console. The escape sequence is ^] login as 'cirros' user. default password: 'gocubsgo'. use 'sudo' for root. sample-vm login: cirros Password: $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 1M 0 disk vda 252:0 0 2G 0 disk |-vda1 252:1 0 2G 0 part / `-vda15 252:15 0 8M 0 part The system is going down NOW! Sent SIGTERM to all processes Sent SIGKILL to all processes Requesting system poweroff [ 687.879014] sd 1:0:0:0: [sda] Synchronizing SCSI cache [ 687.880156] sd 1:0:0:0: [sda] Stopping disk [ 687.973945] reboot: Power down You were disconnected from the console. This has one of the following reasons: - another user connected to the console of the target vm - network issues websocket: close 1006 (abnormal closure): unexpected EOF oc get vm sample-vm -n kevin-dr -oyaml apiVersion: kubevirt.io/v1 kind: VirtualMachine metadata: annotations: apps.open-cluster-management.io/hosting-subscription: kevin-dr/kev-vm-dvtemplate-odr-metro-2-subscription-1 apps.open-cluster-management.io/reconcile-option: merge kubevirt.io/latest-observed-api-version: v1 kubevirt.io/storage-observed-api-version: v1 creationTimestamp: "2023-09-04T16:08:30Z" finalizers: - kubevirt.io/virtualMachineControllerFinalize generation: 13 labels: app: kev-vm-dvtemplate-odr-metro-2 app.kubernetes.io/part-of: kev-vm-dvtemplate-odr-metro-2 appname: vm-dvtemplate-odr-metro apps.open-cluster-management.io/reconcile-rate: medium name: sample-vm namespace: kevin-dr resourceVersion: "26056866" uid: cdbe619e-31f7-4778-a354-a6a2e11cacfd spec: dataVolumeTemplates: - metadata: creationTimestamp: null labels: appname: vm-dvtemplate-odr-metro name: sample-vm-pvc spec: source: registry: url: docker://quay.io/alitke/cirros:latest storage: resources: requests: storage: 2Gi storageClassName: ocs-external-storagecluster-ceph-rbd running: false template: metadata: annotations: vm.kubevirt.io/flavor: small vm.kubevirt.io/os: fedora vm.kubevirt.io/workload: server creationTimestamp: null labels: kubevirt.io/size: small spec: architecture: amd64 domain: cpu: cores: 1 sockets: 1 threads: 1 devices: disks: - disk: bus: virtio name: rootdisk - disk: {} name: cloudinit interfaces: - macAddress: 02:69:36:00:00:00 masquerade: {} model: virtio name: default networkInterfaceMultiqueue: true rng: {} features: acpi: {} machine: type: pc-q35-rhel8.6.0 resources: requests: memory: 2Gi evictionStrategy: LiveMigrate networks: - name: default pod: {} terminationGracePeriodSeconds: 180 volumes: - name: rootdisk persistentVolumeClaim: claimName: sample-vm-pvc - cloudInitNoCloud: userData: | #cloud-config user: cirros password: drftw! chpasswd: expire: false name: cloudinit status: conditions: - lastProbeTime: "2023-09-06T11:06:40Z" lastTransitionTime: "2023-09-06T11:06:40Z" message: VMI does not exist reason: VMINotExists status: "False" type: Ready - lastProbeTime: null lastTransitionTime: null status: "True" type: LiveMigratable desiredGeneration: 13 observedGeneration: 13 printableStatus: Stopped volumeSnapshotStatuses: - enabled: true name: rootdisk - enabled: false name: cloudinit reason: Snapshot is not supported for this volumeSource type [cloudinit]
@kbidarka Hi any updates on this bz?
Hi Kevin, Few questions which you could update to the bug. 1) What is RHACM hub ? 2) How is this VM different from a normal VM created on the cluster ? Asking as we can see, apps.open-cluster-management.io/hosting-subscription: kevin-dr/kev-vm-dvtemplate-odr-metro-2-subscription-1 3) As per you, is this a blocker bug for 4.14? Is this preventing you from testing any other feature ?
(In reply to Kedar Bidarkar from comment #2) > Hi Kevin, > Few questions which you could update to the bug. > > 1) What is RHACM hub ? Red Hat Advance Cluster Management > 2) How is this VM different from a normal VM created on the cluster ? > Asking as we can see, apps.open-cluster-management.io/hosting-subscription: > kevin-dr/kev-vm-dvtemplate-odr-metro-2-subscription-1 It should not be different, On the Red Hat Advanced Cluster Management hub I create a subscription based application (which includes the vm yaml) which deploys the vm to the target openshift cluster > 3) As per you, is this a blocker bug for 4.14? Is this preventing you from > testing any other feature ? I am able to continue my testing, I am able to restart and access the vm after it abruptly shuts down. That said the vm just shutting down is serious, so the bug should definitely get attention.
@Kevin, Also could you please share the must-gather log, next time you hit this issue?
The must-gather link shared was, https://drive.google.com/drive/folders/1LGIxKdDXQFkWqlEagMjOa-BPi4Wru1BE
@kgoldbla I cannot see anything relevant in the uploaded must-gather, actually, I see no VMs running at all and the namespace 'kevin-dr' does not exist. Can you share something that exhibits the issue?
I think this may be due to a misunderstanding of how to manipulate VMs that are part of a Gitops application. In Git, the VM is defined as powered off. When you use virtctl to start it you are changing the app from it's original definition. Eventually ACM will notice the difference and reconcile it (stopping the VM). If you want to start the VM you should update the spec in the git repository.
Kevin, please see if you can reproduce this without using virtctl start/stop. You can use virtctl console but lifecycle operations must be driven by updates to the vm definition stored in the application's git repo.
After discussing with Kevin we concluded that this is not a bug as VMs are getting reconciled by RH ACM, this is happening because in the git repo the VM spec has the `running` field set to `false`, which leads to the systems always trying to Stop the VM as soon as it gets started by a different mean (using virtctl or the UI). Thanks, Adam for the suggestion by the way. Having said that, to improve usability, the UX could be improved by docs and UI possibly reflecting that the VM is being owned by gitops and any modification will be reverted. @alitke thoughts? I'm closing this as 'NOTABUG' and Kevin will open an RFE for the UI to improve this situation.
(In reply to Adam Litke from comment #8) > Kevin, please see if you can reproduce this without using virtctl > start/stop. You can use virtctl console but lifecycle operations must be > driven by updates to the vm definition stored in the application's git repo. @alitke I will try to reproduce this once I get the environment back from the ODF team. I will add this step to my test cases