2237705 – Running VM shuts down with sigterm error

Bug 2237705 - Running VM shuts down with sigterm error

Summary: Running VM shuts down with sigterm error

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	4.14.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	sgott
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:	virtualization
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-09-06 13:09 UTC by Kevin Alon Goldblatt
Modified:	2023-10-11 08:57 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-10-11 08:54:10 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	CNV-32696	0	None	None	None	2023-09-06 13:09:34 UTC

Description Kevin Alon Goldblatt 2023-09-06 13:09:17 UTC

Description of problem:
A running vm shuts down after running for several minutes with a reported SIGTERM being sent to all processes

Version-Release number of selected component (if applicable):
oc get csv -n openshift-cnv
NAME                                           DISPLAY                         VERSION             REPLACES                                       PHASE
kubevirt-hyperconverged-operator.4.14.0-1876   OpenShift Virtualization        4.14.0-1876         kubevirt-hyperconverged-operator.4.14.0-1867   Succeeded
odr-cluster-operator.v4.14.0-123.stable        Openshift DR Cluster Operator   4.14.0-123.stable   odr-cluster-operator.v4.14.0-117.stable        Succeeded
openshift-pipelines-operator-rh.v1.11.1        Red Hat OpenShift Pipelines     1.11.1                                                             Succeeded
volsync-product.v0.7.4                         VolSync                         0.7.4               volsync-product.v0.7.3                         Succeeded


Client Version: 4.14.0-ec.3
Kustomize Version: v5.0.1
Server Version: 4.14.0-0.nightly-2023-08-11-055332
Kubernetes Version: v1.27.4+deb2c60

How reproducible:
100%

Steps to Reproduce:
1. Deployed vm to openshift virtualization cluster from RHACM hub - vm is successfully deployed
2. Start the vm with 'virtctl start vm' - vm is running
3. Access the vm console - 'virtctl console vm', login and write data files
4. After about 10 minutes the VM shuts down with the message below:
5. Restart and access the vm, same happens. Reproduced this multiple times

The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system poweroff
[  687.879014] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[  687.880156] sd 1:0:0:0: [sda] Stopping disk
[  687.973945] reboot: Power down

You were disconnected from the console. This has one of the following reasons:
 - another user connected to the console of the target vm
 - network issues
websocket: close 1006 (abnormal closure): unexpected EOF

Actual results:
VM shuts down unexpectedly with SIGTERM sent to all processes

Expected results:
VM should remain up and running

Additional info:

oc get pvc -n kevin-dr
NAME            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
sample-vm-pvc   Bound    pvc-c8112912-8ac9-4537-adaf-c9fd6089dee7   2Gi        RWX            ocs-external-storagecluster-ceph-rbd   42h
tmp-pvc         Bound    pvc-b08f240f-e828-49bb-9cf4-44ed8e8d9174   954Mi      RWO            ocs-external-storagecluster-ceph-rbd   7d22h

oc get vm -n kevin-dr
NAME        AGE   STATUS    READY
sample-vm   42h   Stopped   False
[kgoldbla@localhost Metro_DR]$ virtctl start sample-vm -n kevin-dr
VM sample-vm was scheduled to start
[kgoldbla@localhost Metro_DR]$ virtctl console sample-vm -n kevin-dr
Successfully connected to sample-vm console. The escape sequence is ^]

login as 'cirros' user. default password: 'gocubsgo'. use 'sudo' for root.
sample-vm login: cirros
Password: 
$ lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda       8:0    0   1M  0 disk 
vda     252:0    0   2G  0 disk 
|-vda1  252:1    0   2G  0 part /
`-vda15 252:15   0   8M  0 part 

The system is going down NOW!
Sent SIGTERM to all processes
Sent SIGKILL to all processes
Requesting system poweroff
[  687.879014] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[  687.880156] sd 1:0:0:0: [sda] Stopping disk
[  687.973945] reboot: Power down

You were disconnected from the console. This has one of the following reasons:
 - another user connected to the console of the target vm
 - network issues
websocket: close 1006 (abnormal closure): unexpected EOF



oc get vm sample-vm -n kevin-dr -oyaml
apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
    apps.open-cluster-management.io/hosting-subscription: kevin-dr/kev-vm-dvtemplate-odr-metro-2-subscription-1
    apps.open-cluster-management.io/reconcile-option: merge
    kubevirt.io/latest-observed-api-version: v1
    kubevirt.io/storage-observed-api-version: v1
  creationTimestamp: "2023-09-04T16:08:30Z"
  finalizers:
  - kubevirt.io/virtualMachineControllerFinalize
  generation: 13
  labels:
    app: kev-vm-dvtemplate-odr-metro-2
    app.kubernetes.io/part-of: kev-vm-dvtemplate-odr-metro-2
    appname: vm-dvtemplate-odr-metro
    apps.open-cluster-management.io/reconcile-rate: medium
  name: sample-vm
  namespace: kevin-dr
  resourceVersion: "26056866"
  uid: cdbe619e-31f7-4778-a354-a6a2e11cacfd
spec:
  dataVolumeTemplates:
  - metadata:
      creationTimestamp: null
      labels:
        appname: vm-dvtemplate-odr-metro
      name: sample-vm-pvc
    spec:
      source:
        registry:
          url: docker://quay.io/alitke/cirros:latest
      storage:
        resources:
          requests:
            storage: 2Gi
        storageClassName: ocs-external-storagecluster-ceph-rbd
  running: false
  template:
    metadata:
      annotations:
        vm.kubevirt.io/flavor: small
        vm.kubevirt.io/os: fedora
        vm.kubevirt.io/workload: server
      creationTimestamp: null
      labels:
        kubevirt.io/size: small
    spec:
      architecture: amd64
      domain:
        cpu:
          cores: 1
          sockets: 1
          threads: 1
        devices:
          disks:
          - disk:
              bus: virtio
            name: rootdisk
          - disk: {}
            name: cloudinit
          interfaces:
          - macAddress: 02:69:36:00:00:00
            masquerade: {}
            model: virtio
            name: default
          networkInterfaceMultiqueue: true
          rng: {}
        features:
          acpi: {}
        machine:
          type: pc-q35-rhel8.6.0
        resources:
          requests:
            memory: 2Gi
      evictionStrategy: LiveMigrate
      networks:
      - name: default
        pod: {}
      terminationGracePeriodSeconds: 180
      volumes:
      - name: rootdisk
        persistentVolumeClaim:
          claimName: sample-vm-pvc
      - cloudInitNoCloud:
          userData: |
            #cloud-config
            user: cirros
            password: drftw!
            chpasswd:
              expire: false
        name: cloudinit
status:
  conditions:
  - lastProbeTime: "2023-09-06T11:06:40Z"
    lastTransitionTime: "2023-09-06T11:06:40Z"
    message: VMI does not exist
    reason: VMINotExists
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: null
    status: "True"
    type: LiveMigratable
  desiredGeneration: 13
  observedGeneration: 13
  printableStatus: Stopped
  volumeSnapshotStatuses:
  - enabled: true
    name: rootdisk
  - enabled: false
    name: cloudinit
    reason: Snapshot is not supported for this volumeSource type [cloudinit]

Comment 1 Kevin Alon Goldblatt 2023-09-18 11:44:13 UTC

@kbidarka Hi any updates on this bz?

Comment 2 Kedar Bidarkar 2023-09-18 13:02:16 UTC

Hi Kevin,
Few questions which you could update to the bug.

1) What is RHACM hub ?

2) How is this VM different from a normal VM created on the cluster ?
Asking as we can see, apps.open-cluster-management.io/hosting-subscription: kevin-dr/kev-vm-dvtemplate-odr-metro-2-subscription-1

3) As per you, is this a blocker bug for 4.14? Is this preventing you from testing any other feature ?

Comment 3 Kevin Alon Goldblatt 2023-09-18 14:12:48 UTC

(In reply to Kedar Bidarkar from comment #2)
> Hi Kevin,
> Few questions which you could update to the bug.
> 
> 1) What is RHACM hub ?
Red Hat Advance Cluster Management 

> 2) How is this VM different from a normal VM created on the cluster ?
> Asking as we can see, apps.open-cluster-management.io/hosting-subscription:
> kevin-dr/kev-vm-dvtemplate-odr-metro-2-subscription-1
It should not be different, On the Red Hat Advanced Cluster Management hub I create a subscription based application (which includes the vm yaml) which deploys the vm to the target openshift cluster 
 
> 3) As per you, is this a blocker bug for 4.14? Is this preventing you from
> testing any other feature ?
I am able to continue my testing, I am able to restart and access the vm after it abruptly shuts down. That said the vm just shutting down is serious, so the bug should definitely get attention.

Comment 4 Kedar Bidarkar 2023-09-21 17:15:43 UTC

@Kevin, Also could you please share the must-gather log, next time you hit this issue?

Comment 5 Kedar Bidarkar 2023-10-06 17:12:00 UTC

The must-gather link shared was, https://drive.google.com/drive/folders/1LGIxKdDXQFkWqlEagMjOa-BPi4Wru1BE

Comment 6 Antonio Cardace 2023-10-10 12:58:32 UTC

@kgoldbla I cannot see anything relevant in the uploaded must-gather, actually, I see no VMs running at all and the namespace 'kevin-dr' does not exist.

Can you share something that exhibits the issue?

Comment 7 Adam Litke 2023-10-10 15:23:14 UTC

I think this may be due to a misunderstanding of how to manipulate VMs that are part of a Gitops application.  In Git, the VM is defined as powered off.  When you use virtctl to start it you are changing the app from it's original definition.  Eventually ACM will notice the difference and reconcile it (stopping the VM).  If you want to start the VM you should update the spec in the git repository.

Comment 8 Adam Litke 2023-10-10 15:25:08 UTC

Kevin, please see if you can reproduce this without using virtctl start/stop.  You can use virtctl console but lifecycle operations must be driven by updates to the vm definition stored in the application's git repo.

Comment 9 Antonio Cardace 2023-10-11 08:54:10 UTC

After discussing with Kevin we concluded that this is not a bug as VMs are getting reconciled by RH ACM, this is happening because in the git repo the VM spec has the `running` field set to `false`, which leads to the systems always trying to Stop the VM as soon as it gets started by a different mean (using virtctl or the UI).

Thanks, Adam for the suggestion by the way.

Having said that, to improve usability, the UX could be improved by docs and UI possibly reflecting that the VM is being owned by gitops and any modification will be reverted.
@alitke thoughts?

I'm closing this as 'NOTABUG' and Kevin will open an RFE for the UI to improve this situation.

Comment 10 Kevin Alon Goldblatt 2023-10-11 08:57:02 UTC

(In reply to Adam Litke from comment #8)
> Kevin, please see if you can reproduce this without using virtctl
> start/stop.  You can use virtctl console but lifecycle operations must be
> driven by updates to the vm definition stored in the application's git repo.

@alitke I will try to reproduce this once I get the environment back from the ODF team. I will add this step to my test cases

Note You need to log in before you can comment on or make changes to this bug.