1930630 – VMs are shutdown an (roughly) hour after upgrades

Bug 1930630 - VMs are shutdown an (roughly) hour after upgrades

Summary: VMs are shutdown an (roughly) hour after upgrades

Keywords:
Status:	CLOSED DUPLICATE of bug 1906496
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.6.2
Assignee:	Jed Lejosne
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-19 10:06 UTC by Fabian Deutsch
Modified:	2021-03-19 09:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-19 09:24:20 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Fabian Deutsch 2021-02-19 10:06:25 UTC

Description of problem that Ruth has discovered:
Tested with  4.7.0-0.nightly-2021-02-17-224627.
Everything was running after the upgrade but after a some time, the 2 migratable VMs were signaled to b shutdown.

VMI with runstrategy: Always:
{"component":"virt-handler","level":"info","msg":"Processing event b4-ugrade/win10-vm-ocs","pos":"vm.go:1175","timestamp":"2021-02-18T16:09:47.623885Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"win10-vm-ocs","namespace":"b4-ugrade","pos":"vm.go:1177","timestamp":"2021-02-18T16:09:47.623910Z","uid":"83dc4f8d-
415b-4e0a-a983-2bf61a97bc74"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Running, reason: Unknown\n","name":"win10-vm-ocs","namespace":"b4-ugrade","pos":"vm.go:1182","timestamp":"2021-02-18T16:09:47.6239
31Z","uid":"83dc4f8d-415b-4e0a-a983-2bf61a97bc74"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Received Domain Event of type MODIFIED","name":"win10-vm-ocs","namespace":"b4-ugrade","pos":"server.go:78","timestamp":"2021-02-18T16:09:47.63361
5Z","uid":"83dc4f8d-415b-4e0a-a983-2bf61a97bc74"}
{"component":"virt-handler","kind":"","level":"info","msg":"Signaled graceful shutdown for win10-vm-ocs","name":"win10-vm-ocs","namespace":"b4-ugrade","pos":"vm.go:1649","timestamp":"2021-02-18T16:09:47.659708Z","uid":"83dc4f8d-415b-4e0a-a983-2bf61a97bc74"}


VMI with runstrategy: Manual:
{"component":"virt-handler","level":"info","msg":"Processing event b4-ugrade/fed-nfs-vm","pos":"vm.go:1175","timestamp":"2021-02-18T10:44:20.705166Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"fed-nfs-vm","namespace":"b4-ugrade","pos":"vm.go:1177","timestamp":"2021-02-18T10:44:20.705199
Z","uid":"218bb948-3110-4f77-ab9b-0d31403eae89"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Paused, reason: Migration\n","name":"fed-nfs-vm","namespace":"b4-ugrade","pos":"vm.go:1182","timestamp":"2021
-02-18T10:44:20.705219Z","uid":"218bb948-3110-4f77-ab9b-0d31403eae89"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Received Domain Event of type MODIFIED","name":"fed-nfs-vm","namespace":"b4-ugrade","pos":"server.go:78","timestamp":"2021-0
2-18T10:44:21.179196Z","uid":"218bb948-3110-4f77-ab9b-0d31403eae89"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain is in state Shutoff reason Migrated","name":"fed-nfs-vm","namespace":"b4-ugrade","pos":"vm.go:2175","timestamp":"2021-02-18T10:44:21.179311Z","uid":"218bb948-3110-4f77-ab9b-0d31403eae89"}
{"component":"virt-handler","level":"info","msg":"Processing event b4-ugrade/fed-nfs-vm","pos":"vm.go:1175","timestamp":"2021-02-18T10:44:21.179384Z"}
{"component":"virt-handler","kind":"","level":"info","msg":"VMI is in phase: Running\n","name":"fed-nfs-vm","namespace":"b4-ugrade","pos":"vm.go:1177","timestamp":"2021-02-18T10:44:21.179454Z","uid":"218bb948-3110-4f77-ab9b-0d31403eae89"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Shutoff, reason: Migrated\n","name":"fed-nfs-vm","namespace":"b4-ugrade","pos":"vm.go:1182","timestamp":"2021-02-18T10:44:21.179474Z","uid":"218bb948-3110-4f77-ab9b-0d31403eae89"}
{"component":"virt-handler","kind":"VirtualMachineInstance","level":"info","msg":"Using cached UID for vmi found in domain cache","name":"fed-nfs-vm","namespace":"b4-ugrade","pos":"vm.go:1350","timestamp":"2021-02-18T10:44:21.207149Z","uid":"218bb948-3110-4f77-ab9b-0d31403eae89"}
{"component":"virt-handler","level":"info","msg":"Processing event b4-ugrade/fed-nfs-vm","pos":"vm.go:1175","timestamp":"2021-02-18T10:44:21.207216Z"}
{"component":"virt-handler","kind":"Domain","level":"info","msg":"Domain status: Shutoff, reason: Migrated\n","name":"fed-nfs-vm","namespace":"b4-ugrade","pos":"vm.go:1182","timestamp":"2021-02-18T10:44:21.207263Z","uid":"218bb948-3110-4f77-ab9b-0d31403eae89"}



- 3 running VMs:
Windows10, OCP, runstrategy; Always
Fedora33, NFS, runstrategy: Manual
Rhel8.3, HPP

Started off from OCP 4.6.17, CNV 2.5.3
Upgraded OCP
VMs were live migrated (checked running process in the migated VMIs):
  ----     ------            ----                   ----                         -------
  Normal   SuccessfulCreate  4h8m                   disruptionbudget-controller  Created PodDisruptionBudget kubevirt-disruption-budget-78g8k
  Normal   SuccessfulCreate  4h8m                   virtualmachine-controller    Created virtual machine pod virt-launcher-win10-vm-ocs-vxjps
  Normal   Started           4h8m                   virt-handler                 VirtualMachineInstance started.
  Warning  SyncFailed        162m                   virt-handler                 unknown error encountered sending command SyncVMI: rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   Created           126m (x141 over 4h8m)  virt-handler                 VirtualMachineInstance defined.
  Normal   SuccessfulCreate  126m                   disruptionbudget-controller  Created Migration kubevirt-evacuation-xjj9z
  Normal   PreparingTarget   123m (x2 over 123m)    virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Normal   PreparingTarget   123m                   virt-handler                 Migration Target is listening at 10.131.0.5, on ports: 39759,40051
  Warning  SyncFailed        122m                   virt-handler                 server error. command Migrate failed: "migration job already executed"
  Normal   SuccessfulCreate  122m                   disruptionbudget-controller  Created Migration kubevirt-evacuation-wfhcz
  Normal   PreparingTarget   120m (x2 over 120m)    virt-handler                 VirtualMachineInstance Migration Target Prepared.
  Normal   PreparingTarget   120m                   virt-handler                 Migration Target is listening at 10.129.2.4, on ports: 34763,43953
  Normal   Created           27m (x132 over 119m)   virt-handler                 VirtualMachineInstance defined.
  Normal   ShuttingDown      25s (x369 over 27m)    virt-handler                 Signaled Graceful Shutdown


$ oc get node
NAME                         STATUS   ROLES    AGE   VERSION
ssp09-c7g7r-master-0         Ready    master   26h   v1.20.0+ba45583
ssp09-c7g7r-master-1         Ready    master   26h   v1.20.0+ba45583
ssp09-c7g7r-master-2         Ready    master   26h   v1.20.0+ba45583
ssp09-c7g7r-worker-0-624qp   Ready    worker   26h   v1.20.0+ba45583
ssp09-c7g7r-worker-0-kwzsk   Ready    worker   26h   v1.20.0+ba45583
ssp09-c7g7r-worker-0-ndrjw   Ready    worker   26h   v1.20.0+ba45583

$ oc get vmi
NAME           AGE     PHASE     IP            NODENAME
fed-nfs-vm     8h      Running   10.129.2.46   ssp09-c7g7r-worker-0-624qp
rhel8-hpp-vm   53m     Running   10.131.0.16   ssp09-c7g7r-worker-0-ndrjw
win10-vm-ocs   3h10m   Running   10.129.2.48   ssp09-c7g7r-worker-0-624qp


Version-Release number of selected component (if applicable):
2.6.0

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 Fabian Deutsch 2021-03-19 09:24:20 UTC

We were seeing bug #1913532 and/or bug #1906496

There were quite a few OOM issues around prom in 4.7, this lead to ndoes becoming unready, leading to pods getting deleted, leading to VMs getting shut down.

*** This bug has been marked as a duplicate of bug 1906496 ***

Note You need to log in before you can comment on or make changes to this bug.