Bug 1926746

Summary: VM connect to SSH and consoles is not responsive after VM is up for 25 days
Product: Container Native Virtualization (CNV) Reporter: Israel Pinto <ipinto>
Component: VirtualizationAssignee: Antonio Cardace <acardace>
Status: CLOSED ERRATA QA Contact: Ying Cui <ycui>
Severity: high Docs Contact:
Priority: high    
Version: 2.6.0CC: acardace, cnv-qe-bugs, fdeutsch, kbidarka, rmohr, sgott
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 14:23:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1901335    
Bug Blocks:    
Attachments:
Description Flags
virt-launcher and virt-handler logs none

Description Israel Pinto 2021-02-09 12:01:25 UTC
Description of problem:
Have VM running for 25 days in longevity test, checking connectivity to VM via: SSH, consoles: serial,VNC
VM is not unresponsive.

Soft reboot with virsh reset did not helped vm is stuck: 
cannot acquire state change lock (held by monitor=remoteDispatchConnectGetAllDomainStats)"


Version-Release number of selected component (if applicable):
CNV 2.6

How reproducible:


Steps to Reproduce:
1.Create VM with disk on OCS and run it for more then 25 days
2. check connectivity to VM

Actual results:
Can't connect to SSH,Console

Adding logs


Additional info:

Comment 1 Israel Pinto 2021-02-09 12:09:48 UTC
Created attachment 1755903 [details]
virt-launcher and virt-handler logs

Comment 2 Roman Mohr 2021-02-09 12:10:06 UTC
More notes on this one:

Before the reboot we could kind-of connect to the VM. VNC and console connections were established and the inputs reached the VMs. On the console there was nothing returned and on VNC it got stuck after typing in the user name and hitting enter. In combination with the restart issue, it looks like the storage may have become unavailable. This may resolve into a more discoverable scenario once we have done the switch to setting a "stop" error policy once the disks become unresponsive.

Comment 3 Fabian Deutsch 2021-03-23 16:15:14 UTC
I'm a little worried here that we can not connect to SSH

@ipinto via which network/NIC did you try to connect to the VM?
Was the VM migrated?

Comment 4 Roman Mohr 2021-03-23 19:09:23 UTC
(In reply to Fabian Deutsch from comment #3)
> I'm a little worried here that we can not connect to SSH
> 
> @ipinto via which network/NIC did you try to connect to the VM?
> Was the VM migrated?

The VM was reachable from the serial console for instance. You could even see the username prompt. Also entering the username and password worked, but as soon as it tried to authenticate, it got stuck. This is a typical behaviour when the disk is blocked. I could not find any other hint that something was wrong, but also not a clear prove that it was the disk before the VM got rebooted.

Comment 5 Antonio Cardace 2021-03-24 09:52:08 UTC
Hi Israel, can you give me access to the cluster this VM is running on? Thanks.

Comment 6 sgott 2021-04-13 14:40:53 UTC
Israel, there's still an open-needinfo on this. Does Antonio have what he needs?

Comment 11 Ying Cui 2021-07-13 10:51:26 UTC
VERIFIED this bug on CNV 4.8.0-372 and OCP 4.8.0-fc.7
We created a few VMs(Centos7, RHEL8, Fedora 33, Windows2k16, Windows10, Windows19) running on longevity env., and did live migration/snapshot actions on the VMs. The env. is running 30+ days.

The issue in bug description can NOT be reproduced on this env.

Comment 14 errata-xmlrpc 2021-07-27 14:23:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920

Comment 15 Red Hat Bugzilla 2023-09-15 01:00:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days