Bug 1278895

Summary:

The RHEL4 guest will loose virsh console along w/ SSH ipv4 and ipv6 connectivity when using > 1 vcpu.

Product:

Red Hat Enterprise Linux 7

Reporter:

MikeBoswell <mboswell>

Component:

qemu-kvm

Assignee:

Laine Stump <laine>

Status:

CLOSED NOTABUG

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

7.1

CC:

alemay, huding, jburke, jstancek, jsuchane, juzhang, knoel, mboswell, rbalakri, virt-maint, weliao, xfu

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-01-18 19:33:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
File with logs and configs attached.	none
qemu-kvm commands	none
rhel4 XML	none

Description MikeBoswell 2015-11-06 16:55:15 UTC

Created attachment 1090729 [details]
File with logs and configs attached.

Description of problem:

The RHEL4 guest will loose virsh console along w/ SSH ipv4 and ipv6 connectivity when I use > 1 vcpu. 

Version-Release number of selected component (if applicable):

Hypervisor is RHEL7.1 running 3.10.0-229.14.1.el7.x86_64  with libvirt-1.2.8-16.el7_1.4.x86_64  and qemu-kvm-1.5.3-86.el7_1.8.x86_64.   The 4 guests are RHEL7.9, 5.11, 6.7, and 7.1.  Network is a bridge w/ a single phy em2. 

How reproducible:

Not immediate but some time following the increase in vcpu.  I have reproduced this when only the RHEL4 VM is running and the others are shut off. Also, happens with the others VMs running.

Steps to Reproduce:
1.  increase the vcpu to > 1 across on all 4 guest or just the one. 
2.  halt and start the vms
3.  wait and see the following

Actual results:

'virsh console rhts-nfs-vm01'  will not give prompt

==> ping
rhts-nfs-vm01.rhts.eng.bos.redhat.com is alive
==> ping6
rhts-nfs-vm01.rhts.eng.bos.redhat.com is alive
==> nmap port 22 ipv4
22/tcp open  ssh
==> nmap port 22 ipv6
22/tcp open  ssh
==> ssh -4
^CKilled by signal 2.
==> ssh -6
^CKilled by signal 2.


Expected results:


Additional info:

Destroy / start of rhel4 VM brings it back to life but only temporarily.  Selinux permissive.

last failure recorded in logs was before Fri Nov  6 11:35:31 EST 2015

Comment 3 MikeBoswell 2015-11-09 13:32:36 UTC

Created attachment 1091748 [details]
qemu-kvm commands

Comment 4 MikeBoswell 2015-11-09 13:40:19 UTC

Thanks.  Here is the brctl configs.  Attached the qemu-kvm commands.  Also the libvirt xml for the rhel4 VM which is BTW included in the FILES.tar attached earlier.  Am I providing the correct xml ?

[root@rhts-nfs network-scripts]# brctl show
bridge name	bridge id		STP enabled	interfaces
br0_vlan175		8000.c81f66f23621	no		em2
							vnet0
virbr0		8000.525400be613a	yes		virbr0-nic
[root@rhts-nfs network-scripts]# cat ifcfg-br0_vlan175
DEVICE=br0_vlan175
TYPE=Bridge
BOOTPROTO=static
ONBOOT=yes
[root@rhts-nfs network-scripts]# cat ifcfg-em2
DEVICE=em2
BOOTPROTO=static
ONBOOT=yes
HWADDR=C8:1F:66:F2:36:21
BRIDGE=br0_vlan175
NM_CONTROLLED=no

Comment 5 MikeBoswell 2015-11-09 13:42:09 UTC

Created attachment 1091752 [details]
rhel4 XML

Comment 6 Laine Stump 2015-11-09 15:09:38 UTC

Some random questions that may or may not lead somewhere:

You say that "virsh console" doesn't respond, nor does ssh. Do you have any indication that the guest isn't completely dead, any reason to believe it is otherwise working? (without other info, I would be inclined to guess that the entire machine is somehow dead, not just networking and the system console).

Can you run virt-viewer on this guest? Does the video console in virt-viewer show anything interesting?

Does the qemu process show up high in "top"? If so, a "thread apply all bt" of the qemu process may or may not reveal something.

I notice that you have a set of USB2 devices setup for the VM but aren't using them. Just to remove variables, would it be possible to remove those from the config? Also, does RHEL4 not support virtio disks? (I'm curious since I see that it's using IDE, but I've never run RHEL4, so don't have firsthand info)

I looked at the system logs of the guest from the sosreport and see nothing interesting.

Comment 7 MikeBoswell 2015-11-09 20:39:00 UTC

Thanks Laine,  

When SSH and virsh console are not responding I still see pings and open ports (see below).  

I still see the qemu-kvm consuming CPU/memory in top.

Virt-viewer has a prompt for username which I can go ahead w/ and then passwd but after that it hangs. 

Virt-manager has the remove button grayed for 'remove' of the USB. Can I safely remove via 'virsh edit' ?  Which lines ? 

Can you help w/ more detail on the how to for a thread backtrace?  I've install the gdb package.

And I get a kernel panic when ide is switched to virtio.

AFTER SYSTEM GOES 'UNRESPONSIVE'
Mon Nov  9 14:47:10 EST 2015
==> ping
rhts-nfs-vm01.rhts.eng.bos.redhat.com is alive
==> ping6
rhts-nfs-vm01.rhts.eng.bos.redhat.com is alive
==> showmount
==> nmap port 22 ipv4
22/tcp open  ssh
==> nmap port 22 ipv6
22/tcp open  ssh
==> ssh -4
^CKilled by signal 2.
==> ssh -6
^CKilled by signal 2.

Comment 8 Laine Stump 2015-11-10 17:21:25 UTC

Can you start up virt-viewer *before* it hangs, make sure it is working, then leave it running? Then check back with it again after it hangs. (maybe even run a loop in the shell there so that you can see if it's still functioning but not accepting any input, e.g. "while true; do date; sleep 1; done" or something like that).

The fact that the console is also unresponsive says to me that it is unrelated to the network. Could it be that something in the guest needed to be rebuilt/reconfigured for SMP but hasn't been? (That seems more likely).

Answers to your questions:

* You can remove all the USB controllers by running "virsh edit rhts-nfs-vm01.rhts.eng.bos.redhat.com" and replacing *all* of the "<controller type='usb' ..." sections with a single element like this:

   <controller type='usb' model='none'/>

* How to get a backtrace of a running qemu process (which is really a Hail-Mary approach because I'm pawing at thin air :-):

(all of this as root, of course)

1) "debuginfo-install qemu-kvm"

2) get the pid of the appropriate qemu process

3) "gdb -p $PID_OF_QEMU"

4) (gdb) "thread apply all bt"

5) Hit enter until you get back to a (gdb) prompt.

6) (gdb) "detach"

7) ctrl-D to get out of gdb and paste all the output into a file.

BTW, I've been told that RHEL4 should support virtio disks, so if you're getting a kernel panic, that is a problem. You'll find that disk performance is *much* better with virtio than with IDE. (one recommendation was "maybe he needs to rebuild initrd?")

Comment 9 MikeBoswell 2015-11-17 14:57:50 UTC

Hi,
I now have the RH4 VM using virtio for the disk. I removed the USB from xml.  virt-viewer is not showing any console messages.  I have noticed that when the ssh goes down that virt-viewer (as well as alternatively  virsh console) will remain operational (some what) as long as I have already passed username and passwd.  If not logged in yet it will hang after the passwd is entered.  For example ls, ll, find / , dmesg, tail /var/log/messages all work.
"sudo su -"  hangs and will respond to ctrl-c
"su -"  , hangs and will not respond to ctrl-c
The top command hangs wo/ displaying anything to the screen.  Does break out on  ctrl-c
During this time I can touch a file but after reboot the file is gone.

Comment 10 Laine Stump 2015-11-19 14:49:52 UTC

Okay, if the guest is still somewhat responding, then it will be better to troubleshoot from that side. (BTW, this sounds very similar to something I experienced on Fedora a few releases ago - "yum update" would hang when I gave the guest >1 vCPU. If I remember correctly, it turned out there was some sort of strange deadlock regression in a string function in glibc that only showed up on a multi-CPU virtual machine (didn't affect real hardware).

If you hadn't said that "find /" worked, I would have suggested that possibly an NFS server was failing to respond. But if "find /" and "df" both complete and return to the shell prompt, that's likely not the issue.

You say that "top" hangs, but you can ctl-C out of it. Can you try running top under gdb, then when it hangs, typing ctrl-C and issuing the following gdb command:

    thread apply all bt

? This will hopefully tell us what it's hung on, and that may lead to something useful. Additionally, we *might* get something from the output of "ps -AlF" (all of this run on the guest, BTW). Since you likely won't be able to create any new logins after the "lockup", you should probably prep for it by logging into shells in multiple virtual consoles on the guest as soon as its booted (hopefully ctl+alt+Fn will switch virtual consoles in RHEL4. You'll need to use the "send-key" menu option in virt-viewer to send it to the guest.

BTW, did you ever run this guest with multiple vCPUs on an older release of RHEL / qemu? (I'm wondering if this is a regression or if it's always been like this).

Comment 11 MikeBoswell 2015-11-19 19:58:44 UTC

Laine,  Here is what returned from gdb.  Not much shown. (incomplete ?)  BTW I had run this same scenario on Red Hat 6 and saw the same hangs for ssh in that env. for rhel4 guest.  I had also a rhel 5, 6 and 7 in that case and rhel 6 had misbehaved similarly. 

(gdb) thread apply all bt full
(gdb) thread apply all bt 
(gdb) bt
#0  0x0000002a9599ed85 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x0000000000408b54 in ?? ()
#2  0x0000002a958f84cb in __libc_start_main () from /lib64/tls/libc.so.6
#3  0x000000000040215a in ?? ()
#4  0x0000007fbffffb98 in ?? ()
#5  0x000000000000001c in ?? ()
#6  0x0000000000000001 in ?? ()
#7  0x0000007fbffffd80 in ?? ()
#8  0x0000000000000000 in ?? ()
(gdb) 



[root@rhts-nfs-vm01 ~]# ps -Alf
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root         1     0  0  76   0 -  1194 109952 14:25 ?        00:00:00 init [3]                                                     
1 S root         2     1  0 -40   - -     0 migrat 14:25 ?        00:00:00 [migration/0]
1 S root         3     1  0  94  19 -     0 ksofti 14:25 ?        00:00:00 [ksoftirqd/0]
1 S root         4     1  0 -40   - -     0 migrat 14:25 ?        00:00:00 [migration/1]
1 S root         5     1  0  94  19 -     0 ksofti 14:25 ?        00:00:00 [ksoftirqd/1]
1 S root         6     1  0  65 -10 -     0 worker 14:25 ?        00:00:00 [events/0]
1 S root         7     1  0  65 -10 -     0 worker 14:25 ?        00:00:00 [events/1]
1 S root         8     1  0  67 -10 -     0 worker 14:25 ?        00:00:00 [khelper]
1 S root         9     1  0  66 -10 -     0 worker 14:25 ?        00:00:00 [kthread]
1 S root        10     9  0  75 -10 -     0 worker 14:25 ?        00:00:00 [kacpid]
1 S root        24     9  0  65 -10 -     0 worker 14:25 ?        00:00:00 [kblockd/0]
1 S root        25     9  0  65 -10 -     0 worker 14:25 ?        00:00:00 [kblockd/1]
1 S root        26     1  0  75   0 -     0 hub_th 14:25 ?        00:00:00 [khubd]
1 S root        53     9  0  80   0 -     0 pdflus 14:25 ?        00:00:00 [pdflush]
1 S root        54     9  0  75   0 -     0 pdflus 14:25 ?        00:00:00 [pdflush]
1 S root        55     1  0  85   0 -     0 kswapd 14:25 ?        00:00:00 [kswapd0]
1 S root        56     9  0  69 -10 -     0 worker 14:25 ?        00:00:00 [aio/0]
1 S root        57     9  0  65 -10 -     0 worker 14:25 ?        00:00:00 [aio/1]
1 S root       201     1  0  84   0 -     0 serio_ 14:25 ?        00:00:00 [kseriod]
1 S root       473     1  0  75   0 -     0 kjourn 14:25 ?        00:00:00 [kjournald]
1 S root       991     9  0  65 -10 -     0 kaudit 14:25 ?        00:00:00 [kauditd]
4 S root      1085     1  0  76   0 -   912 -      14:25 ?        00:00:00 udevd
1 S root      2345     1  0  79   0 -     0 kjourn 14:25 ?        00:00:00 [kjournald]
1 S root      2346     1  0  79   0 -     0 kjourn 14:25 ?        00:00:00 [kjournald]
1 S root      2673     1  0  76   0 -   914 -      14:25 ?        00:00:00 syslogd -m 0
5 S root      2677     1  0  75   0 -   641 syslog 14:25 ?        00:00:00 klogd -x
1 S root      2690     1  0  76   0 -   647 109952 14:25 ?        00:00:00 irqbalance
5 S rpc       2701     1  0  76   0 -  1195 -      14:25 ?        00:00:00 portmap
5 S rpcuser   2710     1  0  78   0 -  1728 -      14:25 ?        00:00:00 rpc.statd
1 S root      2739     1  0  76   0 -  5242 109952 14:25 ?        00:00:00 rpc.idmapd
5 S root      2786     1  0  76   0 - 19240 -      14:25 ?        00:00:00 ypbind
5 S nscd      2849     1  0  76   0 - 45465 cache_ 14:25 ?        00:00:00 /usr/sbin/nscd
1 S root      2859     1  0  78   0 -   641 -      14:25 ?        00:00:00 /usr/sbin/acpid
5 S root      2923     1  0  76   0 -  5493 -      14:25 ?        00:00:00 /usr/sbin/sshd
1 S root      2967     1  0  77   0 -  2189 -      14:25 ?        00:00:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
5 S ntp       2980     1  0  76   0 -  5216 -      14:25 ?        00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g
5 S root      2998     1  0  76   0 -  8763 109952 14:25 ?        00:00:00 sendmail: accepting connections
1 S smmsp     3008     1  0  79   0 -  6962 pause  14:25 ?        00:00:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
1 S root      3018     1  0  76   0 - 14309 184467 14:25 ?        00:00:00 crond
1 S root      3027     1  0  94  19 -   641 rt_sig 14:25 ?        00:00:00 anacron -s
5 S root      3035     1  0  78   0 -  2235 109952 14:25 ?        00:00:00 /usr/sbin/atd
0 S root      3043     1  0  76   0 -   640 -      14:25 ttyS0    00:00:00 /sbin/agetty ttyS0 115200 vt100-nav
4 S root      3044     1  0  78   0 -   637 -      14:25 tty1     00:00:00 /sbin/mingetty tty1
4 S root      3045     1  0  78   0 -   637 -      14:25 tty2     00:00:00 /sbin/mingetty tty2
4 S root      3046     1  0  78   0 -   637 -      14:25 tty3     00:00:00 /sbin/mingetty tty3
4 S root      3047     1  0  80   0 -   637 -      14:25 tty4     00:00:00 /sbin/mingetty tty4
4 S root      3051     1  0  78   0 -   637 -      14:25 tty5     00:00:00 /sbin/mingetty tty5
4 S root      3052     1  0  82   0 -   637 -      14:25 tty6     00:00:00 /sbin/mingetty tty6
4 S root      3717  2923  0  76   0 - 10037 -      14:25 ?        00:00:00 sshd: root@pts/0 
4 S root      3719  3717  0  76   0 - 13279 wait   14:25 pts/0    00:00:00 -bash
0 S root      3750  3719  0  76   0 - 15811 -      14:25 pts/0    00:00:00 gdb top
4 T root      3751  3750  0  76   0 -  1330 ptrace 14:26 pts/0    00:00:00 /usr/bin/top
4 S root      3754  2923  0  76   0 - 10037 -      14:26 ?        00:00:00 sshd: root@pts/1 
4 S root      3756  3754  0  76   0 - 13279 wait   14:26 pts/1    00:00:00 -bash
4 R root      3791  3756  0  78   0 -  1153 -      14:31 pts/1    00:00:00 ps -Alf

Comment 12 Laine Stump 2015-11-20 14:40:23 UTC

It looks like you don't have debuginfo installed for anything, and I don't know if the debuginfo packages are as easily available for RHEL4. If we could perform this experiment with debuginfo installed, we might learn something.

Note that the output of ps shows the wchan of /usr/bin/top as "ptrace" because it's currently stopped in gdb. If you resumed top it could show something different, which may or may not be interesting.

> I had also a rhel 5, 6 and 7 in that case and rhel 6 had misbehaved similarly. 

Are you still talking about the OS of the host, or of the guest? If you could get this same behavior with a RHEL6 guest it might be easier to investigate.

Comment 13 MikeBoswell 2015-11-20 15:28:50 UTC

> Are you still talking about the OS of the host, or of the guest? If you
> could get this same behavior with a RHEL6 guest it might be easier to
> investigate.

Laine,  Yes the host was RHEL6 and the misbehaived guests were RHEL4 and RHEL6 .  I'll look a little harder toward getting the debuginfo on the RHEL4 .  Otherwise I can reprovison the host w/ RHEL6.  THX

Comment 14 MikeBoswell 2016-01-18 19:33:14 UTC

Laine,  As it turns out all this may have been caused by a bad top of rack switch.  Since it's removal the environment has be stable.  I'll close this as it is not a bug.