Bug 1278895
Summary: | The RHEL4 guest will loose virsh console along w/ SSH ipv4 and ipv6 connectivity when using > 1 vcpu. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | MikeBoswell <mboswell> | ||||||||
Component: | qemu-kvm | Assignee: | Laine Stump <laine> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.1 | CC: | alemay, huding, jburke, jstancek, jsuchane, juzhang, knoel, mboswell, rbalakri, virt-maint, weliao, xfu | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-01-18 19:33:14 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Created attachment 1091748 [details]
qemu-kvm commands
Thanks. Here is the brctl configs. Attached the qemu-kvm commands. Also the libvirt xml for the rhel4 VM which is BTW included in the FILES.tar attached earlier. Am I providing the correct xml ? [root@rhts-nfs network-scripts]# brctl show bridge name bridge id STP enabled interfaces br0_vlan175 8000.c81f66f23621 no em2 vnet0 virbr0 8000.525400be613a yes virbr0-nic [root@rhts-nfs network-scripts]# cat ifcfg-br0_vlan175 DEVICE=br0_vlan175 TYPE=Bridge BOOTPROTO=static ONBOOT=yes [root@rhts-nfs network-scripts]# cat ifcfg-em2 DEVICE=em2 BOOTPROTO=static ONBOOT=yes HWADDR=C8:1F:66:F2:36:21 BRIDGE=br0_vlan175 NM_CONTROLLED=no Created attachment 1091752 [details]
rhel4 XML
Some random questions that may or may not lead somewhere: You say that "virsh console" doesn't respond, nor does ssh. Do you have any indication that the guest isn't completely dead, any reason to believe it is otherwise working? (without other info, I would be inclined to guess that the entire machine is somehow dead, not just networking and the system console). Can you run virt-viewer on this guest? Does the video console in virt-viewer show anything interesting? Does the qemu process show up high in "top"? If so, a "thread apply all bt" of the qemu process may or may not reveal something. I notice that you have a set of USB2 devices setup for the VM but aren't using them. Just to remove variables, would it be possible to remove those from the config? Also, does RHEL4 not support virtio disks? (I'm curious since I see that it's using IDE, but I've never run RHEL4, so don't have firsthand info) I looked at the system logs of the guest from the sosreport and see nothing interesting. Thanks Laine, When SSH and virsh console are not responding I still see pings and open ports (see below). I still see the qemu-kvm consuming CPU/memory in top. Virt-viewer has a prompt for username which I can go ahead w/ and then passwd but after that it hangs. Virt-manager has the remove button grayed for 'remove' of the USB. Can I safely remove via 'virsh edit' ? Which lines ? Can you help w/ more detail on the how to for a thread backtrace? I've install the gdb package. And I get a kernel panic when ide is switched to virtio. AFTER SYSTEM GOES 'UNRESPONSIVE' Mon Nov 9 14:47:10 EST 2015 ==> ping rhts-nfs-vm01.rhts.eng.bos.redhat.com is alive ==> ping6 rhts-nfs-vm01.rhts.eng.bos.redhat.com is alive ==> showmount ==> nmap port 22 ipv4 22/tcp open ssh ==> nmap port 22 ipv6 22/tcp open ssh ==> ssh -4 ^CKilled by signal 2. ==> ssh -6 ^CKilled by signal 2. Can you start up virt-viewer *before* it hangs, make sure it is working, then leave it running? Then check back with it again after it hangs. (maybe even run a loop in the shell there so that you can see if it's still functioning but not accepting any input, e.g. "while true; do date; sleep 1; done" or something like that). The fact that the console is also unresponsive says to me that it is unrelated to the network. Could it be that something in the guest needed to be rebuilt/reconfigured for SMP but hasn't been? (That seems more likely). Answers to your questions: * You can remove all the USB controllers by running "virsh edit rhts-nfs-vm01.rhts.eng.bos.redhat.com" and replacing *all* of the "<controller type='usb' ..." sections with a single element like this: <controller type='usb' model='none'/> * How to get a backtrace of a running qemu process (which is really a Hail-Mary approach because I'm pawing at thin air :-): (all of this as root, of course) 1) "debuginfo-install qemu-kvm" 2) get the pid of the appropriate qemu process 3) "gdb -p $PID_OF_QEMU" 4) (gdb) "thread apply all bt" 5) Hit enter until you get back to a (gdb) prompt. 6) (gdb) "detach" 7) ctrl-D to get out of gdb and paste all the output into a file. BTW, I've been told that RHEL4 should support virtio disks, so if you're getting a kernel panic, that is a problem. You'll find that disk performance is *much* better with virtio than with IDE. (one recommendation was "maybe he needs to rebuild initrd?") Hi, I now have the RH4 VM using virtio for the disk. I removed the USB from xml. virt-viewer is not showing any console messages. I have noticed that when the ssh goes down that virt-viewer (as well as alternatively virsh console) will remain operational (some what) as long as I have already passed username and passwd. If not logged in yet it will hang after the passwd is entered. For example ls, ll, find / , dmesg, tail /var/log/messages all work. "sudo su -" hangs and will respond to ctrl-c "su -" , hangs and will not respond to ctrl-c The top command hangs wo/ displaying anything to the screen. Does break out on ctrl-c During this time I can touch a file but after reboot the file is gone. Okay, if the guest is still somewhat responding, then it will be better to troubleshoot from that side. (BTW, this sounds very similar to something I experienced on Fedora a few releases ago - "yum update" would hang when I gave the guest >1 vCPU. If I remember correctly, it turned out there was some sort of strange deadlock regression in a string function in glibc that only showed up on a multi-CPU virtual machine (didn't affect real hardware). If you hadn't said that "find /" worked, I would have suggested that possibly an NFS server was failing to respond. But if "find /" and "df" both complete and return to the shell prompt, that's likely not the issue. You say that "top" hangs, but you can ctl-C out of it. Can you try running top under gdb, then when it hangs, typing ctrl-C and issuing the following gdb command: thread apply all bt ? This will hopefully tell us what it's hung on, and that may lead to something useful. Additionally, we *might* get something from the output of "ps -AlF" (all of this run on the guest, BTW). Since you likely won't be able to create any new logins after the "lockup", you should probably prep for it by logging into shells in multiple virtual consoles on the guest as soon as its booted (hopefully ctl+alt+Fn will switch virtual consoles in RHEL4. You'll need to use the "send-key" menu option in virt-viewer to send it to the guest. BTW, did you ever run this guest with multiple vCPUs on an older release of RHEL / qemu? (I'm wondering if this is a regression or if it's always been like this). Laine, Here is what returned from gdb. Not much shown. (incomplete ?) BTW I had run this same scenario on Red Hat 6 and saw the same hangs for ssh in that env. for rhel4 guest. I had also a rhel 5, 6 and 7 in that case and rhel 6 had misbehaved similarly. (gdb) thread apply all bt full (gdb) thread apply all bt (gdb) bt #0 0x0000002a9599ed85 in __select_nocancel () from /lib64/tls/libc.so.6 #1 0x0000000000408b54 in ?? () #2 0x0000002a958f84cb in __libc_start_main () from /lib64/tls/libc.so.6 #3 0x000000000040215a in ?? () #4 0x0000007fbffffb98 in ?? () #5 0x000000000000001c in ?? () #6 0x0000000000000001 in ?? () #7 0x0000007fbffffd80 in ?? () #8 0x0000000000000000 in ?? () (gdb) [root@rhts-nfs-vm01 ~]# ps -Alf F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 4 S root 1 0 0 76 0 - 1194 109952 14:25 ? 00:00:00 init [3] 1 S root 2 1 0 -40 - - 0 migrat 14:25 ? 00:00:00 [migration/0] 1 S root 3 1 0 94 19 - 0 ksofti 14:25 ? 00:00:00 [ksoftirqd/0] 1 S root 4 1 0 -40 - - 0 migrat 14:25 ? 00:00:00 [migration/1] 1 S root 5 1 0 94 19 - 0 ksofti 14:25 ? 00:00:00 [ksoftirqd/1] 1 S root 6 1 0 65 -10 - 0 worker 14:25 ? 00:00:00 [events/0] 1 S root 7 1 0 65 -10 - 0 worker 14:25 ? 00:00:00 [events/1] 1 S root 8 1 0 67 -10 - 0 worker 14:25 ? 00:00:00 [khelper] 1 S root 9 1 0 66 -10 - 0 worker 14:25 ? 00:00:00 [kthread] 1 S root 10 9 0 75 -10 - 0 worker 14:25 ? 00:00:00 [kacpid] 1 S root 24 9 0 65 -10 - 0 worker 14:25 ? 00:00:00 [kblockd/0] 1 S root 25 9 0 65 -10 - 0 worker 14:25 ? 00:00:00 [kblockd/1] 1 S root 26 1 0 75 0 - 0 hub_th 14:25 ? 00:00:00 [khubd] 1 S root 53 9 0 80 0 - 0 pdflus 14:25 ? 00:00:00 [pdflush] 1 S root 54 9 0 75 0 - 0 pdflus 14:25 ? 00:00:00 [pdflush] 1 S root 55 1 0 85 0 - 0 kswapd 14:25 ? 00:00:00 [kswapd0] 1 S root 56 9 0 69 -10 - 0 worker 14:25 ? 00:00:00 [aio/0] 1 S root 57 9 0 65 -10 - 0 worker 14:25 ? 00:00:00 [aio/1] 1 S root 201 1 0 84 0 - 0 serio_ 14:25 ? 00:00:00 [kseriod] 1 S root 473 1 0 75 0 - 0 kjourn 14:25 ? 00:00:00 [kjournald] 1 S root 991 9 0 65 -10 - 0 kaudit 14:25 ? 00:00:00 [kauditd] 4 S root 1085 1 0 76 0 - 912 - 14:25 ? 00:00:00 udevd 1 S root 2345 1 0 79 0 - 0 kjourn 14:25 ? 00:00:00 [kjournald] 1 S root 2346 1 0 79 0 - 0 kjourn 14:25 ? 00:00:00 [kjournald] 1 S root 2673 1 0 76 0 - 914 - 14:25 ? 00:00:00 syslogd -m 0 5 S root 2677 1 0 75 0 - 641 syslog 14:25 ? 00:00:00 klogd -x 1 S root 2690 1 0 76 0 - 647 109952 14:25 ? 00:00:00 irqbalance 5 S rpc 2701 1 0 76 0 - 1195 - 14:25 ? 00:00:00 portmap 5 S rpcuser 2710 1 0 78 0 - 1728 - 14:25 ? 00:00:00 rpc.statd 1 S root 2739 1 0 76 0 - 5242 109952 14:25 ? 00:00:00 rpc.idmapd 5 S root 2786 1 0 76 0 - 19240 - 14:25 ? 00:00:00 ypbind 5 S nscd 2849 1 0 76 0 - 45465 cache_ 14:25 ? 00:00:00 /usr/sbin/nscd 1 S root 2859 1 0 78 0 - 641 - 14:25 ? 00:00:00 /usr/sbin/acpid 5 S root 2923 1 0 76 0 - 5493 - 14:25 ? 00:00:00 /usr/sbin/sshd 1 S root 2967 1 0 77 0 - 2189 - 14:25 ? 00:00:00 xinetd -stayalive -pidfile /var/run/xinetd.pid 5 S ntp 2980 1 0 76 0 - 5216 - 14:25 ? 00:00:00 ntpd -u ntp:ntp -p /var/run/ntpd.pid -g 5 S root 2998 1 0 76 0 - 8763 109952 14:25 ? 00:00:00 sendmail: accepting connections 1 S smmsp 3008 1 0 79 0 - 6962 pause 14:25 ? 00:00:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue 1 S root 3018 1 0 76 0 - 14309 184467 14:25 ? 00:00:00 crond 1 S root 3027 1 0 94 19 - 641 rt_sig 14:25 ? 00:00:00 anacron -s 5 S root 3035 1 0 78 0 - 2235 109952 14:25 ? 00:00:00 /usr/sbin/atd 0 S root 3043 1 0 76 0 - 640 - 14:25 ttyS0 00:00:00 /sbin/agetty ttyS0 115200 vt100-nav 4 S root 3044 1 0 78 0 - 637 - 14:25 tty1 00:00:00 /sbin/mingetty tty1 4 S root 3045 1 0 78 0 - 637 - 14:25 tty2 00:00:00 /sbin/mingetty tty2 4 S root 3046 1 0 78 0 - 637 - 14:25 tty3 00:00:00 /sbin/mingetty tty3 4 S root 3047 1 0 80 0 - 637 - 14:25 tty4 00:00:00 /sbin/mingetty tty4 4 S root 3051 1 0 78 0 - 637 - 14:25 tty5 00:00:00 /sbin/mingetty tty5 4 S root 3052 1 0 82 0 - 637 - 14:25 tty6 00:00:00 /sbin/mingetty tty6 4 S root 3717 2923 0 76 0 - 10037 - 14:25 ? 00:00:00 sshd: root@pts/0 4 S root 3719 3717 0 76 0 - 13279 wait 14:25 pts/0 00:00:00 -bash 0 S root 3750 3719 0 76 0 - 15811 - 14:25 pts/0 00:00:00 gdb top 4 T root 3751 3750 0 76 0 - 1330 ptrace 14:26 pts/0 00:00:00 /usr/bin/top 4 S root 3754 2923 0 76 0 - 10037 - 14:26 ? 00:00:00 sshd: root@pts/1 4 S root 3756 3754 0 76 0 - 13279 wait 14:26 pts/1 00:00:00 -bash 4 R root 3791 3756 0 78 0 - 1153 - 14:31 pts/1 00:00:00 ps -Alf It looks like you don't have debuginfo installed for anything, and I don't know if the debuginfo packages are as easily available for RHEL4. If we could perform this experiment with debuginfo installed, we might learn something.
Note that the output of ps shows the wchan of /usr/bin/top as "ptrace" because it's currently stopped in gdb. If you resumed top it could show something different, which may or may not be interesting.
> I had also a rhel 5, 6 and 7 in that case and rhel 6 had misbehaved similarly.
Are you still talking about the OS of the host, or of the guest? If you could get this same behavior with a RHEL6 guest it might be easier to investigate.
> Are you still talking about the OS of the host, or of the guest? If you
> could get this same behavior with a RHEL6 guest it might be easier to
> investigate.
Laine, Yes the host was RHEL6 and the misbehaived guests were RHEL4 and RHEL6 . I'll look a little harder toward getting the debuginfo on the RHEL4 . Otherwise I can reprovison the host w/ RHEL6. THX
Laine, As it turns out all this may have been caused by a bad top of rack switch. Since it's removal the environment has be stable. I'll close this as it is not a bug. |
Created attachment 1090729 [details] File with logs and configs attached. Description of problem: The RHEL4 guest will loose virsh console along w/ SSH ipv4 and ipv6 connectivity when I use > 1 vcpu. Version-Release number of selected component (if applicable): Hypervisor is RHEL7.1 running 3.10.0-229.14.1.el7.x86_64 with libvirt-1.2.8-16.el7_1.4.x86_64 and qemu-kvm-1.5.3-86.el7_1.8.x86_64. The 4 guests are RHEL7.9, 5.11, 6.7, and 7.1. Network is a bridge w/ a single phy em2. How reproducible: Not immediate but some time following the increase in vcpu. I have reproduced this when only the RHEL4 VM is running and the others are shut off. Also, happens with the others VMs running. Steps to Reproduce: 1. increase the vcpu to > 1 across on all 4 guest or just the one. 2. halt and start the vms 3. wait and see the following Actual results: 'virsh console rhts-nfs-vm01' will not give prompt ==> ping rhts-nfs-vm01.rhts.eng.bos.redhat.com is alive ==> ping6 rhts-nfs-vm01.rhts.eng.bos.redhat.com is alive ==> nmap port 22 ipv4 22/tcp open ssh ==> nmap port 22 ipv6 22/tcp open ssh ==> ssh -4 ^CKilled by signal 2. ==> ssh -6 ^CKilled by signal 2. Expected results: Additional info: Destroy / start of rhel4 VM brings it back to life but only temporarily. Selinux permissive. last failure recorded in logs was before Fri Nov 6 11:35:31 EST 2015