Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 770345

Summary:	Host got call trace when using kexec to start a new kernel in host with kvm guest running on iscsi disk
Product:	Red Hat Enterprise Linux 5	Reporter:	Qunfang Zhang <qzhang>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED NOTABUG	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	medium
Version:	5.8	CC:	gleb, jasowang, juzhang, michen, qcai
Target Milestone:	rc	Flags:	gleb: needinfo+
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-12-26 08:55:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qunfang Zhang 2011-12-26 06:05:37 UTC

Description of problem:
Running a kvm guest and the image is in an iscsi disk, then using kexec to start a new kernel in host, host got call trace sometimes.
Test with a kvm guest that is installed on local disk, don't meet this issue after about 7 times attempt.

Version-Release number of selected component (if applicable):
kernel-2.6.18-301.el5
kernel-2.6.18-302.el5
kvm-83-246.el5

How reproducible:
Sometimes

Steps to Reproduce:
1.Make sure there are two version of kernel installed on host .
# rpm -qa | grep kernel
kernel-2.6.18-301.el5
kernel-2.6.18-302.el5

2. Start kvm process on host. (A kvm guest installed on a iscsi disk)
 /usr/libexec/qemu-kvm -no-hpet -no-kvm-pit-reinjection -usbdevice tablet -rtc-td-hack -startdate now -name 2k8r2-64 -smp 2,cores=2 -k en-us -m 2048 -boot c -net nic,vlan=1,macaddr=00:1a:4a:40:1a:16,model=virtio -net tap,vlan=1,ifname=virtio_10_1,script=/etc/qemu-ifup,downscript=no -drive file=/dev/vgtest-qzhang/lvtest-2k8r2,media=disk,if=ide,cache=off,format=qcow2,werror=stop -cpu qemu64,+sse2 -M rhel5.6.0 -uuid 3d292e57-5fb4-42f8-8b96-b7d84c016e96 -notify all -balloon none -monitor stdio -vnc :10

2. On host, get current commmad line .
#cat /proc/cmdline 
ro root=/dev/VolGroup00/LogVol00 crashkernel=128M@16M console=tty0 console=ttyS0,115200nb

3. load a new kernel
#uname -r
2.6.18-302.el5
# kexec -l /boot/vmlinuz-2.6.18-301.el5 --append="ro root=/dev/VolGroup00/LogVol00 crashkernel=128M@16M console=tty0 console=ttyS0,115200nb" --initrd=/boot/initrd-2.6.18-301.el5.img

4. Run the currently loaded kernel.
# kexec -e

5. If host starts the new kernel (kernel-301) successfully, then repeat step 1-4 to let the host boot into kernel-302.

6. Repeat step 1-5.


Actual results:
Host got call trace, I meet this issue at the 2nd times kexec test.

Expected results:
Host works well, can boot into new kernel successfully.

Additional info:
Call trace:

localhost.localdomain login: Synchronizing SCSI cache for disk sdi: 
 connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299488421, last ping 4299493421, now 4299498421
 connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299488422, last ping 4299493422, now 4299498432
 connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299488422, last ping 4299493422, now 4299498442
 connection5:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299488423, last ping 4299493423, now 4299498453
 connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299488424, last ping 4299493424, now 4299498463
 connection7:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299488425, last ping 4299493425, now 4299498474
 connection8:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299488426, last ping 4299493426, now 4299498484
 connection6:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4299490185, last ping 4299495185, now 4299500185
FAILED
  status = 0, message = 00, host = 15, driver = 00
  <5>Synchronizing SCSI cache for disk sdh: 
FAILED
  status = 0, message = 00, host = 15, driver = 00
  <5>Synchronizing SCSI cache for disk sdg: 
FAILED
  status = 0, message = 00, host = 15, driver = 00
  <5>Synchronizing SCSI cache for disk sdf: 
FAILED
  status = 0, message = 00, host = 15, driver = 00
  <5>Synchronizing SCSI cache for disk sde: 
FAILED
  status = 0, message = 00, host = 15, driver = 00
  <5>Synchronizing SCSI cache for disk sdd: 
FAILED
  status = 0, message = 00, host = 15, driver = 00
  <5>Synchronizing SCSI cache for disk sdc: 
FAILED
  status = 0, message = 00, host = 15, driver = 00
  <5>Synchronizing SCSI cache for disk sdb: 
FAILED
  status = 0, message = 00, host = 15, driver = 00
  <5>Synchronizing SCSI cache for disk sda: 
Starting new kernel
INFO: task events/2:16 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/2      D 0000000000000000     0    16      1            17    15 (L-TLB)
 ffff810127ff3d30 0000000000000046 0000000300000000 0000000000000000
 0000000000000000 000000000000000a ffff810127fce7e0 ffff810127c81820
 000007fd05e545fa 00000000000007b9 ffff810127fce9c8 0000000227ff3e10
Call Trace:
 [<ffffffff80240a78>] linkwatch_event+0x0/0x30
 [<ffffffff80063171>] wait_for_completion+0x79/0xa2
 [<ffffffff8008ee54>] default_wake_function+0x0/0xe
 [<ffffffff800a178e>] synchronize_rcu+0x30/0x36
 [<ffffffff800a12ca>] wakeme_after_rcu+0x0/0x9
 [<ffffffff80248585>] dev_deactivate+0x82/0xb1
 [<ffffffff80240a3a>] __linkwatch_run_queue+0x1a1/0x1df
 [<ffffffff80240aa2>] linkwatch_event+0x2a/0x30
 [<ffffffff8004d293>] run_workqueue+0x9e/0xfb
 [<ffffffff80049a9c>] worker_thread+0x0/0x122
 [<ffffffff80049b8c>] worker_thread+0xf0/0x122
 [<ffffffff8008ee54>] default_wake_function+0x0/0xe
 [<ffffffff8003265f>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff80032561>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

INFO: task events/2:16 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
events/2      D 0000000000000000     0    16      1            17    15 (L-TLB)
 ffff810127ff3d30 0000000000000046 0000000300000000 0000000000000000
 0000000000000000 000000000000000a ffff810127fce7e0 ffff810127c81820
 000007fd05e545fa 00000000000007b9 ffff810127fce9c8 0000000227ff3e10
Call Trace:
 [<ffffffff80240a78>] linkwatch_event+0x0/0x30
 [<ffffffff80063171>] wait_for_completion+0x79/0xa2
 [<ffffffff8008ee54>] default_wake_function+0x0/0xe
 [<ffffffff800a178e>] synchronize_rcu+0x30/0x36
 [<ffffffff800a12ca>] wakeme_after_rcu+0x0/0x9
 [<ffffffff80248585>] dev_deactivate+0x82/0xb1
 [<ffffffff80240a3a>] __linkwatch_run_queue+0x1a1/0x1df
 [<ffffffff80240aa2>] linkwatch_event+0x2a/0x30
 [<ffffffff8004d293>] run_workqueue+0x9e/0xfb
 [<ffffffff80049a9c>] worker_thread+0x0/0x122
 [<ffffffff80049b8c>] worker_thread+0xf0/0x122
 [<ffffffff8008ee54>] default_wake_function+0x0/0xe
 [<ffffffff8003265f>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff80032561>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Comment 1 Qian Cai 2011-12-26 08:53:48 UTC

Please test kdump not kexec directly by using a modified kdump.conf with a dump target. We don't support kexec direct reboot.

Comment 2 Qian Cai 2011-12-26 08:55:04 UTC

If you tried the above and the problem occurred, please re-open it.

Comment 3 Gleb Natapov 2011-12-27 09:45:51 UTC

(In reply to comment #1)
> Please test kdump not kexec directly by using a modified kdump.conf with a dump
> target. We don't support kexec direct reboot.
We do not support it only on virt or on real HW too? What is the reason we do not support it? It is very useful to quickly reboot machines with very slow BIOS.

Comment 4 Qian Cai 2011-12-27 10:35:54 UTC

Let me re-clarify. kexec() syscall is supported but as I said it is hard to get it right manually on different machines due to complicated arguments required, while kdump will take care of those arguments right automatically.

The use case you described is not really what customers/partners will use AFAICT, so it is more of a convenience debugging feature for kernel hackers similar to kgdb, kernel-debug variants which is just a lower-priority for engineering resource-wise. 

But again, if kdump works not kexec directly, then it is likely that it is because of the arguments were not get right.