From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1 Description of problem: During network configuration the following output is displayed and installation crashes: lcs: Query IPAssist failed. Assuming unsupported! lcs: check on device 0.0.02e4, dstat=0xE, cstat=0x0 lcs: Recovery of device 0.0.02e4 started... lcs: LCS device eth0 without IPv6 support lcs: LCS device eth0 without Multicast support lcs: Device eth0 successfully recovered! Badness in linkwatch_run_queue at net/core/link_watch.c:79 000000001e43fde0 000000001e43fd58 0700000000000ae4 0000000100000000 0000000000107874 000000001e43fc98 000000001e43fc98 000000000010475a 0000000000000000 0000000000000000 0000000000000000 0000000000383224 000000001e43fde0 0000000000000008 000000000000000e 000000001e43fd78 0000000000450a50 000000000010475a 000000001e43fd00 000000001e43fd40 Call Trace: (]<000000000010477a>( dump_stack+0x2a2/0x35c) ]<00000000003833e4>( linkwatch_event+0x1c0/0x294 ]<0000000000165488>( worker_thread+0x1e8/0x280 ]<0000000000170da0>( kthread+0x118/0x14c ]<0000000000107462>( kernel_thread_starter+0x6/0xc ]<000000000010745c>( kernel_thread_starter+0x0/0xc Version-Release number of selected component (if applicable): kernel 2.6.18-1.2747.el5 How reproducible: Always Steps to Reproduce: 1. Start RHEL5 Beta 2 installation on zSeries system (z/VM installation type). Network is of LCS type. 2. Select LCS network type and enter appropriate information. Actual Results: Installation crashes with output shown in 'Description'. CP prompt appears (z/VM). Expected Results: Network adapter should become operational. Installation should continue. Additional info:
This is the original dump on RHEL5. In Description I've put SLES10 dump by mistake. It is however the same problem (installation failure due to lcs network setup crash). Enter the relative port number of your LCS device (required for OSA-Express ATM cards only): 0 lcs: Loading LCS driver lcs: Query IPAssist failed. Assuming unsupported! lcs: check on device 0.0.02e4, dstat=0xE, cstat=0x0 lcs: Recovery of device 0.0.02e4 started... specification exception: 0006 Ý#1¨ CPU: 0 Not tainted Process lcs_recover (pid: 166, task: 0000000001224148, ksp: 0000000000c6fc78) Krnl PSW : 0004000180000000 0000000000000152 (0x152) Krnl GPRS: 0000000000000000 0000000000000000 0000000000000000 0000000000000140 0000000000000000 0000000000000000 0000000000000004 0000000001166df0 0000000000000000 0000000000000000 0000000000000000 0000000001166840 0000000001166800 000000000024c8c0 0000000000787f98 00000000005cbda0 Krnl Code: 00 01 80 00 00 00 00 00 00 00 00 00 01 52 00 00 00 00 00 00 Call Trace: (Ý<0000000000000080>¨ 0x80) Ý<00000000001a9fa8>¨ net_tx_action+0x138/0x180 Ý<000000000003d88c>¨ __do_softirq+0x78/0x110 Ý<000000000001ed5c>¨ do_softirq+0x98/0xb0 Ý<000000000001f5b0>¨ io_return+0x0/0x10 Ý<00000000000f1ee0>¨ sysfs_get_name+0x58/0xac (Ý<0000000000c6fc78>¨ 0xc6fc78) Ý<00000000000f36bc>¨ sysfs_dirent_exist+0x44/0x94 Ý<00000000000f2996>¨ sysfs_add_file+0x56/0xa0 Ý<00000000000f5414>¨ sysfs_create_group+0x104/0x164 Ý<000000000016aed4>¨ class_device_add+0x2fc/0x504 Ý<00000000001a92f4>¨ register_netdevice+0x2a4/0x3d0 Ý<00000000001a94a4>¨ register_netdev+0x84/0xa0 Ý<00000000208937fe>¨ lcs_new_device+0x9ea/0xb08 Ýlcs¨ Ý<0000000020895a12>¨ lcs_recovery+0x126/0x170 Ýlcs¨ Ý<00000000000184ce>¨ kernel_thread_starter+0x6/0xc Ý<00000000000184c8>¨ kernel_thread_starter+0x0/0xc <0>Kernel panic - not syncing: Fatal exception in interrupt HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 00015E24
From email thread <LINUX-390.EDU>: From: Brad Hinson <bhinson> To: Linux on 390 Port <LINUX-390.EDU> Subject: Re: Installation problems (zSeries & lcs network) Date: Tue, 27 Feb 2007 12:41:13 -0500 Here are a few theories as to why this is happening. If there are kernel folks reading feel free to correct or chime in: 1.) In RHEL 5 (kernel 2.6.18), register_netdevice() calls might_sleep(). This is a change from RHEL 4 (2.6.9) where might_sleep() was not called. This may be related because the top of the stack (most recently called function) is an interrupt routine. 2.) In kernel 2.6.18, drivers/s390/net/lcs.c, lcs_new_device():2160, the function netif_carrier_on() is called. This was not called in the 2.6.9 lcs code. I can't find the bug report that necessitated this change, but perhaps this introduced a regression. 3.) In drivers/s390/net/lcs.c, in lcs_recover(), there is: [snip] rc = __lcs_shutdown_device(gdev, 1); rc = lcs_new_device(gdev); [..] This makes me wonder if there is a possible race condition since the device is destroyed and recreated right after each other, and your crash is in lcs_new_device() after ultimately attempting to check if the sysfs group exists (maybe it's still lingering from __lcs_shutdown_device?). Note the 2nd argument to __lcs_shutdown_device() is 1. Looking at __lcs_shutdown_device(), it does lcs_wait_for_threads() only when the 2nd argument is 0. Looking through the qeth code, it appears qeth_recover() does something similar but does call qeth_wait_for_threads(). Perhaps lcs_recovery() should do the same (i.e. call __lcs_shutdown_device with 0).
Created attachment 148933 [details] Test kernel images This contains: images/orig-rc/kernel.img: Post-beta2 kernel (for reference) images/netdev-sched/kernel.img: Remove might_sleep() in register_netdevice() images/lcs-netif-carrier/kernel.img: Remove netif_carrier_on() in lcs_new_device() images/lcs-recovery/kernel.img: Call __lcs_shutdown_device() with 0 instead of 1 in lcs_recover() Can you test these? Thanks
Unfortunately all kernels (including one for reference) failed with the same output. I've used your kernels with initrd.img and redhat.parm from Beta 2 CD. --- OUTPUT --- ... ... Enter the relative port number of your LCS device (required for OSA-Express ATM cards only): 0 Could not detect LCS interface, aborting... Kernel panic - not syncing: Attempted to kill init! HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 0003AD16 --- END OF OUTPUT --- (In reply to comment #3) > Created an attachment (id=148933) [edit] > Test kernel images > > This contains: > > images/orig-rc/kernel.img: > Post-beta2 kernel (for reference) > > images/netdev-sched/kernel.img: > Remove might_sleep() in register_netdevice() > > images/lcs-netif-carrier/kernel.img: > Remove netif_carrier_on() in lcs_new_device() > > images/lcs-recovery/kernel.img: > Call __lcs_shutdown_device() with 0 instead of 1 in lcs_recover() > > Can you test these? > > Thanks
The output in comment 4 is different from comment 2. In my tests I get the "Could not detect LCS interface, aborting..." because I don't have an LCS chpid defined. Is there anything different about your process that could lead to the LCS subchannels not being detected correctly?
In response to Comment #5: I believe that nothing is different now since I get almost the same output with kernel from Beta 2 CD. --- OUTPUT --- 0 lcs: Loading LCS driver lcs: Query IPAssist failed. Assuming unsupported! lcs: check on device 0.0.02e4, dstat=0xE, cstat=0x0 lcs: Recovery of device 0.0.02e4 started... lcs: LCS device eth0 without IPv6 support lcs: LCS device eth0 without Multicast support lcs: Device eth0 successfully recovered! BUG: warning at net/core/link_watch.c:117/linkwatch_run_queue() (Not tainted) 0000000000000000 000000001eeb3cb8 0000000000000002 0000000000000000 000000001eeb3d58 000000001eeb3cd0 000000001eeb3cd0 000000000003748e 0000000000380218 000000000037f9a4 0000000000000000 000000000000000b 0000000000000008 0000000000000000 000000001eeb3cb8 000000001eeb3d30 000000000024c380 00000000000164b2 000000001eeb3cb8 000000001eeb3d08 Call Trace: (Ý<000000000001640a>¨ show_trace+0x11e/0x130) Ý<00000000001b3d70>¨ linkwatch_run_queue+0x1c0/0x22c Ý<00000000001b3e42>¨ linkwatch_event+0x66/0x74 Ý<000000000004ca5a>¨ run_workqueue+0xea/0x150 Ý<000000000004d8b0>¨ worker_thread+0x114/0x154 Ý<0000000000051990>¨ kthread+0x118/0x14c Ý<00000000000184ce>¨ kernel_thread_starter+0x6/0xc Ý<00000000000184c8>¨ kernel_thread_starter+0x0/0xc --- END OF OUTPUT --- Network settings are correct (work for SLES9 installation program).
The output in comment 6 looks like the original output from SLES. Are you now getting that output from RHEL 5? Are you able to repeatedly reproduce the output in comment 1, or is the output constantly changing?
Today I was able to reproduce the output (from Comment #7) twice on two separate z/VMs. I'll repeat those tests tomorrow to see if this is now constant behavior. Do you have any idea why there is a difference between the reference image and the image from Beta 2 CD?
The reference image is a post-beta2 kernel. The output in comment 6 does not indicate a kernel panic. Does the install continue much further?
There is no kernel panic, it just hangs. Like in the case of SLES10.
Created attachment 148998 [details] Two different crash behaviors
Today I repeated test with kernel from the Beta 2 CD. I've got two different behaviors. See <a href="https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=148998">attachment</a>. I also have additional question: is it ok just to use modified kernel while old lcs.ko is still there (in initrd.img from Beta 2 CD). Is it possible that your reference kernel failed because of clash between new thing in kernel and old modules in initrd?
Good point about the initrd and lcs.ko. Instead of attaching a new initrd (which won't match your installation tree), I'll get started on rebuilding the test kernels based on beta 2.
Created attachment 149042 [details] Test kernel images based on beta 2
(In reply to comment #14) > Created an attachment (id=149042) [edit] > Test kernel images based on beta 2 > Can you please check new attachment? It has the same checksum as the old one (sent on on 2007-02-28)...
Created attachment 149139 [details] Test kernel images based on beta 2 Oops, these are the right ones.
Created attachment 149263 [details] Test results from the last kernel. It is the same result except that reference kernel now crashes like the one from Beta 2 CD. This makes me wander if it is enough just to provide me with a new kernel? I believe that there should be also a new initrd with fixed drivers.
Created attachment 149362 [details] initrd-lcs-carrier You are correct. Here are updated initrd images for testing.
Created attachment 149363 [details] initrd-lcs-recovery
Created attachment 149364 [details] initrd-netdev-sched
Unfortunately I get a message: .... md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. RAMDISK: Compressed image found at block 0 No filesystem could mount root, tried: ext2 iso9660 Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(1,0) HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 00358E10 My assumption is that something can be wrong with the packaging. Earlier I see this: ... cpu 0 phys_idx=0 vers=FF ident=00018F machine=1247 unused=0000 Brought up 1 CPUs checking if image is initramfs...it isn't (no cpio magic); looks like an initrd Freeing initrd memory: 11961k freed ...
Created attachment 149591 [details] initrd-lcs-carrier Apologies, used an incorrect flag when creating the initrd. I've tested these, they work correctly.
Created attachment 149592 [details] initrd-lcs-recovery
Created attachment 149593 [details] initrd-netdev-sched
This time I've got something different: Kernels and initrds netdev-sched and lcs-recovery produced more or less standard output: Enter the relative port number of your LCS device (required for OSA-Express ATM cards only): 0 lcs: Loading LCS driver lcs: Query IPAssist failed. Assuming unsupported! lcs: check on device 0.0.02e4, dstat=0xE, cstat=0x0 lcs: Recovery of device 0.0.02e4 started... lcs: LCS device eth0 without IPv6 support lcs: LCS device eth0 without Multicast support lcs: Device eth0 successfully recovered! BUG: warning at net/core/link_watch.c:117/linkwatch_run_queue() (Not tainted) 0000000000000000 0000000001303cb8 0000000000000002 0000000000000000 0000000001303d58 0000000001303cd0 0000000001303cd0 000000000003748e 0000000000380218 000000000037f9a4 0000000000000000 000000000000000b 0000000000000008 0000000000000000 0000000001303cb8 0000000001303d30 000000000024c378 00000000000164b2 0000000001303cb8 0000000001303d08 Call Trace: (Ý<000000000001640a>¨ show_trace+0x11e/0x130) Ý<00000000001b3d98>¨ linkwatch_run_queue+0x1c0/0x22c Ý<00000000001b3e6a>¨ linkwatch_event+0x66/0x74 Ý<000000000004ca5a>¨ run_workqueue+0xea/0x150 Ý<000000000004d8b0>¨ worker_thread+0x114/0x154 Ý<0000000000051990>¨ kthread+0x118/0x14c Ý<00000000000184ce>¨ kernel_thread_starter+0x6/0xc Ý<00000000000184c8>¨ kernel_thread_starter+0x0/0xc ~~~ However, there has been no crash with lcs-netif-carrier: Enter the relative port number of your LCS device (required for OSA-Express ATM cards only): 0 lcs: Loading LCS driver lcs: Query IPAssist failed. Assuming unsupported! lcs: check on device 0.0.02e4, dstat=0xE, cstat=0x0 lcs: Recovery of device 0.0.02e4 started... lcs: LCS device eth0 without IPv6 support lcs: LCS device eth0 without Multicast support lcs: Device eth0 successfully recovered! Unfortunately nothing has happened after that. I've even left system running over night but nothing happened. All this gives me an idea that perhaps lcs driver is not (main) culprit. What should happen in normal installation after loading network driver and setting up network?
The output from comment 25 looks very similar to the SLES 10 output (BUG at net/core/linkwatch.c, linkwatch_run_queue). Can you confirm that this output is from RHEL 5? If so, this is significantly different from the originally reported RHEL 5 output, and just want to make sure everything else has stayed consistent except the new kernel and initrd. After loading the LCS driver, the installer should continue asking for more info, like subchannels, IP address, etc.
Created attachment 149769 [details] 13 experiments with "lcs-recovery" kernel&initrd Yes I do confirm that output has been obtained from RHEL. I've performed additional 13 experiments in a row with "lcs-recovery" kernel&initrd (see the attachment). There are 5 cases with a crash that returns you to CP prompt and 8 cases which doesn't. My guess is that there are two separate problems. I also believe that I have no direct control to occurrence of a particular one. All experiments have been performed on the same z/VM with the same z/VM user. You did mention that after loading the LCS driver, the installer should continue asking for more info, like subchannels, IP address. This is actually done *before* lcs loading. What happens *after*? Is there something critical? One additional question: if you logout from particular z/VM user and then login again, does it mean that environment has been cleaned? So there will be nothing "remembered" from previous RHEL boot? Can you advise me what else to experiment or what additional information should I provide you from particular crash?
After entering the lcs information, the module is loaded and the interface brought online. Also, logging out of z/VM does clear all relevant information, so nothing is remembered that is not written to disk. I will attempt to get an lcs chpid defined here to try and reproduce, just to make sure it's not Flex-ES related.
Created attachment 153281 [details] Test kernel/initrd Please test this kernel and patched initrd.img. This solves the issue for me here. It is based on a newer tree than GA, so you won't be able to perform a full install yet, but I'd like to know if you get past the original error. Thanks
Created attachment 153285 [details] Output of the last test Unfortunately it didn't help. I was running: "Linux version 2.6.18-8.1.3.el5 (brewbuilder.redhat.com) (gcc version 4.1.1 20070105 (Red Hat 4.1.1-52)) #1 SMP Mon Apr 16 15:55:25 EDT 2007". Output is in the output.txt attachment. Installation then hangs. Regards, Ludvik
See comment #30. (In reply to comment #29) > Created an attachment (id=153281) [edit] > Test kernel/initrd > > Please test this kernel and patched initrd.img. This solves the issue for me > here. It is based on a newer tree than GA, so you won't be able to perform a > full install yet, but I'd like to know if you get past the original error. > > Thanks
Looks like there's a problem with your particular LCS device. It's failing here (from lcs.c): rc = lcs_get_problem(cdev, irb); if (rc || (dstat & DEV_STAT_UNIT_EXCEP)) { PRINT_WARN("check on device %s, dstat=0x%X, cstat=0x%X \n", cdev->dev.bus_id, dstat, cstat); if (rc) { lcs_schedule_recovery(card); wake_up(&card->wait_q); return; } } The dstat= and cstat= indicate the error code. cstat is subchannel status (0x0) and dstat is device status (0xE). What details do you have for these devices? Can you access them in a different LPAR? Since I have mine working here, I can compare our IOCDS data with yours if you'd like to post the config.
Here's more info: dstat=0xE corresponds to these three flags being set: DEV_STAT_DEV_END DEV_STAT_CHN_END DEV_STAT_UNIT_CHECK ..which corresponds to state CH_STATE_STOPPED for the device. Most likely the device is offline or not defined for this LPAR.
I would like to stress that with the same VM with the same LCS address it is possible to set up networking with older kernel (SLES9, 31bit). So I believe that device is not off-line. Can you please advise me how to perform more diagnostics? Since I am not quite experienced with z/VM: how to provide IOCDS data for LCS devices? I believe that #cp q ctca is not enough? Please also bear in mind that all this is running on FLEX-ES. Regards, Ludvik p.s. #cp q ctca CTCA 02E0 ON DEV 02E0 SUBCHANNEL = 0008 CTCA 02E1 ON DEV 02E1 SUBCHANNEL = 0009 (In reply to comment #33) > Here's more info: > > dstat=0xE corresponds to these three flags being set: > DEV_STAT_DEV_END > DEV_STAT_CHN_END > DEV_STAT_UNIT_CHECK > > ..which corresponds to state CH_STATE_STOPPED for the device. Most likely the > device is offline or not defined for this LPAR.
The fact that it doesn't work in either distro (both based on newer 2.6 kernels) leads me to believe Flex-ES needs to get involved. The fact that this is not reproducible on the z9 supports this point. The kernel is seeing the subchannels as stopped. What is their response to the status of the flags above on the LCS subchannels? Also, disregard my comment about IOCDS, which is specific to System z hardware configuration. Since it's an emulator, Flex-ES may use another method.
While I am waiting for response regarding flex-es, I wonder if it is realistic that the issue is about emulation. Please note that it is possible to install and run previous version of RHEL (and SLES) in the very same environment. I cannot help but question if there hasn't been done something to the kernel (lcs.c, specifically) that prevents RHEL5 (and SLES10 too) from installing on zSeries. It is possible that kernel sees the channel as stopped because it failed to put device online. You mentioned that "this is not reproducible on the z9". Does that mean that you were not able to reproduce the problem on your system using LCS type network?
That's correct, I was unable to reproduce this on a z9 using subchannels on an LCS type network. There are not many changes to lcs.c between kernels 2.6.9 and 2.6.18; these are the changes we tested early in this bug report. I admit I don't know enough about how Flex-ES emulates the subchannels to narrow down the problem, but the fact that it works on z9 points to an emulation problem. Just to throw the question out there, is there any way around lcs on Flex-ES, or is this still a hard requirement for you?
Hi! I am still waiting for response regarding FLEX-ES emulator... Unfortunately, lcs is a hard requirement because we have other working systems on the same physical machine that would be affected with network adapter change. At this point I would like to make some experiment myself. Can you please advise me how to establish cross-compiling environment to build kernel from vanilla sources by myself? Or can you please direct me to an appropriate documentation? Regards, Ludvik (In reply to comment #37) > That's correct, I was unable to reproduce this on a z9 using subchannels on an > LCS type network. There are not many changes to lcs.c between kernels 2.6.9 and > 2.6.18; these are the changes we tested early in this bug report. I admit I > don't know enough about how Flex-ES emulates the subchannels to narrow down the > problem, but the fact that it works on z9 points to an emulation problem. > > Just to throw the question out there, is there any way around lcs on Flex-ES, or > is this still a hard requirement for you?
If you have gcc and binutils, you should be able to compile. Grab the kernel from http://www.kernel.org/ . The general order is that you untar, run 'make menuconfig', 'make bzimage', 'make modules', then 'make modules_install'. There are other steps, like copying out the vmlinuz and generating the initrd, but these are the main ones. If you'd like to rebuild a Red Hat kernel, the process is a lot easier. Just run rpmbuild --rebuild <kernel.src.rpm>
Hi, Any response from the Flex-ES team?
I think this can be closed. I'm not sure about the state of FLEX-ES anyway.