Description of problem: System rebooted itself after appearing hung. This was during a VM install while another VM was rebooting. Version-Release number of selected component (if applicable): RHEL 5.3 RC2 x86_64 How reproducible: Not sure but have the vmcore file Steps to Reproduce: 1. Install RC2 for Dom0 2. Install RC1-para for VM1 3. Start installing RC1-para for VM2 4. While VM2 is installing, reboot VM1 Actual results: System mouse froze, graphics colors got all funky, and after a couple of minutes the system rebooted itself. Expected results: The system to stay up Additional info: This was all on a local disk so no boot from SAN was involved. HP BL480c with 12GB of ram one disk I have the vmcore file which I will attach after logging this issue. It is 12GB long so it is taking a bit to FTP to my laptop. It may be quicker to have me drive it up to Westford and have someone copy it off my USB disk than to have me add it as an attachment but I'll try the attachment route. If there is a RH FTP site you prefer I load it to, let me know. Is there any other file RH would like off this system? Jim in Marlborough
This priority needs to be set to Urgent but I can't seem to change it.
"The file you are trying to attach is 560265 kilobytes (KB) in size. Non-patch attachments cannot be more than 20000 KB." Even after compressing the file it is too big for the upload. I had to upload it to the dropbox ftp site. vmcore file is 479754.gz
*** Bug 479343 has been marked as a duplicate of this bug. ***
Mostly notes, but: I started to take a look at this core. Unfortunately, in this particular case the core isn't hugely helpful, but it's at least somewhat. I extracted the hypervisor logs from it, and I see this: (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 000000000005fb08: ed=ffff8300ceefa080(0), sd=ffff8300cefc6080, caf=80000002, taf=00000000e8000001 (XEN) (XEN) mm.c:649:d0 Error getting mfn 5fb08 (pfn 40dd) from L1 entry 001000005fb08067 for dom0 (XEN) mm.c:649:d0 Error getting mfn 5fb09 (pfn 40dc) from L1 entry 001000005fb09067 for dom0 So from a high-level, what happened is that some domain tried to do a steal_page (despite the fact that it says "gnttab_transfer", there are actually two ways to get here), but that failed. Later on, when the dom0 goes to use that page, it crashed because it wasn't mapped into the address space. Now, to take a look at why that is. Given the above information, we can see that caf=0x80000002, which is "x" in the source code. And there is this check: x = y; if (unlikely((x & (PGC_count_mask|PGC_allocated)) != (1 | PGC_allocated)) || unlikely(_nd != _d)) { MEM_LOG("gnttab_transfer: Bad page %p: ed=%p(%u), sd=%p," " caf=%08x, taf=%" PRtype_info "\n", (void *) page_to_mfn(page), d, d->domain_id, unpickle_domptr(_nd), x, page->u.inuse.type_info); PGC_count_mask|PGC_allocated == 0x9fffffff, and 1|PGC_allocated == 0x80000001. So we see that & x with 0x9fffffff gives us back 0x80000002, which does not equal the 1|PGC_allocated, which means we hit the check and it's all downhill from there. Now, the real problem here is that x == 0x80000002, when this code is clearing expecting it to be 0x80000001. That, in turn, means that the page_count is too high, meaning that someone mapped it twice (or something like that). I'll have to continue to dig further, to see what is going on. One quick thing that might be useful, though, is to try to find out when this started occuring. We know RC1 and RC2 had it. Did the 5.3 Beta kernel have it? Did 5.2 have it? At least a rough estimation like this can help us narrow it down somewhat. Chris Lalancette
Chris, I didn't test xen with 5.2 so I can't help there and I didn't see it with any of the snapshots. RC1 was my first time seeing it. jim
I encountered a very similar issue with RH 5.3 snapshot 5 (see bugzilla 476294) but was not able to reproduce in snapshot 6. I also tested 5.2 and do not recall seeing this issue, but I don't recall if I was ever rebooting a VM while installing another.
Is there any chance I can get remote access to one of the blades that this is happening on? I can't reproduce it internally, and the core isn't giving me a whole bunch more information, so I think I just need to spend some time with one of these blades and see what I can do. Chris Lalancette
Chris, I don't know of a mechanism to give you remote access however, if you are in Westford and can come down to Marlborough I can give you hands on access to the machine. You can set up any logging you want. Let me know. jim
Jim, Chris is in the UK.
I have discussed this issue with my counterparts and off the top we can't think of a way to give someone outside of HP access to our internal network. Does Red Hat have some access I don't know about? Other than that Chris may have to send me the commands he'd like me to try, or someone from Westford comes to my lab and have me shadow them for the day while s/he communicates with Chris.
Jim, OK, let's just start with something simple. Let's start with RHEL 5.2, and see if the problem happens there. If it doesn't, then I can feed you some kernel packages to try to narrow down which patch between 5.2 and 5.3 started causing the problem. So I just need you to install the RHEL-5.2 kernel (which should be 2.6.18-92.el5xen), and re-run the test. Let me know if you don't have access to that; I can give it to you otherwise. Thanks, Chris Lalancette
Oh, I forgot to add... I see that the boxes you've had problems on here have a bit of memory. As another test, can you try the RHEL 5.3 kernel, but pass "mem=4G" on the *hypervisor* command-line, and then run the test again? That might show us if it is a larger memory related problem. Chris Lalancette
I am also seeing this problem on 5.3 x86_64 on HP BL460c G1 blades with 16gb of ram.. Mainly happens when installing a xen guest, which is also x86_64. Although some guest installs work just fine. Customer installed guests using 5.2 before my arrival onsite on the same hardware and had no issues. More to come as i gather info.
Oh yeah. Setting mem=4G made a WORLD of difference. I just installed three VMs at the same time in a serial fashion, then went through rebooting them as they became available. Last test was to simultaneously boot all three VMs. That worked great. I'd say load up a server with beaucoup of memory and you'll reproduce it. These were all para-virt RH5.3 x86_64 VMs. jim
Interestingly, on different blades in same chassis, each with 16gb of ram and sam proc setup, we were able to install 2 vm's, each on different blades, without problems. About to install 4 more on machines with only 8gb of ram.
(In reply to comment #18) > Oh yeah. Setting mem=4G made a WORLD of difference. > > I just installed three VMs at the same time in a serial fashion, then went > through rebooting them as they became available. Last test was to > simultaneously boot all three VMs. That worked great. > > I'd say load up a server with beaucoup of memory and you'll reproduce it. Still no luck reproducing here. We'll have to continue on your systems. OK, so at this point, we know that 5.2 doesn't have the problem, 5.3 does, and restricting the hypervisor to < 4GB makes the problem go away. So, there are a few things I would like to see next: 1. Please get an sosreport right after booting the 5.3 dom0. This will just give me some additional information (like the output from xm dmesg), so I don't have to bother you with getting some of that information. 2. I'm not sure if these machines have EPT or not, but that was one of the big features that went in for 5.3. Run "xm dmesg | grep -i "Hardware Assisted Paging". If that comes up with something, then you are using EPT. In that case, try to boot again, but pass "hap=0" on the hypervisor command-line. The previous command should then say "Hardware Assisted Paging detected, but disabled". Then try the test again to see if it makes a difference. 3. One thing I've found helpful in the past was to narrow down the problem between the hypervisor and the kernel. In this case, I really am thinking this is a problem in the hypervisor, but it would be good to get confirmation. Try to boot with a 5.3 hypervisor, but a 5.2 dom0 kernel, run the test, and see if it makes a difference. Then swap, and boot with a 5.2 hypervisor and a 5.3 dom0. This will at least narrow down where the regression is. I'll continue to try to reproduce here, but it hasn't been looking good so far. Thanks, Chris Lalancette
Created attachment 334003 [details] sosreport The md5sum is: 6cdaa016b7f4aab7f88d0f18e46e7dda
1. sosreport is now there 2. There is no hardware listed in xm dmesg 3. Will try to get to this soon jim
(In reply to comment #22) > 1. sosreport is now there > > 2. There is no hardware listed in xm dmesg > > 3. Will try to get to this soon Great, thanks. One interesting thing I noticed in your xm dmesg is that VMX is disabled. That's not a widely tested configuration, since people generally want to do full-virt as well as paravirt. Maybe while you are at it you can add: 4. Enable VMX (in the BIOS), power-off and power-on the machine, and then try the test again to see if that makes a difference. Chris Lalancette
Chris, I'm embarassed to say I forgot to enable that as part of the initial system setup. It is part of my checklist and somehow was overlooked. I checked all of my other xen servers and it was enabled so this one was missed. That said after correcting that I was able to reboot all three of my existing VMs while installing an additional three. I thought we had it until the system crashed while installing the 6th VM just after the anaconda line. I'm uploading the vmcore file now to the dropbox incoming directory and will post the file name when it has completed. jim
479754-vmcore-during-vm-install.tgz has been uploaded
Jim, Can you try a test kernel? I have a couple of suspect patches that went in between 5.2 and 5.3, and I'd like a test with one of them reverted. If you grab the kernel at: http://people.redhat.com/clalance/bz479754 and install it, can you run a test to see if it makes a difference? FYI, I've finally been able to reproduce it here, but on the machine I have locally it takes quite a bit of time to reproduce. So if you can do some of these tests quickly, that would help a lot. Thanks, Chris Lalancette
Was anything gleamed out of the March 4th vmcore? As far as the new debug kernel you provided, the system crashed running it but it never got the vmcore file. What was in messages was a bunch of kernel: xen_net: Memory squeeze in netback driver. right before the system rebooted. I'm going to attach the messages file for today and the whole xend.log file. Maybe it will help you determine what happened with the Mar 4th panic as well. I had 2 new VMs installing rhel 5.3 para x86_64 while trying to reboot 4 older rhel 5.3 para VMs. 10:25 rebooted system after installing new kernel 10:49 system rebooted but no vmcore generated jim
Created attachment 335112 [details] today's messages file showing memory squeeze message
Created attachment 335114 [details] the whole xend.log file
Jim, OK. So based on your test, we know the problem isn't in the "[xen] avoid dom0 hang when tearing down domains"; that's actually good news, since that patch was one of the more tricky ones. I'll try again to reproduce, by doing what you did; keep rebooting a few domains while installing two others. In terms of the second core, I haven't yet had time to look at it, but I will soon. I have some ideas for gathering more information with another debug kernel, but that may take a little time to code up. I'll keep you posted. Have you attempted with a 5.2 hypervisor and a 5.3 kernel yet? I would like to try to pin down which of the two it is, and that would be help. Thanks, Chris Lalancette
Chris, I have not had much time on this system due to other priorities but I do try to get to it when I have a chance. I'll try some more installs and reboots the way it is. I have not had a chance to install a 5.2 hypervisor and try that. jim
Chris, This reboot was even easier with your kernel. I got the same message too "Memory squeeze in netback driver" I deleted vm5 and vm6 from the previous reboot so I can have luns to install to. I brought up vm1 through vm4 and got them ready for reboot but didn't start the process yet. I started installing vm5 and got to selecting the lun to install to and it went down. I'll attach today's portion of the xend.log. jim
Created attachment 335572 [details] today's xend.log
The March 4th vmcore is basically the same as the previous vmcore. (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 000000000010306d: ed=ffff8300ceef8080(0), sd=ffff8300cefba080, caf=80000 003, taf=00000000e8000001 (XEN) mm.c:649:d0 Error getting mfn 10306d (pfn b3b6) from L1 entry 001000010306d067 for dom0 (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 000000000010fc85: ed=ffff8300ceef8080(0), sd=ffff8300ceef8080, caf=80000 003, taf=00000000e8000002 (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 000000000005e4e6: ed=ffff8300ceef8080(0), sd=ffff8300cfdce080, caf=80000 002, taf=00000000e8000001 (XEN) mm.c:649:d0 Error getting mfn 5e4e6 (pfn 1faff) from L1 entry 001000005e4e6067 for dom0
(In reply to comment #32) > Chris, > > This reboot was even easier with your kernel. I got the same message too > > "Memory squeeze in netback driver" > > I deleted vm5 and vm6 from the previous reboot so I can have luns to install > to. I brought up vm1 through vm4 and got them ready for reboot but didn't start > the process yet. I started installing vm5 and got to selecting the lun to > install to and it went down. > > I'll attach today's portion of the xend.log. Just for future reference, the xend.log isn't really very interesting for this problem; you have sufficiently explained what it is you are doing, and I can't reproduce reliably, so we'll just keep going on with what you are doing. What *is* interesting is the serial console log; if you can get that from every time we try and fail, that would be useful. It's mostly to ensure that the crash signature remains constant; I want to make sure that patches that I am adding/removing from these test builds don't cause other problems. In any case, I've now built another kernel with a different set of patches backed out, namely, the 2MB page table stuff that went into 5.3. This is another place that is touching page tables in a number of places, so is another good candidate for a "forgotten" put_page. It's available here (you want the -135 version): http://people.redhat.com/clalance/bz479754 If you could give that a whirl (along with doing the 5.2 HV with 5.3 kernel test), that would be great. Thanks, Chris Lalancette
I've also encountered what looks like the same problem as described in comments #4 and #34. Adding myself so I can track this.
Chris, I had a chance to install the -135 driver and with a quick check the system rebooted by itself again. Now to work on setting up the serial console for you. I brought up the initial four VMs and had them sitting there. I started installing two VMs. On one I got to the initial screen and the other I got to where it showed "Anaconda" and that seemed to freeze. When I see that I know the system is going down in about 5-10 seconds. Same "netback driver" comments in the messages file. I want to get you a serial console output of this before installing the 5.2 HV. jim
Created attachment 336288 [details] serial output of crash
Chris, I've attached the serial console output of this crash. I captured a boot sequence, the crash then the following boot sequence. "xen_net: Memory squeeze in netback driver. (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 0000000000123f81: ed=ffff8300cefd4080(0), sd=ffff8300cefd4080, caf=80000002, taf=00000000e8000001 (XEN) printk: 3 messages suppressed. xen_net: Memory squeeze in netback driver. printk: 4 messages suppressed. xen_net: Memory squeeze in netback driver. (XEN) mm.c:1808:d0 Bad type (saw 00000000e8000001 != exp 0000000080000000) for mfn 123f81 (pfn 1fff81) (XEN) mm.c:2098:d0 Error while pinning mfn 123f81 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at arch/x86_64/mm/../../i386/mm/hypervisor.c:197 invalid opcode: 0000 [1] SMP (XEN) Domain 0 crashed: rebooting machine in 5 seconds." I'll wait to hear from you before installing anything else. Method of causing the crash: 1. Bring up the 4 existing VMs and open a terminal window in each. 2. Start setting up for a new install to VM5 but don't hit the finish button 3. Start setting up for a new install to VM6 and watch the screen pause before the crash. I don't know if I need to open terminal windows or if I need all 4 VMs booted but since I can crash this machine this way every time I keep doing it to be consistant and to insure I actually get the crash quickly. jim
OK, good. Your serial output shows the same crash. That's all I want to do here; make sure that when we crash, we are getting the same crash, and not running into something more. Also, whatever your test method is that reliably reproduces it, I wouldn't change it :). The good news is that I've created a script that can now semi-reliably reproduce it locally. It takes between 2 - 4 hours for it to do so, though. The bad news is that given the continued failures, I don't have good ideas at the moment. So, going forward, I would like to see 2 tests: 1) Start up the 5.3 Xen kernel as normal. However, before you start the test that usually causes the crash, run: # xm mem-set 0 3000 which will balloon dom0 down to 3000 MB before the test. I'm thinking that the "Memory squeeze" messages may be related to the crash, and hand-ballooning dom0 might actually avoid the problem. This should just be a quick test. 2) On your machine, please run the test where you boot with a 5.2 HV (2.6.18-92.el5xen) and a 5.3 kernel (2.6.18-128.el5xen), and let's see what happens. Again, that will let us concentrate on the HV vs. the kernel proper. Thanks, Chris Lalancette
OOPS. Ignore #41. I hit refresh and it must have re-written my previous one. Well I did the "xm mem-set 0 3000" and the behavior was indeed different. I was able to install both VM5 and VM6 together along with rebooting VMs 1 through 3. VM1s screen said it crashed but it was flickering between "run" and "pause". While it in the "run" screens I was eventually able to log in and try a shutdown. It was still ugly so I rebooted the whole server to try again. jim
OK. Well, if guests are crashing, that's actually a vast improvement; that's probably some bug with the tools. The important bit is that dom0 is not crashing. That's good that "xm mem-set" improved the situation for you; that agrees with my testing results locally. Interestingly, I ran the test with a 5.2 HV and a 5.3 dom0 kernel, and I still got the crash (it looked ever so slightly different, but I think it is the same root cause). I'm trying again now with a 5.3 HV and a 5.2 dom0 kernel to see if that is any better. I'll keep running the test with different combinations now that I have a reproducer. However, I would still like to see the results of 5.2 HV and 5.3 dom0 on your hardware, and vice-versa. At least it's an additional data point against my testing, which is always good to have. Chris Lalancette
Chris, I re-ran the test from yesterday without VM1. That one becomes a zombie domain so I'm going to blow that away and rebuild it. I was able to install VM5 and VM6 while rebooting VMs 2, 3 and 4 using xm mem-set. Now for the RH5.2 HV. I went looking in my RH52/Server directory for a xen rpm and only see xen-lib rpms. Same thing for RH53/Server. Do the xen rpms get loaded right from Red Hat when using the RH number during the installation, are they hidden elsewhere or is it not a xen-#.#...rpm file? What will be the procedure to install the RH5.2 HV on RH5.3? jim
Ah, yeah, I should have mentioned. So, actually, the hypervisor is included as part of the kernel, so you don't want to mess with the xen package (that is only the userland tools). Basically, to run a 5.2 hypervisor with a 5.3 dom0, I install: kernel-xen-2.6.18-92.el5xen kernel-xen-2.6.18-128.el5xen And then I add a grub.conf entry that looks like: title Red Hat Enterprise Linux Server (2.6.18-128.el5xen) root (hd0,2) kernel /xen.gz-2.6.18-92.el5 module /vmlinuz-2.6.18-128.el5xen ro root=/dev/HostGroup/RHEL5x86_64 module /initrd-2.6.18-128.el5xen.img You can, of course, then switch it to a 5.3 hypervisor with a 5.2 dom0 by having an entry like: title Red Hat Enterprise Linux Server (2.6.18-92.el5xen) root (hd0,2) kernel /xen.gz-2.6.18-128.el5 module /vmlinuz-2.6.18-92.el5xen ro root=/dev/HostGroup/RHEL5x86_64 module /initrd-2.6.18-92.el5xen.img Chris Lalancette
Chris, I updated the grub.conf with the appropriate files and was able to get in the RH53 kernel/RH52 HV test this afternoon. I'll do the other tomorrow. I was able to install both VM5 and VM6 ok but my VM3 is now a zombie too. First thing tomorrow is to reinstall VM1 and VM3 before proceeding. VM2 and VM4 were rebooting during the installations. I did see two "Memory squeeze in Netback Driver" messages on the console but at least the server stayed up. jim
Chris, I reinstalled VM1 and VM3. I booted the RH52 kernel and 53 hypervisor. I booted VMs 1 through 4 I set up for VM5. Before that was ready to start the installation I set up for VM6 so that I could start configuring as soon as VM5 begun. As soon as I hit "finish" on VM5 to begin installing I clicked on "finish" for VM6 so that I could start setting up. At that point VM5 was stuck at "starting the install process" and VM6 showed "that directory could not be mounted from the server". The console spewed out the memory squeeze message for about 10 minutes before I killed both VM5 and VM6 before deleting them. I was expecting the server to go down but it didn't. I also ftp'd into that directory so the server was fine. I then started the same process but waited until the files started loading on VM5 before beginning to set up for VM6 and that went fine. No memory squeeze issues at any time. I rebooted VMs 1-4 a couple of times so there was a lot of activity going on. Installations finished without any extra messages on the console. jim
Chris, I tried to duplicate the issue I logged above and I can't. I restarted the test and both VM5 and 6 are installing like they should. No console messages. So that means I can't crash the system with RH53 kernel + RH52 HV, nor with RH52 kernel + RH53 HV. Only RH53 kernel and HV. Lets hope I haven't lost the touch. jim
Yes I can still crash it. I tried it again with your -135 test kernel. Same senario: 1. Bring up 4 VMs and open a terminal window. 2. Get VM5 all queued up 3. Get VM6 ready to install 4. Start VM5 and as soon as that begins start the VM6 process. 5. VM6 gets to the anaconda line and the system goes down shortly afterwards. printk: 4 messages suppressed. xen_net: Memory squeeze in netback driver. (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 0000000000105556: ed=ffff8300cefd4080(0), sd=ffff8300cf122080, caf=80000002, taf=00000000e8000001 (XEN) (XEN) mm.c:649:d0 Error getting mfn 105556 (pfn 1326a) from L1 entry 0010000105556067 for dom0 (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 00000000000c1f00: ed=ffff8300cefd4080(0), sd=ffff8300cf122080, caf=80000004, taf=00000000e8000001Unable to handle kernel paging request at ffff8801fe351f58 RIP: (XEN) [<ffffffff80261f19>] __memcpy+0x15/0xac (XEN) PGD 2df4067 mm.c:649:d0 Error getting mfn c1f00 (pfn 1e0f7) from L1 entry 00100000c1f00067 for dom0PUD 3bfc067 PMD 3dee067 PTE 0(XEN) mm.c:2768:d0 gnttab_transfer: Bad page 00000000000bda28: ed=ffff8300cefd4080(0), sd=ffff8300cf122080, caf=80000002, taf=00000000e8000001Oops: 0002 [1] (XEN) SMP (XEN) mm.c:649:d0 Error getting mfn bda28 (pfn 125cf) from L1 entry 00100000bda28067 for dom0 last sysfs file: /class/fc_host/host5/speed CPU 1 (XEN) mm.c:2768:d0 gnttab_transfer: Bad page 0000000000111182: ed=ffff8300cefd4080(0), sd=ffff8300cefb6080, caf=80000002, taf=00000000e8000001 (XEN) Modules linked in: nfs lockd(XEN) mm.c:649:d0 Error getting mfn 111182 (pfn 163) from L1 entry 0010000111182067 for dom0 fscache nfs_acl xt_physdev netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport joydev tg3 i5000_edac libphy hpilo bnx2 edac_mc serio_raw sg pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod shpchp qla2xxx scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 10422, comm: python Not tainted 2.6.18-135.el5bz479754xen #1 RIP: e030:[<ffffffff80261f19>] [<ffffffff80261f19>] __memcpy+0x15/0xac RSP: e02b:ffff8802c947fde8 EFLAGS: 00010203 RAX: ffff8801fe351f58 RBX: ffff8801fe350000 RCX: 0000000000000001 RDX: 00000000000000a8 RSI: ffff8802c947ff58 RDI: ffff8801fe351f58 RBP: ffff8802b20a7040 R08: 000000001e972a50 R09: ffff8802c947ff58 R10: 0000000000010800 R11: 0000000000001000 R12: ffff8801fe351f58 R13: ffff8802c48a17e0 R14: 0000000041387250 R15: 00000000003d0f00 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
Created attachment 336695 [details] latest console capture for the latest crash I started this capture after the four VMs were booted and before creating VM5
OK, thanks for the testing. Actually, we've seemed to reverse positions; I can definitely reproduce the error with a 5.2 HV and a 5.3 kernel, and I can also reproduce the problem with a 5.3 kernel and a 5.2 HV. That seems to say we've had this problem for a while. In any case, I've been adding some debug in to try to narrow this down. I'll let you know what I find. Thanks again, Chris Lalancette
*** Bug 485956 has been marked as a duplicate of this bug. ***
*** Bug 483279 has been marked as a duplicate of this bug. ***
I've uploaded a new test kernel that has a possible fix here: http://people.redhat.com/clalance/bz479754 Can people who are affected by this bug please download the appropriate kernel from there, and see if it makes a difference in testing for them? Besides testing to see if it solves the crash, I would also appreciate any performance data people are able to share with this kernel in place. There is a portion of the patch that, in theory, has the possibility to cause a performance regression. In practice, I don't expect it to change very much, but it would be good to confirm that. Thanks, Chris Lalancette
Chris, Much better. I tried twice and both VM5 and 6 install with VMs1-4 rebooting. On occasion I did see what I thought was performance related. One was when I clicked on "next" during the install of VM5 and it seemed to take at least a minute where it should have been seconds. VM6 being set up at the same time didn't see that delay. Another time was after both 5 and 6 were installed but before the first reboot the mouse seemed elusive. It was behind a window but it took a bunch of coaxing to get it into view. I couldn't see it so I couldn't tell if it was moving or not. jim
can anyone check if bug 496741 is a dup of this bug? i am not sure.
I've had a similar problem (bug 496741) and having 2.6.18-138.el5bz479754xen installed I do not run into the issue anymore.
*** Bug 496700 has been marked as a duplicate of this bug. ***
*** Bug 496741 has been marked as a duplicate of this bug. ***
The test kernel seems to work fine on my T500. As an added bonus, my wireless LED now illuminates. You guys rock! Bob
*** Bug 454285 has been marked as a duplicate of this bug. ***
Chris, I have been using 2.6.18-141.el5bz479754perfxen without problems. Today I have tried 2.6.18-144.el5bz479754perf and with that kernel my virtual machines do net get any network access, the error messages are: printk: 7 messages suppressed. netfront: rx->offset: 0, size: 4294967295 printk: 7 messages suppressed. netfront: rx->offset: 0, size: 4294967295 printk: 9 messages suppressed. netfront: rx->offset: 0, size: 4294967295 printk: 7 messages suppressed.
in kernel-2.6.18-146.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
kernel-2.6.18-146.el5 is fine for me. I had to reboot my domUs in order to get rid of the error mentioned in comment #65
the test kernel (2.6.18-144.el5bz479754perf) didn't help, except that it pointed out the intel hda driver on my T61 as being the issue. As soon as I blacklisted the intel_hda kernel modules, and limited dom0 to 768MB of RAM, I have been rock solid. The memory limiting didn't help until I removed the intel_hda driver.
(In reply to comment #69) > the test kernel (2.6.18-144.el5bz479754perf) didn't help, except that it > pointed out the intel hda driver on my T61 as being the issue. As soon as I > blacklisted the intel_hda kernel modules, and limited dom0 to 768MB of RAM, I > have been rock solid. The memory limiting didn't help until I removed the > intel_hda driver. Hm, then this sounds like a different bug. Can you setup a serial console to collect a stack trace, or setup kdump to collect a core? That way we can tell if it has the same trace (which I suspect it will not), and then we can open a different bug about it. Chris Lalancette
Which error? I could not even boot dom0 with the -141 test kernel, which is how I saw the error. I can try one of the test kernels, and just copy some stack info off the screen since I don't know if I can get kdump without dom0 fully booting, or I can setup kdump and see what i get with -128.1.10 and the intel driver not blacklisted.
(In reply to comment #71) > Which error? I could not even boot dom0 with the -141 test kernel, which is > how I saw the error. I'm looking to get a stack trace of what causes this to fail. Since you can't even boot, it sounds like a very different bug. > > I can try one of the test kernels, and just copy some stack info off the screen The test kernels aren't worth it. They are much older now than anything in the 5.4 kernel. What I would like to see is a boot with the latest 5.4 kernel (it should be -151 at this point), and get the stack trace. > since I don't know if I can get kdump without dom0 fully booting, or I can > setup kdump and see what i get with -128.1.10 and the intel driver not > blacklisted. You are right, kdump won't work without fully booting first. But now I'm confused; what are you talking about with the -128.1.10 kernel? I thought this only started happening with the 5.4 preview kernels? In any case, please open a new BZ with all of your information at this point. It's almost certainly a different bug, and we are just cluttering this one up. Chris Lalancette
Setting to POST to pickup a small revert in this patch
in kernel-2.6.18-153.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
*** Bug 503139 has been marked as a duplicate of this bug. ***
The 152 x86_64 kernel seems to behave on my T500 laptop but the 153 kernel hangs at xend during boot up. Let me know if and how I can provide more details.
(In reply to comment #77) > The 152 x86_64 kernel seems to behave on my T500 laptop but the 153 kernel > hangs at xend during boot up. Let me know if and how I can provide more > details. Please open a new Bugzilla for that, since we definitely want to track it, but it's almost certainly a different bug. Chris Lalancette
I have the same or a similar bug on 32-bit. I'll check the kernel 153 from Don and let you know if that helped. I'll add the serial console output to the attachments.
Created attachment 348405 [details] Serial console output for wgold's comment.
> The 152 x86_64 kernel seems to behave on my T500 laptop but the 153 kernel > hangs at xend during boot up. Let me know if and how I can provide more > details. Both the 154 and 155 work with my T105 (x86_64). Bob
(Jim Evans - HP) As the original reporter would you be able to test the -155.el5 kernel located at http://people.redhat.com/dzickus/el5/ and verifiy you no longer see the initial problem on your hardware.
I am happy to report in the 86th entry of this bugzilla that I do not see the issue after installing the -155 kernel. Simultaneously I had four VMs installing while two existing ones were rebooting. Earlier I could panic the machine in a couple of minutes. Hats off to the engineers at Red Hat! Thank you. jim
Thanks everyone for testing.
*** Bug 505352 has been marked as a duplicate of this bug. ***
Created attachment 351083 [details] crash in ssh session I saw several crashes since upgrading to 5.3, see attached log of the latest one. There was one domU running to where I was logged in via ssh from dom0, started Firefox in domU (tunneling X through ssh), started browsing redhat.com and a few minutes later it crashed. DomU is 64-bit and uses the bridged network setup.
Yep, that's exactly this bug. Should be fixed in 5.4 now. Chris Lalancette
Is this patch also part of the latest official kernel update, or do we get a patched version with the latest security update applied from your repository?
The patch has not yet been released in an update kernel for RHEL 5.3. It is currently targeted for the next update around the end of the month.
*** Bug 506859 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html
*** Bug 645043 has been marked as a duplicate of this bug. ***