Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 545692 - /distribution/virt/install/install_rhel5u4_x86_64_pv always failed on dell-pe1950-01
/distribution/virt/install/install_rhel5u4_x86_64_pv always failed on dell-pe...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.4.z
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Andrew Jones
Red Hat Kernel QE team
: Reopened
: 501162 553719 (view as bug list)
Depends On:
Blocks: 514490 5.5_Known-Issues
  Show dependency treegraph
 
Reported: 2009-12-09 00:30 EST by Caspar Zhang
Modified: 2011-02-16 07:06 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
blktap may not function as expected, resulting in slow disk I/O causing the guest to operate slowly also. To work around this issue guests should be installed using a physical disk (i.e. a real partition or a logical volume). (BZ#545692)
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-30 11:24:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test xml (-164.8.1) (6.45 KB, text/xml)
2009-12-09 00:30 EST, Caspar Zhang
no flags Details

  None (edit)
Description Caspar Zhang 2009-12-09 00:30:55 EST
Created attachment 377103 [details]
test xml (-164.8.1)

Description of problem:

the 64bit dom0(x86_64) was running kernel-xen-2.6.18-164.8.1.el5 with SMP test, I tried to confirm kernel-xen-2.6.18-164.8.1.el5 pv kernels boot in 64bit guest, but /distribution/virt/install/install_rhel5u4_x86_64_pv always failed on this machine: dell-pe1950-01.rhts.bos.redhat.com. I tried -164.el5, -164.6.1.el5, -164.7.1.el5, -164.8.1.el5, all of them failed on this machine. But the installation of rhel5u4_x86_64_pv_SMP on other intel-core machines Passed.

The failed job links are below:
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=104946 (-164.8.1)
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=105459 (-164.8.1)
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106400 (-164.8.1)
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106402 (-164.7.1)
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106722 (-164.6.1)
http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=106835 (-164 GA)

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-164.8.1.el5
kernel-xen-2.6.18-164.7.1.el5
kernel-xen-2.6.18-164.6.1.el5
kernel-xen-2.6.18-164.el5

How reproducible:
always

Steps to Reproduce:
1. submit the xml in the attachment.
3.
  
Actual results:
  /distribution/virt/install/install_rhel5u4_x86_64_pv  	 Fail

Expected results:
  /distribution/virt/install/install_rhel5u4_x86_64_pv  	 PASS

Additional info:
Comment 1 Andrew Jones 2009-12-09 08:07:50 EST
I'm attempting to reserve the machine now, and will see what happens when I attempt to install manually.
Comment 2 Andrew Jones 2009-12-09 17:03:48 EST
The install is hanging when formatting the disk. dmesg on dom0 shows that the
blktap driver isn't happy

blktap: ring-ref 770, event-channel 7, protocol 1 (x86_64-abi)
blk_tap: invalid kernel buffer -- could not remap it
blk_tap: invalid user buffer -- could not remap it
blk_tap: invalid kernel buffer -- could not remap it
blk_tap: invalid user buffer -- could not remap it
blk_tap: invalid kernel buffer -- could not remap it
blk_tap: invalid user buffer -- could not remap it
blk_tap: invalid kernel buffer -- could not remap it
blk_tap: invalid user buffer -- could not remap it
blk_tap: Reached Fail_flush
... and more ...

and xm dmesg shows the hypervisor is having trouble with several memory pages.
xm dmesg also shows messages repeated like this

(XEN) sysctl.c:51: Allowing physinfo call with newer ABI version

I'm not sure if the ABI stuff is related or even an issue, but this is all
interesting stuff.
Comment 3 Andrew Jones 2009-12-10 10:32:12 EST
I poked in the code and don't see how the ABI logs could be related, but there's certainly a lot on this machine.

I've also found this is reproducible with the latest RHEL5 (-178). Both dom0 and domU were updated. The logs shown above appeared much less frequently though.

I've also found it's not really hung. It's just super, super slow. Probably the RHTS workflow fails because some watchdog trips before it completes. The formatting takes around 15-20 minutes, then the install takes up to 2 hours. The same results for both 64b and 32b guests.

I've also seen a log like this

(XEN) grant_table.c:154:d0 Increased maptrack size to 3 frames.

in xm dmesg. This looks fine, like it's doing what it should, but it's probably from the same hypervisor call that was returning the maps that blktap is complaining about in the other logs.

I'm going to get some disk I/O metrics on bare-metal, dom0, and the guest.
Comment 4 Andrew Jones 2009-12-10 14:02:45 EST
It's definitely something funny with blktap. If I ensure that I'm not using it by seting up my image file to be on a loopback device and then passing that in for the installation, so I have a phy device in my config, then installs are back to normal speed. So the question is if the problem is in the blktap driver, or the hypervisor grant tables, or a combo of them both.
Comment 5 Andrew Jones 2009-12-10 14:04:39 EST
The question is also how this particular machine is exposing this problem. Since we don't generally see this issue.
Comment 6 Andrew Jones 2009-12-22 09:44:51 EST
I got this machine back and started looking at this again. I installed a vm quickly by using a loop device for its disk. Then I attached another disk using blktap to the vm and compared the performance for both using fs_mark.

#### blktap #####

FSUse%        Count         Size    Files/sec     App Overhead
     8         1000        10240         37.6            14742

CREAT (Min/Avg/Max)        WRITE (Min/Avg/Max)        FSYNC (Min/Avg/Max)
16     33     7130         21       31     58         18293  26481  63295

SYNC (Min/Avg/Max)        CLOSE (Min/Avg/Max)       UNLINK (Min/Avg/Max)
0      0        0          2        3     10         15       17     52

------------------------------------------------------------------------------

#### blkback #####

FSUse%        Count         Size    Files/sec     App Overhead
    17         1000        10240       3478.1            14343

CREAT (Min/Avg/Max)        WRITE (Min/Avg/Max)        FSYNC (Min/Avg/Max)
14     26       44         21     28      51          187   214     4138

SYNC (Min/Avg/Max)        CLOSE (Min/Avg/Max)       UNLINK (Min/Avg/Max)
0      0        0          2     4       15          15     16       42


The numbers are quite similar for most measurements, but the FSYNC times are on average 120 times longer for blktap than blkback, which causes it to be 100 times slower (files/sec) than blkback.

I currently have -183 kernel-xen and -102 xen on the host. The guest is -182. I don't get any logs in dmesg or xm dmesg with these revs.
Comment 7 Andrew Jones 2010-01-12 11:12:35 EST
*** Bug 553719 has been marked as a duplicate of this bug. ***
Comment 8 Andrew Jones 2010-01-15 04:18:54 EST
This has popped up on two machines from different vendors. The common denominator is they both have Intel SATA IDE controllers (which uses the ata_piix module). To test that, I reserved a third machine from yet another vendor that has a similar controller, and sure enough it reproduced again.

So it looks like something with this controller causes a serious slowdown with blktap.
Comment 9 Paolo Bonzini 2010-02-23 12:25:16 EST
It is no surprise that blkback is faster, since it caches all the data in memory while blktap flushes it to disk.  For bare-metal dell-pe1950-01 I'm getting

[root@dell-pe1950-01 ~]# fs_mark -d test -d test2 -s 51200 -n 4096

#  fs_mark  -d  test  -d  test2  -s  51200  -n  4096 
#	Version Version 3.2, 2 thread(s) starting at Tue Feb 23 12:18:01 2010
#	Sync method: INBAND FSYNC: fsync() per file in write loop.
#	Directories:  no subdirectories used
#	File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#	Files info: size 51200 bytes, written with an IO size of 16384 bytes per write
#	App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
     7         8192        51200         39.0            73261

so the controller is slow and blkback is working around the slowness at the cost of safety.
Comment 10 Andrew Jones 2010-02-23 12:56:08 EST
What bare-metal kernel/drivers did you use? We should check the numbers with the latest drivers we can find for this controller, which would ensure the controller doesn't just look slow because the drivers don't know how to use it.

We also shouldn't lose sight of the data points we have in comment 2. I don't think blktap would or should complain this way just because the controller is slow. So in the least we may need to improve our handling of these types of controllers.

In any respect, we should answer a few more questions better, and this bug is as good as any place to do it.
Comment 11 Andrew Jones 2010-03-05 08:20:05 EST
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

New Contents:
Under certain, yet to be determined circumstances, which are likely SATA controller related, blktap will not function properly. The primary symptom is very, very slow disk I/O making it appear that the guest is hung. To work around this issue all guests should be installed using a physical disk, such as a real partition or a logical volume.
Comment 13 Ryan Lerch 2010-03-22 22:55:33 EDT
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.

Diffed Contents:
@@ -1 +1 @@
-Under certain, yet to be determined circumstances, which are likely SATA controller related, blktap will not function properly. The primary symptom is very, very slow disk I/O making it appear that the guest is hung. To work around this issue all guests should be installed using a physical disk, such as a real partition or a logical volume.+blktap may not function as expected, resulting in slow disk I/O causing the guest to operate slowly also. To work around this issue guests should be installed using a physical disk (i.e. a real partition or a logical volume). (BZ#545692)
Comment 15 Paolo Bonzini 2010-06-30 11:24:49 EDT
I think it's just that the hardware is bad?  blktap is less than 10% slower than bare-metal (37.6 vs 39 on fs_mark).  I'm closing this as NOTABUG.
Comment 16 Chris Lalancette 2010-07-19 09:34:27 EDT
Clearing out old flags for reporting purposes.

Chris Lalancette
Comment 17 Laszlo Ersek 2011-02-16 07:06:36 EST
*** Bug 501162 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.