Bug 1167240 - Exception occurred when install RHEV-H due to existing "Root" label partition
Summary: Exception occurred when install RHEV-H due to existing "Root" label partition
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-node
Version: 3.4.4
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 3.5.0
Assignee: Fabian Deutsch
QA Contact: Virtualization Bugs
URL:
Whiteboard: node
: 1171892 (view as bug list)
Depends On:
Blocks: 861659 rhev35rcblocker rhev35gablocker 1178805
TreeView+ depends on / blocked
 
Reported: 2014-11-24 09:37 UTC by cshao
Modified: 2016-02-10 20:05 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1178805 (view as bug list)
Environment:
Last Closed: 2015-02-11 21:06:29 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
failed to set install bootloader.png (3.73 KB, image/png)
2014-11-24 09:37 UTC, cshao
no flags Details
failed log.png (48.65 KB, image/png)
2014-11-24 09:38 UTC, cshao
no flags Details
bootloader failed (7.58 KB, image/png)
2014-12-02 06:14 UTC, cshao
no flags Details
ovirt.log (85.77 KB, text/plain)
2014-12-02 06:14 UTC, cshao
no flags Details
ovirt-node.log (4.89 KB, text/plain)
2014-12-02 06:15 UTC, cshao
no flags Details
1212-failed.png (33.19 KB, image/png)
2014-12-17 06:59 UTC, cshao
no flags Details
1212.tar.gz (51.10 KB, application/x-gzip)
2014-12-17 07:00 UTC, cshao
no flags Details
1218.tar.gz (54.50 KB, application/x-gzip)
2014-12-19 03:24 UTC, cshao
no flags Details
1218-bootloader.png (35.81 KB, image/png)
2014-12-19 03:25 UTC, cshao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2015:0160 0 normal SHIPPED_LIVE ovirt-node bug fix and enhancement update 2015-02-12 01:34:52 UTC
oVirt gerrit 36254 0 None None None Never
oVirt gerrit 36489 0 master MERGED Avoiding mounting of leftovers in fresh install. Never
oVirt gerrit 36564 0 ovirt-3.5 MERGED Avoiding mounting of leftovers in fresh install. Never

Description cshao 2014-11-24 09:37:28 UTC
Created attachment 960630 [details]
failed to set install bootloader.png

Description of problem:
Exception occurred when install RHEV-H "Failed to set install bootloader"

Version-Release number of selected component (if applicable):
rhev-hypervisor6-6.6-20141119.0
ovirt-node-3.0.1-19.el6.24.noarch

How reproducible:
10%

Steps to Reproduce:
1. Boot from PXE.
2. TUI install RHEV-H.
3. Finish the installation with correct steps.
4. Focus on "Install RHEV-H" page.

Actual results:
Exception occurred when install RHEV-H "Failed to set install bootloader"

Expected results:
Install can succeed without exception.

Additional info:

Comment 1 cshao 2014-11-24 09:38:19 UTC
Created attachment 960631 [details]
failed log.png

Comment 3 cshao 2014-11-25 07:47:43 UTC
Hi fabiand, 

I can't provide more log for you debug at present.
1. I try many times but still can't reproduce 
2. Consider to test it after below patches merge.
http://gerrit.ovirt.org/#/c/35490/
http://gerrit.ovirt.org/#/c/35491/

I will provide more info or test ENV for you ASAP if I can reproduce it again.
Thanks!

Comment 4 Ying Cui 2014-11-26 02:30:28 UTC
Chen, Ryan provided the new rhevh 6.6 3.4.z build in bug 1158044#c40, please give a try on this build whether this issue is gone or not. Thanks.

https://bugzilla.redhat.com/show_bug.cgi?id=1158044#c40

Comment 5 cshao 2014-11-26 06:03:00 UTC
(In reply to Ying Cui from comment #4)
> Chen, Ryan provided the new rhevh 6.6 3.4.z build in bug 1158044#c40, please
> give a try on this build whether this issue is gone or not. Thanks.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1158044#c40

Hi ycui,

Seem above 2 path are for 3.4, but actually, this build is for 3.5 but not 3.4, so I guess I don't need do any test with this build.

Test version:
Red Hat Enterprise Virtualization Hypervisor release 6.6 (20141125.0.el6ev)
ovirt-node-3.1.0-0.24.20141104git70ba2b0.el6.noarch

Thanks!

Comment 6 Ying Cui 2014-11-26 06:34:48 UTC
Chen, yes, I have updated the bug to ask Ryan to confirm that, let wait Ryan response, thanks.
https://bugzilla.redhat.com/show_bug.cgi?id=1158044#c41

Comment 11 cshao 2014-11-27 03:24:11 UTC
need info for comment 10

Comment 18 cshao 2014-12-02 06:14:11 UTC
Created attachment 963560 [details]
bootloader failed

Comment 19 cshao 2014-12-02 06:14:45 UTC
Created attachment 963561 [details]
ovirt.log

Comment 20 cshao 2014-12-02 06:15:36 UTC
Created attachment 963562 [details]
ovirt-node.log

Comment 28 Fabian Deutsch 2014-12-10 11:25:27 UTC
Two things:

1. multipath is claiming the device - even if just one path is used
2. a udev race appears when partprobe is run


[root@hp-bl465cg5-02 ~]# multipath -ll
Dec 10 11:10:31 | multipath.conf line 3, invalid keyword: find_multipath
3600508b1001036303020202020200005 dm-2 HP,LOGICAL VOLUME
size=68G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  `- 0:0:0:0 cciss!c0d0 104:0 active ready running



[root@hp-bl465cg5-02 ~]# blkid -L RootNew
/dev/mapper/3600508b1001036303020202020200005p3

Rediscover partitions:

[root@hp-bl465cg5-02 ~]# partprobe
device-mapper: remove ioctl on 3600508b1001036303020202020200005p3 failed: Device or resource busy
…
Warning: parted was unable to re-read the partition table on /dev/mapper/3600508b1001036303020202020200005 (Device or resource busy).  This means Linux won't know anything about the modifications you made. 
…
[root@hp-bl465cg5-02 ~]# blkid -L RootNew
/dev/mapper/3600508b1001036303020202020200005p3

[root@hp-bl465cg5-02 ~]# partprobe
device-mapper: remove ioctl on 3600508b1001036303020202020200005p3 failed: Device or resource busy
…
Warning: parted was unable to re-read the partition table on /dev/mapper/3600508b1001036303020202020200005 (Device or resource busy).  This means Linux won't know anything about the modifications you made. 
…
[root@hp-bl465cg5-02 ~]# blkid -L RootNew
/dev/cciss/c0d0p3

The device changed, raw device is now used, the p3 symlink did not appear:

[root@hp-bl465cg5-02 ~]# ls /dev/mapper/
3600508b1001036303020202020200005    3600508b1001036303020202020200005p2  HostVG-Config  HostVG-Logging  control         live-rw
3600508b1001036303020202020200005p1  3600508b1001036303020202020200005p4  HostVG-Data    HostVG-Swap     live-osimg-min

Adding udevadm settle does not help.

Easiest is to prevent multipath from claiming it. The other question is why the p3 symlink was not created.

Comment 29 Fabian Deutsch 2014-12-10 13:17:33 UTC
The file relevant for those rules is:

/lib/udev/rules.d/10-dm.rules:ENV{DM_UDEV_DISABLE_DM_RULES_FLAG}!="1", ENV{DM_NAME}=="?*", SYMLINK+="mapper/$env{DM_NAME}"

Owned by:
# rpm -qf /lib/udev/rules.d/10-dm.rules
device-mapper-1.02.90-2.el6_6.1.x86_64

Peter, can you tell us why some symlinks do not appear?

Comment 30 Peter Rajnoha 2014-12-10 14:45:01 UTC
(In reply to Fabian Deutsch from comment #28)
> Two things:
> 
> 1. multipath is claiming the device - even if just one path is used
> 2. a udev race appears when partprobe is run
> 
> 
> [root@hp-bl465cg5-02 ~]# multipath -ll
> Dec 10 11:10:31 | multipath.conf line 3, invalid keyword: find_multipath

...that should be "find_multipaths" (the "s" is missing at the end of the keyword).

> Rediscover partitions:
> 
> [root@hp-bl465cg5-02 ~]# partprobe
> device-mapper: remove ioctl on 3600508b1001036303020202020200005p3 failed:
> Device or resource busy
> …

...if you comment out all OPTIONS+="watch" rules in /lib/udev/rules.d (grep for them), does that change the situation in any way? (...I'm not sure now, but I think you need to resetar udev daemon to take the new rules so also call systemctl restart systemd-udevd.service after modifying the rules).

So let's try this for starters.

Comment 31 Fabian Deutsch 2014-12-10 15:11:08 UTC
(In reply to Peter Rajnoha from comment #30)
> (In reply to Fabian Deutsch from comment #28)
> > Two things:
> > 
> > 1. multipath is claiming the device - even if just one path is used
> > 2. a udev race appears when partprobe is run
> > 
> > 
> > [root@hp-bl465cg5-02 ~]# multipath -ll
> > Dec 10 11:10:31 | multipath.conf line 3, invalid keyword: find_multipath
> 
> ...that should be "find_multipaths" (the "s" is missing at the end of the
> keyword).

Yep I fixed right after pasting the snippet.

> > Rediscover partitions:
> > 
> > [root@hp-bl465cg5-02 ~]# partprobe
> > device-mapper: remove ioctl on 3600508b1001036303020202020200005p3 failed:
> > Device or resource busy
> > …
> 
> ...if you comment out all OPTIONS+="watch" rules in /lib/udev/rules.d (grep
> for them), does that change the situation in any way? (...I'm not sure now,
> but I think you need to resetar udev daemon to take the new rules so also
> call systemctl restart systemd-udevd.service after modifying the rules).
> 
> So let's try this for starters.

No:

[root@hp-bl465cg5-02 udev]# diff -ur rules.d.orig/ rules.d
diff -ur rules.d.orig/60-persistent-storage.rules rules.d/60-persistent-storage.rules
--- rules.d.orig/60-persistent-storage.rules	2014-07-24 13:48:42.000000000 +0000
+++ rules.d/60-persistent-storage.rules	2014-12-10 15:08:37.000000000 +0000
@@ -86,9 +86,9 @@
 KERNEL=="xvd*", ENV{DEVTYPE}=="partition", IMPORT{program}="/sbin/blkid -o udev -p $tempnode"
 
 # watch for future changes
-KERNEL!="xvd*|sr*", OPTIONS+="watch"
-KERNEL=="xvd*", ENV{DEVTYPE}!="partition", ATTR{removable}!="1", OPTIONS+="watch"
-KERNEL=="xvd*", ENV{DEVTYPE}=="partition", OPTIONS+="watch"
+#KERNEL!="xvd*|sr*", OPTIONS+="watch"
+#KERNEL=="xvd*", ENV{DEVTYPE}!="partition", ATTR{removable}!="1", OPTIONS+="watch"
+#KERNEL=="xvd*", ENV{DEVTYPE}=="partition", OPTIONS+="watch"
 
 # by-label/by-uuid links (filesystem metadata)
 ENV{ID_FS_USAGE}=="filesystem|other|crypto", ENV{ID_FS_UUID_ENC}=="?*", SYMLINK+="disk/by-uuid/$env{ID_FS_UUID_ENC}"

[root@hp-bl465cg5-02 udev]# udevadm control --reload-rules

[root@hp-bl465cg5-02 udev]# partprobe 
device-mapper: remove ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy
Warning: parted was unable to re-read the partition table on /dev/mapper/3600508b1001036303020202020200005 (Device or resource busy).  This means Linux won't know anything about the modifications you made. 
device-mapper: create ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy
device-mapper: remove ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy

Comment 32 Peter Rajnoha 2014-12-11 09:48:43 UTC
Ah, sorry, I've noticed the link to the test machine. So I looked at it a bit...

(In reply to Fabian Deutsch from comment #28)
> 1. multipath is claiming the device - even if just one path is used

I've noticed there's /etc/multipath/wwids file present which contains 3600508b1001036303020202020200005 as valid wwid for the path (which is the wwid of the cciss!c0d0). Once this file is present, we don't need to wait for another path to appear for the mpath to claim the device (because mpath already knows the wwid so it can compare with that).

The only question now is when and under what circumstances the wwid file got written first - was it during installation (anaconda?) or was it just copied from somewhere else or just the first multipath -c call in udev rules did not recognize this properly (and it wrote incorrect wwid file)? So we need to find out...


> 2. a udev race appears when partprobe is run
> 
> 
> [root@hp-bl465cg5-02 ~]# multipath -ll
> Dec 10 11:10:31 | multipath.conf line 3, invalid keyword: find_multipath
> 3600508b1001036303020202020200005 dm-2 HP,LOGICAL VOLUME
> size=68G features='0' hwhandler='0' wp=rw
> `-+- policy='round-robin 0' prio=1 status=active
>   `- 0:0:0:0 cciss!c0d0 104:0 active ready running
> 
> 
> 
> [root@hp-bl465cg5-02 ~]# blkid -L RootNew
> /dev/mapper/3600508b1001036303020202020200005p3
> 
> Rediscover partitions:
> 
> [root@hp-bl465cg5-02 ~]# partprobe
> device-mapper: remove ioctl on 3600508b1001036303020202020200005p3 failed:
> Device or resource busy

During my test I got (just p4 partition instead of p3 compared to Fabian's log):

[root@hp-bl465cg5-02 ~]# partprobe
device-mapper: remove ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy
Warning: parted was unable to re-read the partition table on /dev/mapper/3600508b1001036303020202020200005 (Device or resource busy).  This means Linux won't know anything about the modifications you made. 
device-mapper: create ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy
device-mapper: remove ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy

So 3600508b1001036303020202020200005p4 can't be removed because it's still open - it's used as a PV:

# lsblk /dev/mapper/3600508b1001036303020202020200005p4
3600508b1001036303020202020200005p4 (dm-6) 253:6    0 67.6G  0 part 
├─HostVG-Swap (dm-7)                       253:7    0  7.9G  0 lvm  
├─HostVG-Config (dm-8)                     253:8    0    8M  0 lvm  /config
├─HostVG-Logging (dm-9)                    253:9    0    2G  0 lvm  /var/log
└─HostVG-Data (dm-10)                      253:10   0  5.8G  0 lvm  /data

# pvs
  PV                                              VG     Fmt  Attr PSize  PFree 
  /dev/mapper/3600508b1001036303020202020200005p4 HostVG lvm2 a--  67.62g 51.98g

Also, parted shows 4 partitions on /dev/cciss/c0d0:

(parted) p                                                                
Model: Compaq Smart Array (cpqarray)
Disk /dev/cciss/c0d0: 143305920s
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start     End         Size        File system  Name     Flags
 1      2048s     499711s     497664s                  primary  bios_grub
 2      499712s   999423s     499712s     ext2         primary  boot
 3      999424s   1499135s    499712s     ext2         primary
 4      1499136s  143304703s  141805568s               primary  lvm

Which seems to correspond with the device-mapper tables:

# dmsetup table 3600508b1001036303020202020200005p1 3600508b1001036303020202020200005p1: 0 497664 linear 253:2 2048
3600508b1001036303020202020200005p2: 0 499712 linear 253:2 499712
3600508b1001036303020202020200005p3: 0 499712 linear 253:2 999424
3600508b1001036303020202020200005p4: 0 141805568 linear 253:2 1499136

So actually partprobe doesn't need to remove these mappings and recreate them - why does it do so?


But the original source of the problem is that the device is claimed by mpath because the wwid is written in /etc/multipath/wwids file - we need to resolve this first.

Comment 33 Peter Rajnoha 2014-12-11 10:02:37 UTC
So two things we need to answer:

  - how did the wwid got into /etc/multipath/wwids even if that wwid is not a path but just an ordinary device?

  - why partprobe tries to recreate the partition mappings if they're already correct?

Comment 34 Fabian Deutsch 2014-12-11 10:57:16 UTC
(In reply to Peter Rajnoha from comment #33)
> So two things we need to answer:
> 
>   - how did the wwid got into /etc/multipath/wwids even if that wwid is not
> a path but just an ordinary device?

My current idea is to use find_multipaths yes as a default,  like we'll be doing on RHEV-H 7.0.

I believe that a multipath call wrote the wwwids file, because we do not write that file ourselfs. Anaconda is not used by us.

>   - why partprobe tries to recreate the partition mappings if they're
> already correct?

Yes. That is the question I'd like to discuss. AFAIU partprobe is not recreating the partitions, but rather sending udev events, udev is then responsible for updating the symlinks.

And there are two questions (at least!):
1. Should partprobe send udev events if nothing changed?
2. Should udev recretae the symlinks if nothing changed?

3. What kind of events does partprobe send? And what would the correct behavior be?

Comment 35 Peter Rajnoha 2014-12-11 12:11:29 UTC
(In reply to Fabian Deutsch from comment #34)
> Yes. That is the question I'd like to discuss. AFAIU partprobe is not
> recreating the partitions, but rather sending udev events, udev is then
> responsible for updating the symlinks.
> 
> And there are two questions (at least!):
> 1. Should partprobe send udev events if nothing changed?
> 2. Should udev recretae the symlinks if nothing changed?
> 
> 3. What kind of events does partprobe send? And what would the correct
> behavior be?

Partprobe itself does not send the events, it's the action done on dm device which triggers the event generation in kernel:
  - device creation
  - device removal
  - device rename
  - device suspend+resume
  - or artificial events based on watch rule (but we ruled them out in comment #31

Comment 36 Peter Rajnoha 2014-12-11 12:34:22 UTC
Adding Brian to CC for the partprobe part:
  - we're seeing partprobe trying to recreate the partition mappings even if those mappings exist and they are correct already (see comment #32).

Comment 37 Peter Rajnoha 2014-12-11 13:07:31 UTC
I think we can rule out interaction with udev since if I kill udev daemon, I still get the same errors:

[root@hp-bl465cg5-02 ~]# killall udevd

[root@hp-bl465cg5-02 ~]# ps aux | grep udevd
root     28996  0.0  0.0   7908   812 pts/2    S+   13:01   0:00 grep udevd

[root@hp-bl465cg5-02 ~]# partprobe
device-mapper: remove ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy
Warning: parted was unable to re-read the partition table on /dev/mapper/3600508b1001036303020202020200005 (Device or resource busy).  This means Linux won't know anything about the modifications you made. 
device-mapper: create ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy
device-mapper: remove ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy

I'd say it's OK that the remove on 3600508b1001036303020202020200005p4 fails since it's open (and used as PV), but partprobe probably shouldn't try to create this device again (create ioctl on 3600508b1001036303020202020200005p4 failed) when it already exists (and it failed to remove before).

Comment 38 Brian Lane 2014-12-11 16:57:05 UTC
This looks like bug 1136966 where dm-multipath udev rule is always running kpartx.

Comment 39 Fabian Deutsch 2014-12-11 19:05:23 UTC
(In reply to bcl from comment #38)
> This looks like bug 1136966 where dm-multipath udev rule is always running
> kpartx.

I verified several things:

According to the 6.6 z-stream of the 6.7 bug 1136966: bug 1162265,
the build device-mapper-multipath-0.4.9-80.el6_6.1.x86_64 fixes this issue.

The build on the host is:
[root@hp-bl465cg5-02 ~]# rpm -q device-mapper-multipath
device-mapper-multipath-0.4.9-80.el6_6.1.x86_64

means the patch is in.
To be on the safe side, I also manually inspected the 40-multipath.rules file, and can confirm that the rule is in.

Then the question remains: What is the problem here?

Comment 40 Brian Lane 2014-12-11 19:39:52 UTC
Is partprobe really part of the problem? Even if it fails, as long as the device nodes are there it shouldn't matter.

It looks like the real issue is that bootloader failure. It may be that whatever is causing that problem is also causing the partprobe issue. partprobe not being able to notify the kernel should not be causing the bootloader problem.

Comment 41 Fabian Deutsch 2014-12-11 20:21:22 UTC
The message comes from our custom installer.
And that one is calling partprobe behind the scenes. So here is what is happening in the background:

1. partition (several calls to partprobe)
2. write image to disk
3. install botloader

Looking at the logs we see that it actually fails in 1

Comment 42 Brian Lane 2014-12-11 21:35:53 UTC
I have no idea what's going on then. Those ioctl errors look exactly like bug 1136966, parted hasn't changed much in 6.6 -- It slowed down rereading the part table (bug 1074069) and it fixed an assumption it was making about major:minor being sequential (bug 1018075).

Maybe some other udev rule is interfering?

Comment 43 Peter Rajnoha 2014-12-12 07:27:37 UTC
(In reply to bcl from comment #42)
> I have no idea what's going on then. Those ioctl errors look exactly like
> bug 1136966, parted hasn't changed much in 6.6 -- It slowed down rereading
> the part table (bug 1074069) and it fixed an assumption it was making about
> major:minor being sequential (bug 1018075).
> 
> Maybe some other udev rule is interfering?

I've switched off udev completely and it's still reproducible.

Comment 44 Peter Rajnoha 2014-12-12 07:36:11 UTC
(In reply to Peter Rajnoha from comment #43)
> I've switched off udev completely and it's still reproducible.

(I mean not the problem in bug #1136966 - that was a race with kpartx run from udev. But with udev switched off, we can rule out udev interference...)

Comment 45 Fabian Deutsch 2014-12-12 08:47:20 UTC
Brian, running strace kpartx shows:

…
ioctl(4, DM_DEV_REMOVE, 0x12823c0)      = -1 EBUSY (Device or resource busy)
write(2, "device-mapper: remove ioctl on 3"..., 98device-mapper: remove ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy) = 98
…
write(2, "Warning: ", 9Warning: )                = 9
write(2, "parted was unable to re-read the"..., 198parted was unable to re-read the partition table on /dev/mapper/3600508b1001036303020202020200005 (Device or resource busy).  This means Linux won't know anything about the modifications you made. 
) = 198


I interprete this as partprobe being the part who is interacting with DM, not udev.
Do you have an idea why partprobe above tries to remove that partition?

Comment 46 Fabian Deutsch 2014-12-12 08:48:52 UTC
Also, further down it tries to create a partition:

…
semctl(29753355, 0, GETVAL, 0xffffffffffffffff) = 2
ioctl(4, DM_DEV_CREATE, 0x127e3b0)      = -1 EBUSY (Device or resource busy)
write(2, "device-mapper: create ioctl on 3"..., 98device-mapper: create ioctl on 3600508b1001036303020202020200005p4 failed: Device or resource busy) = 98
write(2, "\n", 1
…

But why? The partitioning did not change.

Comment 47 Fabian Deutsch 2014-12-12 09:50:20 UTC
With the assumption that partprobe is calling dm directly, I dug a bit more.
It turns out that libparted/arch/linux.c is the place where the DM calls come from.

parted-2.1 with some patches is used in RHEL 6.6.

Looking at the changes done to the file above after 2.1, revealed:

commit c605b2cea04a6c7478f5ed1254c74e02d943fb58
Author: Hans de Goede <hdegoede>
Date:   Fri Apr 23 13:08:43 2010 +0200

    linux: detect dm_task_run failure
    
    We were checking for a return value of < 0 for dm_task_run errors, but
    dm_task_run returns 0 on error (and 1 on success). Thanks to Joe Jin
    for spotting this, see Red Hat bug 582907.
    
    * libparted/arch/linux.c(_dm_remove_map_name, _dm_is_part,
    _dm_remove_parts, _dm_add_partition): dm_task_run returns 0 on error.


I am npt familiar with the parted code, but the patch matches the observations.
We observe: Partitions are created even if it is not necessary.
The patch fixes: Incorrect return code handling (basically negating return codes IIUIC)

The following patches might also be relevant here, because they looked like follow up patches, but I am not sure if they affect our codepath:

commit 76f8e829e43773778b69915b5cfff9f643701074
Author: Jim Meyering <meyering>
Date:   Mon Apr 12 12:08:16 2010 +0200

    libparted: linux_disk_commit: don't ignore _disk_sync_part_table failure
    
    * libparted/arch/linux.c (linux_disk_commit):
    When calling _disk_sync_part_table, always return its result.

commit 81ed7fc413375a8b8ed5bd792e7385dacaf8a3e1
Author: Jim Meyering <meyering>
Date:   Mon Apr 12 12:06:30 2010 +0200

    libparted: _disk_sync_part_table: always return 0 upon failure
    
    * libparted/arch/linux.c (_disk_sync_part_table):
    Return 0 (not 1) upon failure.


Brian, what do you think?

Comment 49 Brian Lane 2014-12-12 18:23:46 UTC
Ok, I think I see what's going on now. Sorry for not realizing it sooner. I think this boils down to the fact that parted 2.1 and partprobe cannot operate on busy devices.

When you call partprobe without specifying a device it probes all the devices on the system and attempts to tell the kernel about the partitioning. For non-dm devices this is done using BLKRRPART calls. For dm it removes and recreates the mappings.

As for the patches in comment 47, they aren't related, and are from newer code than we have in 2.1. The return values in 2.1 are working correctly.

I think the real problem here is your use of partprobe. Have you changed how you call it recently? Or added steps?

In reproducer script you made for bug 1173698 you are calling partprobe later than you should, and not specifying the device that was partitioned. Is this how it is being called in RHEV and how is this different than in previous releases?

Comment 50 Fabian Deutsch 2014-12-13 07:52:08 UTC
We had calls like "partprobe /dev/mapper/*" since 2010 and "partprobe" since 2012. Not really a recent change.

And everything works good on 6.5. I know that there weren't many changes to parted since 6.5, but maybe kernel timing changes uncover the problems now.

Despite that we added a plain partprobe call to a function lately. The reasons why we added those calls was often, that partitions did not appear after we created them.

Regarding the patches, yes, they are not part fo 2.1, I just wondered if they needed to get backported.


We could try ripping out the plain partprobe calls, because we know there are improvements on the multipath and udev fronts. But it still does not explain, why we are facing the problems now on 6.6.

Comment 51 Fabian Deutsch 2014-12-15 07:24:00 UTC
*** Bug 1171892 has been marked as a duplicate of this bug. ***

Comment 52 Brian Lane 2014-12-15 14:57:33 UTC
If partitions aren't showing up after creation then that may be because of old metadata on the device getting activated by some other thing (mdadm, lvm, ?) we've been seeing problems like that with Fedora and Anaconda where if you repeat an exact layout the metadata gets detected and something else interferes with parted notifying the kernel.

A bare partprobe shouldn't ever work if there are partitions mounted so those should be changed to specify the devices. You may also need to do more to wipe out previous metadata on the device.

Comment 53 cshao 2014-12-16 06:35:10 UTC
Hi fabiand,

Can we request cciss machine (10.66.73.4) back for our new build(1212) testing?

Thanks!

Comment 54 Fabian Deutsch 2014-12-16 08:34:44 UTC
Yes.

Comment 56 cshao 2014-12-17 06:59:20 UTC
Created attachment 969921 [details]
1212-failed.png

Comment 57 cshao 2014-12-17 07:00:34 UTC
Created attachment 969922 [details]
1212.tar.gz

Comment 58 Ying Cui 2014-12-17 11:37:07 UTC
After review the whole bug above comments and confirmed with shaochen, summarizing here for QE test probability and test machines:

server:
    Dell R210 - tested about 5 times no such issue
    Dell pet105 -01 - tested about 5 times, 2 times encountered this bug 1167240.
    HP-bl465cg5-02 - tested about 8 times, 4 times encountered this bug 1167240.

workstation:
    hp-xw4550-02 - tested about 6 times, 2 times encountered this bug 1167240 exsit.

desktop:
    Dell 9010 - tested about 10 times, 3 times encountered this bug 1167240 exsit.
    dell 790  - tested about 10 times, 4 times encountered this bug 1167240 exsit.

Comment 59 Ying Cui 2014-12-17 12:00:35 UTC
The hp customer encountered the failed bootloader issue on bug 1171892 with rhev 3.5 beta5 builds. So here need to highlight it.

Comment 60 Ying Cui 2014-12-17 13:03:41 UTC
From QE side, I highlight and summarize this bug here:

1. this bug exist in rhevh 6.6 for 3.4.z build, see bug description and comment 17
2. this bug exist in rhevh 6.6 for 3.5 build, see bug comment 55.
3. this bug exist in rhevh 7.0 for 3.5 build, see bug 1171892.

Whether we need to separate this bug to 3.4.z and 3.5 to separately track?

Comment 61 Fabian Deutsch 2014-12-18 14:13:27 UTC
(In reply to Ying Cui from comment #60)
> From QE side, I highlight and summarize this bug here:
> 
> 1. this bug exist in rhevh 6.6 for 3.4.z build, see bug description and
> comment 17
> 2. this bug exist in rhevh 6.6 for 3.5 build, see bug comment 55.
> 3. this bug exist in rhevh 7.0 for 3.5 build, see bug 1171892.
> 
> Whether we need to separate this bug to 3.4.z and 3.5 to separately track?

We can track it here, and then do a z-stream clone to track.

The attached patch addresses this issue by reducing the number of partprobe calls, which can lead to this problem.

Comment 62 cshao 2014-12-19 03:24:13 UTC
Test version:
rhev-hypervisor6-6.6-20141218.0.el6ev
ovirt-node-3.1.0-0.37.20141218gitcf277e1.el6.noarch

I noticed that the path http://gerrit.ovirt.org/#/c/36254/ have been merged, but I still encountered the failed bootloader issue with above build.

Dell pet105 -01 - tested about 2 times, 1 times encountered this bug 1167240.

Please see the new attachment "1218.tar.gz" for more details.
Due to there is new build(1218) coming, so I can't leave the test env to here.
All logs info have uploaded.
/var/log/*.*
/tmp/ovirt.log

Thanks for understanding.

Comment 63 cshao 2014-12-19 03:24:46 UTC
Created attachment 970989 [details]
1218.tar.gz

Comment 64 cshao 2014-12-19 03:25:38 UTC
Created attachment 970990 [details]
1218-bootloader.png

Comment 65 cshao 2014-12-19 06:13:18 UTC
Change status to ASSIGNED according #c62

Comment 66 cshao 2014-12-19 10:34:32 UTC
Hi fabiand,

I just reproduce this bug on hp-z800-02, so the bug is not only appear on pet105-1.
I Still can't provide remote access due to new build testing in process.

Thanks!

Comment 67 Fabian Deutsch 2014-12-19 11:36:14 UTC
Chen, please provide logs for every failed installation. We need more data to solve this.

Comment 68 cshao 2014-12-19 11:44:13 UTC
(In reply to Fabian Deutsch from comment #67)
> Chen, please provide logs for every failed installation. We need more data
> to solve this.

Hi fabiand,

Actually I have uploaded all log info as attachment"1218.tar.gz", please #c63.

Thanks!

Comment 69 Fabian Deutsch 2014-12-19 11:50:04 UTC
(In reply to shaochen from comment #68)
> (In reply to Fabian Deutsch from comment #67)
> > Chen, please provide logs for every failed installation. We need more data
> > to solve this.
> 
> Hi fabiand,
> 
> Actually I have uploaded all log info as attachment"1218.tar.gz", please
> #c63.

Please add the logs for the failed installation from comment 66.

We need all the data, to get a picture of what is going wrong.

Comment 70 Fabian Deutsch 2014-12-19 12:07:59 UTC
(In reply to shaochen from comment #63)
> Created attachment 970989 [details]
> 1218.tar.gz

It looks like a partition labeled Root exists twice, this should not be the case.
Let#s see what the other failure log says.

Comment 71 cshao 2014-12-22 10:49:25 UTC
> Please add the logs for the failed installation from comment 66.
> 
> We need all the data, to get a picture of what is going wrong.

So let me be clear, comment 66 just insisting that the bug is not only appear on pet105-1 as you talked to me on IRC, so I didn't attach the log info. 

Today I test 5 times on hp-z800-02, didn't met the issue, so can't obtain log, but I will upload log info once I can.

Thanks!

Comment 72 Ying Cui 2014-12-24 04:52:03 UTC
Remove the bug subject prefix "[3.4_6.6]" due to this bug occurred on 6.6_3.4.z/6.6_3.5/7.0_3.5, see comment 60 and comment 61.

Comment 75 cshao 2014-12-31 09:51:14 UTC
Test version:
ovirt-node-iso-3.5-0.999.201412310932.el6.iso
ovirt-node-3.1.999-0.0.master.el6.noarch

Test machines:
Dell pet105 -02 - tested about 8 times, didn't met this bug.
dell-per210-01 - tested about 5 times, didn't met this bug.
Dell 790  - tested about 2 times, didn't met this bug.

Test result:
I can't reproduce this issue, seem the bug has gone. I will verify this bug after get official build from brew page. But if I can reproduce this issue again, I will leave env to here and let you know.

Thanks!

Comment 76 cshao 2015-01-04 03:22:35 UTC
Test version:
ovirt-node-iso-3.5-0.999.201412310932.el6.iso
ovirt-node-3.1.999-0.0.master.el6.noarch

Test machines:
hp-z800-02 - tested about 5 times, didn't met this bug.
hp-5850 - tested about 5 times, didn't met this bug.

Comment 78 cshao 2015-01-06 07:18:27 UTC
Test version:
rhev-hypervisor6-6.6-20150105.0
ovirt-node-3.1.0-0.39.20150105gitb784105.el6.noarch

rhev-hypervisor7-7.0-20150105.0
ovirt-node-3.1.0-0.39.20150105gitb784105.el7.noarch

Test machines:
Dell pet105 -02 - tested about 5 times, didn't met this bug.
hp-5850 - tested about 5 times, didn't met this bug.
hp-bl465cg5-01 - tested about 5 times, didn't met this bug.

Test result:
I can't reproduce this issue, seem the bug has gone. I will verify this bug after status change to ON_QA

Comment 84 cshao 2015-01-08 10:30:30 UTC
Test version:
rhev-hypervisor7-7.0-20150106.0
ovirt-node-3.1.0-0.40.20150105git69f34a6.el7.noarch

Test machines: 
Dell 790 - tested about 10 times, didn't met this bug
hp-5850 -  tested about 5 times, didn't met this bug
dell-pet105-02 - tested about 5 times, didn't met this bug
hp-z800-02 - tested about 5 times, didn't met this bug
dell r510 - tested about 5 times, didn't met this bug

Comment 85 cshao 2015-01-21 03:17:35 UTC
(In reply to shaochen from comment #84)
> Test version:
> rhev-hypervisor7-7.0-20150106.0
> ovirt-node-3.1.0-0.40.20150105git69f34a6.el7.noarch
> 
> Test machines: 
> Dell 790 - tested about 10 times, didn't met this bug
> hp-5850 -  tested about 5 times, didn't met this bug
> dell-pet105-02 - tested about 5 times, didn't met this bug
> hp-z800-02 - tested about 5 times, didn't met this bug
> dell r510 - tested about 5 times, didn't met this bug

Test version:
rhev-hypervisor7-7.0-20150114.0
ovirt-node-3.2.1-4.el7.noarch

Test machines: 
Dell 9010 - Tested 7 times, did not encounter this bug
HP-z600-03 - Tested 5 times, did not encounter this bug
HP-z800-02 - Tested 6 times, did not encounter this bug
Dell-pet105-02 - Tested 5 times, did not encounter this bug
HP-5850 - Tested 3 times, did not encounter this bug

Test result:
According above testing result and #c84, the bug has been fixed, change bug status to VERIFIED.

Comment 87 errata-xmlrpc 2015-02-11 21:06:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0160.html


Note You need to log in before you can comment on or make changes to this bug.