Bug 1211405

Summary: sfdisk dump/restore alters partition table on some disks (at least Fedora cloud images)
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: util-linuxAssignee: Karel Zak <kzak>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: unspecified    
Version: 26CC: awilliam, dustymabe, jonathan, kzak
Target Milestone: ---Keywords: Reopened
Target Release: ---Flags: dustymabe: needinfo? (kzak)
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-29 11:57:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Adam Williamson 2015-04-13 23:02:13 UTC
On at least some disks - including the Fedora Cloud images - simply running 'sfdisk --dump /dev/vdb > dump; sfdisk /dev/vdb < dump' alters the partition table in some way. It should only dump the existing configuration and then restore it exactly the same.

This appears to be the root cause of https://bugzilla.redhat.com/show_bug.cgi?id=1147998 , which was worked around.

To reproduce:

1. Get a Fedora Cloud image (e.g. https://dl.fedoraproject.org/pub/alt/stage/22_Beta_RC1/Cloud-Images/x86_64/Images/Fedora-Cloud-Base-22_Beta-20150407.x86_64.qcow2 )
2. Attach it to a VM (I'm assuming it shows up as '/dev/vdb' in the end, if not, adjust accordingly)
3. Attach a Fedora live image to the VM's CD/DVD drive
4. Boot to the live image
5. Install vim-common
5. As root, run 'xxd -l512 /dev/vdb > /tmp/before'
6. As root, run 'sfdisk --dump /dev/vdb > dump'
7. As root, run 'sfdisk /dev/vdb < dump --force'
8. As root, run 'xxd -l512 /dev/vdb > /tmp/after'
9. Run 'diff -u /tmp/before /tmp/after'

Note the two are not identical. Tested with util-linux-2.26.1-1.fc22, but per #1147998 , this seems likely to have also been present in 2.25. In my test, this is the diff:

--- /tmp/before	2015-04-13 18:55:53.543764592 -0400
+++ /tmp/after	2015-04-13 18:58:12.062145749 -0400
@@ -27,6 +27,6 @@
 00001a0: b40e cd10 ac3c 0075 f4c3 0000 0000 0000  .....<.u........
 00001b0: 0000 0000 0000 0000 a765 9cb5 0000 8000  .........e......
 00001c0: 2102 830e def9 0008 0000 00a0 0f00 000e  !...............
-00001d0: dff9 8e0f ffff 00a8 0f00 0058 7002 0000  ...........Xp...
+00001d0: dff9 8e02 a28a 00a8 0f00 0058 7002 0000  ...........Xp...
 00001e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
 00001f0: 0000 0000 0000 0000 0000 0000 0000 55aa  ..............U.

dustymabe also observed this. He tested within the cloud environment, by disabling the 'growpart' script that runs on first boot (so it has not already caused the change by the time he gets to try). He found this difference in his test:

# Before sfdisk dump/restore sfdisk -d /dev/vda > mbr.dump; sfdisk --force /dev/vda < mbr.dump;
[root@f22sfdisk ~]# file -s /dev/vda
/dev/vda: DOS/MBR boot sector; partition 1 : ID=0x83, active, start-CHS (0x0,32,33), end-CHS (0x17e,146,20), startsector 2048, 6144000 sectors

# After sfdisk dump/restore
[root@f22sfdisk ~]# file -s /dev/vda
/dev/vda: DOS/MBR boot sector; partition 1 : ID=0x83, active, start-CHS (0x2,0,33), end-CHS (0x3d1,4,20), startsector 2048, 6144000 sectors

Comment 1 Adam Williamson 2015-04-13 23:43:12 UTC
The diff I pasted above is actually from a random VM's hard disk as I ran the test on the wrong disk...but it still illustrates the problem. I confirmed I do see a difference between 'before' and 'after' when running the same test on the right disk (a Cloud image).

Comment 2 Karel Zak 2015-04-13 23:45:56 UTC
(In reply to awilliam@redhat.com from comment #0)
> [root@f22sfdisk ~]# file -s /dev/vda
> /dev/vda: DOS/MBR boot sector; partition 1 : ID=0x83, active, start-CHS
> (0x0,32,33), end-CHS (0x17e,146,20), startsector 2048, 6144000 sectors
> 
> # After sfdisk dump/restore
> [root@f22sfdisk ~]# file -s /dev/vda
> /dev/vda: DOS/MBR boot sector; partition 1 : ID=0x83, active, start-CHS
> (0x2,0,33), end-CHS (0x3d1,4,20), startsector 2048, 6144000 sectors

Do you want to say that the difference is in CHS addresses? Frankly I don't care about CHS, and plan is to drop it in long term. 

Important is that start and size in sectors is the same.

Comment 3 Adam Williamson 2015-04-14 00:25:11 UTC
Karel: I've no idea, you're the expert. :) I'm just logging the differences.

The practical impact, however, in at least some cases, is serious: it stops the system booting.

Take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1147998#c0 , particularly the two xxd dumps. The first contains some bootloader code whose origins seem slightly mysterious, but we think it might be parted. That's the boot code you get after running an anaconda install to a completely clean disk with syslinux as the bootloader. Whatever it is, it is capable of booting the system - *until* an sfdisk dump/restore. Once an sfdisk dump/restore has happened, the system no longer boots.

The 'fix' for #1147998 was really more of a workaround: we added lines in the cloud kickstarts' %post sections that do this:

dd if=/usr/share/syslinux/mbr.bin of=/dev/vda

that installs some boot code provided by syslinux. That's the second dump in https://bugzilla.redhat.com/show_bug.cgi?id=1147998#c0 . That boot code is apparently more sophisticated, and continues to work after whatever sfdisk does to the partition table. But it seems reasonable to me to characterize this as a workaround for the sfdisk bug, not really a *fix*. It doesn't seem correct for a straight dump/restore with sfdisk to stop the system booting.

Comment 4 Karel Zak 2015-04-14 11:36:05 UTC
(In reply to awilliam@redhat.com from comment #3)
> The practical impact, however, in at least some cases, is serious: it stops
> the system booting.
> 
> Take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1147998#c0 ,

but this is already reported angainst utils-linux as bug #1210428 and it's already fixed by util-linux-2.26.1-1.fc22

(the update fixed two bugs: 1/ the problem with boot flag; 2/ the problem with erased boot sector.)

So, do you report something else, or we duplicate #1210428 here? I have lost :-)

> The 'fix' for #1147998 was really more of a workaround: we added lines in
> the cloud kickstarts' %post sections that do this:
> 
> dd if=/usr/share/syslinux/mbr.bin of=/dev/vda

I guess this workaround is unnecessary with util-linux-2.26.1-1.fc22, try it.

Comment 5 Adam Williamson 2015-04-14 12:33:42 UTC
#1210428 was never the same bug as #1147998; they just have similar *effects*. The boot sector deletion and boot flag clearing issues were new in 2.26. The partition table modification issue already existed in 2.25 (and probably earlier).

"I guess this workaround is unnecessary with util-linux-2.26.1-1.fc22, try it."

It is not. We had some trouble getting the images to build, so dgilmore took the 'dd' out of %post , and we found that when he did that, #1147998 immediately came back. There was no issue with the boot flag and the boot sector was not zero'ed, but systems still did not reboot.

He then managed to get an image to build with the 'dd' line back in %post, and rebooting works.

So the cause of #1147998 has remained in util-linux all along, and we still need the dd in %post to work around it.

Comment 6 Karel Zak 2015-04-14 16:24:22 UTC
Still not sure if the issue is sfdisk:

# wget https://dl.fedoraproject.org/pub/alt/stage/22_Beta_RC1/Cloud-Images/x86_64/Images/Fedora-Cloud-Base-22_Beta-20150407.x86_64.qcow2

# modprobe nbd
# qemu-nbd -c /dev/nbd0 Fedora-Cloud-Base-22_Beta-20150407.x86_64.qcow2

# xxd -l512 /dev/nbd0 > /tmp/before

# sfdisk --dump /dev/nbd0 > dump
# sfdisk /dev/nbd0 --force < dump

# xxd -l512 /dev/nbd0 > /tmp/after

# md5sum /tmp/before /tmp/after
22444b10a0a07dc7dde3bddd6bb3c1e3  /tmp/before
22444b10a0a07dc7dde3bddd6bb3c1e3  /tmp/after

.. so no change.

I'm able to reproduce the problem only with sfdisk 2.26 (= without bugfix for bug #1210428).

I can try reproducer from comment #0 but I have real doubts that sfdisk behaves differently within VM (well, I guess VM image does not contain the broken v2.26).

Comment 7 Dusty Mabe 2015-04-17 03:53:40 UTC
Hi Karel,

I am using https://kojipkgs.fedoraproject.org//work/tasks/648/9470648/Fedora-Cloud-Base-22_Beta-20150411.x86_64.qcow2 and booting it on openstack. I first disable growpart with the following user-data:

#cloud-config
growpart:
  mode: off
  ignore_growroot_disabled: false

and then I boot an instance, log in, obtain root, and install vim-common/file rpms. After that here are the exact commands I run and the output obtained:

[root@f22sfdisk2 ~]# rpm -q util-linux
util-linux-2.26.1-1.fc22.x86_64
[root@f22sfdisk2 ~]# 
[root@f22sfdisk2 ~]# xxd -l512 /dev/vda > before
[root@f22sfdisk2 ~]# sfdisk -d /dev/vda > mbr.dump
[root@f22sfdisk2 ~]# file -s /dev/vda
/dev/vda: DOS/MBR boot sector; partition 1 : ID=0x83, active, start-CHS (0x0,32,33), end-CHS (0x17e,146,20), startsector 2048, 6144000 sectors
[root@f22sfdisk2 ~]# sfdisk --force /dev/vda  < mbr.dump
Checking that no-one is using this disk right now ... FAILED

This disk is currently in use - repartitioning is probably a bad idea.
Umount all file systems, and swapoff all swap partitions on this disk.
Use the --no-reread flag to suppress this check.

Disk /dev/vda: 20 GiB, 21474836480 bytes, 41943040 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x28824fd4

Old situation:

Device     Boot Start     End Sectors Size Id Type
/dev/vda1  *     2048 6146047 6144000   3G 83 Linux

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new DOS disklabel with disk identifier 0x28824fd4.
Created a new partition 1 of type 'Linux' and of size 3 GiB.
/dev/vda2: 
New situation:

Device     Boot Start     End Sectors Size Id Type
/dev/vda1  *     2048 6146047 6144000   3G 83 Linux

The partition table has been altered.
Calling ioctl() to re-read partition table.
Re-reading the partition table failed.: Device or resource busy
The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8).
Syncing disks.
[root@f22sfdisk2 ~]# 
[root@f22sfdisk2 ~]# xxd -l512 /dev/vda > after
[root@f22sfdisk2 ~]# diff before after
28,29c28,29
< 00001b0: 0000 0000 0000 0000 d44f 8228 0000 8020  .........O.(... 
< 00001c0: 2100 8392 547e 0008 0000 00c0 5d00 0000  !...T~......]...
---
> 00001b0: 0000 0000 0000 0000 d44f 8228 0000 8000  .........O.(....
> 00001c0: 2102 8304 d4d1 0008 0000 00c0 5d00 0000  !...........]...
[root@f22sfdisk2 ~]# 
[root@f22sfdisk2 ~]# file -s /dev/vda
/dev/vda: DOS/MBR boot sector; partition 1 : ID=0x83, active, start-CHS (0x2,0,33), end-CHS (0x3d1,4,20), startsector 2048, 6144000 sectors

Comment 8 Dusty Mabe 2015-04-17 03:59:47 UTC
I forgot to say this before. The operation performed in the previous comment renders the instance unbootable (won't reboot or normal boot successfully).

Comment 9 Dusty Mabe 2015-04-23 19:27:45 UTC
kzak, have you been able to recreate the issue? Is there anything wrong with my reproducer that you can see?

Comment 10 Dusty Mabe 2015-08-06 18:17:49 UTC
Marking as needinfo.

Comment 11 Karel Zak 2015-08-12 10:58:29 UTC
Frankly, "booting it on openstack" is not too useful reproducer for me ;-)

(As low-level system developer I do not use openstack and if I need virtual machine/container I use qemu/libvirt/systemd-nspawn/etc. ...I don't plan to study and debug openstack for this one issue.)

It would be nice to have step by step reproducer.

Note that all works as expected with qemu-nbd (comment #6).

Comment 12 Fedora End Of Life 2016-07-19 20:08:39 UTC
Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 13 Adam Williamson 2016-08-24 22:02:55 UTC
this still looks to be valid. recently there's been a bug with cloud image composes which prevents %post from being run - so the workaround for this bug (which is *still* in cloud %post) was not getting run. And, indeed, the images don't have the right MBR contents and don't reach a bootloader on boot.

Comment 14 Fedora End Of Life 2017-02-28 09:42:48 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 26 development cycle.
Changing version to '26'.

Comment 15 Fedora End Of Life 2018-05-03 08:15:27 UTC
This message is a reminder that Fedora 26 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 26. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '26'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 26 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 16 Fedora End Of Life 2018-05-29 11:57:42 UTC
Fedora 26 changed to end-of-life (EOL) status on 2018-05-29. Fedora 26
is no longer maintained, which means that it will not receive any
further security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.