Bug 1465526 - RuntimeError: mount_ro: /dev/cl/root on / (options: 'ro'): mount: mount /dev/mapper/cl-root on /sysroot failed: Structure needs cleaning
RuntimeError: mount_ro: /dev/cl/root on / (options: 'ro'): mount: mount /dev/...
Status: NEW
Product: Virtualization Tools
Classification: Community
Component: libguestfs (Show other bugs)
unspecified
x86_64 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Richard W.M. Jones
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-27 10:50 EDT by Nadav Goldin
Modified: 2017-07-05 05:15 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Nadav Goldin 2017-06-27 10:50:30 EDT
Description of problem:
Using guestfs python bindings:

   r = libguestfsmod.mount_ro(self._o, mountable, mountpoint)  

Fails with:

E       RuntimeError: mount_ro: /dev/cl/root on / (options: 'ro'): mount: mount 
/dev/mapper/cl-root on /sysroot failed: Structure needs cleaning

libguestfs version: 1.36.4 
Host is Fedora 26
The guest image is CentOS 7.3

For now only noticed it failing once.

The full logs(with libguestfs debug mode), are available in :

http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__root_nowhere_dead_/


Any ideas?
Comment 1 Richard W.M. Jones 2017-06-27 11:07:11 EDT
I think your filesystem is corrupt.  If you look in the long log
just before the error you'll see lots of lines like:

[   17.623837] XFS (dm-1): Metadata corruption detected at xfs_inode_buf_verify+0x73/0xf0 [xfs], xfs_inode block 0x509000
[   17.648406] XFS (dm-1): Unmount and run xfs_repair
[   17.659452] XFS (dm-1): First 64 bytes of corrupted metadata buffer:
[   17.674282] ffffa380007a0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   17.694213] ffffa380007a0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   17.714461] ffffa380007a0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   17.734386] ffffa380007a0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   17.754657] XFS (dm-1): metadata I/O error: block 0x509000 ("xlog_recover_do..(read#2)") error 117 numblks 32
[   17.783402] XFS (dm-1): log mount/recovery failed: error -117
[   17.797346] XFS (dm-1): log mount failed
Comment 2 Nadav Goldin 2017-06-27 12:46:51 EDT
Thanks, missed that. I Wonder how the corruption happened - the disk was freshly created in the beginning of the test, and right before this exception was thrown and right after it, two other tests that used the exact same method passed(only different paths were attempted to be copied). Looking at the libguestfs debug output on the tests that did pass, I don't see those XFS errors. 

The host itself is a VM(if that matters).

good output before:
http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__nothing_here_dead_/
good output after:
http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__var_log_nested_nothing_dead_/
Comment 3 Richard W.M. Jones 2017-06-27 15:23:48 EDT
In general terms, provided there are no kernel or qemu bugs, libguestfs
guarantees that your changes are written to disk when you call
guestfs_shutdown on the handle, and it looks as if you are doing that.

Is it possible two handles are open read-write on the same disk?  This
would cause instant corruption.  (Or if something else not libguestfs
has the disk open for writes, eg a live VM).

I looked at the traces you supplied and I cannot see anything bad.  You're
using the API correctly as far as I can tell.
Comment 4 Nadav Goldin 2017-06-28 03:46:51 EDT
I'm not writing anything to the guest with guestfs, it uses the 'mount_ro' call, and then should attempt to copy a file from the guest to the host(though it failed on the mount_ro before that).
I don't think there are any other handles open, however the VM is live indeed. That should be safe for 'mount_ro', no?


Few runs have passed and didn't see it happening again,
Comment 5 Pino Toscano 2017-06-28 04:06:44 EDT
The VM could have not flushed all the changes to the disk, so when trying to mount the filesystem it is detected as corrupted, potentially changing its metadata to be able to mount it even in read-only mode; in case the disk not in read-only mode, this means writing to the same disk used by the running VM -> big problems ahead.

What is the exact add_drive command you are using? Does it include readonly=True? Or are using add_drive_ro, perhaps?
Comment 6 Nadav Goldin 2017-06-28 04:32:43 EDT
It is : 
     add_drive_opts(disk_path, format='qcow2', readonly=1)
Comment 7 Richard W.M. Jones 2017-06-28 04:45:36 EDT
Right, as Pino says it's safe to use mount_ro on a live VM, but
that doesn't mean you'll get consistent results.  The guest can
be in the middle of writing to the disk and you may see those
writes in any order or not at all which can confuse the libguestfs
appliance kernel.

Also qcow2 makes this worse since you might not only be dealing
with partial guest changes, but partial qcow2 metadata changes.
Raw is a bit better.

You just have to close and retry if this happens.

If you really want a consistent view of a guest then you can get
qemu to export a point-in-time snapshot as NBD (even though the guest
is live and continues running) which libguestfs can read, but it
involves sending commands to the qemu monitor of the guest.
Comment 8 Nadav Goldin 2017-06-29 09:24:44 EDT
All right, using re-tries for now, seems to work.
Thanks for the explanations.
Comment 9 Nadav Goldin 2017-07-05 03:58:49 EDT
I have set retries - but at least for the one time I caught 'mount_ro' operation failing again, all retries afterwards failed as well. This led me to suspect, I'm not retrying properly. So my question is:

Is retrying over 'mount_ro' enough? Or I need to shut down the client, open a new connection, and start again from 'add_drive_ro'?
Comment 10 Richard W.M. Jones 2017-07-05 05:15:33 EDT
No it's definitely not enough.  You must close and reopen the handle.

Note You need to log in before you can comment on or make changes to this bug.