1465526 – RuntimeError: mount_ro: /dev/cl/root on / (options: 'ro'): mount: mount /dev/mapper/cl-root on /sysroot failed: Structure needs cleaning

Bug 1465526 - RuntimeError: mount_ro: /dev/cl/root on / (options: 'ro'): mount: mount /dev/mapper/cl-root on /sysroot failed: Structure needs cleaning

Summary: RuntimeError: mount_ro: /dev/cl/root on / (options: 'ro'): mount: mount /dev/...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Virtualization Tools
Classification:	Community
Component:	libguestfs
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Richard W.M. Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-27 14:50 UTC by Nadav Goldin
Modified:	2019-02-19 02:39 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-02-19 02:39:36 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nadav Goldin 2017-06-27 14:50:30 UTC

Description of problem:
Using guestfs python bindings:

   r = libguestfsmod.mount_ro(self._o, mountable, mountpoint)  

Fails with:

E       RuntimeError: mount_ro: /dev/cl/root on / (options: 'ro'): mount: mount 
/dev/mapper/cl-root on /sysroot failed: Structure needs cleaning

libguestfs version: 1.36.4 
Host is Fedora 26
The guest image is CentOS 7.3

For now only noticed it failing once.

The full logs(with libguestfs debug mode), are available in :

http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__root_nowhere_dead_/


Any ideas?

Comment 1 Richard W.M. Jones 2017-06-27 15:07:11 UTC

I think your filesystem is corrupt.  If you look in the long log
just before the error you'll see lots of lines like:

[   17.623837] XFS (dm-1): Metadata corruption detected at xfs_inode_buf_verify+0x73/0xf0 [xfs], xfs_inode block 0x509000
[   17.648406] XFS (dm-1): Unmount and run xfs_repair
[   17.659452] XFS (dm-1): First 64 bytes of corrupted metadata buffer:
[   17.674282] ffffa380007a0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   17.694213] ffffa380007a0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   17.714461] ffffa380007a0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   17.734386] ffffa380007a0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   17.754657] XFS (dm-1): metadata I/O error: block 0x509000 ("xlog_recover_do..(read#2)") error 117 numblks 32
[   17.783402] XFS (dm-1): log mount/recovery failed: error -117
[   17.797346] XFS (dm-1): log mount failed

Comment 2 Nadav Goldin 2017-06-27 16:46:51 UTC

Thanks, missed that. I Wonder how the corruption happened - the disk was freshly created in the beginning of the test, and right before this exception was thrown and right after it, two other tests that used the exact same method passed(only different paths were attempted to be copied). Looking at the libguestfs debug output on the tests that did pass, I don't see those XFS errors. 

The host itself is a VM(if that matters).

good output before:
http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__nothing_here_dead_/
good output after:
http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__var_log_nested_nothing_dead_/

Comment 3 Richard W.M. Jones 2017-06-27 19:23:48 UTC

In general terms, provided there are no kernel or qemu bugs, libguestfs
guarantees that your changes are written to disk when you call
guestfs_shutdown on the handle, and it looks as if you are doing that.

Is it possible two handles are open read-write on the same disk?  This
would cause instant corruption.  (Or if something else not libguestfs
has the disk open for writes, eg a live VM).

I looked at the traces you supplied and I cannot see anything bad.  You're
using the API correctly as far as I can tell.

Comment 4 Nadav Goldin 2017-06-28 07:46:51 UTC

I'm not writing anything to the guest with guestfs, it uses the 'mount_ro' call, and then should attempt to copy a file from the guest to the host(though it failed on the mount_ro before that).
I don't think there are any other handles open, however the VM is live indeed. That should be safe for 'mount_ro', no?


Few runs have passed and didn't see it happening again,

Comment 5 Pino Toscano 2017-06-28 08:06:44 UTC

The VM could have not flushed all the changes to the disk, so when trying to mount the filesystem it is detected as corrupted, potentially changing its metadata to be able to mount it even in read-only mode; in case the disk not in read-only mode, this means writing to the same disk used by the running VM -> big problems ahead.

What is the exact add_drive command you are using? Does it include readonly=True? Or are using add_drive_ro, perhaps?

Comment 6 Nadav Goldin 2017-06-28 08:32:43 UTC

It is : 
     add_drive_opts(disk_path, format='qcow2', readonly=1)

Comment 7 Richard W.M. Jones 2017-06-28 08:45:36 UTC

Right, as Pino says it's safe to use mount_ro on a live VM, but
that doesn't mean you'll get consistent results.  The guest can
be in the middle of writing to the disk and you may see those
writes in any order or not at all which can confuse the libguestfs
appliance kernel.

Also qcow2 makes this worse since you might not only be dealing
with partial guest changes, but partial qcow2 metadata changes.
Raw is a bit better.

You just have to close and retry if this happens.

If you really want a consistent view of a guest then you can get
qemu to export a point-in-time snapshot as NBD (even though the guest
is live and continues running) which libguestfs can read, but it
involves sending commands to the qemu monitor of the guest.

Comment 8 Nadav Goldin 2017-06-29 13:24:44 UTC

All right, using re-tries for now, seems to work.
Thanks for the explanations.

Comment 9 Nadav Goldin 2017-07-05 07:58:49 UTC

I have set retries - but at least for the one time I caught 'mount_ro' operation failing again, all retries afterwards failed as well. This led me to suspect, I'm not retrying properly. So my question is:

Is retrying over 'mount_ro' enough? Or I need to shut down the client, open a new connection, and start again from 'add_drive_ro'?

Comment 10 Richard W.M. Jones 2017-07-05 09:15:33 UTC

No it's definitely not enough.  You must close and reopen the handle.

Comment 11 Richard W.M. Jones 2019-02-19 02:39:36 UTC

Closing as this is a corrupt filesystem, not a bug in libguestfs.

Note You need to log in before you can comment on or make changes to this bug.