Description of problem: Using guestfs python bindings: r = libguestfsmod.mount_ro(self._o, mountable, mountpoint) Fails with: E RuntimeError: mount_ro: /dev/cl/root on / (options: 'ro'): mount: mount /dev/mapper/cl-root on /sysroot failed: Structure needs cleaning libguestfs version: 1.36.4 Host is Fedora 26 The guest image is CentOS 7.3 For now only noticed it failing once. The full logs(with libguestfs debug mode), are available in : http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__root_nowhere_dead_/ Any ideas?
I think your filesystem is corrupt. If you look in the long log just before the error you'll see lots of lines like: [ 17.623837] XFS (dm-1): Metadata corruption detected at xfs_inode_buf_verify+0x73/0xf0 [xfs], xfs_inode block 0x509000 [ 17.648406] XFS (dm-1): Unmount and run xfs_repair [ 17.659452] XFS (dm-1): First 64 bytes of corrupted metadata buffer: [ 17.674282] ffffa380007a0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 17.694213] ffffa380007a0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 17.714461] ffffa380007a0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 17.734386] ffffa380007a0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 17.754657] XFS (dm-1): metadata I/O error: block 0x509000 ("xlog_recover_do..(read#2)") error 117 numblks 32 [ 17.783402] XFS (dm-1): log mount/recovery failed: error -117 [ 17.797346] XFS (dm-1): log mount failed
Thanks, missed that. I Wonder how the corruption happened - the disk was freshly created in the beginning of the test, and right before this exception was thrown and right after it, two other tests that used the exact same method passed(only different paths were attempted to be copied). Looking at the libguestfs debug output on the tests that did pass, I don't see those XFS errors. The host itself is a VM(if that matters). good output before: http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__nothing_here_dead_/ good output after: http://jenkins.ovirt.org/job/lago_master_github_check-patch-fc26-x86_64/158/testReport/junit/tests.functional-sdk/test_sdk_sanity/test_extract_paths_ignore_nopath_vm_el7_3_base__var_log_nested_nothing_dead_/
In general terms, provided there are no kernel or qemu bugs, libguestfs guarantees that your changes are written to disk when you call guestfs_shutdown on the handle, and it looks as if you are doing that. Is it possible two handles are open read-write on the same disk? This would cause instant corruption. (Or if something else not libguestfs has the disk open for writes, eg a live VM). I looked at the traces you supplied and I cannot see anything bad. You're using the API correctly as far as I can tell.
I'm not writing anything to the guest with guestfs, it uses the 'mount_ro' call, and then should attempt to copy a file from the guest to the host(though it failed on the mount_ro before that). I don't think there are any other handles open, however the VM is live indeed. That should be safe for 'mount_ro', no? Few runs have passed and didn't see it happening again,
The VM could have not flushed all the changes to the disk, so when trying to mount the filesystem it is detected as corrupted, potentially changing its metadata to be able to mount it even in read-only mode; in case the disk not in read-only mode, this means writing to the same disk used by the running VM -> big problems ahead. What is the exact add_drive command you are using? Does it include readonly=True? Or are using add_drive_ro, perhaps?
It is : add_drive_opts(disk_path, format='qcow2', readonly=1)
Right, as Pino says it's safe to use mount_ro on a live VM, but that doesn't mean you'll get consistent results. The guest can be in the middle of writing to the disk and you may see those writes in any order or not at all which can confuse the libguestfs appliance kernel. Also qcow2 makes this worse since you might not only be dealing with partial guest changes, but partial qcow2 metadata changes. Raw is a bit better. You just have to close and retry if this happens. If you really want a consistent view of a guest then you can get qemu to export a point-in-time snapshot as NBD (even though the guest is live and continues running) which libguestfs can read, but it involves sending commands to the qemu monitor of the guest.
All right, using re-tries for now, seems to work. Thanks for the explanations.
I have set retries - but at least for the one time I caught 'mount_ro' operation failing again, all retries afterwards failed as well. This led me to suspect, I'm not retrying properly. So my question is: Is retrying over 'mount_ro' enough? Or I need to shut down the client, open a new connection, and start again from 'add_drive_ro'?
No it's definitely not enough. You must close and reopen the handle.
Closing as this is a corrupt filesystem, not a bug in libguestfs.