Bug 576879

Summary: libguestfs protocol loses synchronization if you 'upload' before mounting disks
Product: [Community] Virtualization Tools Reporter: Richard W.M. Jones <rjones>
Component: libguestfsAssignee: Richard W.M. Jones <rjones>
Status: CLOSED UPSTREAM QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: mbooth, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 627835 (view as bug list) Environment:
Last Closed: 2011-07-14 19:04:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 584228, 591155, 591250, 627835    

Description Richard W.M. Jones 2010-03-25 14:49:07 UTC
[Originally reported by Seth Vidal]

Description of problem:

guestfish <<EOF
> add f12-minimal.img
> run
> upload /var/tmp/guestfish-1.0.85-1.el5.7.x86_64.rpm /home/vmbuild/guestfish-1.0.85-1.el5.7.x86_64.rpm
> EOF
libguestfs: error: check_for_daemon_cancellation_or_eof: read 0x64 from daemon, expected 0xffffeeee
 
libguestfs: error: message length (536933877) > maximum possible size (4194304)
libguestfs: error: check_for_daemon_cancellation_or_eof: read 0x1 from daemon, expected 0xffffeeee

With another version of libguestfs:

$ cat test
#!/bin/sh -

guestfish -x <<EOF
add f12.img
run
upload a_big_file /
EOF
$ sh test
libguestfs: error: check_for_daemon_cancellation_or_eof: read 0x64 from daemon, expected 0xffffeeee

libguestfs: error: message length (536933877) > maximum possible size (4194304)
libguestfs: error: check_for_daemon_cancellation_or_eof: read 0x1 from daemon, expected 0xffffeeee

Version-Release number of selected component (if applicable):

libguestfs 1.0.87

How reproducible:

Always.

The problem here is we're uploading without first mounting
any disk.  Upload is failing and probably reporting an error,
but then the protocol loses synchronization and it's game over.

Comment 2 Richard W.M. Jones 2010-04-18 09:29:19 UTC
Setting back to ASSIGNED, since we're still not quite there
with this patch.

><fs> sparse /tmp/test.img 10M
><fs> run
><fs> tar-in /tmp/foobar /blah
libguestfs: error: open: /tmp/foobar: No such file or directory
><fs> list-devices 
libguestfs: error: unexpected procedure number (69/7)
><fs> list-devices 
/dev/vda

The first error from list-devices shouldn't happen.

Comment 3 Richard W.M. Jones 2010-04-18 09:30:09 UTC
Another example:

><fs> tar-in /tmp/foobar /blah
libguestfs: error: open: /tmp/foobar: No such file or directory
><fs> ping-daemon
libguestfs: error: check_for_daemon_cancellation_or_eof: read 0x64 from daemon, expected 0xffffeeee
><fs> ping-daemon
libguestfs: error: check_for_daemon_cancellation_or_eof: read 0x2000f5f5 from daemon, expected 0xffffeeee

Comment 4 Richard W.M. Jones 2010-05-12 18:06:12 UTC
Here's a one line reproducer for the latest libguestfs:

$ ./fish/guestfish -N disk -- -tar-in /dev/nofile /blah : ping-daemon
libguestfs: error: open: /dev/nofile: No such file or directory
libguestfs: error: unexpected procedure number (69/92)
libguestfs: error: check_for_daemon_cancellation_or_eof: read 0x18 from daemon, expected 0xffffeeee

libguestfs: error: check_for_daemon_cancellation_or_eof: read 0x2000f5f5 from daemon, expected 0xffffeeee

Comment 5 Richard W.M. Jones 2010-05-12 18:22:26 UTC
OK I understand what's going on here.  Both ends simultaneously send
cancel messages:

library                 daemon
  |
  V
 sends RPC message -------+
  |                       |
  |                 receives RPC message
  |                       |
  V                       V
 opens file,        filesystem not mounted!
 error: not found!        |
  |                       |
  V                       V
 sends cancel       sends cancel
  +------->      <--------+
            !!!!

Comment 6 Richard W.M. Jones 2010-05-12 19:00:36 UTC
Patch posted upstream to fix the issue described in comment 5:
https://www.redhat.com/archives/libguestfs/2010-May/msg00061.html

Comment 7 Richard W.M. Jones 2010-10-27 12:23:10 UTC
Setting back to ASSIGNED, since the regression test for
this test has started to hang.

See also:
http://git.annexia.org/?p=libguestfs.git;a=commitdiff;h=fb998000e60b32219c2bf839044cff59f499dff1

Comment 8 Richard W.M. Jones 2011-03-18 13:34:05 UTC
[Copy of a note sent to the mailing list]

I just pushed a commit which reenables two tests for [this bug]:

http://git.annexia.org/?p=libguestfs.git;a=commitdiff;h=dc8e4b057ecd3984d7c27c8e
ce54048b6a06d662

This is a really long-standing bug which we thought we'd fixed, but
then turned up again.  It currently is *not* failing on my machine.  I
added some clearer debug messages to the code paths involved.

It seems to be highly timing related and I doubt that it is fully
squashed, so it is quite probably that these tests will fail for
somebody somewhere.  If you can get it to fail with LIBGUESTFS_DEBUG=1
then please post the full log into the bug report.

Comment 9 Richard W.M. Jones 2011-03-18 16:41:55 UTC
I think I've nailed this one finally.  Posted a patch
here, still testing it:

https://www.redhat.com/archives/libguestfs/2011-March/msg00090.html

Comment 11 Richard W.M. Jones 2011-03-18 20:03:57 UTC
Fixes included in 1.9.12.

Comment 12 Richard W.M. Jones 2011-04-12 17:35:30 UTC
*** Bug 624035 has been marked as a duplicate of this bug. ***

Comment 13 Richard W.M. Jones 2011-07-14 19:04:46 UTC
Haven't seen this for quite a while.  FIXED!