1158034 – Fail to save guest to the pre-create save file which locate in root_squash nfs server

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1158034 - Fail to save guest to the pre-create save file which locate in root_squash nfs server

Summary: Fail to save guest to the pre-create save file which locate in root_squash nf...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.1
Hardware:	x86_64
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	John Ferlan
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1158036
TreeView+	depends on / blocked

Reported:	2014-10-28 11:40 UTC by zhenfeng wang
Modified:	2015-11-19 05:54 UTC (History)
CC List:	8 users (show)
Fixed In Version:	libvirt-1.2.13-1.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1158036 (view as bug list)
Environment:
Last Closed:	2015-11-19 05:54:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:2202	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2015-11-19 08:17:58 UTC

Description zhenfeng wang 2014-10-28 11:40:54 UTC

Description of problem:
Fail to save guest to the pre-create save file which locate in root_squash nfs server

Version-Release number of selected component (if applicable):
libvirt-1.2.8-5.el7.x86_64
qemu-kvm-rhev-2.1.2-3.el7.x86_64
kernel-3.10.0-191.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Prepare a nfs server with root_squash
# cat /etc/exports
/var/tmp/nfs *(rw,root_squash)

# ll /var/tmp/nfs -d
drwxrwxrwx. 2 root root 4096 Sep 29 16:08 /var/tmp/nfs

2.Pre-create the guest's save file and prepare a guest img on the nfs server
# ll /var/tmp/nfs/
total 5077000
-rw-r--r--. 1 root root 4984995840 Oct 28 12:23 rhel7.img
-rw-r--r--. 1 root root  213851864 Oct 28 12:23 rhel7.save

3.Change the guest's image and pre-create save file's group to qemu, so that
qemu group user can read&write the image and pre-create save file
#chmod 777 /var/tmp/nfs/rhel7.img
#chmod 777 /var/tmp/nfs/rhel7.save

#chown root:qemu /var/tmp/nfs/rhel7.img
#chown root:qemu /var/tmp/nfs/rhel7.save

# ll /var/tmp/nfs/
total 5077000
-rwxrwxrwx. 1 root qemu 4984995840 Oct 28 12:23 rhel7.img
-rwxrwxrwx. 1 root qemu  213851864 Oct 28 12:23 rhel7.save

4.keep the qemu.conf default configuration in nfs client
#vim /etc/libvirt/qemu.conf
user = "qemu"
group= "qemu"
dynamic_ownership = 1

5.create a nfs pool on the nfs client
#cat nfs.xml
<pool type='netfs'>
  <name>nfs</name>
  <uuid>08380649-cf25-4af7-b816-6f8494003f69</uuid>
  <capacity unit='bytes'>206423719936</capacity>
  <allocation unit='bytes'>121639010304</allocation>
  <available unit='bytes'>84784709632</available>
  <source>
    <host name='$nfs_server_addr'/>
    <dir path='/var/tmp/nfs'/>
    <format type='nfs'/>
  </source>
  <target>
    <path>/tmp/pl</path>
    <permissions>
      <mode>0700</mode>
      <owner>0</owner>
      <group>0</group>
    </permissions>
  </target>
</pool>

#mkdir /tmp/pl
#virsh pool-define nfs.xml
#virsh pool-start nfs

5.Start a normal guest which guest's img locate in the nfs server
#virsh dumpxml rhel6
--
<disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/tmp/pl/rhel7.img'/>
      <target dev='hda' bus='ide'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
--

#virsh start rhel6
Domain rhel7 started

6.Save the guest to the pre-create save file, will fail to save the guest to the pre-create save file and report the following error
# virsh save rhel7 /tmp/pl/rhel7.save 
error: Failed to save domain rhel7 to /tmp/pl/rhel7.save
error: Error from child process creating '/tmp/pl/rhel7.save': Transport endpoint is not connected

7.Also hit this issue in rhel6.6 host

Actual results:
Fail to save guest to the pre-create save file which both guest's image and save file locate in the nfs share directory

Expected results:
Should save the guest successfully

Additional info:

Comment 3 John Ferlan 2015-01-23 00:48:35 UTC

Just to be sure we are "on the same page"...  After creating the /var/tmp/nfs, modifying /etc/exports, and 'service nfs restart', I have:

# exportfs -v
/home/nfs_pool/nfs-export
		localhost(rw,wdelay,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
/var/tmp/nfs  	<world>(rw,wdelay,root_squash,no_subtree_check,sec=sys,rw,secure,root_squash,no_all_squash)
#

is that what you expect?

Step 2 takes a leap of faith... 

What did you do to "pre-create" the guest's save file?  Does that mean you saved it somewhere else first and then copied it over to that NFS directory?

Step 3 - why root:qemu and not qemu:qemu from qemu.conf?

Step 5 #1 - what does your /tmp/pl look like protection wise when you create/start the pool? It seems it would be <mode>0700</mode>. Your owner/group are 0 (root) which doesn't quite match up with the root:qemu for the NFS server directory.

Step 5 #2 - is there really such thing as a normal guest?  It would have been nicer to have the guest XML included. I can probably just take one of my guests and use it, but I also want to make sure there's nothing else odd...

So when you get here it would seem (without me trying it yet), that there's a protection disconnect between your nfs client (eg, pool, /tmp/pl) and server (eg /var/tmp/nfs).

Comment 4 John Ferlan 2015-01-23 20:55:49 UTC

Because I don't have enough space on /var/tmp - I used a different location... and get slightly different results, but still a failure...  After some extra debugging - I still think there may be a configuration issue w/r/t qemu:qemu and/or root:qemu for the pl directory, but I have a bit more debugging to do before being positive.  Additionally, I can see/get an error of EACCES when trying to open() the save file; however, that error isn't propagated back properly so I'd like to get the correct error returned. Ironically the error that is returned is ENOTCONN which is 107 and happens to be the uid/gid of qemu. Anyway, the following are my steps...

Started one of my guests (rhel70) - it's currently using /home/vm-images/rhel70 for the guest image... It also has a second data disk (for a different test). Then 'virsh save rhel70 /home/vm-images/rhel70.save'.

# pwd
/home/bz1158034
# mkdir nfs
# chmod 777 nfs
# cp /home/vm-images/rhel70* nfs/
# chown root:qemu nfs/rhel*
# ll nfs
total 26578732
-rwxrwxrwx. 1 root qemu 26843545600 Jan 23 11:48 rhel70
-rwxrwxrwx. 1 root qemu    52428800 Jan 23 11:48 rhel70-data-50M
-rwxrwxrwx. 1 root qemu   320635821 Jan 23 11:48 rhel70.save
# mkdir pl
# ls -al
total 20
drwxr-xr-x.  4 root root 4096 Jan 23 11:42 .
drwxr-xr-x. 25 root root 4096 Jan 23 12:14 ..
drwxrwxrwx.  2 root root 4096 Jan 23 11:48 nfs
-rw-r--r--.  1 root root  328 Jan 23 11:42 nfs.xml
drwxr-xr-x.  2 root root 4096 Jan 23 11:41 pl
#

vim /etc/exports to add /home/bz1158034/nfs *(rw,root_squash)

# cat /etc/exports
/home/nfs_pool/nfs-export localhost(rw,no_root_squash)
/home/bz1158034/nfs *(rw,root_squash)
# service nfs restart

# exportfs -v
/home/nfs_pool/nfs-export
		localhost(rw,wdelay,no_root_squash,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
/home/bz1158034/nfs
		<world>(rw,wdelay,root_squash,no_subtree_check,sec=sys,rw,secure,root_squash,no_all_squash)

Modify my default/saved qemu.conf:

# diff /etc/libvirt/qemu.conf /etc/libvirt/qemu.conf.save
230d229
< user = "qemu"
235d233
< group = "qemu"
241d238
< dynamic_ownership = 1

Restart libvirtd

# service libvirtd restart
Restarting libvirtd (via systemctl):                       [  OK  ]
#

Modified the nfs.xml file to use my directories

# cat nfs.xml
<pool type='netfs'>
  <name>nfs</name>
  <source>
    <host name='localhost'/>
    <dir path='/home/bz1158034/nfs'/>
    <format type='nfs'/>
  </source>
  <target>
    <path>/home/bz1158034/pl</path>
    <permissions>
      <mode>0700</mode>
      <owner>0</owner>
      <group>0</group>
    </permissions>
  </target>
</pool>

# virsh pool-define nfs.xml
Pool nfs defined from nfs.xml

[root@localhost bz1158034]# virsh pool-start nfs
Pool nfs started

[root@localhost bz1158034]# virsh vol-list nfs
 Name                 Path                                    
------------------------------------------------------------------------------
 rhel70               /home/bz1158034/pl/rhel70               
 rhel70-data-50M      /home/bz1158034/pl/rhel70-data-50M      
 rhel70.save          /home/bz1158034/pl/rhel70.save          

# ll nfs
total 26578732
-rwxrwxrwx. 1 root qemu 26843545600 Jan 23 11:48 rhel70
-rwxrwxrwx. 1 root qemu    52428800 Jan 23 11:48 rhel70-data-50M
-rwxrwxrwx. 1 root qemu   320635821 Jan 23 11:48 rhel70.save
# ll pl
total 26578732
-rwxrwxrwx. 1 root qemu 26843545600 Jan 23 11:48 rhel70
-rwxrwxrwx. 1 root qemu    52428800 Jan 23 11:48 rhel70-data-50M
-rwxrwxrwx. 1 root qemu   320635821 Jan 23 11:48 rhel70.save
#

Edit my guest config to use the nfs client area:

    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/bz1158034/pl/rhel70'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/bz1158034/pl/rhel70-data-50M'/>
      <target dev='vdb' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </disk>


On a whim:
# virsh restore /home/bz1158034/pl/rhel70.save
Domain restored from /home/bz1158034/pl/rhel70.save

# virsh destroy rhel70
Domain rhel70 destroyed

# virsh start rhel70
Domain rhel70 started

# virsh save rhel70 /home/bz1158034/pl/rhel70.save
error: Failed to save domain rhel70 to /home/bz1158034/pl/rhel70.save
error: internal error: unexpected async job 3

# virsh list
 Id    Name                           State
----------------------------------------------------
 11    rhel70                         paused

# virsh save rhel70 /home/bz1158034/pl/rhel70.save2
error: Failed to save domain rhel70 to /home/bz1158034/pl/rhel70.save2
error: Error from child process creating '/home/bz1158034/pl/rhel70.save2': Transport endpoint is not connected


Destroy rhel70, restarting, and going with the same save2 command gets the same error as the initial save attempt.

.....

Going with debug mode and looking at the logs finds (grep rhel70.save libvirtd.log):

2015-01-23 18:03:50.043+0000: 3159: debug : virStorageFileGetMetadataInternal:778 : path=/home/vm-images/rhel70.save, buf=0x7f936824d820, len=33280, meta->format=-1
2015-01-23 18:03:50.043+0000: 3159: debug : virStorageFileProbeFormatFromBuf:697 : path=/home/vm-images/rhel70.save, buf=0x7f936824d820, buflen=33280
2015-01-23 18:03:50.142+0000: 3159: debug : virStorageFileGetMetadataInternal:778 : path=/home/bz1158034/pl/rhel70.save, buf=0x7f936825ea20, len=33280, meta->format=-1
2015-01-23 18:03:50.142+0000: 3159: debug : virStorageFileProbeFormatFromBuf:697 : path=/home/bz1158034/pl/rhel70.save, buf=0x7f936825ea20, buflen=33280
2015-01-23 18:04:16.522+0000: 3153: debug : virDomainSave:820 : dom=0x7f935c001590, (VM: name=rhel70, uuid=d83e3364-6520-4c40-bfff-5add2d010107), to=/home/bz1158034/pl/rhel70.save
2015-01-23 18:04:16.702+0000: 3153: debug : virFileIsSharedFSType:2949 : Check if path /home/bz1158034/pl/rhel70.save with FS magic 26985 is shared
2015-01-23 18:04:16.853+0000: 3153: error : virFileOpenForceOwnerMode:2010 : cannot chown '/home/bz1158034/pl/rhel70.save' to (0, 0): Operation not permitted
2015-01-23 18:04:16.860+0000: 3153: error : qemuOpenFileAs:2963 : Error from child process creating '/home/bz1158034/pl/rhel70.save': Transport endpoint is not connected
2015-01-23 18:06:41.611+0000: 3152: debug : virDomainSave:820 : dom=0x7f9358000a00, (VM: name=rhel70, uuid=d83e3364-6520-4c40-bfff-5add2d010107), to=/home/bz1158034/pl/rhel70.save2
2015-01-23 18:06:41.645+0000: 3152: debug : virFileIsSharedFSType:2949 : Check if path /home/bz1158034/pl/rhel70.save2 with FS magic 26985 is shared
2015-01-23 18:06:41.745+0000: 3152: error : virFileOpenForceOwnerMode:2010 : cannot chown '/home/bz1158034/pl/rhel70.save2' to (0, 0): Operation not permitted
2015-01-23 18:06:41.750+0000: 3152: error : qemuOpenFileAs:2963 : Error from child process creating '/home/bz1158034/pl/rhel70.save2': Transport endpoint is not connected

Comment 5 zhenfeng wang 2015-01-26 08:57:38 UTC

(In reply to John Ferlan from comment #3)
> Just to be sure we are "on the same page"...  After creating the
> /var/tmp/nfs, modifying /etc/exports, and 'service nfs restart', I have:
> 
> # exportfs -v
> /home/nfs_pool/nfs-export
> 	
> localhost(rw,wdelay,no_root_squash,no_subtree_check,sec=sys,rw,secure,
> no_root_squash,no_all_squash)
> /var/tmp/nfs  
> <world>(rw,wdelay,root_squash,no_subtree_check,sec=sys,rw,secure,root_squash,
> no_all_squash)
> #
> 
> is that what you expect?
yes, you nfs configuration was the same with me
> 
> Step 2 takes a leap of faith... 
> 
> What did you do to "pre-create" the guest's save file?  Does that mean you
> saved it somewhere else first and then copied it over to that NFS directory?
No, The save file was a very normal file, even you can create it by the touch command
#touch rhel7.save

> 
> Step 3 - why root:qemu and not qemu:qemu from qemu.conf?
since I want to check if the guest's group permission work for the guest, so i set a different owner with the save file. BTW, the qemu group permssion should enouth to do the operate with the guest since i could start the guest with that permission successfully.



> 
> Step 5 #1 - what does your /tmp/pl look like protection wise when you
> create/start the pool? It seems it would be <mode>0700</mode>. Your
> owner/group are 0 (root) which doesn't quite match up with the root:qemu for
> the NFS server directory.
The pool could be start successfully
1.Before start the pool
# ll /tmp/pl/ -d
drwxr-xr-x. 2 root root 6 Jan 26 15:05 /tmp/pl/

2.After start the pool
# virsh pool-start nfs
Pool nfs started

# ll /tmp/pl/ -d
drwxrwxrwx. 2 root root 4096 Jan 26 15:56 /tmp/pl/


> 
> Step 5 #2 - is there really such thing as a normal guest?  It would have
> been nicer to have the guest XML included. I can probably just take one of
> my guests and use it, but I also want to make sure there's nothing else
> odd...
> 
> So when you get here it would seem (without me trying it yet), that there's
> a protection disconnect between your nfs client (eg, pool, /tmp/pl) and
> server (eg /var/tmp/nfs).

Comment 6 John Ferlan 2015-01-26 17:25:28 UTC

I'm still not quite convinced your assertion about having an NFS server with "root:qemu" files and expecting things to just work when you attempt to create a "qemu:qemu" file will hold up for "all" instances.

First off a bit of history since results differ depending on what's running. There's been many changes since the RHEL6 version of libvirt (0.10.2-*) that will make the output from the RHEL7 version of libvirt (1.2.8-*) appear to be different, but I believe the underlying issue is the same. You will note in bug 1158036 which was cloned from this case that the save command on a RHEL6 system results in the following error:

# virsh save rhel7 /tmp/pl/rhel7.save
error: Failed to save domain rhel7 to /tmp/pl/rhel7.save
error: Error from child process creating '/tmp/pl/rhel7.save': Operation not permitted

Whereas this case run on a RHEL7 system has the error:

# virsh save rhel7 /tmp/pl/rhel7.save
error: Failed to save domain rhel7 to /tmp/pl/rhel7.save
error: Error from child process creating '/tmp/pl/rhel7.save': Transport endpoint is not connected

and my upstream testing using F21 has the error:

# virsh save rhel70 /home/bz1158034/pl/rhel70.save
error: Failed to save domain rhel70 to /home/bz1158034/pl/rhel70.save
error: internal error: unexpected async job 3

For RHEL6, the 'EPERM' error is returned, while for RHEL7 an ENOTCONN error occurs during the save operation. For my upstream testing, the error is different I believe due to changes from commit id '540c339a2', but I didn't dig into the difference too much beyond that.

Although the user displayed error is slightly different, the underlying cause is the same - it's a failure to open the save file. Why RHEL7 reports ENOTCONN instead perhaps has to do with changes within the calling stack with the gnulib code for a 'recvfd' call during virFileOpenForked. I haven't run a debug test on a RHEL6 system yet, but hopefully will to determine if my theory is right. The upstream error is also ENOTCONN, but it's obscured by a subsequent unsuccessful restart of CPU's and using the error from that failed restart rather than the original error. Adding code to save/restore the error in my testing proves that out - something that I will update upstream, but is unnecessary downstream for RHEL6.

In any case, in some ways this issue still feels partially like a configuration expectation error especially with root_squash (although I don't claim to have too much in depth knowledge of expectations when configured this way). However, I can also make the argument that how libvirt goes about saving off the file using a specific user/group in the NFS pool needs a slight algorithm update.

What libvirt does during the save operation is attempt to save the file as root/root and then change the user/group of the file to what is saved in the qemu.conf. This is "known to fail" for an NFS served, root squashed, access restricted target location. So libvirt, will then call virFileOpenForked in order to "... fork, then the child does setuid+setgid to given uid:gid and attempts to open the file, while the parent just calls recvfd to get the open fd back from the child...". In our initial attempt to save, open, and change owner/group on the file, the code creates a root:root file, but does not remove it prior to calling the second path which on open (write/create) as "qemu:qemu" will fail EPERM because the file already exists at root:root. For upstream, the failure to open causes the child to fail and close it's pipe back to the parent. When the parent calls recvfd(), it gets ENOTCONN and falls into a failure path which doesn't attempt to get the actual error from the child (why that is, I'm not quite sure yet). I have not tried this on RHEL6, but for now I am going to assume RHEL6 would get an EPERM from that recvfd() and hence would skip the check where RHEL7 returns without getting the child's status. During debugging using upstream/f21 I was able to ascertain that the child failure is on an open() with EPERM as the failure reason. So that's why I feel the underlying problem is the same.

So taking this a step further - I added some code to handle the condition where the file exists after the first failed create and save. The change will unlink/delete the initially created save file and then when going through the virFileOpenForked path will successfully create the save file, but there's an issue. Immediately after the successful save, the pool has the following files:

# ll /home/bz1158034/pl/
total 26475540
-rwxrwxrwx. 1 root qemu 26843545600 Jan 26 07:00 rhel70
-rwxrwxrwx. 1 root qemu 52428800 Jan 23 11:48 rhel70-data-50M
-rw-r-----. 1 qemu qemu 214968556 Jan 26 07:00 rhel70.save
#

So it appears things work, right? Well not so fast. A subsequent libvirtd restart failed to start the nfs pool... Hmm... So rewind a bit to reset the test, save the file, and then just do a pool-refresh yields the following error:

# virsh pool-refresh nfs
error: Failed to refresh pool nfs
error: cannot open volume '/home/bz1158034/pl/rhel70.save': Permission denied

And the pool is destroyed/stopped... A subsequent restart attempt yields my now expected failure:

# virsh pool-start nfs
error: Failed to start pool nfs
error: cannot open volume '/home/bz1158034/pl/rhel70.save': Permission denied

The following does work (although perhaps expected):

# virsh restore /home/bz1158034/pl/rhel70.save
Domain restored from /home/bz1158034/pl/rhel70.save

If I modify the permissions for the rhel70.save file, I can start the pool again:

# chmod 644 /home/bz1158034/nfs/rhel70.save
# ll /home/bz1158034/nfs
total 26273460
-rwxrwxrwx. 1 root qemu 26843545600 Jan 26 10:02 rhel70
-rwxrwxrwx. 1 root qemu 52428800 Jan 23 11:48 rhel70-data-50M
-rw-r--r--. 1 qemu qemu 8043956 Jan 26 10:09 rhel70.save
# virsh pool-start nfs
Pool nfs started

Looking a bit more into how the file is created and how initial protections are set, it seems perhaps some defaults are being taken and for some reason the code path doesn't "decide" to also force a chmod on the file even though it passes "S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP" as the expected mode for the open as user and set mode path as opposed to just "S_IRUSR | S_IWUSR" in the open as root path. If I add the "S_IROTH" and force the path to set the mode bits, then everything works as expected. However, whether that's "expected" for this path I'm not 100% sure, but will find out.

Comment 7 John Ferlan 2015-01-28 22:26:45 UTC

After some more digging and debugging, I've finalized and sent patches upstream to address this issue. The initial patches are upstream starting at:

  http://www.redhat.com/archives/libvir-list/2015-January/msg01029.html

So it's not lost to time - I did investigate some of the differences and what follows are some findings as well as some more thoughts about the general protection mismatch.

The differences in errors that I saw ("unexpected async job 3" vs. "transport endpoint not connected") were a result of recent upstream change not in RHEL7 or RHEL6, I've included a patch for that in my series.

The difference between errors in RHEL7 and RHEL6 ("transport endpoint not connected" vs. "Operation not permitted") was due to a change in the underlying gnulib that libvirt uses to return ENOTCONN in certain circumstance instead of EACCES, but not checking for ENOTCONN in the libvirt code. That'll be fixed too.

That leaves the base bug/issue where the file could not be saved which will be fixed with one caveat. If the 'save' file already exists in the NFS pool, but cannot be opened, written to, chown, and chmod due to some file ownership or protection issue, then the attempt to save will fail with:

# virsh save rhel70 /home/bz1158034/pl/rhel70.save
error: Failed to save domain rhel70 to /home/bz1158034/pl/rhel70.save
error: Error from child process creating '/home/bz1158034/pl/rhel70.save': Permission denied

#

For example, from the description example step 3:

> 3.Change the guest's image and pre-create save file's group to qemu, so that
> qemu group user can read&write the image and pre-create save file
> #chmod 777 /var/tmp/nfs/rhel7.img
> #chmod 777 /var/tmp/nfs/rhel7.save
>
> #chown root:qemu /var/tmp/nfs/rhel7.img
> #chown root:qemu /var/tmp/nfs/rhel7.save
>
> # ll /var/tmp/nfs/
> total 5077000
> -rwxrwxrwx. 1 root qemu 4984995840 Oct 28 12:23 rhel7.img
> -rwxrwxrwx. 1 root qemu  213851864 Oct 28 12:23 rhel7.save
>

If the 'rhel7.save' *already exists* as root:qemu, when libvirt goes to chown to qemu:qemu that will fail. If the file had already existed as "qemu:qemu", then things will work. The chown is part of the processing of the saving of the file.

In the "existing" code, if any part of save failed *and* the path to the save file already exists, it was removed resulting in the "inability" to check what protections may have caused the issue. With these patches, if the file already exists, libvirt will not attempt to delete the file (if of course that patch is accepted).

The one "unresolved" issue is this usage of the NFS pool in order to create the save file and the "potential" mismatch of file mode bits which I'm not quite sure the best way to resolve.  "Theory" says once you successfully create and start a pool, then you should use libvirt volume API's in order to create/delete files within the pool. Using 'virsh save...' to the same client directory as the pool "goes behind the back" of the libvirt's storage volume since that save code has no way of knowing that the file is in the pool. There's no 'virsh save $domain $file --pool nfs' that would then use the pool code to manage the file... Thus if the creation of that file causes issues with pool refresh or libvirtd restart/reload, because the permissions aren't right, then that's a "configuration error" that needs to be resolved.

Comment 8 John Ferlan 2015-01-29 21:38:02 UTC

The important patches have been submitted into upstream

$ git describe 7879d031974417e767c2b6e198493289abffabdf
v1.2.12-47-g7879d03
$

Moving to POST - when libvirt is rebased this will be pulled in.

Comment 11 zhenfeng wang 2015-05-22 09:42:38 UTC

Could reproduce this issue with libvirt-1.2.8-5.el7.x86_64, Verify this bug with libvirt-1.2.15-2.el7.x86_64, Verify steps as following
1.Prepare a nfs server with root_squash
# cat /etc/exports
/var/tmp/nfs *(rw,root_squash)

# ll /var/tmp/nfs/ -d
drwxrwxrwx. 2 root root 4096 May 22 17:22 /var/tmp/nfs/

2. Pre-create the guest's save file and prepare a guest img on the nfs server
# touch rhel71.save
# ll
total 9438852
-rwxrwxrwx. 1 root qemu          0 May 22 17:25 rhel71.save
-rwxrwxrwx. 1 root qemu 9665380352 May 22 17:22 rhel7.img

3. keep the qemu.conf default configuration in nfs client
#vim /etc/libvirt/qemu.conf
user = "qemu"
group= "qemu"
dynamic_ownership = 1

4.create a nfs pool on the nfs client, and start the pool
#cat nfs.xml
<pool type='netfs'>
  <name>nfs</name>
  <uuid>08380649-cf25-4af7-b816-6f8494003f69</uuid>
  <capacity unit='bytes'>206423719936</capacity>
  <allocation unit='bytes'>121639010304</allocation>
  <available unit='bytes'>84784709632</available>
  <source>
    <host name='$nfs_server_addr'/>
    <dir path='/var/tmp/nfs'/>
    <format type='nfs'/>
  </source>
  <target>
    <path>/tmp/pl</path>
    <permissions>
      <mode>0700</mode>
      <owner>0</owner>
      <group>0</group>
    </permissions>
  </target>
</pool>

#mkdir /tmp/pl
#virsh pool-define nfs.xml
#virsh pool-start nfs

5.Start a normal guest which guest's img locate in the nfs server
# virsh list
 Id    Name                           State
----------------------------------------------------
 6     7.2                         running

6.Try to save guest to the nfs server using pre-create save file, all failed

# virsh save 7.2 /tmp/pl/rhel71.save 
error: Failed to save domain 7.2 to /tmp/pl/rhel71.save
error: Error from child process creating '/tmp/pl/rhel71.save': Permission denied

7.Check the pre-create save file still existing, libvirt does not delete them even though save guest failed with above error

# ll /tmp/pl/
total 9438852
-rwxrwxrwx. 1 root qemu          0 May 22 17:25 rhel71.save
-rwxrwxrwx. 1 root qemu 9665380352 May 22 17:36 rhel7.im

8.Try to save guest to nfs server directory with a non-existing file, it is successful

# virsh save 7.0 /tmp/pl/7.2.save

Domain 7.0 saved to /tmp/pl/7.2.save

# ll /tmp/pl/
total 10072852
-rw-rw----. 1 qemu qemu  649208581 May 22 17:37 7.2.save
-rwxrwxrwx. 1 root qemu          0 May 22 17:25 rhel71.save
-rwxrwxrwx. 1 root qemu 9665380352 May 22 17:36 rhel7.img

9.Restore guest with this save file, also successful, guest works well
# virsh restore /tmp/pl/7.2.save 
Domain restored from /tmp/pl/7.2.save

According to the upper steps, mark this bug verifed

Comment 13 errata-xmlrpc 2015-11-19 05:54:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html

Note You need to log in before you can comment on or make changes to this bug.