Bug 1242558

Summary: journal: cannot remove config /etc/libvirt/qemu/tux1.xml: Operation not permitted
Product: Red Hat Enterprise Linux 7 Reporter: Robert Scheck <redhat-bugzilla>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: cluster-qe <cluster-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.1CC: agk, cfeist, cluster-maint, djansa, dyuan, jtomko, lmiksik, mnovacek, mzhan, obockows, ppostler, rbalakri, robert.scheck, royoung
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: resource-agents-3.9.5-61.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1283877 (view as bug list) Environment:
Last Closed: 2016-11-03 23:57:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1283877    
Attachments:
Description Flags
Diff between VirtualDomain from RHEL 6 and 7 none

Description Robert Scheck 2015-07-13 15:06:00 UTC
Description of problem:
Either journal or libvirt (?) tries to delete the XML configuration files
for virtual machines from time to time, but I do not get why this happens.
If we 'chattr +i' the files, we get these errors logged:

Jul  8 09:14:35 intranet1 journal: libvirt version: 1.2.8, package: 16.el7_1.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-04-02-12:11:36, x86-024.build.eng.bos.redhat.com)
Jul  8 09:14:35 intranet1 journal: cannot remove config /etc/libvirt/qemu/tux1.xml: Operation not permitted
Jul  8 09:14:35 intranet1 journal: cannot remove config /etc/libvirt/qemu/tux2.xml: Operation not permitted
Jul  8 09:14:35 intranet1 journal: cannot remove config /etc/libvirt/qemu/tux3.xml: Operation not permitted

Without 'chattr +i' the file is gone and you (as a sysadmin) are doomed.

Version-Release number of selected component (if applicable):
libvirt-1.2.8-16.el7_1.3.x86_64
systemd-208-20.el7_1.5.x86_64

How reproducible:
Don't know how to reproduce exactly, sometimes shutting down the VM is
enough. Note that /etc/libvirt is symlinked to a DRBD partition, but that
shouldn't be the cause given the same scenario works with RHEL 6 fine.

Actual results:
Delete XML configuration file for virtual machine.

Expected results:
No daemon should start deleting configuration files under /etc simply.

Comment 2 Robert Scheck 2015-07-13 15:42:42 UTC
Cross-filed case #01476059 on the Red Hat customer portal.

Comment 3 Alexandros Gkesos 2015-07-27 11:45:21 UTC
Hello,

The customer logs are attached in the case #01476059, but they are around 116MBs ( i can't attach them here)

Do you need any specific files?

Release    :  Red Hat Enterprise Linux Server release 7.1 (Maipo)
Kernel     :  3.10.0-229.7.2.el7.x86_64

vdsm	    : Not Installed          	 libvirt     : 1.2.8-16.el7_1.3      
qemu-img   : 1.5.3-86.el7_1.2       	 qemu-kvm    : 1.5.3-86.el7_1.2

Comment 4 Ján Tomko 2015-07-27 14:21:42 UTC
The only code paths where libvirt can log that error are the virDomainUndefineFlags API and the migration APIs with the use of the VIR_MIGRATE_UNDEFINE_SOURCE flag.

So from libvirt's point of view everything is fine, somebody must have asked libvirt to undefine the machine.

I do not see any logs related to libvirt in the customer case, but this is probably done by a different application. Is there anything running that could ask libvirtd to undefine a machine?

Comment 5 Robert Scheck 2015-07-27 14:27:55 UTC
Ján, could that be caused by /usr/lib/ocf/resource.d/heartbeat/VirtualDomain
somehow? We added the virtual machine to pacemaker like this:

mkdir -p /var/lib/libvirt/qemu/pacemaker
chown qemu:qemu /var/lib/libvirt/qemu/pacemaker
pcs cluster cib vm_cfg
pcs -f vm_cfg \
  resource create vm ocf:heartbeat:VirtualDomain \
    config=/etc/libvirt/qemu/vm.xml \
    snapshot=/var/lib/libvirt/qemu/pacemaker
pcs -f vm_cfg \
  resource op remove vm monitor interval=10 timeout=30   # See: RHBZ#1031141
pcs -f vm_cfg \
  resource op remove vm start interval=0s timeout=90000  # See: RHBZ#1031141
pcs -f vm_cfg \
  resource op remove vm stop interval=0s timeout=90000   # See: RHBZ#1031141
pcs -f vm_cfg \
  resource op add vm monitor interval=60s timeout=30s
pcs -f vm_cfg \
  resource op add vm start interval=0 timeout=120s
pcs -f vm_cfg \
  resource op add vm stop interval=0 timeout=120s
pcs -f vm_cfg \
  constraint colocation add vm libvirtd INFINITY
pcs -f vm_cfg \
  constraint order libvirtd then vm
pcs cluster cib-push vm_cfg

And VirtualDomain_Start() calls verify_undefined() which runs itself the
"virsh undefine <domain>" command.

Comment 6 Robert Scheck 2015-07-27 14:30:24 UTC
Created attachment 1056605 [details]
Diff between VirtualDomain from RHEL 6 and 7

The "virsh undefined" stuff was newly added with RHEL 7 - and as it works
on RHEL 6, this could be the cause...maybe?

Comment 7 Ján Tomko 2015-07-27 15:00:50 UTC
The undefine call was added for bug 1016140:
https://github.com/ClusterLabs/resource-agents/commit/f00dcaf19

The following upstream change copies the file back into the original location if it disappeared during the undefine:
https://github.com/ClusterLabs/resource-agents/commit/897c03a3
which effectively makes the undefine last only until the next libvirtd restart.

If the domain needs to be undefined for the VirtualDomain agent, I think the config should be stored somewhere else, not in libvirt's /etc/libvirt/qemu - all the machines there are marked as defined when the libvirt daemon starts up.

Comment 8 Robert Scheck 2015-07-27 15:17:29 UTC
I am not sure if 897c03a3 is really a good fix for this issue, given that it
is messing around dynamically in /etc/libvirt and /tmp. Note that e.g. in our
environment, /etc/libvirt, /var/lib/libvirt and /var/log/libvirt reside on a
DRBD device and we currently also seem to loose VMs during pacemaker takeover.

Comment 13 Ján Tomko 2015-10-07 14:43:15 UTC
That patch just seems to copy the config back if it disappears during undefine.
If the config file disappears during undefine, it should stay deleted - otherwise what's the point of undefining it?

Comment 14 Oyvind Albrigtsen 2015-10-08 10:48:36 UTC
This is to keep the configuration when doing undefine as virsh removes it if it is in the libvirt-directories, but not if you put it outside of those directories, so this is a patch to avoid users upgrading from earlier versions losing their configuration.

This is how it is done in upstream, and you can read the discussion/reasoning here: https://github.com/ClusterLabs/resource-agents/issues/487.

Comment 19 Oyvind Albrigtsen 2016-03-01 11:10:42 UTC
Step-by-step reproduction:
Keep a backup of vm.xml, so it doesnt get lost in the Before test.

Before:
# rpm -q resource-agents
resource-agents-3.9.5-54.el7_2.6.x86_64
# pcs resource disable vm
# virsh define /etc/libvirt/qemu/vm.xml
# pcs resource enable vm
# ls /etc/libvirt/qemu/vm.xml
ls: cannot access /etc/libvirt/qemu/vm.xml: No such file or directory

After:
# rpm -q resource-agents
resource-agents-3.9.5-61.el7.x86_64
# pcs resource disable vm
# virsh define /etc/libvirt/qemu/vm.xml
# pcs resource enable vm
# ls /etc/libvirt/qemu/vm.xml 
/etc/libvirt/qemu/vm.xml

Comment 20 michal novacek 2016-07-26 09:37:02 UTC
I have verified that xml file defining virtual machine does not disappear from
/etc/libvirt/qemu/ after the machine is started with resource-agents
3.9.5-79.el7.x86_64.

----

common environment:
# pcs resource create vm ocf:heartbeat:VirtualDomain config=/etc/libvirt/qemu/rhel-7.xm
# pcs resource show vm
 Resource: vm (class=ocf provider=heartbeat type=VirtualDomain)
  Attributes: config=/etc/libvirt/qemu/rhel-7.xml
  Utilization: cpu=2 hv_memory=1024
  Operations: start interval=0s timeout=90 (vm-start-interval-0s)
              stop interval=0s timeout=90 (vm-stop-interval-0s)
              monitor interval=10 timeout=30 (vm-monitor-interval-10)
# pcs resource disable vm
# ls -l /etc/libvirt/qemu/rhel-7.xml
-rw-------. 1 root root 4004 26. čec 10.59 /etc/libvirt/qemu/rhel-7.xml
# virsh define /etc/libvirt/qemu/rhel-7.xml

before the fix (resource-agents-3.9.5-54.el7.x86_64)
----------------------------------------------------
# pcs resource enable vm
# ls -l /etc/libvirt/qemu/*.xml
ls: cannot access /etc/libvirt/qemu/rhel-7.xml: No such file or directory


after the fix (resource-agents-3.9.5-79.el7.x86_64)
---------------------------------------------------
# pcs resource enable vm
# ls -l /etc/libvirt/qemu/*.xml
-rw-------. 1 root root 4004 26. čec 10.59 /etc/libvirt/qemu/rhel-7.xml

# pcs resource  | grep vm
 vm     (ocf::heartbeat:VirtualDomain): Started kiff-01.cluster-qe.lab.eng.brq.redhat.com

Comment 22 errata-xmlrpc 2016-11-03 23:57:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2174.html