Description of problem: CTDB cluster doesn't come to healthy state when multiple nodes are rebooted one after the other and I/O's are running from windows client. 1st time: ************** Out of 4 node CTDB cluster, when rebooted two nodes one after the other, the node comes back and remains in UNHEALTHY state and two other nodes goes to BANNING state. 2nd time: ************ Out of 4 node CTDB cluster, when rebooted two nodes one after the other, the node comes back and remains in UNHEALTHY state and two other nodes goes to DISCONNECTED state. It happens even without running the I/O's. Version-Release number of selected component (if applicable): ctdb2.5-2.5.5-2.el7rhgs.x86_64 How reproducible: Always Steps to Reproduce: 1.Create a CTDB setup 2. Mount volume using VIP 3. start i/o's from windows client 4. reboot node 1, check ctdb status 5. reboot node 3 , check ctdb status 6. wait for both nodes to come up, check ctdb status 7. ctdb status shows the nodes in UNHEALTHY/DISCONNECTED state. 8. In one scenario node goes to banned state. Actual results: CTDB cluster UNHEALTHY. Node goes to banned/DISCONNECTED state Expected results: Once all the nodes come up, the cluster should be up and all nodes should be in OK state. Additional info: When test was run in SELinux enforcing mode, there were AVC's related to ctdb and iptables. type=AVC msg=audit(06/30/2015 01:25:33.897:367) : avc: denied { read } for pid=4431 comm=iptables path=/var/lib/ctdb/iptables-ctdb.flock dev="dm-0" ino=67681652 scontext=system_u:system_r:iptables_t:s0 tcontext=system_u:object_r:ctdbd_var_lib_t:s0 tclass=file Switched the SELinux to permissive mode. Still cluster not coming to healthy state. Will provide the sosreports.
Even with the new build CTDB2.5.5-3 , the nodes are not coming to healthy state. after reboot. Seeing following AVC's when a system is rebooted and trying to failback. type=AVC msg=audit(07/03/2015 01:30:25.839:154) : avc: denied { block_suspend } for pid=31332 comm=smbd capability=block_suspend scontext=system_u:system_r:smbd_t:s0 tcontext=system_u:system_r:smbd_t:s0 tclass=capability2
Worked with smb-dev and SELinux team to root cause this and seems like SELinux issue. The fix has to come in the next build of Selinux for RHEL7.1. The SELinux bz for RHEL7.1 is https://bugzilla.redhat.com/show_bug.cgi?id=1224879
With the policy provided in #C9 , With multiple reboots of nodes,all nodes comes to OK state. No AVC's are seen related to iptables, winbind and ctdb. Please include these policies in RHEL7.1 selinux policy build.
With #C25 in BZ : https://bugzilla.redhat.com/show_bug.cgi?id=1224879 , All the AVC's are fixed now.Need RHEL7 SELinux policy build to verify the bug.
With SELinux policy build : selinux-policy-targeted-3.13.1-32.el7.noarch selinux-policy-3.13.1-32.el7.noarch I am seeing following AVC's which were not seen in the earlier build. Worked with Milos on the same and found that the rule allow ctdbd_t systemd_systemctl_exec_t : file { ioctl read getattr lock execute execute_no_trans open } ; is present in .31el7 build but is missing from .32el7 build. Updated RHEL policy BZ : https://bugzilla.redhat.com/show_bug.cgi?id=1224879
It is strange. Lukas, can you check it?
This is very strange. Actually, I'm working on this issue.
commit ce652d6c62c6d38d1dab05b862cecc863075d28c Author: Lukas Vrabec <lvrabec> Date: Wed Jul 15 14:01:16 2015 +0200 Allow ctdbd_t send signull to samba_unconfined_net_t. commit 4aea5f1b161c8e711f593cf123de3b155ba71229 Author: Lukas Vrabec <lvrabec> Date: Wed Jul 15 14:00:39 2015 +0200 Add samba_signull_unconfined_net() commit 645b04ea4006f4f25f606662cdf9b526df7226e5 Author: Lukas Vrabec <lvrabec> Date: Wed Jul 15 13:44:41 2015 +0200 Add samba_signull_winbind()
I make new selinux-policy build with fixes.
We need a RHEL 7.1.z build for the BZ to be moved to ON_QA The fix is to be tested with the new selinux-policy-3.13.1-33.el7 build
with the build selinux-policy-3.13.1-33.el7.noarch selinux-policy-targeted-3.13.1-33.el7.noarch There is no AVC seen and all ctdb nodes comes to OK state after rebooting multiple nodes. Need 7.1.z build for this bug. Moving it to verified with this build which is for 7.2.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html