Bug 1466093 - Decommissioning domain controller role fails if system is rebooted after deployment (due to firewalld bug)
Decommissioning domain controller role fails if system is rebooted after depl...
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: firewalld (Show other bugs)
28
All Linux
unspecified Severity medium
: ---
: ---
Assigned To: Eric Garver
Fedora Extras Quality Assurance
AcceptedFreezeException
:
Depends On:
Blocks: BetaFreezeException/F28BetaFreezeException
  Show dependency treegraph
 
Reported: 2017-06-28 22:15 EDT by Adam Williamson
Modified: 2018-03-26 18:30 EDT (History)
5 users (show)

See Also:
Fixed In Version: firewalld-0.5.2-2.fc28
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-03-26 18:30:30 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log from an affected case (5.93 MB, application/x-gzip)
2017-06-28 22:16 EDT, Adam Williamson
no flags Details

  None (edit)
Description Adam Williamson 2017-06-28 22:15:57 EDT
I recently implemented an openQA test which does the following:

* Starting from a clean Fedora 25 Server install, deploy the domain controller role
* On another system, starting from a clean Fedora 25 Server install, enrol as a client in the domain (using realmd)
* Once the client is enrolled, upgrade the Server system to Fedora 26, then upgrade the client system to Fedora 26
* Run through the usual server and client tests on Fedora 26

The server part of this test always fails right at the end, when the role is decommissioned with `rolectl decommission domaincontroller/domain.local`. In contrast, when a similar test is run entirely on Fedora 26 (and, indeed, entirely on Fedora 25) the decommissioning works successfully. It's only when an upgrade is involved that the decommissioning fails.

The failure seems to be related to something done to the firewall configuration during decommissioning, as the system journal contains these lines at the relevant time:

Jun 25 11:49:32 ipa001.domain.local firewalld[654]: WARNING: '/usr/sbin/iptables-restore --wait=2 -n' failed:
Jun 25 11:49:32 ipa001.domain.local firewalld[654]: WARNING: '/usr/sbin/ip6tables-restore --wait=2 -n' failed:
Jun 25 11:49:32 ipa001.domain.local firewalld[654]: ERROR: COMMAND_FAILED

/var/log/rolekit doesn't provide anything useful, though - the last message in it is:

2017-06-25 14:49:31 ERROR: b'Client uninstall complete.'

which I believe is passed along from the FreeIPA client uninstallation process.

I will attach a tarball containing the complete contents of /var/log from the server to this report. You can use 'journalctl --file' to read the journal files under /var/log/journal .
Comment 1 Adam Williamson 2017-06-28 22:16 EDT
Created attachment 1292752 [details]
/var/log from an affected case
Comment 2 Adam Williamson 2018-03-15 13:30:25 EDT
This has been happening ever since. It'd be really nice if this test would actually run successfully. Did you ever get to look at it, sgallagh?
Comment 3 Adam Williamson 2018-03-21 14:10:01 EDT
So, this is indeed still happening, now we've solved various other issues that got in the way:

https://openqa.stg.fedoraproject.org/tests/261752

shows the very same error:

https://openqa.stg.fedoraproject.org/tests/261752#step/role_deploy_domain_controller_check/23
Comment 4 Adam Williamson 2018-03-21 18:28:38 EDT
sgallagh has said on IRC that this can be reproduced simply across a reboot (i.e. deploy, reboot, attempt to decommission -> fail) and seems to be an issue in firewalld.
Comment 5 Adam Williamson 2018-03-21 18:35:05 EDT
Proposing as a Beta freeze exception issue. It appears to me the criteria do not in fact cover decommissioning roles, though this may possibly be an oversight.
Comment 6 Stephen Gallagher 2018-03-21 20:05:25 EDT
OK, I've finally tracked down the failure. The issue is definitely occurring within firewalld. It can be reproduced with the following steps that do not require rolekit:

1) Install a system with firewalld enabled
2) `firewall-cmd --add-service freeipa-ldap --permanent`
3) `firewall-cmd --add-service freeipa-ldaps --permanent`
4) Reboot the system
5) Verify that both services are enabled with `firewall-cmd --list-all`
6) `firewall-cmd --remove-service freeipa-ldaps` (Succeeds)
7) `firewall-cmd --remove-service freeipa-ldap` (Returns "Error: COMMAND_FAILED")

It appears that firewalld doesn't properly handle the second removal of a permanent service for which the services have entries that overlap. The freeipa-ldap and freeipa-ldaps services are almost identical, providing numerous ports. They differ only on the LDAP tcp port, which is 389 for freeipa-ldap and 636 for freeipa-ldaps.

So it appears that after removing one of the two services, firewalld cannot properly handle removing the other one. This is what causes the FreeIPA decommissioning to fail. You can reverse steps 6 and 7 above and the second one will always fail.

(I was clued in that it might be related to the freeipa-ldap/s interaction because the postgresql role did not exhibit the same behavior.)
Comment 7 Eric Garver 2018-03-22 08:41:48 EDT
What version of firewalld? I think this was fixed in version firewalld-0.5.2-1.
Comment 8 Stephen Gallagher 2018-03-22 08:53:31 EDT
I can reproduce the issue with firewalld-0.5.1-2.fc28.noarch

I can indeed confirm that firewalld-0.5.2-1.fc28.noarch resolves this issue. Please get it submitted in Bodhi ASAP.
Comment 9 Fedora Update System 2018-03-22 09:27:56 EDT
firewalld-0.5.2-1.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-fa59de3ded
Comment 10 Fedora Update System 2018-03-22 11:06:49 EDT
firewalld-0.5.2-1.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-fa59de3ded
Comment 11 Fedora Update System 2018-03-22 12:06:12 EDT
firewalld-0.5.2-2.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-980d3f6ad7
Comment 12 František Zatloukal 2018-03-22 14:46:35 EDT
Discussed during blocker review [1]:

AcceptedFreezeException (Beta) - decommissioning isn't actually part of the release criteria, but is a significant function of server roles, and pulling this in will improve openQA test coverage

[1] https://meetbot-raw.fedoraproject.org/fedora-meeting-1/2018-03-22/
Comment 13 Fedora Update System 2018-03-23 10:44:00 EDT
firewalld-0.5.2-2.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-980d3f6ad7
Comment 14 Adam Williamson 2018-03-23 11:48:16 EDT
openQA testing confirmed the fix for this:

https://openqa.stg.fedoraproject.org/tests/263230
Comment 15 Fedora Update System 2018-03-26 18:30:30 EDT
firewalld-0.5.2-2.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.