Bug 1452866

Summary: FreeIPA upgrade script requires network to be up, but network is not up during upgrade when using dnf system-upgrade
Product: [Fedora] Fedora Reporter: Adam Williamson <awilliam>
Component: freeipaAssignee: IPA Maintainers <ipa-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 26CC: abokovoy, edewata, ignatenko, ipa-maint, jhrozek, jmracek, jpazdziora, lkrispen, mreynolds, nhosoi, nkinder, pschindl, pvoborni, rcritten, rmeggins, robatino, samuel-rhbugs, slaznick, ssorce, thaller, wibrown, zbyszek
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: AcceptedFreezeException
Fixed In Version: freeipa-4.6.3-2.fc27 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-20 17:12:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1349187, 1403352, 1456038, 1503321    
Attachments:
Description Flags
/var/log (in a tar.gz) from the server side
none
/var/log (in a tar.gz) from the client side none

Description Adam Williamson 2017-05-19 19:00:49 UTC
I've been adjusting openQA to test FreeIPA upgrades recently. I *think* I've now got it to the point where the test is valid, but it's failing because of a genuine bug (but this is all new stuff, so it's possible I got something wrong).

Anyway, the test does this:

* A server VM boots from a clean Fedora 25 image and deploys itself as a FreeIPA server (using rolectl)
* A client VM boots from the same image and enrols into the domain (using realmd)
* The server upgrades itself to Fedora 26 (using dnf system-upgrade), then reboots and sits there until the client is done
* The client waits for the server to be done upgrading, then also upgrades to Fedora 26
* The client runs some client-side tests to see if things are working correctly
* Once the client is done, the server tries to cleanly decommission itself, then the test ends

What I see happening is that one of the client side tests fails - when it runs `kinit -k host/client003.domain.local`, expecting success, it fails with the error "kinit: Cannot contact any KDC for realm 'DOMAIN.LOCAL' while getting initial credentials". Some earlier tests pass, but I don't think any of those actually *contact* the server, they're just checking local configuration on the client side.

When I look at the server logs, I see several failures from ns-slapd, including one which recurs over and over:

May 19 11:26:43 ipa001.domain.local ns-slapd[1205]: [19/May/2017:14:26:43.033212238 -0400] - ERR - cos-plugin - cos_cache_query_attr - cos attribute krbPwdPolicyReference failed schema check on dn: cn=dns,dc=domain,dc=local

there are various other errors, and ipa-dnskeysyncd.service keeps trying to start up and failing, with these messages:

May 19 11:28:12 ipa001.domain.local audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=ipa-dnskeysyncd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]: ipa         : INFO     LDAP bind...
May 19 11:28:14 ipa001.domain.local python2[1823]: GSSAPI client step 1
May 19 11:28:14 ipa001.domain.local python2[1823]: GSSAPI client step 1
May 19 11:28:14 ipa001.domain.local python2[1823]: GSSAPI client step 1
May 19 11:28:14 ipa001.domain.local python2[1823]: GSSAPI client step 2
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]: ipa         : INFO     Commencing sync process
May 19 11:28:14 ipa001.domain.local ns-slapd[1205]: [19/May/2017:14:28:14.803715502 -0400] - ERR - cos-plugin - cos_cache_query_attr - cos attribute krbPwdPolicyReference failed schema check on dn: idnsname=domain.local.,cn=dns,dc=domain,dc=local
May 19 11:28:14 ipa001.domain.local ns-slapd[1205]: [19/May/2017:14:28:14.806331783 -0400] - ERR - cos-plugin - cos_cache_query_attr - cos attribute krbPwdPolicyReference failed schema check on dn: ipk11UniqueID=562d1e30-3cbd-11e7-b9ff-525400120019,cn=keys,cn=sec,cn=dns,dc=domain,dc=local
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]: ipa.ipapython.dnssec.keysyncer.KeySyncer: INFO     Initial LDAP dump is done, sychronizing with ODS and BIND
May 19 11:28:14 ipa001.domain.local audit[1829]: AVC avc:  denied  { execute_no_trans } for  pid=1829 comm="ipa-dnskeysyncd" path="/usr/libexec/ipa/ipa-dnskeysync-replica" dev="dm-0" ino=812860 scontext=system_u:system_r:ipa_dnskey_t:s0 tcontext=system_u:object_r:ipa_dnskey_exec_t:s0 tclass=file permissive=0
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]: Traceback (most recent call last):
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:   File "/usr/libexec/ipa/ipa-dnskeysyncd", line 110, in <module>
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:     while ldap_connection.syncrepl_poll(all=1, msgid=ldap_search):
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:   File "/usr/lib64/python2.7/site-packages/ldap/syncrepl.py", line 409, in syncrepl_poll
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:     self.syncrepl_refreshdone()
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:   File "/usr/lib/python2.7/site-packages/ipapython/dnssec/keysyncer.py", line 115, in syncrepl_refreshdone
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:     self.hsm_replica_sync()
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:   File "/usr/lib/python2.7/site-packages/ipapython/dnssec/keysyncer.py", line 181, in hsm_replica_sync
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:     ipautil.run([paths.IPA_DNSKEYSYNCD_REPLICA])
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:   File "/usr/lib/python2.7/site-packages/ipapython/ipautil.py", line 459, in run
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:     preexec_fn=preexec_fn)
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:   File "/usr/lib64/python2.7/subprocess.py", line 390, in __init__
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:     errread, errwrite)
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:   File "/usr/lib64/python2.7/subprocess.py", line 1024, in _execute_child
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]:     raise child_exception
May 19 11:28:14 ipa001.domain.local ipa-dnskeysyncd[1823]: OSError: [Errno 13] Permission denied
May 19 11:28:14 ipa001.domain.local audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=ipa-dnskeysyncd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
May 19 11:28:14 ipa001.domain.local systemd[1]: ipa-dnskeysyncd.service: Main process exited, code=exited, status=1/FAILURE
May 19 11:28:14 ipa001.domain.local systemd[1]: ipa-dnskeysyncd.service: Unit entered failed state.
May 19 11:28:14 ipa001.domain.local systemd[1]: ipa-dnskeysyncd.service: Failed with result 'exit-code'.

I will attach tarballs containing the whole contents of /var/log from both server and client ends; there are rather a lot of errors, and I'm not sure which are significant.

Proposing as an F26 Beta blocker, as a violation of "For each one of the release-blocking package sets, it must be possible to successfully complete a direct upgrade from fully updated installations of the last two stable Fedora releases with that package set installed. ...  The upgraded system must meet all release criteria." - an upgraded Server system deployed as a FreeIPA server doesn't meet the Alpha criterion "Release-blocking roles and the supported role configuration interfaces must meet the core functional Role Definition Requirements to the extent that supported roles can be successfully deployed, started, stopped, brought to a working configuration, and queried.", as the domain controller role does not meet its core requirements. Now I read the criteria again, it's arguable whether this applies (I don't think FreeIPA is actually part of the *default* Server install), but I think it's worth considering at least.

Comment 1 Adam Williamson 2017-05-19 19:03:06 UTC
Created attachment 1280508 [details]
/var/log (in a tar.gz) from the server side

Comment 2 Adam Williamson 2017-05-19 19:06:51 UTC
Created attachment 1280509 [details]
/var/log (in a tar.gz) from the client side

Comment 3 Adam Williamson 2017-05-19 19:24:26 UTC
Something that might be useful when reading the logs: the client side attempted the 'kinit' command that failed at ~18:35 UTC - you can view the journals from the log tarballs with UTC timestamps using 'journalctl --file var/log/journal/(tab)/system.journal --utc' .

Comment 4 Adam Williamson 2017-05-19 21:08:19 UTC
So I think possibly the errors on post-upgrade boot may be caused by the FreeIPA upgrade process failing during the system upgrade:

2017-05-19T18:19:50Z DEBUG The ipa-server-upgrade command failed, exception: ScriptError: CA did not start in 300.0s
2017-05-19T18:19:50Z ERROR CA did not start in 300.0s
2017-05-19T18:19:50Z ERROR The ipa-server-upgrade command failed. See /var/log/ipaupgrade.log for more information

This seems to be the cause of the failure:

2017-05-19T18:14:47Z DEBUG Failed to check CA status: cannot connect to 'http://ipa001.domain.local:8080/ca/admin/ca/getStatus': [Errno 101] Network is unreachable
2017-05-19T18:14:49Z DEBUG The CA status is: check interrupted due to error: cannot connect to 'http://ipa001.domain.local:8080/ca/admin/ca/getStatus': [Errno 101] Network is unreachable

that repeats over and over until the upgrade process gives up because it thinks the CA isn't up.

AFAICT, the network is never brought up during system upgrade. I am not sure whether or not this is intentional. ipa001.domain.local is, of course, the server itself, so network connectivity shouldn't actually be required for this to work; there should be an entry for ipa001.domain.local in /etc/hosts , though it may list the system's public IP address, not 127.0.0.1.

So, *something* there is clearly wrong, but I'm not sure what to call the bug: is the fact that the network isn't brought up during system upgrade a bug? Is the fact that the system can't 'reach' itself with the network down a bug? Is the fact that the IPA upgrade process tries to use the FQDN of the server instead of just 'localhost' or '127.0.0.1' or something a bug? Really not sure.

When I get a minute, I'll try deploying a 24 server and upgrading to 25 or deploying a 23 server and upgrading to 24, and compare the results...

Comment 5 Adam Williamson 2017-05-19 21:16:30 UTC
Adding DNF folks and Zbigniew for the system-upgrade question; do we know if it's intentional that network is not brought up during system upgrade (at least, from the logs it looks that way)? It seems like network.target is activated, but network-online.target is not; network.target doesn't actually bring up the network, by the looks of it.

Comment 6 Zbigniew Jędrzejewski-Szmek 2017-05-19 21:48:01 UTC
Yes, network is not started by system-update.target. The idea is that the network should not be needed for upgrades, in fact it is expected that the upgrade will go through just fine w/o network.

> there should be an entry for ipa001.domain.local in /etc/hosts

Alternatively nss-myhostname could be in /etc/nsswitch.conf. That's a much more robust solution ;)

I think that using ipa001.domain.local sounds reasonable, if that is the configured controller name. I think we should make sure that this name is resolvable, even w/o functional network.

Comment 7 Adam Williamson 2017-05-19 23:33:26 UTC
Zbigniew: just to be clear - was that the same in earlier Fedoras (i.e. network did not come up during upgrade from 23 to 24, or 24 to 25)? Or has it changed?

Assuming it's always been that way, I think there *have* been some changes to nsswitch.conf / authconfig this cycle, and it's possible that may be causing different behaviour here. I'll experiment with some upgrade scenarios on Tuesday and see; I'm on vacation from now till Monday. If anyone else can play with it before then, that'd be great.

Comment 8 Zbigniew Jędrzejewski-Szmek 2017-05-20 02:44:15 UTC
Things changed. Things were documented to be this way (in the original offline updates wiki page, and then in the man page), but dnf-system-upgrade.service and packagekit-offline-update.service didn't have DefaultDependencies=no, so they pulled in a bunch of stuff.

As a short-term fix, if ipa server needs network to be up, then either that service could have Requires=network-online.target, After=network-online.target, or the upgrade script could start network-online.target.

Comment 9 Adam Williamson 2017-05-20 04:25:45 UTC
Aha, that might explain things. It would have to be the upgrade script that brought the network up, I guess - the service isn't actually *enabled*, it's just the upgrade script running as a scriptlet when the package is updated.

Comment 10 Adam Williamson 2017-05-20 04:27:08 UTC
Er, I mean, the service is enabled usually, but it doesn't start as part of system-update.target .

Comment 11 Simo Sorce 2017-05-20 12:53:05 UTC
Seem like a regression of system-update.
What is the reasoning behind not needing network ?

Even if the name can be resolved, we will probably have issues if multiple machines are upgraded at once and network is not available, as they may end up creating identical entries in LDAP, but without network they will not replicate out to other machines. When they are all done they will start replicating and we'll have replication conflicts.

Comment 12 Zbigniew Jędrzejewski-Szmek 2017-05-20 23:10:31 UTC
The idea is that the minimum amount of "stuff" is started. The sysinit.target decides what this minimum is, but essentially filesystems and basic linux API.

If an upgrade script requires some daemon to be started, it has to take care of this on its own. The service in question requires functional network, some other service might require a functional web server, etc. We explicitly do not want to start anything above that minimum by default, to avoid interference and problems while files in the filesystem are being swapped.

That said, I think it'd be much better to make the upgrade not require a functional network connection. Networks are finicky, and such a requirement makes the whole upgrade procedure less robust.

Comment 13 Simo Sorce 2017-05-21 14:57:30 UTC
I'll think about how to deal with network access and upgrades, but for F26 I think the simplest solution is to turn on the network.

Comment 14 Simo Sorce 2017-05-21 15:18:38 UTC
opened this upstream: https://pagure.io/freeipa/issue/6968

Comment 15 Adam Williamson 2017-05-21 18:22:53 UTC
So I did notice something openQA-specific here also: it's not Fedora or anaconda or FreeIPA or anything writing the /etc/hosts line for the server's own hostname in this test, I don't think, but openQA. Currently it writes a line using the system's external IP address, like this:

192.168.1.100 ipa001.domain.local

I've tweaked it to write this instead:

127.0.0.1 ipa001.domain.local

and we'll see how that works. I can't recall why I set it up to use the public IP initially; I *may* have tried with 127.0.0.1 first and run into some trouble. But we'll see how this goes, I guess.

Comment 16 Petr Schindler 2017-05-22 16:25:20 UTC
Discussed at 2017-05-22 blocker review meeting: [1]. 

This bug was rejected as beta blocker and accepted as Freeze Exception: This bug doesn't strictly violate the upgrade criterion. However, it's a nasty bug that we'd consider a FE for fixing. Please repropose for Final, as if this doesn't get fixed before then we'd want to re-evaulate.

[1] https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2017-05-22

Comment 17 Adam Williamson 2017-05-22 19:09:10 UTC
So I tested with two variants of a loopback address in /etc/hosts:

127.0.0.1 ipa001.domain.local

127.0.1.1 ipa001.domain.local

and initial server deployment fails in both cases (don't recall the error in the first case, the error in the latter case is "2017-05-22T19:03:31Z WARNING Invalid IP address 127.0.1.1 for ipa001.domain.local: cannot use loopback IP address 127.0.1.1").

So it seems pretty clear that we have to use a non-loopback address in the /etc/hosts line for the server's FQDN, but if we use the server's external network address, then we'll run into this problem if the network isn't up during the upgrade.

So to summarize: I think this is a real issue, not an error in the way we set up the networking configuration in the openQA test, unless anyone else can see a way to set things up such that the upgrade process would run correctly without the network being up.

Comment 18 Simo Sorce 2017-05-22 20:26:38 UTC
We need the network up for now, I opened an upstream b ug to change things so we can avoid it in futer, but for F26 I do not see any other option.

Comment 19 Adam Williamson 2017-05-23 01:18:45 UTC
I *think* we don't need anything more complex than the RPM scriptlet (or the upgrade command itself, possibly) running 'systemctl start network-online.target' to achieve that, though I haven't tried it yet.

Comment 20 Simo Sorce 2017-05-23 15:40:02 UTC
Well we need to start the network and wait until it is ready ... not just start it

Comment 21 Adam Williamson 2017-05-23 15:53:37 UTC
that's what network-online.target *is*: it's a target that's only 'complete' when the network is up.

Comment 22 Adam Williamson 2017-05-25 21:19:20 UTC
Sadly...this isn't so simple.

In a *default* Fedora setup, network-online.target Wants just one service: NetworkManager-wait-online.service . Unhappily, this service has:

Requisite: NetworkManager.service

what that means is that if NetworkManager.service has not already been started by some other mechanism or dependency chain, this service will *fail* (with a dependency error). The difference between Requires: and Requisite: is that 'Requires: foo.service' will cause systemd to attempt to start foo.service if it's not already started, whereas 'Requisite: foo.service' means "if foo.service hasn't already been started by other means, fail".

Of course we can just do 'systemctl start NetworkManager.service' first, but then what about people who use network.service or systemd-networkd.service ?

So basically right now I can't see a great, network-manager-agnostic way to tell systemd 'just bring up the network now please', beyond doing something icky like trying to figure out which network manager is enabled on multi-user.target or something (or just going ahead and starting the whole of default.target, which would entirely defeat the point of the minimal system update target). Zbigniew, is there one that I'm missing?

Comment 23 Adam Williamson 2017-05-25 21:21:17 UTC
sorry, just to make the problem explicit: all of the above means if you simply do 'systemctl start network-online.target' from the 'system upgrade' environment, it doesn't actually bring up the network, you just get these errors instead:

May 25 13:42:15 ipa001.domain.local systemd[1]: Network Manager is not active.
May 25 13:42:15 ipa001.domain.local systemd[1]: Dependency failed for Network Manager Wait Online.
May 25 13:42:15 ipa001.domain.local systemd[1]: NetworkManager-wait-online.service: Job NetworkManager-wait-online.service/start failed with result 'dependency'.
May 25 13:42:15 ipa001.domain.local systemd[1]: Reached target Network is Online.

Comment 24 Adam Williamson 2017-05-26 19:38:42 UTC
So I've been doing some further testing and it seems like there are still several issues in the upgrade process even with the network running, and I've not yet had a successful test even with various workarounds applied (network forced up during upgrade, SELinux in permissive mode).

So I'm going to rejig things a bit. As we've mostly been discussing the 'no network during upgrade' part of the problem here, let's make this the bug for that. I'm currently filing several other bugs for other issues I've noted, and I'll create a new tracker bug and have all the bugs block that.

Comment 25 Petr Vobornik 2017-05-29 14:32:20 UTC
I like Simo's proposal - run upgrade before IPA service start and not in RPM update https://pagure.io/freeipa/issue/6968 

That way we can get more predictable environment and is easier to test.

Question is what to do in f26.

But 6968 proposal is:
1. remove running of ipa-server-upgrade from %posttrans server
2. add running of ipa-server-upgrade  in ipa.service or `ipaclt start` Note: ipa.service calls ipactl start

I'm not sure how safe is #2.

A possibility is also not to run upgrader automatically. RPM transaction can just write that it needs to be run. If somebody then run `ipactl start` without running the upgrader first, ipactl will tell that upgrader needs to be run (this is already implemented).

Comment 26 Ludwig 2017-05-29 14:39:32 UTC
isn't the sevice stopped and started in rpm update ? if so, then moving the upgrade to service start will not get it completely out of rpm update

Comment 27 Simo Sorce 2017-05-30 08:47:27 UTC
Ludwig,
during system upgrade IPA is not running, so RPM will not restart it (unless there is a mistake in our scriplets.

And this i the case we want to deal with.
If you upgrade when the system is running and IPA is up, the restart will cause it to run the upgrade of course, but that is ok.

Petr,
we cannot leave systems "broken" and requiring manual intervention just because an rpm upgrade was run so we need to accept that upgrade can still run during rpm transactions.

A more complex setup would allow us to *not* restart services until the rpm transaction is completed and run the restart at the end to make sure all rpms have been properly updated before that happens. And that is probably already possibile within rpm, just need to make sure it works that way with our packages.

Comment 28 Zbigniew Jędrzejewski-Szmek 2017-05-30 19:00:59 UTC
I removed the dependency on #1455704. My bad, network-online.target works fine.

Comment 29 Adam Williamson 2017-05-30 19:45:19 UTC
Uh? No, I filed the bug from experience. I actually *tested* it,  with freeipa scratch builds adjusting the %posttrans scriptlet. First I adjusted it to just do:

systemctl start network-online.target

and it *didn't work*. That did not actually bring the network up - it failed with a dependency failure error message, and all the subsequent logs confirm the network really was not up; I got all the same 'network is unavailable' errors whenever it tried to reach anything but a loopback address.

Then I did a second scratch build which does:

systemctl start NetworkManager.service
systemctl start network-online.target

and that one *does* work, it actually brings the network up - there are no errors from systemd, and no errors indicating lack of network access.

So no, so far as my testing indicates, 1455704 is not some 'cosmetic' bug and it *does* block this. Restoring the dependency.

Comment 30 Adam Williamson 2017-05-30 19:47:45 UTC
Ah. Actually, wait - now I see. There really *is* a systemd problem, but it's not 1455704, that doesn't describe the bug properly. Removing the dependency again, but we do need a bug for 'there is no reliable way to say "just bring the network up now" because `systemctl start network-online.target` doesn't achieve it' - Zbigniew, is there an existing bug for that, or should I file one?

Comment 31 Zbigniew Jędrzejewski-Szmek 2017-05-30 20:07:21 UTC
Indeed. 'systemctl start network-online.target' pulls in neither NetworkManager.service nor systemd-networkd.service, because both *-wait-online.service have Requisite=. So we don't have a network-implementation agnostic way to pull in the network-configuration services.

If we do the change I proposed in #1455704, and then change Requisite= to Requires= in the *-wait-online.service, then things would work as expected. This would be a change as to how the *-wait-online.service's are used, but it seems to be a reasonable change.

Comment 32 Zbigniew Jędrzejewski-Szmek 2017-06-01 03:10:18 UTC
https://github.com/systemd/systemd/pull/6065
— if that is accepted, 'systemctl start network-online.target' would do the right thing (and synchronously configure the network).

Comment 33 Adam Williamson 2017-06-02 20:59:20 UTC
So here's an interesting thing: after fixing the network config in the openQA tests, I tried running them with all workarounds disabled - no hack to make the network work during the upgrade process, and no SELinux disabling...and the apparent test results are the same as with the workarounds in place. The client test passes, and the server test fails during decommissioning (I haven't got around to filing that bug yet, but it does the same with the workarounds active).

So even though the upgrade process clearly doesn't run properly at all, the end result is a deployment that at least still works to some degree (the client tests involve checking kinit and getent, logging in as a domain user, and checking that trying to log in as a domain user who's not allowed to log in fails correctly; https://openqa.stg.fedoraproject.org/tests/119514/modules/freeipa_client/steps/1/src is the test code, you can probably see what it's doing).

I did check from the server logs that ipa-server-upgrade definitely failed as described in this report - /var/log/ipaupgrade.log ends with:

2017-06-01T20:26:57Z DEBUG The ipa-server-upgrade command failed, exception: ScriptError: CA did not start in 300.0s
2017-06-01T20:26:57Z ERROR CA did not start in 300.0s
2017-06-01T20:26:57Z ERROR The ipa-server-upgrade command failed. See /var/log/ipaupgrade.log for more information

and I also checked that ipa-server-upgrade isn't re-run on the boot after the system upgrade (I thought it might be, but it doesn't seem to be the case). Nevertheless, things seem to work to at least some degree. I'd be surprised if *some* stuff wasn't broken by the upgrade process failing, though.

Comment 34 Simo Sorce 2017-06-05 10:29:12 UTC
New (or existing but changed) functions that depends on the upgrade running will probably be broken, the rest will keep working.
For example logging into the webui or even the whole REST API may be broken as we changed it quite dramatically. The code laid down expect an upgraded set of config files and LDAP records. So depending on how/when the upgrade breaks all or part of this may be broken until the upgrade is completed.

Comment 35 Adam Williamson 2017-06-05 15:23:48 UTC
Ah, okay. I didn't try running the test that exercises the webUI in this upgrade scenario yet, maybe I'll see if I can get that working this week (having multiple client tests in the upgrade case might be a bit hard depending on exactly the openQA mechanism for making tests wait on each other works in that scenario, I'm gonna have to dig into it).

Comment 36 Adam Williamson 2017-06-05 18:44:37 UTC
Discussed at 2017-06-05 blocker review meeting: https://meetbot-raw.fedoraproject.org/fedora-blocker-review/2017-06-05/f26-blocker-review.2017-06-05-16.01.html . As we're not entirely clear what the practical impact of this bug is yet, we can't yet determine its blocker status.

Comment 37 Adam Williamson 2017-06-09 22:03:49 UTC
So here's a funny thing: I just edited the upgrade test on openQA to run the web UI and password change modules, and...they passed:

https://openqa.stg.fedoraproject.org/tests/121421

So either I messed up something in the test, or the bits of the web UI (and user management) that those test modules exercise still work OK after the upgrade (you can see what the tests do from the video, or from https://openqa.stg.fedoraproject.org/tests/121421/modules/freeipa_webui/steps/1/src and https://openqa.stg.fedoraproject.org/tests/121421/modules/freeipa_password_change/steps/1/src ). It's Friday afternoon so I don't quite have time to dig into all the logs and stuff, will do next week if I can.

Comment 38 Standa Laznicka 2017-06-14 13:51:48 UTC
(In reply to Simo Sorce from comment #34)
> New (or existing but changed) functions that depends on the upgrade running
> will probably be broken, the rest will keep working.
> For example logging into the webui or even the whole REST API may be broken
> as we changed it quite dramatically. The code laid down expect an upgraded
> set of config files and LDAP records. So depending on how/when the upgrade
> breaks all or part of this may be broken until the upgrade is completed.

Simo, Adam, we're keeping FreeIPA 4.4 in Fedora so none of the changes that landed in 4.5 apply here. That's why your Web UI logging tests succeed. I am actually not quite sure whether there have been any changes that could break anything when the upgrade fails.

Adam, if I understand it correctly, in your comment https://bugzilla.redhat.com/show_bug.cgi?id=1452866#c33 you describe a situation when you were able to create such an environment where the network is online during the upgrade but the upgrade would still fail. Would it be possible to have the logs from this scenario? I would be interested to know why CA failed to start.

That way or another, if we need to resolve this problem quickly it might be best to store a flag that IPA should be upgraded somewhere on the disk and run the ipa-server-upgrade script when IPA starts for the first time after the system upgrade.

Comment 39 Adam Williamson 2017-06-14 16:01:05 UTC
Stanislav: sorry, I must've not been clear in my comment. #c33 describes a run where I disabled all workarounds (so networking would not have been up during the upgrade, and it would've run into several SELinux denials, which are reported in other bugs). When I use workarounds to ensure networking is available during the upgrade process (and SELinux is in permissive mode, to avoid the denials), the upgrade script succeeds.

Comment 40 Adam Williamson 2017-06-14 16:03:07 UTC
Given the 4.4 vs. 4.5 thing, I guess it will be interesting to see how this test behaves on Rawhide. I'll look at that...

Comment 41 Standa Laznicka 2017-06-15 06:17:08 UTC
If you do that, you really want to do that in permissive mode, otherwise I'm afraid you won't get very far. We have a RHEL Bugzilla for AVC denials there and scrolling through that may cause severe damage to your finger: https://bugzilla.redhat.com/show_bug.cgi?id=1436689

I am not sure whether we've got a Fedora BZ for that, I opened one months ago but that one did not by far include all the denials you would probably get running FreeIPA on rawhide.

Anyway - given that the upgrade passes with the network on, would it be enough to just wait for https://bugzilla.redhat.com/show_bug.cgi?id=1455704 to make it to Fedora and maybe only perform
`systemctl start network-online.target`
during the upgrade process? I would like to keep it as simple as possible.

Comment 42 Adam Williamson 2017-06-19 16:14:29 UTC
I'm withdrawing the blocker nomination at least until we're sure what the consequences of the failed upgrade script are. It's hard to decide if this is a blocker without that.

Comment 43 Tomas Krizek 2017-07-11 08:31:10 UTC
Fixed upstream
master:
https://pagure.io/freeipa/c/7ef08aa8cb6e504bc38850a6bf49cb2766b3c1df
ipa-4-5:
https://pagure.io/freeipa/c/3fd92a9c677f0486823c5218d087fa3703db5eab

This is fixed on IPA side, but in order for it to work, it also requires this systemd patch:
https://github.com/systemd/systemd/commit/9db307820e6f545665fc87f255af737228b7183c

Comment 44 Zbigniew Jędrzejewski-Szmek 2017-07-11 12:54:23 UTC
> https://github.com/systemd/systemd/commit/9db307820e6f545665fc87f255af737228b7183c

Oh, I thought I had included that patch in F26 and rawhide, but not (it was in rawhide temporarily, but then that build got unlinked). I'll do it.

I think we also need the corresponding NM change (#1455704). Not for this particular bug, but in general.

Comment 45 Tomas Krizek 2017-07-13 14:50:40 UTC
Please ignore the commits in comment #43, I accidentally pushed the code to a mirror repo instead of upstream. Sorry about that.

Comment 47 Adam Williamson 2017-10-18 16:44:34 UTC
Looks like this may be causing more serious issues in F26->F27 upgrades: https://bugzilla.redhat.com/show_bug.cgi?id=1503321

Comment 48 Adam Williamson 2017-10-30 22:42:54 UTC
So the attempt to fix this doesn't really work. Here's what gets logged when the scriptlet is run during the upgrade process:

Oct 30 12:40:40 ipa001.domain.local systemd[1]: Network Manager is not active.
Oct 30 12:40:40 ipa001.domain.local systemd[1]: Dependency failed for Network Manager Wait Online.
Oct 30 12:40:40 ipa001.domain.local systemd[1]: NetworkManager-wait-online.service: Job NetworkManager-wait-online.service/start failed with result 'dependency'.
Oct 30 12:40:40 ipa001.domain.local systemd[1]: Reached target Network is Online.

as a reminder, the attempted fix was adding these lines to the %posttrans in the package:

    # This is necessary for Fedora system upgrades which by default
    # work with the network being offline
    /bin/systemctl start network-online.target

right before the run of ipa-server-upgrade. So that command gets run, alright, but it does not actually bring the network up.

This is, I think, the same problem we were discussing back around comment #29. The upstream pull request - https://github.com/systemd/systemd/pull/6065 - seems to have been merged to master, but that's only relevant to networkd; the NetworkManager-wait-online service still has Requisite= , not Requires= .

So...what should we do about this? Change the NetworkManager-wait-online service to use Requires and see if that shifts it? How about folks using network.service (or, rather, /etc/init.d/network )? Zbigniew, how does network-online.target work with /etc/init.d/network , when that's being used rather than NetworkManager? Does it work at all?

Comment 49 Adam Williamson 2017-10-30 22:43:35 UTC
CC thaller as this may require a change to NetworkManager.

Comment 50 Thomas Haller 2017-11-02 16:13:24 UTC
(In reply to Adam Williamson from comment #49)
> CC thaller as this may require a change to NetworkManager.

The "Requires=NetworkManager.service" should be fixed for f27 and rawhide with NetworkManager-1.8.4-6.fc27.

Please test. See https://bodhi.fedoraproject.org/updates/NetworkManager-1.8.4-6.fc27

Comment 51 Petr Vobornik 2017-11-21 15:36:51 UTC
Upstream ticket:
https://pagure.io/freeipa/issue/6968

Comment 52 Petr Vobornik 2018-01-15 17:43:10 UTC
Both upstream tickets are fixed.

ipa-4-6:
    b2d3b56 Run server upgrade in ipactl start/restart

master (4-6):
    a36f2ae Ensure network is online prior to an upgrade
ipa-4-5:
    6ca6942 Ensure network is online prior to an upgrade

Comment 53 Fedora Update System 2018-01-31 23:24:26 UTC
freeipa-4.6.3-1.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-787ba41890

Comment 54 Fedora Update System 2018-02-01 19:30:41 UTC
freeipa-4.6.3-1.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-787ba41890

Comment 55 Fedora Update System 2018-02-09 00:09:25 UTC
freeipa-4.6.3-2.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-787ba41890

Comment 56 Fedora Update System 2018-02-09 18:46:40 UTC
freeipa-4.6.3-2.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-787ba41890

Comment 57 Fedora Update System 2018-02-20 17:12:04 UTC
freeipa-4.6.3-2.fc27 has been pushed to the Fedora 27 stable repository. If problems still persist, please make note of it in this bug report.