Bug 1396904
Summary: | rhel-osp-director: minor update: os-net-config package update restarts the network when corosync is already up and running | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
Component: | openstack-tripleo-heat-templates | Assignee: | Brent Eagles <beagles> |
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> |
Severity: | unspecified | Docs Contact: | |
Priority: | high | ||
Version: | 10.0 (Newton) | CC: | dbecker, fdinitto, jcoufal, jschluet, jslagle, mandreou, mbayer, mburns, mcornea, michele, morazi, rhel-osp-director-maint, rscarazz, sasha |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 10.0 (Newton) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-5.1.0-6.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-12-14 16:34:15 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Alexander Chuzhoy
2016-11-21 06:24:58 UTC
Created attachment 1222314 [details]
tail -10 overcloud nodes /var/log/yum.log
Created attachment 1222317 [details]
list_nodes_status output is clean
Created attachment 1222318 [details]
grep for error from description
(In reply to Alexander Chuzhoy from comment #0) > rhel-osp-director: OSP10 minor update fails: ERROR galera sync timed out > > 2. Try to minor update. To be clear, this is a fresh OSP10 deployment, with IPv6, 3 controllers HA and also 3 service nodes (composable roles). Some clarifications please sasha, something doesn't add up here: > > Result: > The minor update fails. > > Heat shows: > Galera cluster node is not synced. > ERROR galera sync timed out > ", "deploy_stderr": "", "update_managed_packages": "false", > "deploy_status_code": 1 }, "creation_time": "2016-11-21T01:16:40Z", > "updated_time": "2016-11-21T03:12:32Z", "input_values": { > "update_identifier": "1479694135" }, "action": "UPDATE", "status_reason": > "deploy_status_code : Deployment exited with non-zero status code: 1", "id": > "cccd70cd-cb16-4006-80e3-2539ca8a20ae" } > > > Logs on controller show: > INFO: Galera instance wsrep_cluster_status=non-Primary > Nov 21 02:26:57 overcloud-controller-0.localdomain galera(galera)[272198]: > ERROR: local node <overcloud-controller-0> is started, but not in primary > mode. Unknown state > Which controller did you see this error on? ^^^ . I see this error exactly once, only on controller 0 https://bugzilla.redhat.com/attachment.cgi?id=1222318, but it has a different timestamp to what is reported above) " 19420:Nov 20 21:26:57 host-192-168-0-18 galera(galera)[272133]: ERROR: local node <overcloud-controller-0> is started, but not in primary mode. Unknown state." Poking at the undercloud in the env you sent via email, I see -rw-r--r--. 1 stack stack 389 Nov 21 02:04 overcloud_install.log and -rwxr-xr-x. 1 stack stack 55 Nov 21 02:07 update_command I wonder if this is a case of you not giving the cluster enough time to settle/services started after the deployment (3 mins after deploy you started the update ^^ looks like?). But I am still confused by that error then about galera, did it happen during the initial deploy? Can you give any more details about where the update failed? I mean which nodes if any were already successfully updated? I *think* controllers were updated (based on tail -10 /var/log/yum.log https://bugzilla.redhat.com/attachment.cgi?id=1222314), but none of the other nodes. The cluster (pcs status) looks green to me (no errors/failed services) and the output of list_nodes_status shows nothing (i.e. no heat resources deployed with status other than 0 https://bugzilla.redhat.com/attachment.cgi?id=1222317). SO, atm there is nothing to debug here, sasha please clarify. Does this reproduce? thanks, marios Reproduced. The job waited for 10 mins before starting the update. Seems like not all nodes have their RPMs updated: [stack@undercloud-0 ~]$ for i in `nova list|awk '/ACTIVE/ {print $(NF-1)}'|awk -F"=" '{print $NF}'`; do ssh heat-admin@$i "hostname; sudo tail /var/log/yum.log"; done overcloud-cephstorage-0.localdomain Nov 21 16:31:15 Installed: rhos-release-1.1.8-1.noarch Nov 21 16:31:29 Installed: wget-1.14-13.el7.x86_64 overcloud-compute-0.localdomain Nov 21 16:31:32 Installed: rhos-release-1.1.8-1.noarch Nov 21 16:31:46 Installed: wget-1.14-13.el7.x86_64 Nov 21 16:46:16 Updated: puppet-neutron-9.4.0-4.el7ost.noarch Nov 21 16:46:16 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch overcloud-controller-0.localdomain Nov 21 16:55:28 Updated: 1:openstack-manila-3.0.0-5.el7ost.noarch Nov 21 16:55:28 Updated: 1:openstack-manila-share-3.0.0-5.el7ost.noarch Nov 21 16:55:28 Updated: nss-tools-3.21.3-2.el7_3.x86_64 Nov 21 16:55:28 Updated: 1:openstack-heat-api-7.0.0-7.el7ost.noarch Nov 21 16:55:28 Updated: 1:openstack-heat-engine-7.0.0-7.el7ost.noarch Nov 21 16:55:28 Updated: 1:openstack-heat-api-cloudwatch-7.0.0-7.el7ost.noarch Nov 21 16:55:28 Updated: 1:openstack-heat-api-cfn-7.0.0-7.el7ost.noarch Nov 21 16:55:28 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch Nov 21 16:55:28 Updated: puppet-neutron-9.4.0-4.el7ost.noarch Nov 21 16:55:28 Updated: os-net-config-5.0.0-5.el7ost.noarch overcloud-controller-1.localdomain Nov 21 17:02:22 Updated: 1:openstack-manila-3.0.0-5.el7ost.noarch Nov 21 17:02:22 Updated: 1:openstack-manila-share-3.0.0-5.el7ost.noarch Nov 21 17:02:22 Updated: nss-tools-3.21.3-2.el7_3.x86_64 Nov 21 17:02:22 Updated: 1:openstack-heat-api-7.0.0-7.el7ost.noarch Nov 21 17:02:22 Updated: 1:openstack-heat-engine-7.0.0-7.el7ost.noarch Nov 21 17:02:22 Updated: 1:openstack-heat-api-cloudwatch-7.0.0-7.el7ost.noarch Nov 21 17:02:22 Updated: 1:openstack-heat-api-cfn-7.0.0-7.el7ost.noarch Nov 21 17:02:22 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch Nov 21 17:02:23 Updated: puppet-neutron-9.4.0-4.el7ost.noarch Nov 21 17:02:23 Updated: os-net-config-5.0.0-5.el7ost.noarch overcloud-controller-2.localdomain Nov 21 16:59:26 Updated: 1:openstack-manila-3.0.0-5.el7ost.noarch Nov 21 16:59:26 Updated: 1:openstack-manila-share-3.0.0-5.el7ost.noarch Nov 21 16:59:27 Updated: nss-tools-3.21.3-2.el7_3.x86_64 Nov 21 16:59:27 Updated: 1:openstack-heat-api-7.0.0-7.el7ost.noarch Nov 21 16:59:27 Updated: 1:openstack-heat-engine-7.0.0-7.el7ost.noarch Nov 21 16:59:27 Updated: 1:openstack-heat-api-cloudwatch-7.0.0-7.el7ost.noarch Nov 21 16:59:27 Updated: 1:openstack-heat-api-cfn-7.0.0-7.el7ost.noarch Nov 21 16:59:27 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch Nov 21 16:59:27 Updated: puppet-neutron-9.4.0-4.el7ost.noarch Nov 21 16:59:27 Updated: os-net-config-5.0.0-5.el7ost.noarch overcloud-objectstorage-0.localdomain Nov 21 16:32:48 Installed: rhos-release-1.1.8-1.noarch Nov 21 16:33:03 Installed: wget-1.14-13.el7.x86_64 Nov 21 16:43:26 Updated: puppet-neutron-9.4.0-4.el7ost.noarch Nov 21 16:43:26 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch overcloud-serviceapi-0.localdomain Nov 21 16:33:06 Installed: rhos-release-1.1.8-1.noarch Nov 21 16:33:23 Installed: wget-1.14-13.el7.x86_64 Nov 21 16:49:29 Updated: puppet-neutron-9.4.0-4.el7ost.noarch Nov 21 16:49:29 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch overcloud-serviceapi-1.localdomain Nov 21 16:33:26 Installed: rhos-release-1.1.8-1.noarch Nov 21 16:33:43 Installed: wget-1.14-13.el7.x86_64 Nov 21 16:52:19 Updated: puppet-neutron-9.4.0-4.el7ost.noarch Nov 21 16:52:19 Updated: puppet-tripleo-5.4.0-2.el7ost.noarch overcloud-serviceapi-2.localdomain Nov 21 16:33:47 Installed: rhos-release-1.1.8-1.noarch Nov 21 16:34:01 Installed: wget-1.14-13.el7.x86_64 Created attachment 1222689 [details] some of the controller journal from sos reports in comment #5 from the timestamps this looks like it is happening before the minor update is attempted? Are we sure the deployment was in a good state before starting the update? Created attachment 1222705 [details] some of the controller journal from sos reports in comment #5 some of the controller journal from sos reports in comment #5 from the timestamps this looks like it is happening before the minor update is attempted? Are we sure the deployment was in a good state before starting the update? This behaviour you are seeing here may be related to BZ 1391671 - but at this point the sos reports don't strongly support this theory (I see a restart of firewall as per bandini's comment https://bugzilla.redhat.com/show_bug.cgi?id=1391671#c9 but it happens far earlier that the update at Nov 20 20:23:15 host-192-168-0-18 systemd: Starting IPv4 firewall with iptables... and only once so not clear yet). I see the following a lot in the controller logs however and it explicitly mentions the firewall: Nov 21 01:34:49 host-192-168-0-18 corosync[266094]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. In trying to trace when this first started, ^^^ I think it is as attached in comment #10 which seems to be before the minor update? Sasha can you please verify this recreates on a completely clean/new environment, and be sure to keep the env around so we can debug further. I think we may ultimately escalate to PIDONE. So we looked at this in more detail and here is what happens (taking overcloud-controller-0 as a reference, but the same problem happens on all the controllers): 1) Cluster on the node gets stopped 2) yum update upgrade packages (amongst which os-net-config) 3) Cluster on the node gets started 4) The new os-net-config triggers and ifdown and ifup of the interface Step 4) is the problematic one. Corosync does not cope with interfaces being deleted (planned for corosync 3) and then readded and we get the logs fully spammed with: Nov 21 02:28:48 overcloud-controller-0.localdomain corosync[266094]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of thi... We *must* guarantee that any network restart happens before corosync gets started. Maybe calling os-net-config manually within yum_update.sh ? Or somehow making os-net-config smarter about which interfaces it can restart and which ones it cannot restart. Marios, how can we make sure that either a) os-net-config run while the cluster is down or b) os-net-config smarter about what it can restart? (In reply to Michele Baldessari from comment #14) > So we looked at this in more detail and here is what happens (taking > overcloud-controller-0 as a reference, but the same problem happens on all > the controllers): > 1) Cluster on the node gets stopped > 2) yum update upgrade packages (amongst which os-net-config) > 3) Cluster on the node gets started > 4) The new os-net-config triggers and ifdown and ifup of the interface > > Step 4) is the problematic one. Corosync does not cope with interfaces being > deleted (planned for corosync 3) and then readded and we get the logs fully > spammed with: > Nov 21 02:28:48 overcloud-controller-0.localdomain corosync[266094]: [MAIN > ] Totem is unable to form a cluster because of an operating system or > network fault. The most common cause of thi... > > We *must* guarantee that any network restart happens before corosync gets > started. Maybe calling os-net-config manually within yum_update.sh ? Or > somehow making os-net-config smarter about which interfaces it can restart > and which ones it cannot restart. > > Marios, how can we make sure that either a) os-net-config run while the > cluster is down or b) os-net-config smarter about what it can restart? How did you determine that 4 (the os-net-config restart) is happening? Would just like to know where to start looking. got the info on irc: 15:01 < bandini> slagle: https://review.openstack.org/#/c/397405/ see also giulio's comment on patchset 3. So I think that due to the fact that os-net-config gets updated we have this ifdown/ifup sequence 15:01 < supybot> Title: Gerrit Code Review (at review.openstack.org) 15:01 < bandini> slagle: makes sense? 15:02 < slagle> bandini: yea, i think so 15:02 < slagle> os-net-config detects that there is a change in the interface config, so it has to bounce it beagles said he would look into this one initially at least patch currently proposed upstream on master: https://review.openstack.org/#/c/400960/ The snippet works in isolation. I'm trying to get an ha env setup so I can test it in an HA context, but it's taking awhile. If anybody has an existing setup that they can pull the patch into to give it a try, it would be much appreciated. waiting for CI to land master and in the meantime I cherrypicked newton so updating trackers https://review.openstack.org/#/c/401987/ stable/newton Description of the "fast version" of how this patch was tested. On an HA setup - - make copy yum_update.sh - remove preamble specific to being driven by heat - copy yum_update.sh to controller nodes - one one node, leave things interface files alone and run. What we expect to see: - pcs commands to stop cluster services - yum update of regular packages - yum update for openvswitch - os-net-config command runs successfully but no interface changes, doesn't bounce the interfaces - pcs commands to start cluster services - cluster settles This worked fine. os-net-config checked all of the interfaces and decided there were no changes necessary. - on another node, edit ifcfg-br-ex and change in some way (I removed the standalone thing) and run. We expect to see: - pcs commands to stop cluster services - yum update of regular packages - yum update for openvswitch - os-net-config command runs successfully, changing the modified interface back to the configured values and bounces the interfaces - pcs commands to start cluster services - cluster settles Also works as expected I copied it to a compute node and ran the same test, but we don't expect to see anything relevant there because the yum_update.sh script exits before it gets to the os-net-config. I don't think this is an issue. os-refresh-config will run os-net-config and update the OVS bridges that are configured on that node but this would only impact compute nodes that were not multi-nic or DVR and any impact should be short lived. Verified: Environment: openstack-tripleo-heat-templates-5.1.0-6.el7ost.noarch The reported issue doesn't reproduce. Updated successfully. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html |