Bug 1424945 - Director minor update fails using openstack.
Summary: Director minor update fails using openstack.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z3
: 10.0 (Newton)
Assignee: mathieu bultel
QA Contact: Eyal Dannon
URL:
Whiteboard:
Depends On:
Blocks: 1408224
TreeView+ depends on / blocked
 
Reported: 2017-02-20 07:04 UTC by Eyal Dannon
Modified: 2017-06-28 14:44 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-28 14:44:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
openstack software config (17.59 KB, text/plain)
2017-02-20 07:04 UTC, Eyal Dannon
no flags Details
sosreport compute node (9.91 MB, application/x-xz)
2017-02-20 07:06 UTC, Eyal Dannon
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1668266 0 None None None 2017-03-02 05:38:09 UTC
Red Hat Product Errata RHBA-2017:1585 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 director Bug Fix Advisory 2017-06-28 18:42:51 UTC

Description Eyal Dannon 2017-02-20 07:04:52 UTC
Created attachment 1255593 [details]
openstack software config

Description of problem:
Overcloud update fails when including repo contains openvswitch-2.6

Version-Release number of selected component (if applicable):
OSPd -2017-02-15.1
openvswitch-2.5.0-14
openstack-tripleo-heat-templates-5.2.0-3.el7ost.noarch
openstack-heat-templates-0-0.11.1e6015dgit.el7ost.noarch

How reproducible:
Always

Steps to Reproduce:
1.Installation of OSPd + DPDK
2.Creation of local repo which contains openvswitch-2.6.1 over the nodes
3."openstack overcloud update stack -i fff"

Actual results:
Update fails

Expected results:
Update should successes  

Additional info:

*Setup:
| 5eadda2c-7699-47bf-9ff8-cfadd1909352 | compute-0    | ACTIVE | -          | Running     | ctlplane=192.0.40.7  |
| 05b0cb21-de7b-44f5-8708-4a309a3cf134 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.0.40.10 |


$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
IN_PROGRESS
WAITING
not_started: [u'controller-0']
on_breakpoint: [u'compute-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 0e29f18d-4d58-43b8-b2b6-fdf8d9dbfc4b), no=cancel update, C-c=quit interactive mode: 
IN_PROGRESS
IN_PROGRESS
FAILED
update finished with status FAILED
Stack update failed.


$ openstack stack failures  list  overcloud
overcloud.Controller.0:
  resource_type: OS::TripleO::Controller
  physical_resource_id: 83bd4fce-fd34-4d0b-9efb-2799e7966a1b
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted
overcloud.Compute.0.UpdateDeployment:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 81639f86-e4f2-4678-b6f8-aa6c92c7a777
  status: CREATE_FAILED
  status_reason: |
    Error: resources.UpdateDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 3
  deploy_stdout: |
    Started yum_update.sh on server 5eadda2c-7699-47bf-9ff8-cfadd1909352 at Mon Feb 20 05:38:40 UTC 2017
  deploy_stderr: |

*"openstack software config" result is attached.
* sosreports are attached.

Comment 1 Eyal Dannon 2017-02-20 07:06:10 UTC
Created attachment 1255594 [details]
sosreport compute node

Comment 2 Saravanan KR 2017-02-22 14:50:16 UTC
Is there any guest VM running during the update? If yes, can you remove all the guest VMs and validate it.

Comment 3 Sanjay Upadhyay 2017-02-22 15:05:13 UTC
Eyal,,

Its failing right after yum_update.sh -- and thel-registration step, time shows error in /var/log
I am seeing this in the logs for rhel-registration
/var/log/messages -
 4288 Feb 19 15:33:31 compute-0 os-collect-config: dib-run-parts Sun Feb 19 15:33:31 EST 2017 Running /usr/libexec/os-refresh-config/pre-configure.d/06-rhel-registration
 4461 Feb 19 15:33:41 compute-0 os-collect-config: WARNING: Support for registering with a username and password is deprecated.
 4462 Feb 19 15:33:41 compute-0 os-collect-config: Please use activation keys instead.  See the README for more information.
 4463 Feb 19 15:33:41 compute-0 os-collect-config: WARNING: only 'portal', 'satellite', and 'disable' are valid values for REG_METHOD.
 4464 Feb 19 15:33:41 compute-0 os-collect-config: dib-run-parts Sun Feb 19 15:33:41 EST 2017 06-rhel-registration completed
...
20038 Feb 20 00:20:41 compute-0 os-collect-config: dib-run-parts Mon Feb 20 05:20:41 UTC 2017 Running /usr/libexec/os-refresh-config/pre-configure.d/06-rhel-registration
20039 Feb 20 00:20:45 compute-0 os-collect-config: WARNING: Support for registering with a username and password is deprecated.
20040 Feb 20 00:20:45 compute-0 os-collect-config: Please use activation keys instead.  See the README for more information.
20041 Feb 20 00:20:45 compute-0 os-collect-config: WARNING: only 'portal', 'satellite', and 'disable' are valid values for REG_METHOD.
               

*since you might have tried the update many times, this above info is many times*
it correlates to - /var/log/rhsm/rhsmcertd.log

0 Sun Feb 19 15:35:26 2017 [WARN] (Auto-attach) Update failed (255), retry will occur on next run.
 11 Sun Feb 19 15:35:27 2017 [WARN] (Cert Check) Update failed (255), retry will occur on next run.
 12 Mon Feb 20 00:33:27 2017 [WARN] (Cert Check) Update failed (255), retry will occur on next run.
 13 Mon Feb 20 04:33:28 2017 [WARN] (Cert Check) Update failed (255), retry will occur on next run.


can you check if your rhel-registration.yaml complies with the documentation?

refer - https://access.redhat.com/errata/RHBA-2016:2978

Comment 4 Eyal Dannon 2017-02-26 12:18:29 UTC
Hi Sanjay,

I'm moving the repos from the undercloud to overcloud nodes using our automation plus creating local repo with the relevant packages, then updating the environment following the regular guide-lines.
In this case, is using rhel-registration.yaml is a must?

Secondly, I have tried to re-update the environment after removing all the repos and using the following parameters which works with manual registration:
parameter_defaults:
  rhel_reg_activation_key: ""
  rhel_reg_auto_attach: "true"
  rhel_reg_base_url: "cdn.stage.redhat.com"
  rhel_reg_environment: ""
  rhel_reg_force: "true"
  rhel_reg_machine_name: ""
  rhel_reg_org: ""
  rhel_reg_password: "*******"
  rhel_reg_pool_id: ""
  rhel_reg_release: ""
  rhel_reg_repos: ""
  rhel_reg_sat_url: ""
  rhel_reg_server_url: "subscription.rhn.stage.redhat.com:443/subscription"
  rhel_reg_service_level: ""
  rhel_reg_user: "edannon"
  rhel_reg_type: ""
  rhel_reg_method: ""
  rhel_reg_sat_repo: ""

# openstack overcloud deploy --debug --update-plan-only --templates --environment-file "$HOME/extra_env.yaml" --libvirt-type kvm --ntp-server clock.redhat.com -e /home/stack/ospd-10-multiple-nic-vlans-ovs-dpdk-single-port/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/neutron-ovs-dpdk.yaml -e /home/stack/templates/rhel-registration/environment-rhel-registration.yaml

# openstack overcloud update stack -i overcloud

Any additional input?

Thanks!

Comment 5 Saravanan KR 2017-03-02 05:40:41 UTC
Even though this issue can be fixed in code for future deployements, it will still fail for update from OSP10.z2 to OSP10.z3 minor update. Because the software config for yum_update.sh is already created and it will not be updated by updating the package. Currently working on to figure out the solution for this issue.

Comment 6 Saravanan KR 2017-03-02 05:42:30 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1428017 BUG is to fix the script issue downstream.

This bug is used to track the solution for updating from OSP10.z2 to OSP10.z3 minor update by updating the existing software config.

Comment 7 Saravanan KR 2017-03-03 05:33:06 UTC
While discussing with team, figured out that inorder to update the new template files before the minor update, we nee dot execute the same deploy command with additional argument "--update-plan-only". After doing this minor update command should be executed. We are testing this.

Comment 8 Saravanan KR 2017-03-13 12:45:43 UTC
OvS is restarting when the package is updated from 2.5.0-14 to 2.6.1-10. Restart of OvS will fail the deployment if the control plane is on a ovs bridge. For use-cases, where the control plane is on interface, minor update with the fix https://bugzilla.redhat.com/show_bug.cgi?id=1428017 is working fine. You may need a restart after the minor update. 

@Eyal, Can you check this path?

Comment 9 Flavio Leitner 2017-03-14 16:15:06 UTC
If you're upgrading from 2.5.0-14 to 2.6.1-3 or older, the OVS service will probably be left in a broken state needing manual intervention.

We fixed this in 2.6.1-4 and newer, so upgrading from 2.5.0-14 to 2.6.1-10 should leave you with a working OVS service which requires to restart the service, so that means other services might be impacted.

If you don't want to restart the services, then you need to pass at least the options to rpm --nopostun --notriggerun during openvswitch package upgrade, but that leaves the systemd out-of-sync with regards to the service changes. So, it will cause issues if one tries to manage OVS services manually after the upgrade without a reboot/daemon-reload.

Comment 10 Saravanan KR 2017-03-15 06:18:27 UTC
(In reply to Flavio Leitner from comment #9)
> If you're upgrading from 2.5.0-14 to 2.6.1-3 or older, the OVS service will
> probably be left in a broken state needing manual intervention.
> 
> We fixed this in 2.6.1-4 and newer, so upgrading from 2.5.0-14 to 2.6.1-10
> should leave you with a working OVS service which requires to restart the
> service, so that means other services might be impacted.
> 
Impact of restarting openvswitch:
---------------------------------
For DPDK on tenant network with VxLAN, we need to ensure the tenant traffic is on a bridge (tenant only bridge), which will have only the dpdk port attached to it. IP will be directly assigned to the bridge. When we try to do minor update on this setup, when openvswitch restarts, the IP on the bridge will be lost. To bring the IP back again, we need to restart the network.service or trigger ifup scripts. Because of restarting openvswitch, the update will fail.

> If you don't want to restart the services, then you need to pass at least
> the options to rpm --nopostun --notriggerun during openvswitch package
> upgrade, but that leaves the systemd out-of-sync with regards to the service
> changes. So, it will cause issues if one tries to manage OVS services
> manually after the upgrade without a reboot/daemon-reload.
What exactly do you mean by "manually" here? Assume that we avoid openvswitch restart, and the deployment continues to run puppet (i don't know what exact call to ovs are made in puppet run), is there any issue here? 

If the issue is only after the update is completed, we need to restart the openvswitch, before operator executes any command, then I assume it is fine (Franck agreed to it).

Comment 11 Saravanan KR 2017-03-15 06:26:53 UTC
I am capturing very useful mail from Flavio to the BZ:
------------------------------------------------------
Yes, it (restart) is expected because 2.5.0-14 did not include the fix to not
restart OVS during upgrade.  We fixed it in 2.5.0-15 and onwards.

Therefore, when moving to either 2.5.0-23 or 2.6.1-10 from that
version, there will be one last service restart but regardless of
the restart, the OVS service should be up and running after the update.

This is a summary of upgrade and what happens:

 From:            To:        Result:
2.4           2.5.0-14       service restart, operational
2.4           2.5.0-22       service restart, broken service
2.4           2.5.0-23       service restart, operational
2.4           2.6.1-10       service restart, operational

2.5.0-14      2.5.0-22       service restart, broken service
2.5.0-14      2.5.0-23       service restart, operational
2.5.0-14      2.6.1-10       service restart, operational

from here on, the table is the same:
2.5.0-22      2.5.0-23       no restart, service remains
2.5.0-22      2.6.1-10       no restart, service remains
2.5.0-23      2.6.1-10       no restart, service remains
2.6.1-10      any new        no restart, service remains


* operational means that OVS service should be operating normally.
* broken service means that the OVS service is not operating
  and systemd OVS services are unstable/broken.

Comment 12 Saravanan KR 2017-03-15 14:26:40 UTC
We are testing with "--notriggerun" options. 3 different types of setup are used for testing. I have completed mine, status below.

Note: For DPDK update, we need to use the new post-install scripts for plan update. Scripts can be found at https://github.com/krsacme/tht-dpdk/blob/master/post-install-update.yaml.


1) DPDK on Provide network (Saravanan) 
    - Updated successfully without ovs restart
    - Restarted all nodes
    - Created VMs on DPDK provide network, Ping working

2) DPDK on Tenant network with VxLAN (Karthik)

3) ControlPlane on a bridge and DPDK on provider network (Eyal)

Comment 13 Karthik Sundaravel 2017-03-15 14:32:22 UTC
(In reply to Saravanan KR from comment #12)
> 2) DPDK on Tenant network with VxLAN (Karthik)
    - Updated successfully without ovs restart
    - Restarted all nodes
    - Created VMs on tenant network (Vxlan), Ping working

Comment 14 Eyal Dannon 2017-03-16 15:36:43 UTC
(In reply to Saravanan KR from comment #12)
> 3) ControlPlane on a bridge and DPDK on provider network (Eyal)
    - Updated successfully without ovs restart
    - Restarted all nodes
    - Created VMs on DPDK provide network, Ping working

Changes been done:
- using the given post-install.yaml during the update.
- using "openstack-tripleo-heat-templates-5.2.0-5.el7ost.noarch" which includes the patch for non-ha environment.
- Adding "--notriggerun" as mentioned above.

Comment 16 Marios Andreou 2017-03-16 16:48:07 UTC
Hi all, thanks for the update skramaja. I'm now clearer on what the request/issue is here but I am still not 100% clear on why it is needed. 

I now understand that in your testing you need to have a special case yum update for openvswitch, specifically when starting at openvswitch-2.5.0-14. The workaround is what we had previously [1] but with the addition of '--notriggerun'. The result of doing the update this way is that openvswitch will not be restarted.

What I'm not clear about is why openvswitch being restarted is a problem. Is it something specific to your deployment? In the dev/qe environments that DFG:Upgrades is using we are indeed going from 2.5.0-14 to 2.6.1-8. We are doing that via yum update and apparently it doesn't break the upgrade anymore (it used to). In fact we had to *remove* the special case workaround logic with [2] because this time round the workaround itself was breaking us [3].

So, yes, we *could* add a manual rpm update with the right flags back into the minor update/major upgrade workflow and yes we *could* (horribly) detect that we were starting from openvswitch-2.5.0.14 and only execute the special case then (and otherwise fall back to yum update). However do we really need and have to? I have to check that actually doing "rpm -U --replacepkgs --nopostun --notriggerun" works for us because at least without the --notriggerun it does not as we found out and had to land [2] for.

thanks, marios

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/pacemaker_common_functions.sh#L314
[2] https://review.openstack.org/#/q/59e5f9597eb37f69045e470eb457b878728477d7,n,z 
[3] https://bugs.launchpad.net/tripleo/+bug/1669714

Comment 18 Saravanan KR 2017-03-16 17:30:54 UTC
(In reply to marios from comment #16)
> What I'm not clear about is why openvswitch being restarted is a problem. Is
> it something specific to your deployment? 

Yes, it specific to a deployment in which IP is directly assigned on a OvS bridge. For DPDK deployments, in order to use tenant network with DPDK on VxLAN, TenantIP should be set on the ovs_user_bridge. 

> In the dev/qe environments that
> DFG:Upgrades is using we are indeed going from 2.5.0-14 to 2.6.1-8. We are
> doing that via yum update and apparently it doesn't break the upgrade
> anymore (it used to). 

Is it possible to verify multiple-nic templates in the general deployment (without DPDK), where I think, the update failure might happen. Let us know if this has been verified already and working. One of our deployment is similar to this.


> In fact we had to *remove* the special case workaround
> logic with [2] because this time round the workaround itself was breaking us
> [3].

This is something bothering.



Case 1) 2.4 -> 2.5 
TripleO yum update scripts apply special condition on not to restart openvswitch. And the product documentation recommends manual reboot after update.


Case 2) 2.6 -> Further (2.7)
As per Flavio's comment on ovs restart table:
2.6.1-10      any new        no restart, service remains
So. openvswitch will not restart on yum update from 2.6 onwards, which essentially means that we need to reboot(restart of service) after update.

Case 3) 2.5.0-14 -> 2.6.1-4+
We want to keep the same behavior.

Comment 19 Saravanan KR 2017-03-16 17:32:15 UTC
(In reply to Saravanan KR from comment #18)
> (In reply to marios from comment #16)
> > In the dev/qe environments that
> > DFG:Upgrades is using we are indeed going from 2.5.0-14 to 2.6.1-8. We are
> > doing that via yum update and apparently it doesn't break the upgrade
> > anymore (it used to). 
> 
> Is it possible to verify multiple-nic templates in the general deployment
> (without DPDK), where I think, the update failure might happen. Let us know
> if this has been verified already and working. One of our deployment is
> similar to this.
>

Multiple nic template from tripleo repo - https://github.com/openstack/tripleo-heat-templates/blob/master/network/config/multiple-nics/compute.yaml#L121

Comment 20 Marios Andreou 2017-03-17 18:32:46 UTC
o/ skramaja I didn't test the templates you suggested but I did check to see if
the --notriggerun works in my dev/env... with controllers starting @
openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 ... on controller-1 I manually 
ran it like at [1] and confirmed that the node hangs and I couldn't get back
onto the box until I nova reboot it. On control-0 I added the --notriggerun
and it updated without any problems.

fyi, but lets pickup next week?  thanks

[1]https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_common_functions.sh#L301

Comment 21 Saravanan KR 2017-03-20 06:15:56 UTC
(In reply to marios from comment #20)
> o/ skramaja I didn't test the templates you suggested but I did check to see
> if
> the --notriggerun works in my dev/env... with controllers starting @
> openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 ... on controller-1 I
> manually 
> ran it like at [1] and confirmed that the node hangs and I couldn't get back
> onto the box until I nova reboot it. On control-0 I added the --notriggerun
> and it updated without any problems.
> 
> fyi, but lets pickup next week?  thanks
> 
> [1]https://github.com/openstack/tripleo-heat-templates/blob/master/
> extraconfig/tasks/pacemaker_common_functions.sh#L301

All my tests were based on 1 controller. Probably I will check if I can get 3 controllers in my lab to test it. The templates which I use in the lab are: https://github.com/krsacme/tht-dpdk

Comment 22 Saravanan KR 2017-03-20 10:54:28 UTC
I have tested with 3 controllers setup. Minor update is successful with --notriggerun option. Able to create VMs after rebooting all the nodes. All my testes are based on baremetal overcloud nodes.

Marios, Can you share your environment configs? Probably, I might miss something in the HA configuration.

Comment 26 Assaf Muller 2017-03-23 17:36:05 UTC
https://review.openstack.org/#/c/434346/ was sent by Mathieu, should he be the owner of this bug? Also, do we expect that patch to resolve this RHBZ entirely?

Comment 28 Assaf Muller 2017-04-03 14:12:27 UTC
Is this bug a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1431115? If not, can you please point to two distinct patches, one to resolve this bug, and one to resolve the other bug?

Comment 31 Eyal Dannon 2017-05-21 10:56:07 UTC
I've verified minor update following google doc:

https://docs.google.com/document/d/1PUdFw3L_9J49jTjzkabfSaNOCxH8fPDrGkmYmCrmgbQ/edit

Thanks,

Comment 33 errata-xmlrpc 2017-06-28 14:44:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1585


Note You need to log in before you can comment on or make changes to this bug.