Created attachment 1255593 [details] openstack software config Description of problem: Overcloud update fails when including repo contains openvswitch-2.6 Version-Release number of selected component (if applicable): OSPd -2017-02-15.1 openvswitch-2.5.0-14 openstack-tripleo-heat-templates-5.2.0-3.el7ost.noarch openstack-heat-templates-0-0.11.1e6015dgit.el7ost.noarch How reproducible: Always Steps to Reproduce: 1.Installation of OSPd + DPDK 2.Creation of local repo which contains openvswitch-2.6.1 over the nodes 3."openstack overcloud update stack -i fff" Actual results: Update fails Expected results: Update should successes Additional info: *Setup: | 5eadda2c-7699-47bf-9ff8-cfadd1909352 | compute-0 | ACTIVE | - | Running | ctlplane=192.0.40.7 | | 05b0cb21-de7b-44f5-8708-4a309a3cf134 | controller-0 | ACTIVE | - | Running | ctlplane=192.0.40.10 | $ openstack overcloud update stack -i overcloud starting package update on stack overcloud IN_PROGRESS WAITING not_started: [u'controller-0'] on_breakpoint: [u'compute-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 0e29f18d-4d58-43b8-b2b6-fdf8d9dbfc4b), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS FAILED update finished with status FAILED Stack update failed. $ openstack stack failures list overcloud overcloud.Controller.0: resource_type: OS::TripleO::Controller physical_resource_id: 83bd4fce-fd34-4d0b-9efb-2799e7966a1b status: UPDATE_FAILED status_reason: | UPDATE aborted overcloud.Compute.0.UpdateDeployment: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 81639f86-e4f2-4678-b6f8-aa6c92c7a777 status: CREATE_FAILED status_reason: | Error: resources.UpdateDeployment: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 3 deploy_stdout: | Started yum_update.sh on server 5eadda2c-7699-47bf-9ff8-cfadd1909352 at Mon Feb 20 05:38:40 UTC 2017 deploy_stderr: | *"openstack software config" result is attached. * sosreports are attached.
Created attachment 1255594 [details] sosreport compute node
Is there any guest VM running during the update? If yes, can you remove all the guest VMs and validate it.
Eyal,, Its failing right after yum_update.sh -- and thel-registration step, time shows error in /var/log I am seeing this in the logs for rhel-registration /var/log/messages - 4288 Feb 19 15:33:31 compute-0 os-collect-config: dib-run-parts Sun Feb 19 15:33:31 EST 2017 Running /usr/libexec/os-refresh-config/pre-configure.d/06-rhel-registration 4461 Feb 19 15:33:41 compute-0 os-collect-config: WARNING: Support for registering with a username and password is deprecated. 4462 Feb 19 15:33:41 compute-0 os-collect-config: Please use activation keys instead. See the README for more information. 4463 Feb 19 15:33:41 compute-0 os-collect-config: WARNING: only 'portal', 'satellite', and 'disable' are valid values for REG_METHOD. 4464 Feb 19 15:33:41 compute-0 os-collect-config: dib-run-parts Sun Feb 19 15:33:41 EST 2017 06-rhel-registration completed ... 20038 Feb 20 00:20:41 compute-0 os-collect-config: dib-run-parts Mon Feb 20 05:20:41 UTC 2017 Running /usr/libexec/os-refresh-config/pre-configure.d/06-rhel-registration 20039 Feb 20 00:20:45 compute-0 os-collect-config: WARNING: Support for registering with a username and password is deprecated. 20040 Feb 20 00:20:45 compute-0 os-collect-config: Please use activation keys instead. See the README for more information. 20041 Feb 20 00:20:45 compute-0 os-collect-config: WARNING: only 'portal', 'satellite', and 'disable' are valid values for REG_METHOD. *since you might have tried the update many times, this above info is many times* it correlates to - /var/log/rhsm/rhsmcertd.log 0 Sun Feb 19 15:35:26 2017 [WARN] (Auto-attach) Update failed (255), retry will occur on next run. 11 Sun Feb 19 15:35:27 2017 [WARN] (Cert Check) Update failed (255), retry will occur on next run. 12 Mon Feb 20 00:33:27 2017 [WARN] (Cert Check) Update failed (255), retry will occur on next run. 13 Mon Feb 20 04:33:28 2017 [WARN] (Cert Check) Update failed (255), retry will occur on next run. can you check if your rhel-registration.yaml complies with the documentation? refer - https://access.redhat.com/errata/RHBA-2016:2978
Hi Sanjay, I'm moving the repos from the undercloud to overcloud nodes using our automation plus creating local repo with the relevant packages, then updating the environment following the regular guide-lines. In this case, is using rhel-registration.yaml is a must? Secondly, I have tried to re-update the environment after removing all the repos and using the following parameters which works with manual registration: parameter_defaults: rhel_reg_activation_key: "" rhel_reg_auto_attach: "true" rhel_reg_base_url: "cdn.stage.redhat.com" rhel_reg_environment: "" rhel_reg_force: "true" rhel_reg_machine_name: "" rhel_reg_org: "" rhel_reg_password: "*******" rhel_reg_pool_id: "" rhel_reg_release: "" rhel_reg_repos: "" rhel_reg_sat_url: "" rhel_reg_server_url: "subscription.rhn.stage.redhat.com:443/subscription" rhel_reg_service_level: "" rhel_reg_user: "edannon" rhel_reg_type: "" rhel_reg_method: "" rhel_reg_sat_repo: "" # openstack overcloud deploy --debug --update-plan-only --templates --environment-file "$HOME/extra_env.yaml" --libvirt-type kvm --ntp-server clock.redhat.com -e /home/stack/ospd-10-multiple-nic-vlans-ovs-dpdk-single-port/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/neutron-ovs-dpdk.yaml -e /home/stack/templates/rhel-registration/environment-rhel-registration.yaml # openstack overcloud update stack -i overcloud Any additional input? Thanks!
Even though this issue can be fixed in code for future deployements, it will still fail for update from OSP10.z2 to OSP10.z3 minor update. Because the software config for yum_update.sh is already created and it will not be updated by updating the package. Currently working on to figure out the solution for this issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1428017 BUG is to fix the script issue downstream. This bug is used to track the solution for updating from OSP10.z2 to OSP10.z3 minor update by updating the existing software config.
While discussing with team, figured out that inorder to update the new template files before the minor update, we nee dot execute the same deploy command with additional argument "--update-plan-only". After doing this minor update command should be executed. We are testing this.
OvS is restarting when the package is updated from 2.5.0-14 to 2.6.1-10. Restart of OvS will fail the deployment if the control plane is on a ovs bridge. For use-cases, where the control plane is on interface, minor update with the fix https://bugzilla.redhat.com/show_bug.cgi?id=1428017 is working fine. You may need a restart after the minor update. @Eyal, Can you check this path?
If you're upgrading from 2.5.0-14 to 2.6.1-3 or older, the OVS service will probably be left in a broken state needing manual intervention. We fixed this in 2.6.1-4 and newer, so upgrading from 2.5.0-14 to 2.6.1-10 should leave you with a working OVS service which requires to restart the service, so that means other services might be impacted. If you don't want to restart the services, then you need to pass at least the options to rpm --nopostun --notriggerun during openvswitch package upgrade, but that leaves the systemd out-of-sync with regards to the service changes. So, it will cause issues if one tries to manage OVS services manually after the upgrade without a reboot/daemon-reload.
(In reply to Flavio Leitner from comment #9) > If you're upgrading from 2.5.0-14 to 2.6.1-3 or older, the OVS service will > probably be left in a broken state needing manual intervention. > > We fixed this in 2.6.1-4 and newer, so upgrading from 2.5.0-14 to 2.6.1-10 > should leave you with a working OVS service which requires to restart the > service, so that means other services might be impacted. > Impact of restarting openvswitch: --------------------------------- For DPDK on tenant network with VxLAN, we need to ensure the tenant traffic is on a bridge (tenant only bridge), which will have only the dpdk port attached to it. IP will be directly assigned to the bridge. When we try to do minor update on this setup, when openvswitch restarts, the IP on the bridge will be lost. To bring the IP back again, we need to restart the network.service or trigger ifup scripts. Because of restarting openvswitch, the update will fail. > If you don't want to restart the services, then you need to pass at least > the options to rpm --nopostun --notriggerun during openvswitch package > upgrade, but that leaves the systemd out-of-sync with regards to the service > changes. So, it will cause issues if one tries to manage OVS services > manually after the upgrade without a reboot/daemon-reload. What exactly do you mean by "manually" here? Assume that we avoid openvswitch restart, and the deployment continues to run puppet (i don't know what exact call to ovs are made in puppet run), is there any issue here? If the issue is only after the update is completed, we need to restart the openvswitch, before operator executes any command, then I assume it is fine (Franck agreed to it).
I am capturing very useful mail from Flavio to the BZ: ------------------------------------------------------ Yes, it (restart) is expected because 2.5.0-14 did not include the fix to not restart OVS during upgrade. We fixed it in 2.5.0-15 and onwards. Therefore, when moving to either 2.5.0-23 or 2.6.1-10 from that version, there will be one last service restart but regardless of the restart, the OVS service should be up and running after the update. This is a summary of upgrade and what happens: From: To: Result: 2.4 2.5.0-14 service restart, operational 2.4 2.5.0-22 service restart, broken service 2.4 2.5.0-23 service restart, operational 2.4 2.6.1-10 service restart, operational 2.5.0-14 2.5.0-22 service restart, broken service 2.5.0-14 2.5.0-23 service restart, operational 2.5.0-14 2.6.1-10 service restart, operational from here on, the table is the same: 2.5.0-22 2.5.0-23 no restart, service remains 2.5.0-22 2.6.1-10 no restart, service remains 2.5.0-23 2.6.1-10 no restart, service remains 2.6.1-10 any new no restart, service remains * operational means that OVS service should be operating normally. * broken service means that the OVS service is not operating and systemd OVS services are unstable/broken.
We are testing with "--notriggerun" options. 3 different types of setup are used for testing. I have completed mine, status below. Note: For DPDK update, we need to use the new post-install scripts for plan update. Scripts can be found at https://github.com/krsacme/tht-dpdk/blob/master/post-install-update.yaml. 1) DPDK on Provide network (Saravanan) - Updated successfully without ovs restart - Restarted all nodes - Created VMs on DPDK provide network, Ping working 2) DPDK on Tenant network with VxLAN (Karthik) 3) ControlPlane on a bridge and DPDK on provider network (Eyal)
(In reply to Saravanan KR from comment #12) > 2) DPDK on Tenant network with VxLAN (Karthik) - Updated successfully without ovs restart - Restarted all nodes - Created VMs on tenant network (Vxlan), Ping working
(In reply to Saravanan KR from comment #12) > 3) ControlPlane on a bridge and DPDK on provider network (Eyal) - Updated successfully without ovs restart - Restarted all nodes - Created VMs on DPDK provide network, Ping working Changes been done: - using the given post-install.yaml during the update. - using "openstack-tripleo-heat-templates-5.2.0-5.el7ost.noarch" which includes the patch for non-ha environment. - Adding "--notriggerun" as mentioned above.
Hi all, thanks for the update skramaja. I'm now clearer on what the request/issue is here but I am still not 100% clear on why it is needed. I now understand that in your testing you need to have a special case yum update for openvswitch, specifically when starting at openvswitch-2.5.0-14. The workaround is what we had previously [1] but with the addition of '--notriggerun'. The result of doing the update this way is that openvswitch will not be restarted. What I'm not clear about is why openvswitch being restarted is a problem. Is it something specific to your deployment? In the dev/qe environments that DFG:Upgrades is using we are indeed going from 2.5.0-14 to 2.6.1-8. We are doing that via yum update and apparently it doesn't break the upgrade anymore (it used to). In fact we had to *remove* the special case workaround logic with [2] because this time round the workaround itself was breaking us [3]. So, yes, we *could* add a manual rpm update with the right flags back into the minor update/major upgrade workflow and yes we *could* (horribly) detect that we were starting from openvswitch-2.5.0.14 and only execute the special case then (and otherwise fall back to yum update). However do we really need and have to? I have to check that actually doing "rpm -U --replacepkgs --nopostun --notriggerun" works for us because at least without the --notriggerun it does not as we found out and had to land [2] for. thanks, marios [1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/pacemaker_common_functions.sh#L314 [2] https://review.openstack.org/#/q/59e5f9597eb37f69045e470eb457b878728477d7,n,z [3] https://bugs.launchpad.net/tripleo/+bug/1669714
(In reply to marios from comment #16) > What I'm not clear about is why openvswitch being restarted is a problem. Is > it something specific to your deployment? Yes, it specific to a deployment in which IP is directly assigned on a OvS bridge. For DPDK deployments, in order to use tenant network with DPDK on VxLAN, TenantIP should be set on the ovs_user_bridge. > In the dev/qe environments that > DFG:Upgrades is using we are indeed going from 2.5.0-14 to 2.6.1-8. We are > doing that via yum update and apparently it doesn't break the upgrade > anymore (it used to). Is it possible to verify multiple-nic templates in the general deployment (without DPDK), where I think, the update failure might happen. Let us know if this has been verified already and working. One of our deployment is similar to this. > In fact we had to *remove* the special case workaround > logic with [2] because this time round the workaround itself was breaking us > [3]. This is something bothering. Case 1) 2.4 -> 2.5 TripleO yum update scripts apply special condition on not to restart openvswitch. And the product documentation recommends manual reboot after update. Case 2) 2.6 -> Further (2.7) As per Flavio's comment on ovs restart table: 2.6.1-10 any new no restart, service remains So. openvswitch will not restart on yum update from 2.6 onwards, which essentially means that we need to reboot(restart of service) after update. Case 3) 2.5.0-14 -> 2.6.1-4+ We want to keep the same behavior.
(In reply to Saravanan KR from comment #18) > (In reply to marios from comment #16) > > In the dev/qe environments that > > DFG:Upgrades is using we are indeed going from 2.5.0-14 to 2.6.1-8. We are > > doing that via yum update and apparently it doesn't break the upgrade > > anymore (it used to). > > Is it possible to verify multiple-nic templates in the general deployment > (without DPDK), where I think, the update failure might happen. Let us know > if this has been verified already and working. One of our deployment is > similar to this. > Multiple nic template from tripleo repo - https://github.com/openstack/tripleo-heat-templates/blob/master/network/config/multiple-nics/compute.yaml#L121
o/ skramaja I didn't test the templates you suggested but I did check to see if the --notriggerun works in my dev/env... with controllers starting @ openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 ... on controller-1 I manually ran it like at [1] and confirmed that the node hangs and I couldn't get back onto the box until I nova reboot it. On control-0 I added the --notriggerun and it updated without any problems. fyi, but lets pickup next week? thanks [1]https://github.com/openstack/tripleo-heat-templates/blob/master/extraconfig/tasks/pacemaker_common_functions.sh#L301
(In reply to marios from comment #20) > o/ skramaja I didn't test the templates you suggested but I did check to see > if > the --notriggerun works in my dev/env... with controllers starting @ > openvswitch-2.5.0-14.git20160727.el7fdp.x86_64 ... on controller-1 I > manually > ran it like at [1] and confirmed that the node hangs and I couldn't get back > onto the box until I nova reboot it. On control-0 I added the --notriggerun > and it updated without any problems. > > fyi, but lets pickup next week? thanks > > [1]https://github.com/openstack/tripleo-heat-templates/blob/master/ > extraconfig/tasks/pacemaker_common_functions.sh#L301 All my tests were based on 1 controller. Probably I will check if I can get 3 controllers in my lab to test it. The templates which I use in the lab are: https://github.com/krsacme/tht-dpdk
I have tested with 3 controllers setup. Minor update is successful with --notriggerun option. Able to create VMs after rebooting all the nodes. All my testes are based on baremetal overcloud nodes. Marios, Can you share your environment configs? Probably, I might miss something in the HA configuration.
https://review.openstack.org/#/c/434346/ was sent by Mathieu, should he be the owner of this bug? Also, do we expect that patch to resolve this RHBZ entirely?
Is this bug a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1431115? If not, can you please point to two distinct patches, one to resolve this bug, and one to resolve the other bug?
I've verified minor update following google doc: https://docs.google.com/document/d/1PUdFw3L_9J49jTjzkabfSaNOCxH8fPDrGkmYmCrmgbQ/edit Thanks,
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1585