Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1585770

Summary: [RFE] safer and more transparent minor OSP updates
Product: Red Hat OpenStack Reporter: Aviv Guetta <aguetta>
Component: RFEsAssignee: Lukas Bezdicka <lbezdick>
Status: CLOSED WORKSFORME QA Contact: Raviv Bar-Tal <rbartal>
Severity: urgent Docs Contact:
Priority: urgent    
Version: unspecifiedCC: augol, ccamacho, cfields, cjanisze, cswanson, dciabrin, dgurtner, djuran, eelena, fbaudin, fherrman, gfidente, gkeegan, jpretori, jzaher, lbezdick, marjones, markmc, mbracho, mbultel, mburns, mtenheuv, rlondhe, srevivo, yprokule
Target Milestone: zstreamKeywords: FutureFeature, Triaged, ZStream
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-23 11:48:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1646332, 1647438    
Bug Blocks:    

Description Aviv Guetta 2018-06-04 16:43:16 UTC
Hi,
This case is opened following a [rhos-tech] mailing list discussion ('Minor Updates of Compute nodes always happens all at once (and not "one at a time") - is this intended?')

We'd like to have more detailed information, clarifications and answers  from engineering about the list of steps below, so:

1) We'll be able to make the update process more clear and transparent to its users.
2) If the update isn't really 'one by one' we'd like to treat this request as a bug.
3) Document the update process in detail.


b) because I'm not clear if this is intended or a bug (I expect intended because the code to handle it this way is deliberate)

Can you provide some clarity if that is in fact how the process is designed, or if this should go into a BZ?


Following Field Interlock presentation from back in January 2018 [1][2].

On slide 9 [3], at around the 5:20 mark, it's mentioned that it's possible to run an update on Compute nodes "one at a time", so it's possible to migrate workloads if needed.

From a recent experience in OSP10, the system behaves in a different way.
Instead, the update works as follows:

1. Update OSP-director (undercloud)

2. Controllers.
   "one-at-a-time" with breakpoints ('-i'). It's also possible to reboot them[4][5]

3. Computes "pre-update". 
   In fact, the only thing that gets updated via yum is puppet modules. Migrating workloads away or rebooting the computes after this step is actually pointless.

4. Potentially Ceph (and other storage nodes) update[6]

5. After all nodes have been "updated"(step3), a stack update runs, which goes through the normal Puppet steps. During these steps, there is a Puppet collector which updates all packages. It is here that packages on all compute nodes at the same time get updated.[7]

6. After Puppet finished, all the configuration is again in sync with Computes. It's at this point that it makes sense to migrate workloads and reboot compute nodes.


To illustrate that the above process is actually the case, please take a look:
1. The SoftwareDeployment "yum_update.sh" script on a Compute node, run at 22:35:
~~~
[root@lab-compute01 ~]# ls -altr /var/lib/heat-config/heat-config-script/81698e1c-ac23-4f10-8ce2-bcafc8d9e5ae 
-rwx------. 1 root root 20863 Feb 11 22:35 /var/lib/heat-config/heat-config-script/81698e1c-ac23-4f10-8ce2-bcafc8d9e5ae
[root@lab-compute01 ~]# tail -n 3  /var/lib/heat-config/heat-config-script/81698e1c-ac23-4f10-8ce2-bcafc8d9e5ae
echo "Finished yum_update.sh on server $deploy_server_id at `date`"
exit $return_code
~~~

2. The 'yum history' output on this Compute node. At 22:40 only the Puppet modules get update[8]:
~~~
[root@lab-compute01 ~]# yum history | grep -E '^\s*([8,9]|10)'
    10 | System <unset>           | 2018-02-12 10:28 | I, O, U        |  682 EE
     9 | System <unset>           | 2018-02-11 23:29 | Update         |    2   
     8 | System <unset>           | 2018-02-11 22:40 | Update         |   32  <
~~~
[root@lab-compute01 ~]# yum history info 8 | grep "Command Line"
Command Line   : -q -y update puppet puppet-aodh puppet-apache puppet-barbican puppet-cassandra puppet-ceilometer puppet-ceph puppet-certmonger puppet-cinder puppet-collectd puppet-concat puppet-contrail puppet-corosync puppet-datacat puppet-elasticsearch puppet-firewall puppet-fluentd puppet-git puppet-glance puppet-gnocchi puppet-haproxy puppet-heat puppet-horizon puppet-inifile puppet-ironic puppet-java puppet-kafka puppet-keepalived puppet-keystone puppet-kibana3 puppet-kmod puppet-manila puppet-memcached puppet-midonet puppet-mistral puppet-module-data puppet-mongodb puppet-mysql puppet-n1k-vsm puppet-neutron puppet-nova puppet-nssdb puppet-ntp puppet-opendaylight puppet-openstack_extras puppet-openstacklib puppet-oslo puppet-ovn puppet-pacemaker puppet-rabbitmq puppet-redis puppet-remote puppet-rsync puppet-sahara puppet-sensu puppet-snmp puppet-ssh puppet-staging puppet-stdlib puppet-swift puppet-sysctl puppet-tempest puppet-timezone puppet-tomcat puppet-tripleo puppet-trove puppet-uchiwa puppet-vcsrepo puppet-vlan puppet-vswitch puppet-xinetd puppet-zaqar puppet-zookeeper
~~~

3. After the compute node finished,at the next breakpoint (Ceph node), which then run "yum_update.sh" on this system at 22:46:
~~~
[root@lab-ceph01 ~]# ls -altr /var/lib/heat-config/heat-config-script/8448d216-cc7b-466e-a435-171c492146ad 
-rwx------. 1 root root 20863 Feb 11 22:46 /var/lib/heat-config/heat-config-script/8448d216-cc7b-466e-a435-171c492146ad
[root@lab-ceph01 ~]# tail -n 3 /var/lib/heat-config/heat-config-script/8448d216-cc7b-466e-a435-171c492146ad 
echo "Finished yum_update.sh on server $deploy_server_id at `date`"
exit $return_code
~~~

4. That also only runs yum on the Puppet modules, at around 22:50:
~~~
[root@lab-ceph01 ~]# yum history | grep -E '^\s*[6,7]'
     7 | System <unset>           | 2018-02-11 23:28 | I, O, U        |  711 EE
     6 | System <unset>           | 2018-02-11 22:50 | Update         |   32  <
~~~

5. after all breakpoints were done, Puppet started running through the steps.
   For example, in the Compute node from above (lab-compute01) at 23:23, here's the Puppet module which ultimately triggers the "update", via the "Package <| |> collector:
~~~
[root@lab-compute01 ~]# ls -altr /var/lib/heat-config/heat-config-puppet/903348f2-70fd-423b-9299-2870b0e41cc5.pp 
-rwx------. 1 root root 1840 Feb 11 23:23 /var/lib/heat-config/heat-config-puppet/903348f2-70fd-423b-9299-2870b0e41cc5.pp
[root@lab-compute01 ~]# grep "packages" /var/lib/heat-config/heat-config-puppet/903348f2-70fd-423b-9299-2870b0e41cc5.pp 
$package_manifest_name = join(['/var/lib/tripleo/installed-packages/overcloud_compute', hiera('step')])
include ::tripleo::packages
~~~
[root@lab-compute01 ~]# grep latest -B1 /etc/puppet/modules/tripleo/manifests/packages.pp 
# [*enable_upgrade*]
#  Upgrades all puppet managed packages to latest.
--
  if $enable_upgrade {
    Package <| |> { ensure => 'latest' }
~~~
6. similarly on the Ceph at around the same time at 23:22:
~~~
[root@lab-ceph01 ~]# ls -altr /var/lib/heat-config/heat-config-puppet/22e48e59-d36f-4616-9fdc-0bc81a420c0e.pp 
-rwx------. 1 root root 1397 Feb 11 23:22 /var/lib/heat-config/heat-config-puppet/22e48e59-d36f-4616-9fdc-0bc81a420c0e.pp
[root@lab-ceph01 ~]# grep package /var/lib/heat-config/heat-config-puppet/22e48e59-d36f-4616-9fdc-0bc81a420c0e.pp
$package_manifest_name = join(['/var/lib/tripleo/installed-packages/overcloud_cephstorage', hiera('step')])
package_manifest{$package_manifest_name: ensure => present}
include ::tripleo::packages
~~~

7. And a yum update as we already saw above that actually the update triggered on the of all packages on the Ceph node at 23:28:
[root@lab-ceph01 ~]# yum history | grep -E '^\s*7'
     7 | System <unset>           | 2018-02-11 23:28 | I, O, U        |  711 EE





[1] https://bluejeans.com/playback/s/54XXIE7ZPZbdvZbXNFlFArkQcdYF9zjd0nSWPeXAzF4MdfguifoHu6AldDjGu4HI
[2] https://docs.google.com/presentation/d/1M1p0FGEfOLRW1GoZm4IPgE8-CLxuY_cYfsj2bgiYOxo/edit#slide=id.g1ea7293052_0_302
[3] https://docs.google.com/presentation/d/1M1p0FGEfOLRW1GoZm4IPgE8-CLxuY_cYfsj2bgiYOxo/edit#slide=id.g1d9bee5d40_0_3338
[4] One caveat being, that once you are on the next breakpoint, they are already managed back by pacemaker again, so this is already not great.
[5] At this point no service got restarted either, which is a good thing because if new packages require changed config, this will only happen later when Puppet runs. So rebooting at this time might actually break things.
[6] Ceph/storage nodes suffer from the same "pre-update" issue as Compute nodes.
[7] Nothing critical gets restarted, so VMs might actually continue running normally. It is then possible to migrate workloads away from one at a time and reboot them, once the update process is completely finished.
[8] If you wonder why the big update of 682 packages only ran the next morning - Puppet failed and I had to rerun the update the next morning. This does not invalidate the timeline though.




Version-Release number of selected component (if applicable):
RH-OSP 10

Comment 1 Lukas Bezdicka 2018-06-06 12:21:05 UTC
The slides are incorrect and what you observed is the real workflow:

Breakpoints are only good for controllers where yum_update.sh puts controller services to unmanaged, updates all the packages and than panages the services back. This happends controller by controller. Noncontroller nodes (Ceph,Compute...) will at this point only get workarounds for networking and updated puppet. After last node runs it starts puppet steps on all nodes and in those steps yum update runs for all noncontrollers at once. There is no outage expected which means there is no need to migrate workloads away. The need for workload migration comes later if the reboot is required. For later versions this got refactored and one could override the ansible playbooks to migrate away and even reboot, but we are talking OSP12+.

Comment 24 Lukas Bezdicka 2019-02-14 16:34:15 UTC
In case of OSP13 the process would be to prepare the heat outputs which does not touch overcloud:
    openstack overcloud update prepare <templates>
This usually takes around 15 minutes and remember that after this you should not scale up before you finish as you need the converge run.
Next step would be to update controllers one by one or any node that you really just want to update and is not computes.
    openstack overcloud update run --nodes <role>
But than you'd like to update specific computes or nodes manually that would happen by just changing role to node
    openstack overcloud update run --nodes my-precious-compute-2
After you do all the nodes you run converge, the reason for not doing scale up is that converge reruns deployment steps on all the nodes but as you ran already the update run this is noop (just takes time).
    openstack overcloud update converge <templates>

So as you can see you can controll by the --nodes option as it directly translates to ansible --limit option. This allows you to limit to what your inventory allows - that is generated by tripleo-ansible-inventory.

Comment 29 Kamal Bhaskar 2020-01-23 09:36:59 UTC
Hi Damien

Glad to hear that.
Can you please also advise if the workaround script shared in comment#4 is also supported from Red Hat, if in case customer decides to go ahead with this in production environment.

Thanks & Regards
Kamal Bhaskar

Comment 33 Lukas Bezdicka 2020-12-23 11:48:09 UTC
There will be no more changes to OSP10 update procedure and all the possible paths were explored. If there are any issues please open NEW BZs. OSP13 and OSP16 have quite different procedures and in my opinion solve all the issues so please upgrade as many others already did.