Bug 1383441

Summary: Shutdown and Start up procedure request for OSP Director based setups
Product: Red Hat OpenStack Reporter: Aviv Guetta <aguetta>
Component: documentationAssignee: Dan Macpherson <dmacpher>
Status: CLOSED CURRENTRELEASE QA Contact: RHOS Documentation Team <rhos-docs>
Severity: low Docs Contact:
Priority: medium    
Version: 8.0 (Liberty)CC: aguetta, brault, dbecker, dmacpher, jcoufal, jefbrown, jliberma, jslagle, lbopf, mburns, morazi, mschuppe, radoslaw.smigielski, rhel-osp-director-maint, rhos-docs, scohen, srevivo
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-15 13:57:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Aviv Guetta 2016-10-10 15:40:37 UTC
Description of problem:

We need a procedure for a graceful shutdown and start up for Openstack Cloud (Undercloud + Overcloud, HA) which is using Red Hat Openstack Director.

Version-Release number of selected component (if applicable):
Any.


Actual results:
Currently there is no

Comment 10 Dan Macpherson 2017-03-13 05:17:55 UTC
Content reviewed by Don Domingo and is now live.

Here's the link to the general reboot procedures in the Director guide:

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/director_installation_and_usage/sect-rebooting_the_overcloud

And I've add the same reboot procedures at the end of each relevant upgrade procedure:

    Director - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Updating_Director_Packages
    Object Storage - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Swift
    Controller - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Controller
    Ceph Storage - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Ceph
    Compute - https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html/upgrading_red_hat_openstack_platform/chap-upgrading_the_environment#sect-Major-Upgrading_the_Overcloud-Compute

For each upgrade procedure, I have this text before the reboot:

"Check the /var/log/yum.log file on the [ROLE] node you have upgraded to see if either the kernel or openvswitch packages have updated their major or minor versions. If so, perform a reboot of each node:"

@Aviv -- How does the content look to you? Did you have any suggestions for improvements?

Comment 11 Aviv Guetta 2017-03-13 13:47:12 UTC
Hi Dan,
* [9.4] there should be a separator between the Overall description to steps to reboot the compute node.
* [9.4] there should be a sanity check after the compute node is up.
* [all] I don't understand why the procedure always states "Select the next node to reboot.", if nodes should be rebooted one-by-one, it should be explicitly stated.
* [all] In order to avoid issues (like rebooting a faulty node, which will not start afterwards), there should be a sanity check also before the reboot.
* [all] Another use for the procedure is to shutdown the compute node (and starting it up later), so adding 'poweroff' option should be considered as well.



Aviv

Comment 13 Dan Macpherson 2017-03-13 15:19:33 UTC
Hi Aviv,

Thanks for the feedback. I might need some more information on these requests. Responses inline...

(In reply to Aviv Guetta from comment #11)
> Hi Dan,
> * [9.4] there should be a separator between the Overall description to steps
> to reboot the compute node.

I'm not sure what you mean by a separator. Can you elaborate on this?

> * [9.4] there should be a sanity check after the compute node is up.

Sure, have you got a recommendation for a Compute sanity check?

> * [all] I don't understand why the procedure always states "Select the next
> node to reboot.", if nodes should be rebooted one-by-one, it should be
> explicitly stated.

I think you answered already that this one is done.

> * [all] In order to avoid issues (like rebooting a faulty node, which will
> not start afterwards), there should be a sanity check also before the reboot.

What sort of pre-reboot sanity check were you thinking? I only ask because on of the reasons you could be rebooting a node is because there's a fault with the node. What did you want to check for?

> * [all] Another use for the procedure is to shutdown the compute node (and
> starting it up later), so adding 'poweroff' option should be considered as
> well.

I think the same processes can also apply to powering off nodes. Just instead of rebooting, you would just power off. So instead of a whole new procedure for power off, maybe a note to say the same procedures can be used but instead of rebooting to just power the node off. Would that make sense or is there more to it?

Comment 14 Aviv Guetta 2017-04-02 12:33:54 UTC
(In reply to Dan Macpherson from comment #13)
Hi Dan,

> > * [9.4] there should be a separator between the Overall description to steps
> > to reboot the compute node.
> 
> I'm not sure what you mean by a separator. Can you elaborate on this?

There are 3 steps which describe the process, then there is the practical part ('list compute nodes'), There should be a separator between them.


> > * [9.4] there should be a sanity check after the compute node is up.
> 
> Sure, have you got a recommendation for a Compute sanity check?


> > * [all] I don't understand why the procedure always states "Select the next
> > node to reboot.", if nodes should be rebooted one-by-one, it should be
> > explicitly stated.
> 
> I think you answered already that this one is done.
Ack


> > * [all] In order to avoid issues (like rebooting a faulty node, which will
> > not start afterwards), there should be a sanity check also before the reboot.
> 
> What sort of pre-reboot sanity check were you thinking? I only ask because
> on of the reasons you could be rebooting a node is because there's a fault
> with the node. What did you want to check for?

It can be a shutdown as well. Additionally, an operator can reboot node for one reason and avoid other issues. Additionally, it should give the operator a good view of the current status of the node and the environment, before he does such an action.
At first glance, i'd suggest:
- [undercloud] Checking all computes status is ok:
# [root@undercloud-0 ~]# openstack server list  

- [overcloud] Examine of the openstack services at the rebooted compute:
# [heat-admin@compute-1 ~]$ sudo systemctl list-units "openstack*" "neutron*" "openvswitch*" 
  ## this command can be changed according to the customer environment (in case of additional services).

> > * [all] Another use for the procedure is to shutdown the compute node (and
> > starting it up later), so adding 'poweroff' option should be considered as
> > well.
> 
> I think the same processes can also apply to powering off nodes. Just
> instead of rebooting, you would just power off. So instead of a whole new
> procedure for power off, maybe a note to say the same procedures can be used
> but instead of rebooting to just power the node off. Would that make sense
> or is there more to it?

It should just be mentioned, alongside the reboot command step.

Comment 15 Aviv Guetta 2017-04-04 08:03:48 UTC
Hi Dan,
Additionally, there should be a separation between Overcloud and Undercloud, as Director is Undercloud and controllers / computes are overcloud.
Currently there is one (wrong) title:
"CHAPTER 9. REBOOTING THE OVERCLOUD"

Comment 16 Aviv Guetta 2017-05-15 11:19:31 UTC
Hi Dan,
I didn't receive any comments from the customer,as we did provide the documentation already, I think we can close this Bugzilla.
Thanks,

Aviv

Comment 17 Dan Macpherson 2017-05-15 13:56:13 UTC
No prob, Aviv. I just switched back to ASSIGNED to take care of the feedback from comment #15.