Hide Forgot
Description of problem: I've been running through the procedure for replacing a controller (OSP-8) and I have a few corrections/suggestions. Some more, some less important. But I'm anyway bundling them all up here... First off, why is the procedure for replacing a controller hidden chapter 9 which deals with scaling the OverCloud? When I have a failed controller, scaling is not the first this that comes to mind... In https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes-Preliminary_Checks Step2: There are still references to ironic-discovered, which was replaced by inspector after OSP7... Also in https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes-Preliminary_Checks Step 9: Make sure all Undercloud services are running: From the command-line given, it's not very clear which the UnderCloud services are and which ones _should_ be running. If this step is relevant, I would suggest we better document what the expected result is. In https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes-Node_Replacement A bit into this section, the following command-line is given: sudo sed -i "s/resource\.0/resource.1/g" ~/templates/my-overcloud/overcloud.yaml The "sudo" is not required (and also confusing) since the operation should be done on a copy of the overcloud.yaml. I therefore suggest we drop the sudo. Also in https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes-Node_Replacement There is a sentence , in a box labeled as "Importan" that reads: "However, note that the -e ~/templates/remove-controller.yaml is only required once in this instance. " I haven't a clue what this instruction/warning refers to. In my opinion, we should clarify... Instructions for dealing with nodes that are registered to a Red Hat Satellite are missing. If a failed node is registered but is no longer operational, the RHELUnregistrationDeployment resource will hang for a long time (4 hours?). To get around this, one need to manually signal the resource by running heat resource-signal <SUBSTACK> RHELUnregistrationDeployment Obviously also then manually freeing up the entitlement in the satellite is recommended. Section https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes-Manual_Intervention starts with: "During the ControllerNodesPostDeployment stage, the Overcloud stack update halts with an UPDATE_FAILED error at ControllerLoadBalancerDeployment_Step1. " What it doesn't say is that it will take ControllerLoadBalancerDeployment_Step1 more then an hour to time-out (due to puppet trying for an hour to create a corosync cluster before finally giving up?) In my opinion, we should document that here one need to have patience, or possibly document how we could make puppet fail a bit faster... Also in https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes-Manual_Intervention Step 5, Delete the failed node from the RabbitMQ cluster: I believe this step is redundant. When I reached this step, the node had already been removed from the RabbitMQ cluster. Possibly by pacemaker... Continuing with https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes-Manual_Intervention step 15 Wait until the Keystone service starts on all nodes. Maybe I was doing something wrong by haproxy was failing on me, complaining about horizon. And since keystone depended on haproxy, this step wouldn't work. Eventually I admitted defeat and just commented out the horizon section from the haproxy config. This allowed me to proceed. And in the subsequent OverCloud refresh, horizon got put back into the haproxy config, so no need to uncomment it again. Finally, we should also add a section that one need to add a fencing resource for the new controller node. And delete the resource for the failed node. Version-Release number of selected component (if applicable): OSP8
(In reply to David Juran from comment #0) > Description of problem: > > I've been running through the procedure for replacing a controller (OSP-8) > and I have a few corrections/suggestions. Some more, some less important. > But I'm anyway bundling them all up here... > > First off, why is the procedure for replacing a controller hidden chapter 9 > which deals with scaling the OverCloud? When I have a failed controller, > scaling is not the first this that comes to mind... You've got a good point here, though I'm not sure what to call this section. "Scaling and Replacing Nodes" maybe? I'm open to suggestions. > > In > https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ > single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- > Preliminary_Checks > Step2: > There are still references to ironic-discovered, which was replaced by > inspector after OSP7... Correcting for OSP 8, 9, and 10. > > Also in > https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ > single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- > Preliminary_Checks > Step 9: Make sure all Undercloud services are running: > From the command-line given, it's not very clear which the UnderCloud > services are and which ones _should_ be running. If this step is relevant, I > would suggest we better document what the expected result is. Sure I'll replace this command with the following: $ sudo systemctl list-units httpd* mariadb* neutron* openstack* openvswitch* rabbitmq* That should covver all the main components AFAIK > In > https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ > single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- > Node_Replacement > A bit into this section, the following command-line is given: > sudo sed -i "s/resource\.0/resource.1/g" > ~/templates/my-overcloud/overcloud.yaml > The "sudo" is not required (and also confusing) since the operation should > be done on a copy of the overcloud.yaml. I therefore suggest we drop the > sudo. Very true. Dropping > Also in > https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ > single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- > Node_Replacement > There is a sentence , in a box labeled as "Importan" that reads: > "However, note that the -e ~/templates/remove-controller.yaml is only > required once in this instance. " > I haven't a clue what this instruction/warning refers to. In my opinion, we > should clarify... I've clarified this. Basically, you only include the remove-controller.yaml file once because you're only running the controller removal process on that particular "openstack overcloud deploy" run and not on any subsequent runs. > > Instructions for dealing with nodes that are registered to a Red Hat > Satellite are missing. > If a failed node is registered but is no longer operational, the > RHELUnregistrationDeployment resource will hang for a long time (4 hours?). > To get around this, one need to manually signal the resource by running > > heat resource-signal <SUBSTACK> RHELUnregistrationDeployment > > Obviously also then manually freeing up the entitlement in the satellite is > recommended. > I can add a note about this. You have an idea of the substack name pattern that heat uses with this resource? > > Section > https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ > single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- > Manual_Intervention starts with: > "During the ControllerNodesPostDeployment stage, the Overcloud stack update > halts with an UPDATE_FAILED error at ControllerLoadBalancerDeployment_Step1. > " > What it doesn't say is that it will take > ControllerLoadBalancerDeployment_Step1 more then an hour to time-out (due to > puppet trying for an hour to create a corosync cluster before finally giving > up?) > In my opinion, we should document that here one need to have patience, or > possibly document how we could make puppet fail a bit faster... Yeah, it might be an idea to add a timeout. The only question is what we should make the timeout? Maybe 2 hrs instead of 4? > Also in > https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ > single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- > Manual_Intervention Step 5, > Delete the failed node from the RabbitMQ cluster: > I believe this step is redundant. When I reached this step, the node had > already been removed from the RabbitMQ cluster. Possibly by pacemaker... I don't think this is redundant. I've had to remove the node from the RabbitMQ cluster in my tests. Likewise, OSP QA has verified this step too. > Continuing with > https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ > single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- > Manual_Intervention step 15 > Wait until the Keystone service starts on all nodes. > Maybe I was doing something wrong by haproxy was failing on me, complaining > about horizon. And since keystone depended on haproxy, this step wouldn't > work. Eventually I admitted defeat and just commented out the horizon > section from the haproxy config. This allowed me to proceed. And in the > subsequent OverCloud refresh, horizon got put back into the haproxy config, > so no need to uncomment it again. I'm not able to reproduce this and I'm also not sure what the documentation requirement is here. It seems like a workaround for a problem you've faced with your specific environment. It might be worth opening an engineering BZ for this one. > Finally, we should also add a section that one need to add a fencing > resource for the new controller node. And delete the resource for the failed > node. True. Adding this note. > > Version-Release number of selected component (if applicable): > OSP8
>You've got a good point here, though I'm not sure what to call this section. >"Scaling and Replacing Nodes" maybe? I'm open to suggestions. I like that idea, go for it (-: >> Also in >> https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ >> single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- >> Preliminary_Checks >> Step 9: Make sure all Undercloud services are running: >> From the command-line given, it's not very clear which the UnderCloud >> services are and which ones _should_ be running. If this step is relevant, I >> would suggest we better document what the expected result is. >Sure I'll replace this command with the following: > >$ sudo systemctl list-units httpd* mariadb* neutron* openstack* openvswitch* >rabbitmq* > >That should covver all the main components AFAIK I think that would be good. But don't forget to back-slash those asterisks, or they could get globbed by the shell to anything matching one might have in the CWD... >> Also in >> https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ >> single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- >> Node_Replacement >> There is a sentence , in a box labeled as "Importan" that reads: >> "However, note that the -e ~/templates/remove-controller.yaml is only >> required once in this instance. " >> I haven't a clue what this instruction/warning refers to. In my opinion, we >> should clarify... > >I've clarified this. Basically, you only include the remove-controller.yaml >file once because you're only running the controller removal process on that >particular "openstack overcloud deploy" run and not on any subsequent runs. Ah, that makes sense. That explains it much better! >> Instructions for dealing with nodes that are registered to a Red Hat >> Satellite are missing. >> If a failed node is registered but is no longer operational, the >> RHELUnregistrationDeployment resource will hang for a long time (4 hours?). >> To get around this, one need to manually signal the resource by running >> >> heat resource-signal <SUBSTACK> RHELUnregistrationDeployment >> >> Obviously also then manually freeing up the entitlement in the satellite is >> recommended. >> >I can add a note about this. You have an idea of the substack name pattern >that heat uses with this resource? It's a random string )-: But you can find out by running heat resource-list -n 5 overcloud The last column will tell you the name of the substack >> Section >> https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ >> single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- >> Manual_Intervention starts with: >> "During the ControllerNodesPostDeployment stage, the Overcloud stack update >> halts with an UPDATE_FAILED error at ControllerLoadBalancerDeployment_Step1. >> " >> What it doesn't say is that it will take >> ControllerLoadBalancerDeployment_Step1 more then an hour to time-out (due to >> puppet trying for an hour to create a corosync cluster before finally giving >> up?) >> In my opinion, we should document that here one need to have patience, or >> possibly document how we could make puppet fail a bit faster... > >Yeah, it might be an idea to add a timeout. The only question is what we >should make the timeout? Maybe 2 hrs instead of 4? So it's not the heat update that need to time out (no idea what would happen if it does but rather puppet need to give up trying to contact the cluster. If I'm not mistaken, it's the call to pacemaker::corosync we do. And that is using the default time-out, which I think is an hour. So I'm not sure if there is a way to gracefully interrupt this. Killing puppet maybe?? Or signalling the resource? I didn't try. But if nothing else, we should tell the operator to not give up hope and start trying silly things. Like I did... (-: >> Continuing with >> https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/ >> single/director-installation-and-usage/#sect-Replacing_Controller_Nodes- >> Manual_Intervention step 15 >> Wait until the Keystone service starts on all nodes. >> Maybe I was doing something wrong by haproxy was failing on me, complaining >> about horizon. And since keystone depended on haproxy, this step wouldn't >> work. Eventually I admitted defeat and just commented out the horizon >> section from the haproxy config. This allowed me to proceed. And in the >> subsequent OverCloud refresh, horizon got put back into the haproxy config, >> so no need to uncomment it again. > >I'm not able to reproduce this and I'm also not sure what the documentation >requirement is here. It seems like a workaround for a problem you've faced >with your specific environment. It might be worth opening an engineering BZ >for this one. OK. If I can reproduce this issue again, I'll file a seperate bz.
Hi David, I think I've got the main facets of your request integrated with the OSP 8 guide. Have a look at let me know it there's anything I've missed: https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes If everything is okay, I'll merge these changes into the OSP 9 and the eventual OSP 10 guide.
Onne small thing I found, I'd like to add that at https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/single/director-installation-and-usage/#sect-Replacing_Controller_Nodes-Manual_Intervention one need to have patience. A lot of it, since it takes around 90 minutes for the stuff to time out. Even better would be if we found a procedure for speeding up the time-out. Otherwise, I think the procedure look good.
This BZ seems to be resolved. Closing this BZ but feel free to reopen if further updates are required.