| Summary: | rhel-osp-director: Replacing Controller Node on 8.0: all pcs resources appear unmanaged, unable to start openstack-keystone | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> | ||||||||||||
| Component: | documentation | Assignee: | RHOS Documentation Team <rhos-docs> | ||||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | RHOS Documentation Team <rhos-docs> | ||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||
| Priority: | high | ||||||||||||||
| Version: | unspecified | CC: | dbecker, mandreou, mburns, michele, morazi, rhel-osp-director-maint, srevivo | ||||||||||||
| Target Milestone: | ga | Keywords: | Documentation | ||||||||||||
| Target Release: | 8.0 (Liberty) | ||||||||||||||
| Hardware: | Unspecified | ||||||||||||||
| OS: | Unspecified | ||||||||||||||
| Whiteboard: | |||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||
| Clone Of: | Environment: | ||||||||||||||
| Last Closed: | 2018-07-20 07:53:57 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Attachments: |
|
||||||||||||||
|
Description
Alexander Chuzhoy
2016-04-15 18:13:45 UTC
Hey Sasha, I am a little confused what happened first - I think the stack update failed and then you couldn't get the cluster back up. What/how did the stack fail on. The description says 'resources unmanaged' but then you have pcs resource enable for keystone-clone specifically for controller-3. Was it running on the others? Was keystone failing to be enabled at all? I am still waiting on the logs to download from ctrl-0 but some info below may/not be of interest: The fact the cluster was unmanaged tells me it failed somewhere either on application of the controller_pacemaker.pp puppet manifest (applying config and initialising the things, and pacemaker constraints) https://github.com/openstack/tripleo-heat-templates/blob/4afed8617e56b1d9648955b971d5c2e4cd3cd7f8/puppet/manifests/overcloud_controller_pacemaker.pp The sequence is *pre_puppet_pacemaker.yaml* https://github.com/openstack/tripleo-heat-templates/blob/4afed8617e56b1d9648955b971d5c2e4cd3cd7f8/extraconfig/tasks/pre_puppet_pacemaker.yaml which puts the cluster into maintenance mode (with https://github.com/openstack/tripleo-heat-templates/blob/4afed8617e56b1d9648955b971d5c2e4cd3cd7f8/extraconfig/tasks/pacemaker_maintenance_mode.sh ), the *puppet_pacemaker.pp* manifest is applied, then once that completes we get the *post_puppet_pacemaker* https://github.com/openstack/tripleo-heat-templates/blob/4afed8617e56b1d9648955b971d5c2e4cd3cd7f8/extraconfig/tasks/post_puppet_pacemaker.yaml which puts the cluster out of maintenance mode (so we didn't get here in this environment) and optionally disables/enables and restarts things like https://github.com/openstack/tripleo-heat-templates/blob/4afed8617e56b1d9648955b971d5c2e4cd3cd7f8/extraconfig/tasks/pacemaker_resource_restart.sh I am still waiting on the controller logs - i am grabbing ctrl-0 for now but is a big file so I'll have a closer look once they land. The fact you couldn't restart one of the resources could be because another resource that it depends on is still disabled/stopped. Really though I'd like to understand what went wrong in the first place for the stack update to fail. thanks, marios (In reply to marios from comment #3) > Hey Sasha, I am a little confused what happened first - I think the stack > update failed and then you couldn't get the cluster back up. What/how did > the stack fail on. The description says 'resources unmanaged' but then you > have pcs resource enable for keystone-clone specifically for controller-3. > Was it running on the others? Was keystone failing to be enabled at all? I > am still waiting on the logs to download from ctrl-0 but some info below > may/not be of interest: > > The fact the cluster was unmanaged tells me it failed somewhere either on > application of the controller_pacemaker.pp puppet manifest (applying config > and initialising the things, and pacemaker constraints) > https://github.com/openstack/tripleo-heat-templates/blob/ > 4afed8617e56b1d9648955b971d5c2e4cd3cd7f8/puppet/manifests/ > overcloud_controller_pacemaker.pp sorry, the 'either' there was going to be 'or on the restart of the services' but since the cluster is still in maintenance mode when the stack update fails, we didn't get there. So it must be at some point of the application of the config to the given controller. Created attachment 1148164 [details]
corosync log from control0
Created attachment 1148165 [details]
journalctl from ctrl 0
Created attachment 1148166 [details]
control 0 neutron-server log
Created attachment 1148167 [details]
keystone log from control 0
Created attachment 1148169 [details]
corosync log from control0 in tgz instead of raw
Comment on attachment 1148164 [details] corosync log from control0 replaced with a tgz version of this file since it is 16 MB https://bugzilla.redhat.com/attachment.cgi?id=1148169 Hey Sasha, I poked a bit at the logs. I see there are multiple errors (keystone, neutron-server, journalctl, corosync). I also took a closer look at the docs you pointed at in "8.10.4. Replacing Controller Nodes " at https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/director-installation-and-usage/810-scaling-the-overcloud - there is quite a lot of manual intervention required here and there are multiple stack updates going on it seems. If I've understood correctly from the description this environment has had at least 2 stack updates (and I assume you've done all the required manual steps in between). The first openstack overcloud cloud where you include the 'remove-node.yaml' - is expected to fail, I wonder however was the cluster in maintenance mode when /after that happened and before you re-ran the next update. It could be we need to update the docs for this process. Are you sure you've removed the node correctly, from the corosync logs i see reference to 4 controllers: Apr 15 18:06:16 [38453] overcloud-controller-0.localdomain attrd: info: crm_dump_peer_hash: crm_find_peer: Node 0/overcloud-controller-3 = 0xa4cbc0 - c81f76a4-9a3a-4125-9631-80714c271eec Apr 15 18:06:16 [38453] overcloud-controller-0.localdomain attrd: info: crm_dump_peer_hash: crm_find_peer: Node 3/overcloud-controller-2 = 0xa1e140 - fb395c66-d723-4109-a9c2-f8d0dbc63f8d Apr 15 18:06:16 [38453] overcloud-controller-0.localdomain attrd: info: crm_dump_peer_hash: crm_find_peer: Node 1/overcloud-controller-0 = 0xa133e0 - d440d218-d2df-4755-a908-83045c882417 Apr 15 18:06:16 [38453] overcloud-controller-0.localdomain attrd: info: crm_dump_peer_hash: crm_find_peer: Node 2/overcloud-controller-1 = 0xa25d10 - e775a918-2c31-4423-9876-90b9a187088c Apr 15 18:06:16 [38453] overcloud-controller-0.localdomain attrd: error: crm_abort: crm_find_peer: Forked child 2222 to record non-fatal assert at membership.c:435 : member weirdness and a lot of things failing Apr 15 18:09:41 [38455] overcloud-controller-0.localdomain crmd: warning: status_from_rc: Action 70 (galera_monitor_30000) on overcloud-controller-3 failed (target: 0 vs. rc: 7): Error Apr 15 18:09:45 [38455] overcloud-controller-0.localdomain crmd: warning: status_from_rc: Action 239 (httpd_start_0) on overcloud-controller-3 failed (target: 0 vs. rc: 7): Error Apr 15 16:20:07 [38455] overcloud-controller-0.localdomain crmd: warning: status_from_rc: Action 8 (haproxy_monitor_0) on overcloud-controller-2 failed (target: 7 vs. rc: 0): Error Apr 15 16:20:07 [38455] overcloud-controller-0.localdomain crmd: warning: status_from_rc: Action 14 (mongod_monitor_0) on overcloud-controller-2 failed (target: 7 vs. rc: 0): Error From keystone log: 2016-04-14 22:39:14.811 38999 WARNING keystone.common.wsgi [req-9bc66d4b-dd58-47fd-9c4e-700241de5e67 - - - - -] Authorization failed. The request you have made requires authentication. from 10.19.94.12 2016-04-15 14:45:29.523 38992 ERROR oslo.messaging._drivers.impl_rabbit [req-2f2141b0-71e4-4c7b-b970-fcb37e6f7c37 - - - - -] AMQP server on 10.19.94.12:5672 is unreachable: timed out. Trying again in 1 seconds. neutron-server log has 2016-04-15 17:54:17.051 37082 WARNING keystonemiddleware.auth_token [-] Identity response: {"error": {"message": "Could not find token: 616e7396cef642a1808d1fac4ec49ce0", "code": 404, "title": "Not Found"}} and from journalctl : Apr 14 22:33:40 overcloud-controller-0.localdomain neutron-openvswitch-agent[39131]: 2016-04-14 22:33:40.682 39131 ERROR neutron.agent.ovsdb.impl_vsctl RuntimeError: Apr 15 13:54:11 overcloud-controller-0.localdomain neutron-server[36740]: 2016-04-15 13:54:11.098 37102 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: timed out Apr 15 13:54:11 overcloud-controller-0.localdomain proxy-server[41541]: STDERR: ERROR:root:Timeout talking to memcached: 10.19.94.12:11211 (txn: txde9eee59de6441239ac09-005710f281) Apr 15 13:54:11 overcloud-controller-0.localdomain proxy-server[41541]: ERROR with Account server 192.168.200.13:6002/d1 re: Trying to HEAD /v1/AUTH_3ccd8431cabf467fb3db44bdb22ec7dc: ConnectionTimeout (0.5s) (txn: txde9eee59de6441239ac09-005710f281) Apr 15 13:54:21 overcloud-controller-0.localdomain neutron-server[36740]: 2016-04-15 13:54:21.735 37110 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: timed out I've attached these logs for convenience from the controller-0 sos report. I'll try and find out more once I get access to the environment, but I am concerned about whether the node was removed correctly and that all the manual steps were followed and/or are documented correctly. I will try and understand more once I can get onto the environment later, thanks, marios Re-built the setup from scratch and retried the procedure: same result. Nova list before replacement: +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | 164f995b-589f-445b-a19b-4c4e9876cd9e | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=192.168.0.8 | | 1df8cc6d-a475-4858-9f3f-ceb08a6c32d6 | overcloud-cephstorage-1 | ACTIVE | - | Running | ctlplane=192.168.0.7 | | 13d56ecf-b1ca-4e43-a1ed-c155e9f21226 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.168.0.11 | | 97d26fd5-0e15-4571-9bbc-33f89f6d5872 | overcloud-compute-1 | ACTIVE | - | Running | ctlplane=192.168.0.9 | | b32211f9-39db-49dc-95e3-0e6a969431e2 | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.0.13 | | cf9505cf-af7e-4186-ae52-02ab7e0a6104 | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.168.0.10 | | ff128ea2-a904-4b5b-9674-6e4f60a3415a | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.168.0.12 | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ Nova list after replacement (that fails): +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | 164f995b-589f-445b-a19b-4c4e9876cd9e | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=192.168.0.8 | | 1df8cc6d-a475-4858-9f3f-ceb08a6c32d6 | overcloud-cephstorage-1 | ACTIVE | - | Running | ctlplane=192.168.0.7 | | 13d56ecf-b1ca-4e43-a1ed-c155e9f21226 | overcloud-compute-0 | ACTIVE | - | Running | ctlplane=192.168.0.11 | | 97d26fd5-0e15-4571-9bbc-33f89f6d5872 | overcloud-compute-1 | ACTIVE | - | Running | ctlplane=192.168.0.9 | | b32211f9-39db-49dc-95e3-0e6a969431e2 | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.0.13 | | ff128ea2-a904-4b5b-9674-6e4f60a3415a | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.168.0.12 | | 2b5d763b-b9bb-4f02-8925-78057fc83d82 | overcloud-controller-3 | ACTIVE | - | Running | ctlplane=192.168.0.14 | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ The logs were taken and are available on controllers. Deployment command: export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates $THT \ -e $THT/environments/storage-environment.yaml \ -e $THT/environments/network-isolation.yaml \ -e /home/stack/network-environment.yaml \ --control-scale 3 \ --compute-scale 2 \ --ceph-storage-scale 2 \ --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph-storage \ --neutron-disable-tunneling \ --neutron-network-type vlan \ --neutron-network-vlan-ranges tenantvlan:18:43 \ --neutron-bridge-mappings datacentre:br-ex,tenantvlan:br-nic4 \ --rhel-reg --reg-method satellite --reg-sat-url https://openstack-006.usersys.redhat.com --reg-org 1 --reg-activation-key 1-rhos80 --reg-force \ --ntp-server clock.redhat.com \ --timeout 180 The first replacement command (that is supposed to fail according to the guide) is: export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates $THT \ -e $THT/environments/storage-environment.yaml \ -e $THT/environments/network-isolation.yaml \ -e /home/stack/network-environment.yaml \ --control-scale 3 \ --compute-scale 2 \ --ceph-storage-scale 2 \ --compute-flavor compute --control-flavor control --ceph-storage-flavor ceph-storage \ --neutron-disable-tunneling \ --neutron-network-type vlan \ --neutron-network-vlan-ranges tenantvlan:18:43 \ --neutron-bridge-mappings datacentre:br-ex,tenantvlan:br-nic4 \ --rhel-reg --reg-method satellite --reg-sat-url https://openstack-006.usersys.redhat.com --reg-org 1 --reg-activation-key 1-rhos80 --reg-force \ --ntp-server clock.redhat.com \ -e /home/stack/remove-node.yaml \ --timeout 180 Then re-ran the original deployment command - also failed. Unable to start the openstack-keystone resource. Hey sasha, i tried and failed to revive your environment. I eventually discovered that httpd flat can't start on control 3 because it isn't configured at all from our side. At some point one of the earlier stack updates must have failed (probably the original one that added this controller, since that's when it should have gotten this config) leaving . Since http can't start nothing else will. I will try and reproduce on my upgraded 8 environment tomorrow but at this point you should try and make sure controller 3 is configured correctly. I'd try and get the rest of the cluster into shape (probably all the things will be down on control3 but should be ok on control2 and 0) and re run the deploy so that controller 3 gets configured correctly. For example to check the httpd conf on control 3 compare /etc/httpd/conf/httpd.conf on ctrl 3 vs 0 or /etc/httpd/conf.d/15-default.conf (currently missing on ctrl3) hope it helps for now, thanks, marios Thanks Marios. So I redeployed the overcloud and attempted to replace controller again on the same setup. This time I copied /etc/httpd from another controller and changed the IPs in the conf file to reflect the new controller. Same issue reproduced. Feel free to poke around. Hi Sasha, I had a go at the process in https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/director-installation-and-usage/94-replacing-controller-nodes on an environment i had just upgraded to 8. I think those docs need some attention. I did the first stack update to remove controller-1 and add controller-3. The update failed as documented (I think we should be clear on exactly what this should fail on, like a trace for example to make sure there isn't some other issue) and controller-1 was gone replaced by controller-3 in nova. Following the instructions I couldn't add the new node to corosync or get corosync to start on the new node. I first had to manually add an /etc/hosts entry for the new controller and then also added the new node to the corosync.conf. Even then I couldn't add the new node to the cluster because it wasn't authenticated. I don't think it's worth trying to understand more about why the stack update failed here until we have a reliable process documented - albeit a manual one. @sasha did you also have to do those things in your environment - or did corosync just start without issue for you? thanks, marios fyi/fwiw today I sanity checked the "replace a controller" process against a fresh 7.3 deployment and it got further. The intial deployment to remove a controller end with | resource_status_reason | resources.ControllerNodesPostDeployment: resources.ControllerLoadBalancerDeployment_Step1: Error: resources[3]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 | and on controller 0: 4366:Apr 22 12:15:15 overcloud-controller-3.localdomain os-collect-config[5396]: u001b[mNotice: Finished catalog run in 3782.73 seconds\u001b[0m\n", "deploy_stderr": "Device \"br_ex\" does not exist.\nDevice \"ovs_system\" does not exist.\n\u001b[1;31mError: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]\u001b[0m\n\u001b[1;31mError: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: change from notrun to 0 failed: /usr/sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Pacemaker::Corosync/Notify[pacemaker settled]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Pacemaker::Stonith/Exec[Disable STONITH]: Skipping because of failed dependencies\u001b[0m\n", "deploy_status_code": 6} The /etc/hosts *was* updated on all three nodes (including the new one) and the addition of the new node to the cluster was more straightforward and as documented. So it seems to be the 8 envs in particular for which the docs need attention (and not saying the 7.3 replacement worked 100% but just not as bad for 7.3 envs as my experience in comment #16 thanks, marios |