Description of problem: Attempting to add a compute node to my OSP8 GA environment keeps giving me a generic: ERROR: Authentication failed: Authentication required Command I'm running: $ openstack overcloud deploy --compute-scale 4 --templates ~/templates/my-overcloud/ -e ~/templates/my-overcloud/environments/network-isolation.yaml -e ~/templates/network-environment.yaml -e ~/templates/my-overcloud/environments/storage-environment.yaml What I know: - nova list DOES list the new compute node I'm attempting to add - source overcloudrc ; nova service-list DOES NOT list the compute I'm trying to add - heat-engine.log and heat-api.log do not report any errors on my undercloud node (cmd: journalctl -u openstack-heat-engine.service | grep -e ERROR e TRACE -e FAIL) - os-collect-config does not report any errors on my node being added nor on my overcloud nodes - heat resource-list reports the problem happening in the Controller nodes. I check os-collect-config on the controller nodes and no errors reported there either (cmd: journalctl -u os-collect-config | grep -e TRACE -e FAIL -e ERROR) - Last thing seen on the screen prior to the Authentication error: UPDATE_COMPLETE Stack UPDATE completed successfully 2016-07-07 15:07:56 [overcloud-Controller-7yvdmu2gcnw5-0-zak5poab6id7]: UPDATE_COMPLETE Stack UPDATE completed successfully 2016-07-07 15:07:58 [0]: UPDATE_COMPLETE state changed When I check heat resource list via cmd: heat resource-list -n5 overcloud|grep -v COMPLETE (http://pastebin.test.redhat.com/390430), I do see some Controller and SoftwareDeployment errors and a status of UPDATE_FAILED. When I attempt to look at those update failures using heat deployment-show, the status shows IN_PROGRESS which is contradictory to what heat resource-list displays (http://pastebin.test.redhat.com/390431) At this point, I don't know what else to check. Anyone have any ideas?
It sounds like the operation took over 4 hours so you saw the authentication error because the token expired. The UPDATE_FAILED deployment resources timed out (the stack has a 4 hour timeout) and the underlying deployments remain IN_PROGRESS (this is a known issue with an upstream bug raised.) I suspect one or more of the overcloud nodes has os-collect-config stuck on a previous run (or has stopped polling for some other reason) Could you please do the following: - identify the servers which have UPDATE_FAILED deployments - ssh in and collect the following information for this bug - systemctl status os-collect-config - pstree of the os-collect-config pid - if os-collect-config is not running, start it - if it appears to be stalled in a child process, kill the *os-refresh-config* process then observe if os-collect-config starts polling again by watching its journalctl logging. - once os-collect-config appears to be behaving normally on all affected servers, rerun the same overcloud deploy command to finish the scaling operation. We're trying to collect enough data to diagnose bug #1306140, so your help would be appreciated.
Decided just to go ahead and provide you all the info from the entire environment as I'm seeing warnings and what seem like issues on all the nodes when I ran systemctl status os-collect-config. It can be found here: http://pastebin.test.redhat.com/390534 When I run pstree on the PID of os-collect-config on all my nodes, I get: # pstree <PID> os-collect-conf I gathered the PID from the systemctl status os-collect-config command. If there are particular options you want me to run pstree with let me know. Could you clarify: - if it appears to be stalled in a child process, kill the *os-refresh-config* process then observe if os-collect-config starts polling again by watching its journalctl logging. I don't see a os-refresh-config process when I do "ps -ef | grep os-refresh-config" on any of my nodes My guess from all this is, I probably just need to restart the os-collect-config and then attempt again. But before I do that, I want to capture all you need to assist in your bug diag for bug#1306140
Decided to go ahead and essentially reboot each system 1x1. Basically the biggest issue I'm seeing is on controller-0. systemctl status os-collect-config for each node is here: http://pastebin.test.redhat.com/390767
Interesting, it looks like 7 of the 9 nodes stopped polling at around 2016-07-05 13:47. The only line which indicated an issue was 78. Jul 05 13:47:24 overcloud-compute-0.localdomain os-collect-config[6017]: 2016-07-05 13:47:24.548 6017 WARNING os_collect_config.ec2 [-] ('Connection aborted.', BadStatus...("''",)) So it could be that there was temporary connectivity issues to the nova metadata server which os-collect-config doesn't recover from. I'll attempt to replicate this in a test case. In the meantime you can continue to re-attempt the overcloud deploy now that the nodes have been rebooted.
FYI https://bugzilla.redhat.com/show_bug.cgi?id=1306140#c16
Steve, This still fails to add a compute and I'm pretty sure its because my controller0 is showing no local metadata found. All other nodes seem to be ok. Any ideas how to fix this issue of os-collect-config for this controller?
Can you please attach the journalctl log for os-collect-config for controller0?
Created attachment 1178562 [details] Controller0 os-collect-config tar
Steve, Anything in the logs that you see that might fix the issue? Also, I'm thinking if no easy resolution of using "nova rebuild" to recreate the controller node. I have some steps on how to do this of course not tested but not sure I have many other options at this pt.
The last 100 lines is all that is needed. Assuming that Jul 11 22:02:44 close to when the log was captured then os-collect-config is still polling fine. If Jul 11 22:02:44 is old then restart os-collect-config, confirm in journalctl that it is logging something every 30 seconds, then re-attempt the deploy. If the deploy fails or times out please attach the os-collect-config log for the failing node and if the deployment resource went to UPDATE_FAILED for reasons other than a timeout, please attach the heat deployment-output-show for deploy_stderr and deploy_stdout. The timezones aren't good for me helping you live, maybe you could ask shardy or bnemec for assistance for a live diagnosis session.
Doing a nova rebuild seems excessive, I'd recommend diagnosing the specific problem and fixing just that.
Steve, So after we spoke yesterday, with the help of bandini and kgaillot, I was able to get to the bottom of my galera issues. Turns out that for some reason, I couldn't get galera to properly pop up because pacemaker was set into maintenance mode. No idea how this happened. However, this was confirmed with the cmd: pcs property This essentially shows that all my services were unamanged. Once I changed this from true to false using the cmd: pcs property set maintenance-mode=false Galera properly started up and I was able to get all the other services online. Now with that, my controller0 still shows up: Jul 13 20:48:50 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:48:50.748 26239 WARNING os_collect_config.local [-] No local metadata found ([...-data']) Jul 13 20:48:50 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:48:50.749 26239 WARNING os_collect_config.zaqar [-] No auth_url configured. Jul 13 20:49:22 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:22.793 26239 WARNING os-collect-config [-] Source [request] Unavailable. After this I spoke to bnemec. He mentioned to take a look at the undercloud heat heat-api and heat-api-cfn logs. In there I found: ERROR oslo.messaging._drivers.impl_rabbit [req-62ef7688-6607-40ec-8d1d-85dca917e66d abf77f06859f46ef8cc17a8a5bf0adcd fc6ee92e077e4a32acc328af5a00e480] The broker has blocked the connection: low on disk echo "ERROR failed to apply new pacemaker config"\n exit 1\n fi\n\n echo "Pacemaker running, stopping cluster node He mentioned that since I had some low space issues, rabbitmq might of caused an error. Once I freed up some additional space and restarted the openstack-service restart. I then ran the deploy command again to add a new compute node. It failed once more but via the heat resource-list I was able to gather that the reason it did not finish was because it was looking for repos in order to do the update. Once I logged into the new compute I wanted to add and added the repos, reran the deploy and it finally worked! With that, controller0 still shows: # systemctl status os-collect-config \u25cf os-collect-config.service - Collect metadata and run hook commands. Loaded: loaded (/usr/lib/systemd/system/os-collect-config.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2016-07-13 19:25:43 UTC; 1h 24min ago Main PID: 26239 (os-collect-conf) CGroup: /system.slice/os-collect-config.service \u2514\u250026239 /usr/bin/python2 /usr/bin/os-collect-config Jul 13 20:48:50 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:48:50.748 26239 WARNING os_collect_config.local [-] No local metadata found ([...-data']) Jul 13 20:48:50 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:48:50.749 26239 WARNING os_collect_config.zaqar [-] No auth_url configured. Jul 13 20:49:22 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:22.793 26239 WARNING os-collect-config [-] Source [request] Unavailable. Jul 13 20:49:22 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:22.797 26239 WARNING os_collect_config.local [-] /var/lib/os-collect-config...Skipping Jul 13 20:49:22 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:22.797 26239 WARNING os_collect_config.local [-] No local metadata found ([...-data']) Jul 13 20:49:22 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:22.798 26239 WARNING os_collect_config.zaqar [-] No auth_url configured. Jul 13 20:49:54 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:54.616 26239 WARNING os-collect-config [-] Source [request] Unavailable. Jul 13 20:49:54 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:54.620 26239 WARNING os_collect_config.local [-] /var/lib/os-collect-config...Skipping Jul 13 20:49:54 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:54.620 26239 WARNING os_collect_config.local [-] No local metadata found ([...-data']) Jul 13 20:49:54 overcloud-controller-0.localdomain os-collect-config[26239]: 2016-07-13 20:49:54.621 26239 WARNING os_collect_config.zaqar [-] No auth_url configured. Hint: Some lines were ellipsized, use -l to show in full. Not sure if I should worry about it at all but pointing that out. Also TIP: If you are adding a compute node or any overcloud node for that matter, the adding of a compute node wants to update the packages first SO you need some sort of registration yaml file to be part of the deploy cmd otherwise it will fail. If you dont have a registration yaml file, you can log into the overcloud compute node after the failure add them manually and rerun the deploy cmd to add the compute node to the stack.