Bug 1418010

Summary: Overcloud upgrade to RHEL 7.3 is failing
Product: Red Hat OpenStack Reporter: Eduard Barrera <ebarrera>
Component: openstack-tripleoAssignee: Marios Andreou <mandreou>
Status: CLOSED CURRENTRELEASE QA Contact: Arik Chernetsky <achernet>
Severity: unspecified Docs Contact:
Priority: high    
Version: 8.0 (Liberty)CC: apetrich, aschultz, augol, bfournie, djuran, hjensas, jslagle, mandreou, mburns, mcornea, mschuppe, rhel-osp-director-maint, sathlang, therve
Target Milestone: async   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-08 10:38:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduard Barrera 2017-01-31 15:28:20 UTC
Description of problem:


We are updating the overcloud nodes to OSP 8 RH 7.3, now satellite can see 7.3 packages instead of 7.2. openstack is already to the latest version in overcloud. 

The inital update failed with http error 504 but the stack remained update in progres... and after 17 hours the stack was interrupted by restarting heat-engine.

To update the overcloud we do ... and after a while it returns the prompt:


cd /home/stack
openstack overcloud update stack overcloud -i -vv \
  --templates \
  -e ~/templates/bb1-overcloud-storage.yaml \
  -e ~/templates/bb1-overcloud-network.yaml \
  -e ~/templates/bb1-overcloud-hostnames.yaml \
  -e ~/templates/bb1-overcloud-parameters.yaml \
  -e ~/templates/enable-tls.yaml \
  -e ~/templates/inject-trust-anchor.yaml \
  -e ~/templates/firstboot/user-data.yaml \
  -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
  -e ~/templates/rhel-registration/environment-rhel-registration.yaml \


curl -g -i -X GET -H 'User-Agent: python-heatclient' -H 'Content-Type: application/json' -H 'X-Auth-Url: https://172.X.X.X:13000/v2.0' -H 'Accept: application/json' -H 'X-Auth-Token: {SHA1}aea53137ffb5d42275762b13e1c4cfdb3c9bb7c6' https://172.16.X.X:13004/v1/1ec8f1ff4c114b1b93f0015a77b52865/stacks/overcloud/d901cde4-bd16-4850-bb98-01c067eda427/resources?nested_depth=5


```
HTTP/1.0 504 Gateway Time-out
connection: close
content-type: text/html
cache-control: no-cache

<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>


ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

clean_up UpdateOvercloud: ERROR: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

END return value: 1
```


There heat logs were truncated before launching the update again there is no relevant ERRORs but :


2017-01-31 13:55:32.422 10577 INFO heat.engine.stack [-] Stack UPDATE FAILED (overcloud): Engine went down during stack UPDATE


its also worth tho check the file show-nested.txt (91 KB) there are various stack on failed to delete

last heat logs: 	heat-logs-2017-01-31_152418_CET.tgz

full sosreport being uploaded its bug and slow to upload





Version-Release number of selected component (if applicable):

OSP 8

How reproducible:
always

Steps to Reproduce:
1. execute the overcloud update 
2.
3.

Actual results:
gateway timeout

Expected results:


Additional info:

Comment 4 Bob Fournier 2017-02-01 14:20:49 UTC
Eduard - can you add the contents of /var/log/messages from the node that is having problems (controller0). Thanks.

Comment 7 Bob Fournier 2017-02-01 20:51:38 UTC
Thanks Eduard.  It would also be useful to see the status of interfaces via 
'ip a' when you logged in and "found out controller0 with all the interfaces 
down". I wasn't able to correlate that to the sosreports.

Also can you run the command "neutron agent-list" on the undercloud?  It would
be good if you can run the command both after sourcing 'stackrc' and again 
after sourcing 'overcloudrc' after the deployment finishes.

Some thoughts upon looking through the logs...

1. The debug output shows these error messages:
\nERROR bb01-ctrl0 failed to join cluster in 600 seconds\n", 
 "deploy_stderr": "Error: unable to start corosync\nError: cluster is not currently running on this node

2. I'm seeing a log of neutron connectivity related issues in /var/log/messages on 
the controller, not sure if this is just due to upgrading as other neutron 
agents may be down periodically, but there are many of these type of
messages:

14:10:26 bb01-ctrl0 ceilometer-polling: 2017-02-01 14:10:26.508 17552 ERROR ceilometer.neutron_client [-] internalURL endpoint for network service not found (these eventually went away...)

Feb  1 15:16:02 bb01-ctrl0 glance-api: 2017-02-01 15:16:02.795 19026 ERROR glance.registry.client.v1.client [req-a9091770-5ac6-4408-aadc-f5a809e7b985 13fe9767454c4ca6a2f618a1e61c878a d41ceb78a14d46b79ab4140a731d75ec - - -] Registry client request GET /images/1fd81867-561f-43dd-96e3-89fd8487c67b raised NotFound

Feb  1 15:33:12 bb01-ctrl0 neutron-lbaasv2-agent: 2017-02-01 15:33:12.741 2868 ERROR neutron.common.rpc [-] Timeout in RPC method get_ready_devices. Waiting for 47 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough

Feb  1 15:33:59 bb01-ctrl0 neutron-lbaasv2-agent: 2017-02-01 15:33:59.861 2868 ERROR neutron_lbaas.agent.agent_manager [-] Unable to retrieve ready devices

3. I don't really see a connection to this issue and the initscripts bug 
https://bugzilla.redhat.com/show_bug.cgi?id=1367580, mainly because
I don't see NM bringing down any interfaces.  In the bug we have interface
down messages:
Aug  9 20:44:44 overcloud-controller-1 nm-dispatcher: Dispatching action 'down' for br-ex

but I don't see interfaces being brought in the logs.

4. The supplied journal for NetworkManager on controller doesn't appear
to have captured any problems.

5. The update stayed at "In Progress" for until it was terminated and 
all of the resources had this message:
 UPDATE paused until Hook pre-update is cleared

Comment 32 Amit Ugol 2017-03-08 10:38:21 UTC
The status of the bug is still on new, though from the comments I see that the issue was resolved, if there is a need to track something here, please raise a new bug.