Bug 1315467

Summary: rhel-osp-director: 7.3->8.0 upgrade fails with ERROR: Timed out waiting for a reply to message ID 84a44ca3ed724eda991ba689cc364852.
Product: Red Hat OpenStack Reporter: Alexander Chuzhoy <sasha>
Component: rhosp-directorAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Alexander Chuzhoy <sasha>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: adahms, dbecker, jcoufal, jslagle, mburns, morazi, rhel-osp-director-maint
Target Milestone: ga   
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-tripleoclient-0.3.4-1.el7ost Doc Type: Bug Fix
Doc Text:
Previously, after upgrading the undercloud, there was a missing restart of the openstack-nova-api service, which would cause upgrades of the overcloud to fail due to a timeout that would report the error "ERROR: Timed out waiting for a reply to message ID 84a44ca3ed724eda991ba689cc364852". Now, the openstack-nova-api service is correctly restarted as part of the undercloud upgrade process, allowing the overcloud upgrade process to proceed without encountering this timeout issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-07 21:48:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
nova-api.log none

Description Alexander Chuzhoy 2016-03-07 20:04:05 UTC
rhel-osp-director: 7.3->8.0 upgrade fails with ERROR: Timed out waiting for a reply to message ID 84a44ca3ed724eda991ba689cc364852.


Environment:
openstack-tripleo-heat-templates-kilo-0.8.9-1.el7ost.noarch
instack-undercloud-2.2.4-1.el7ost.noarch
openstack-puppet-modules-7.0.12-1.el7ost.noarch
openstack-tripleo-heat-templates-0.8.9-1.el7ost.noarch

Steps to reproduce:
1. Deploy 7.3 (3 controllers +2 computes) with network isolation. 
Deployment command: openstack overcloud deploy --templates --control-scale 3 --compute-scale 2    --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server x.x.x.x --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml
 
2. Upgrade the undercloud to 8.0
3. Attempt to update the overcloud with:
openstack overcloud deploy --templates tripleo-heat-templates -e tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e tripleo-heat-templates/environments/puppet-pacemaker.yaml -e tripleo-heat-templates/environments/network-isolation.yaml -e tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network-environment.yaml -e tripleo-heat-templates/environments/major-upgrade-script-delivery.yaml


Result:
2016-03-07 15:31:50 [NodeTLSData]: UPDATE_COMPLETE  state changed                       
2016-03-07 15:31:51 [ControllerConfig]: UPDATE_IN_PROGRESS  state changed               
2016-03-07 15:31:52 [NetworkConfig]: UPDATE_COMPLETE  state changed                     
2016-03-07 15:31:52 [NodeTLSCAData]: UPDATE_IN_PROGRESS  state changed                  
2016-03-07 15:31:53 [ControllerConfig]: CREATE_IN_PROGRESS  state changed               
2016-03-07 15:31:54 [ControllerConfig]: CREATE_COMPLETE  state changed                  
2016-03-07 15:31:55 [NodeTLSCAData]: UPDATE_COMPLETE  state changed                     
2016-03-07 15:31:55 [NodeTLSData]: UPDATE_IN_PROGRESS  state changed                    
2016-03-07 15:31:55 [ControllerDeployment]: UPDATE_IN_PROGRESS  state changed           
2016-03-07 15:31:57 [NodeTLSData]: UPDATE_COMPLETE  state changed                       
2016-03-07 15:31:57 [ControllerConfig]: UPDATE_IN_PROGRESS  state changed               
2016-03-07 15:31:58 [ControllerConfig]: CREATE_IN_PROGRESS  state changed               
2016-03-07 15:31:59 [ControllerConfig]: CREATE_COMPLETE  state changed                  
2016-03-07 15:32:19 [UpdateDeployment]: SIGNAL_IN_PROGRESS  Signal: deployment succeeded
2016-03-07 15:32:19 [UpdateDeployment]: UPDATE_COMPLETE  state changed                  
2016-03-07 15:32:20 [ControllerDeployment]: UPDATE_IN_PROGRESS  state changed       

Broadcast message from systemd-journald (Mon 2016-03-07 12:48:12 EST):                                           
haproxy[27435]: proxy ironic has no server available!                                                                                                                                                               
ERROR: Timed out waiting for a reply to message ID 84a44ca3ed724eda991ba689cc364852



Checking the os-collect-config for errors - (repeating messages):
Mar 07 19:04:10 overcloud-controller-0.localdomain os-collect-config[3829]: 2016-03-07 19:04:10.710 3829 WARNING os_collect_config.ec2 [-] 500 Server Error: Internal Server Error
Mar 07 19:04:41 overcloud-controller-0.localdomain os-collect-config[3829]: 2016-03-07 19:04:41.352 3829 WARNING os_collect_config.ec2 [-] 500 Server Error: Internal Server Error
Mar 07 19:05:12 overcloud-controller-0.localdomain os-collect-config[3829]: 2016-03-07 19:05:12.036 3829 WARNING os_collect_config.ec2 [-] 500 Server Error: Internal Server Error
Mar 07 19:05:42 overcloud-controller-0.localdomain os-collect-config[3829]: 2016-03-07 19:05:42.642 3829 WARNING os_collect_config.ec2 [-] 500 Server Error: Internal Server Error




Expected result:
Successful update of the overcloud.

Comment 2 Alexander Chuzhoy 2016-03-07 21:11:13 UTC
Created attachment 1133904 [details]
nova-api.log

Comment 3 Marios Andreou 2016-03-17 11:32:56 UTC
So I can confirm that I've hit this many many times testing the upgrades in a virt environment. The fix discussed on irc yesterday, to restart openstack-nova-api after upgrading the undercloud seems to fix it for me. I've added the restart to the tripleoclient undercloud upgrade @  https://review.openstack.org/#/c/293960/

Comment 5 Alexander Chuzhoy 2016-04-04 20:15:42 UTC
Verified:

Environment:
python-tripleoclient-0.3.4-2.el7ost.noarch

Was able to upgrade OC 7.3 to 8.0

Comment 7 errata-xmlrpc 2016-04-07 21:48:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0604.html