rhel-osp-director: update 7.0->7.2 fails ERROR: openstack ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1 Authentication required Environment: instack-undercloud-2.1.2-36.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch Steps to reproduce: 1. Deploy HA overcloud 7.0 with network isolation. 2. Attempt to update to 7.2. Result: The hosts get updated pretty quickly, then after a long time: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS ERROR: openstack ERROR: Authentication failed. Please try again with option --include-password or export HEAT_INCLUDE_PASSWORD=1 Authentication required Expected result: Successful update. Notes: Seems like puppet issues: heat resource-list -n 5 overcloud|grep -v COMPLETE | | | ControllerNodesPostDeployment | 7d89fd52-d552-46c9-9e4d-feeb8a100827 | O| S::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2015-12-11T22:26:48Z | | | | | ControllerOvercloudServicesDeployment_Step4 | 9e1059b6-34ed-467b-a90e-46d1771ee31c | O| S::Heat::StructuredDeployments | UPDATE_IN_PROGRESS | 2015-12-11T22:32:07Z | Con| trollerNodesPostDeployment | | | 2 | 82be7957-7b81-4ff0-a006-0364cdf65aa7 | O| S::Heat::StructuredDeployment | UPDATE_IN_PROGRESS | 2015-12-11T22:32:24Z | Con| trollerOvercloudServicesDeployment_Step4 | | | 0 | ddd5489d-efff-4184-b727-428b37188032 | O| S::Heat::StructuredDeployment | UPDATE_IN_PROGRESS | 2015-12-11T22:32:28Z | Con| trollerOvercloudServicesDeployment_Step4 | | | 1 | dfd7ed81-18cd-4997-8e44-035075b4c30d | O| S::Heat::StructuredDeployment | UPDATE_IN_PROGRESS | 2015-12-11T22:32:29Z | Con| trollerOvercloudServicesDeployment_Step4 from pcs status: ip-192.0.2.12 (ocf::heartbeat:IPaddr2): Stopped (unmanaged)
Created attachment 1104930 [details] messages from one controller.
the logs are too big, attached just the messages file.
I think this is likely to have the same root cause and workaround as bug 1290950 https://bugzilla.redhat.com/show_bug.cgi?id=1290949#c8
Doc/workaround works for me.
*** This bug has been marked as a duplicate of bug 1290949 ***
The error reproduced on a VM setup with 2 vCPUS.
Doing some basic diagnosis on a failed stack, the resource UpdateDeployment on controller 0 failed due to a timeout. sshing into controller 0 and doing the following: journalctl -l -u os-collect-config systemctl status os-collect-config shows that the puppet-apply is still running, and apparently stalled on a call to nova-manage db_sync CGroup: /system.slice/os-collect-config.service ├─ 2417 /usr/bin/systemctl restart openstack-nova-scheduler ├─12684 /sbin/dhclient -H overcloud-controller-0 -1 -q -lf /var/lib/dhc... ├─14016 /usr/bin/python /usr/bin/os-collect-config ├─19509 /usr/bin/python /usr/bin/os-refresh-config ├─19644 /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-... ├─22133 python /usr/libexec/os-refresh-config/configure.d/55-heat-confi... ├─22139 python /var/lib/heat-config/hooks/puppet ├─22141 /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/l... └─31139 /usr/bin/python /usr/bin/nova-manage db sync ControllerOvercloudServicesDeployment_Step4 on all 3 controllers are also in a failed state. Looking at os-collect-config on controller 1 shows that the puppet-apply is wedged on restarting openstack-nova-scheduler: CGroup: /system.slice/os-collect-config.service ├─ 7366 /usr/bin/python /usr/bin/os-refresh-config ├─ 7504 /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-... ├─10047 python /usr/libexec/os-refresh-config/configure.d/55-heat-confi... ├─10052 python /var/lib/heat-config/hooks/puppet ├─10054 /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/l... ├─12693 /sbin/dhclient -H overcloud-controller-1 -1 -q -lf /var/lib/dhc... ├─13680 /usr/bin/systemctl restart openstack-nova-scheduler └─15004 /usr/bin/python /usr/bin/os-collect-config Controller 2 is also wedged while running puppet, but there doesn't seem to be a long running subprocess causing this so I'm not sure what is the cause of this wedge: CGroup: /system.slice/os-collect-config.service ├─ 811 /usr/bin/systemctl restart openstack-nova-scheduler ├─ 2973 /usr/bin/python /usr/bin/os-collect-config ├─12685 /sbin/dhclient -H overcloud-controller-2 -1 -q -lf /var/lib/dhclient/dhclient--br-ex.lease -pf /var/run/dhclient-br-ex.pid br-ex ├─27005 /usr/bin/python /usr/bin/os-refresh-config ├─27155 /bin/bash /usr/local/bin/dib-run-parts /usr/libexec/os-refresh-config/configure.d ├─29697 python /usr/libexec/os-refresh-config/configure.d/55-heat-config ├─29708 python /var/lib/heat-config/hooks/puppet └─29710 /usr/bin/ruby /usr/bin/puppet apply --detailed-exitcodes /var/lib/heat-config/heat-config-puppet/8827922e-444e-4ec3-acd8-53ece6c6127e.pp Any heat stack-update is going to time out on unrelated resources while these puppet-apply runs are wedged. I think the "Authentication failed." message is actually the token timing out, so apart from the misleading error message, heat is working correctly in this instance. I'm unassigning myself from this bug so someone else can diagnose these puppet issues.
This looks like a false alarm to me. I suspect an inadvertent change in network isolation settings between the initial deployment and the update. On controller-0 there's a puppet run stuck, it's stuck on restarting nova-scheduler and doing nova db-sync. In nova-scheduler.log i see it cannot connect to the database: 2015-12-16 04:46:08.457 2426 WARNING oslo_db.sqlalchemy.session [req-a2dc7d9a-1ec5-4cba-941e-0ec36dd878fc - - - - -] SQL connection failed. -10696 attempts left. 2015-12-16 04:46:18.467 2426 WARNING oslo_db.sqlalchemy.session [req-a2dc7d9a-1ec5-4cba-941e-0ec36dd878fc - - - - -] SQL connection failed. -10697 attempts left. 2015-12-16 04:46:28.474 2426 WARNING oslo_db.sqlalchemy.session [req-a2dc7d9a-1ec5-4cba-941e-0ec36dd878fc - - - - -] SQL connection failed. -10698 attempts left. The `connection` option in nova.conf points to 192.0.2.6: connection=mysql://nova:<password-removed-for-BZ>@192.0.2.6/nova The 192.0.2.6 address is pingable and assigned to controller-0: [root@overcloud-controller-0 ~]# ip a | grep 192.0.2.6 inet 192.0.2.6/32 brd 192.0.2.255 scope global br-ex The reason MariaDB isn't reachable on this IP is that haproxy listens on a different one. Mysql section from haproxy.cfg: listen mysql bind 192.168.100.11:3306 option httpchk stick on dst stick-table type ip size 1000 timeout client 0 timeout server 0 server overcloud-controller-0 192.168.100.15:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2 server overcloud-controller-1 192.168.100.12:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2 server overcloud-controller-2 192.168.100.14:3306 backup check fall 5 inter 2000 on-marked-down shutdown-sessions port 9200 rise 2 On the undercloud machine i don't see what command was used to deploy but i see two network-environment files: network-environment.yaml.org and network-environment.yaml. By the look of it i guess network-environment.yaml.org was used in initial deployment, and network-environment.yaml for the update. There's quite a lot of differences between the two files, e.g. in network-environment.yaml.org i see: InternalApiNetCidr: 192.168.100.0/24 which corresponds to the haproxy.cfg setting above. In network-environment.yaml i don't see any InternalApiNetCidr line though. Also in the .org variant of the file i see OS::TripleO::Network::InternalApi assigned in the resource registry, but in the one used for update i don't see it. This probably caused "un-isolation" of the InternalApi network and falling back to ctlplane, causing the desync between nova.conf and haproxy.conf (haproxy didn't get restarted by puppet, as it's managed by pacemaker). I think unchanged environment files should be passed to the update CLI call, except for cases where the update doc says that some modifications are needed.
I've attached the upstream bugs for the cosmetic aspects of this issue.
As explained in https://bugzilla.redhat.com/show_bug.cgi?id=1290950#c19, the problem with the investigated environment was network misconfiguration in the environment files. If the same symptoms appear for a different reason, we should probably open a new BZ.
Reproduce this issue on 7.2->7.3 Update. Deployed 7.2 GA with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1 --neutron-network-type vxlan --neutro n-tunnel-types vxlan --ntp-server x.x.x.x --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/ openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml Update command: openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-vip.yaml -e network-environment.yaml
The environment files were modified.
Adding blocker ? for 7.3GA. This bug was initially opened when we tested updates attempt from 7.0 to 7.2 . now it reproduced when attempted update of a clean 7.2 deployment to latest 7.3 bits. environment: -------------- openstack-heat-api-2015.1.2-7.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-112.el7ost.noarch python-heatclient-0.6.0-1.el7ost.noarch openstack-heat-templates-0-0.8.20150605git.el7ost.noarch openstack-heat-api-cloudwatch-2015.1.2-7.el7ost.noarch openstack-heat-engine-2015.1.2-7.el7ost.noarch
And reproduced on 7.0 -> 7.3 update: Deployment command: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1 --neutron-network-type gre --neutron-tunnel-types gre --ntp-server x.x.x.x --timeout 90 -e network-environment.yaml Update command: openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-keystone-admin-internal-api.yaml -e network-environment.yaml
The root cause on the system i got to investigate looks like something related to SSL support, possibly the OS::TripleO::NodeTLSCAData missing from the resource registry. [stack@instack ~]$ heat resource-list overcloud -n5 | grep -vi complete +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | parent_resource | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ | Controller | e3e5d008-201a-4a0e-9196-338186d78233 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2016-02-01T21:17:26Z | | | 2 | b7ac5241-4e77-4b38-bb6d-5bd82908dabe | OS::TripleO::Controller | UPDATE_FAILED | 2016-02-01T21:17:49Z | Controller | | 1 | a5c81a21-2113-4d56-aeeb-12c0f8449359 | OS::TripleO::Controller | UPDATE_FAILED | 2016-02-01T21:19:51Z | Controller | +-----------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-----------------------------------------------+ [stack@instack ~]$ heat stack-show a5c81a21-2113-4d56-aeeb-12c0f8449359 | grep reason | stack_status_reason | Unknown resource Type : OS::TripleO::NodeTLSCAData | [stack@instack ~]$ heat stack-show b7ac5241-4e77-4b38-bb6d-5bd82908dabe | grep reason | stack_status_reason | Unknown resource Type : OS::TripleO::NodeTLSCAData | [stack@instack ~]$ heat stack-show e3e5d008-201a-4a0e-9196-338186d78233 | grep reason | stack_status_reason | Timed out |
Pasting from IRC -- the update command was missing the resource registry: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --ceph-storage-scale 1 --neutron-network-type vxlan --neutron-tunnel-types vxlan --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network
Sorry, that's ^ the deploy cmd, the update cmd was: openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/updates/update-from-vip
[stack@instack openstack-tripleo-heat-templates]$ grep -ri OS::TripleO::NodeTLSCAData overcloud-resource-registry-puppet.yaml: OS::TripleO::NodeTLSCAData: puppet/extraconfig/tls/no-ca.yaml puppet/controller-puppet.yaml: type: OS::TripleO::NodeTLSCAData puppet/swift-storage-puppet.yaml: type: OS::TripleO::NodeTLSCAData puppet/ceph-storage-puppet.yaml: type: OS::TripleO::NodeTLSCAData puppet/cinder-storage-puppet.yaml: type: OS::TripleO::NodeTLSCAData puppet/compute-puppet.yaml: type: OS::TripleO::NodeTLSCAData environments/inject-trust-anchor.yaml: OS::TripleO::NodeTLSCAData: ../puppet/extraconfig/tls/ca-inject.yaml I see OS::TripleO::NodeTLSCAData mapped to no-ca.yaml in the resource registry ^^ so i wonder if "Unknown resource Type : OS::TripleO::NodeTLSCAData" even though we do pass the resource registry could be a symptom of something else, possibly a Heat bug of some sort.
This looks similar to the error in bug 1298589 (which is for 8.0)
Here is the stack trace, very similar to bug129859: 2016-02-01 16:03:11.033 7460 ERROR heat.engine.service [-] Unhandled error in asynchronous task 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service Traceback (most recent call last): 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 123, in log_exceptions 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service gt.wait() 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service return self._exit_event.wait() 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 125, in wait 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service current.throw(*self._exc) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service result = function(*args, **kwargs) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 112, in _start_with_trace 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service return func(*args, **kwargs) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service return f(*args, **kwargs) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 76, in handle_exceptions 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service {'func': func.__name__, 'msg': errmsg}) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__ 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service six.reraise(self.type_, self.value, self.tb) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 71, in handle_exceptions 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service return func(stack, *args, **kwargs) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 893, in update 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service updater() 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 174, in __call__ 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service self.start(timeout=timeout) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 200, in start 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 223, in step 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service next(self._runner) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 289, in wrapper 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service subtask = next(parent) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 946, in update_task 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service updater.start(timeout=self.timeout_secs()) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 200, in start 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service self.step() 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 223, in step 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service next(self._runner) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 289, in wrapper 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service subtask = next(parent) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/update.py", line 55, in __call__ 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service self.previous_stack.dependencies, 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 260, in dependencies 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service self.resources.itervalues()) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 223, in resources 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service self.t.resource_definitions(self).items()) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 222, in <genexpr> 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service for (name, data) in 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 143, in __new__ 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service resource_name=name) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service File "/usr/lib/python2.7/site-packages/heat/engine/environment.py", line 439, in get_class_to_instantiate 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service raise exception.StackValidationFailed(message=msg) 2016-02-01 16:03:11.033 7460 TRACE heat.engine.service StackValidationFailed: Unknown resource Type : OS::TripleO::NodeTLSCAData Crucially, the error appears to be in getting the dependencies for the *previous* stack.
The log also shows that heat-engine was restarted with a stack IN_PROGRESS prior to this failure occurring, and also that the stack experience a failure due to message timeout (bug 1290949) about 20mins prior to the error above. So it's likely that we have some partially-created resource that is then being resolved against the old environment, which hasn't defined a type for it - i.e. the same problem that should have been fixed as bug 1278975.
Current best guess is that the fix for bug 1278975 doesn't get the chance to run when the stack operation ends abnormally (e.g. by restarting heat-engine). It's not inside an exception handler, so even assuming things proceed in a somewhat orderly manner (i.e. not kill -9) it doesn't stand a chance. This can probably be worked around by not restarting heat-engine while it's still working on a stack. A likely fix would be along the lines of storing the updated environment in advance, rather than waiting until after something has failed.
Per Zane, this bug does not have a fix yet. Most of the issues that were considered blocking in this bug are really due to bug 1304878 which is already fixed. This bug only shows up if heat-engine is restarted mid-update, I'd propose dropping blocker.
*** Bug 1306502 has been marked as a duplicate of this bug. ***
I agree with Mike in Comment 37. This isn't a release blocker. It's a consequence of restarting the heat-engine in mid-update.
(In reply to Angus Thomas from comment #39) > I agree with Mike in Comment 37. This isn't a release blocker. It's a > consequence of restarting the heat-engine in mid-update. It is, but people do that all the time and we've been recommending that as our workaround for waiting 4 hours for the timeout on a child stack if something fails. On the happy path you never hit it, but that doesn't mean you won't be hitting it regularly. And it's really hard to recover from because it makes the data about the *existing* state inconsistent. We have a patch and can avoid a lot of needless suffering.
haven't been able to reproduce the original issue, not that its an easy thing to do.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-0266.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days