Description of problem: During overcloud minor update during software configuration step, pcs turns of vlan interfaces on the controller nodes and do not up the interfaces back on. This causes the minor update to fail and therefore stops the cluster altogether. The issue is only noticed for interfaces with the nomenclature of 'vlanxx'. Also while issuing nova hyper-list on the undercloud, compute nodes appear down. But logging in manually to these node and checking nova statuses prove that the services are up. $ nova hypervisor-list +----+-------------------------------+-------+---------+ | ID | Hypervisor hostname | State | Status | +----+-------------------------------+-------+---------+ | 2 | 798012-compute003.localdomain | down | enabled | | 5 | 798013-compute004.localdomain | down | enabled | | 8 | 798015-compute006.localdomain | down | enabled | | 11 | 798014-compute005.localdomain | down | enabled | | 14 | 798011-compute002.localdomain | down | enabled | | 17 | 798010-compute001.localdomain | down | enabled | Here are the stack deployment status outputs: $ openstack stack resource list iad-4679686 | grep FAILED | Controller | 148dfb9e-b535-4866-a6b5-edd1ded6f6a9 | OS::Heat::ResourceGroup | UPDATE_FAILED | 2016-12-31T23:58:50 | | Compute | f417b699-c08a-4a63-84ac-640ff70f41cc | OS::Heat::ResourceGroup | UPDATE_FAILED | 2017-01-01T02:10:14 | $ openstack stack resource list 148dfb9e-b535-4866-a6b5-edd1ded6f6a9 +---------------+--------------------------------------+-------------------------+-----------------+---------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | +---------------+--------------------------------------+-------------------------+-----------------+---------------------+ | 1 | 5b5448c2-7f0e-4c3c-9531-2efcf4afbbdd | OS::TripleO::Controller | UPDATE_FAILED | 2016-12-31T23:58:53 | | 0 | 51d37914-f023-44be-b27f-8407bff236c1 | OS::TripleO::Controller | UPDATE_FAILED | 2016-12-31T22:59:09 | | 2 | 0485c1da-9ecc-4d61-94e2-23e266500f4a | OS::TripleO::Controller | UPDATE_FAILED | 2016-12-31T22:58:58 | +---------------+--------------------------------------+-------------------------+-----------------+---------------------+ [stack@798006-director01 161220-11524]$ openstack stack resource list 51d37914-f023-44be-b27f-8407bff236c1 +--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | +--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+ | NodeUserData | 27b9fd0a-75fd-4904-b28c-165a87f7f950 | OS::TripleO::NodeUserData | UPDATE_COMPLETE | 2016-12-31T22:59:24 | | ControllerExtraConfigPre | 1a7ea770-ef39-4958-a63e-916f638d722a | OS::TripleO::ControllerExtraConfigPre | UPDATE_COMPLETE | 2016-09-22T00:20:09 | | NodeTLSCAData | 9d9bb7e2-b741-4fbb-94b3-ceaed078ba4a | OS::TripleO::NodeTLSCAData | UPDATE_COMPLETE | 2016-12-31T22:59:41 | | ExternalPort | 627fffe3-0177-45ac-b31a-56e0e89a64f2 | OS::TripleO::Controller::Ports::ExternalPort | UPDATE_COMPLETE | 2016-12-31T22:59:31 | | StorageMgmtPort | 6e35726c-09a5-42ba-a3a9-e3f752f29132 | OS::TripleO::Controller::Ports::StorageMgmtPort | UPDATE_COMPLETE | 2016-12-31T22:59:33 | | InternalApiPort | 613b398a-8a6a-420c-8680-8e67af9e11d1 | OS::TripleO::Controller::Ports::InternalApiPort | UPDATE_COMPLETE | 2016-12-31T22:59:30 | | NodeTLSData | 702077fc-0655-4e78-acc2-be58eb7116d2 | OS::TripleO::NodeTLSData | UPDATE_COMPLETE | 2016-12-31T22:59:45 | | StoragePort | 69574f4d-d390-429c-bd95-6e7a5a88b230 | OS::TripleO::Controller::Ports::StoragePort | UPDATE_COMPLETE | 2016-12-31T22:59:32 | | Controller | 9763fd7e-a921-447b-8655-7ecc3109d94f | OS::Nova::Server | UPDATE_COMPLETE | 2016-12-31T22:59:29 | | NodeAdminUserData | 2d4e819d-8aed-46da-b546-d3598804e6b9 | OS::TripleO::NodeAdminUserData | UPDATE_COMPLETE | 2016-12-31T22:59:23 | | NetIpMap | 477ddd95-e35c-4ffb-85e7-b58cb49b58c3 | OS::TripleO::Network::Ports::NetIpMap | UPDATE_COMPLETE | 2016-12-31T22:59:39 | | NodeExtraConfig | 53299b6f-4c98-4a05-b225-7f4cf5fb6e87 | OS::TripleO::NodeExtraConfig | UPDATE_COMPLETE | 2016-09-22T00:20:13 | | TenantPort | 87152981-c16a-4f02-8ff2-5e98e262e2c4 | OS::TripleO::Controller::Ports::TenantPort | UPDATE_COMPLETE | 2016-12-31T22:59:32 | | NetworkConfig | 9d6588c2-6b07-4d8d-8e46-c20693e70961 | OS::TripleO::Controller::Net::SoftwareConfig | UPDATE_COMPLETE | 2016-12-31T22:59:37 | | UpdateConfig | 9e7021a9-396d-42c3-a255-6d88d909ae49 | OS::TripleO::Tasks::PackageUpdate | UPDATE_COMPLETE | 2016-12-31T22:59:23 | | ControllerDeployment | 6320ec81-044a-4bc9-b65b-a66db1f8b9db | OS::TripleO::SoftwareDeployment | UPDATE_COMPLETE | 2016-09-21T23:56:31 | | ControllerConfig | 0d30a522-f72d-4bb3-b80d-5d6c00da8f6f | OS::Heat::StructuredConfig | CREATE_COMPLETE | 2016-09-21T18:47:33 | | NetIpSubnetMap | d21a19ef-ef0b-4832-9ee2-5d5b800ff07b | OS::TripleO::Network::Ports::NetIpSubnetMap | UPDATE_COMPLETE | 2016-12-31T22:59:38 | | UpdateDeployment | 2d959540-f337-4e13-b4ac-2f159ac4c914 | OS::Heat::SoftwareDeployment | UPDATE_FAILED | 2016-12-31T23:00:54 | | UserData | cfb8358e-6441-4697-98e2-49845b439ff5 | OS::Heat::MultipartMime | CREATE_COMPLETE | 2016-08-10T21:33:54 | | NetworkDeployment | ed034810-b845-4152-bdb9-67cf6543cbfd | OS::TripleO::SoftwareDeployment | CREATE_COMPLETE | 2016-08-10T21:33:54 | | ManagementPort | 56e921c6-d785-4e9b-b3fc-0387eaf65eb5 | OS::TripleO::Controller::Ports::ManagementPort | UPDATE_COMPLETE | 2016-12-31T22:59:33 | +--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+ $ openstack software deployment output show 2d959540-f337-4e13-b4ac-2f159ac4c914 --all output_values: deploy_stdout: | ... warning: /var/lib/logrotate.status saved as /var/lib/logrotate.status.rpmsave Created symlink from /etc/systemd/system/sockets.target.wants/virtlogd.socket to /usr/lib/systemd/system/virtlogd.socket. 2670 blocks yum return code: 0 Starting cluster node Starting Cluster... Redirecting to /bin/systemctl start corosync.service Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details. ERROR 798007-controller01 failed to join cluster in 600 seconds (truncated, view all with --long) deploy_stderr: | ... Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node Error: cluster is not currently running on this node (truncated, view all with --long) update_managed_packages: false deploy_status_code: 1 Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-2.0.0-41.el7ost.noarch openstack-heat-engine-6.0.0-12.el7ost.noarch openstack-heat-templates-0-0.3.96a0b0bgit.el7ost.noarch How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
A workaround was applied to continue and complete the stack update: **** Regarding the controllers. 1. make sure the vlans are up: # for i in 204 206 207; do ifup vlan$i; done 2. restart the openvswitch daemons: # systemctl restart openvswitch.service openvswitch-nonetwork.service neutron-openvswitch-agent.service 3. Remove the stale files: # mount | grep xxx.xx.xx.xx # umount -f /var/lib.... 4. restart the cluster daemons # systemctl restart corosync.service pcsd.service 5. start the cluster # pcs cluster start Then we rerun the openstack overcloud update, and after the openstack overcloud update turned off the ifcfg-vlan* interfaces, we quickly turned them back on manually and after that the update continued. Another thing that I've found in both nodes is that /var/lib/images/glance has the wrong owner and permissions after the update. # chown glance:glance /var/lib/glance/images # chmod 755 /var/lib/glance/images **** Regarding the computes. I only required to restart the vlans: #for i in 203 204 206 207; do ifup vlan$i; done
I don't think it's possible to root cause this without the logs of the event, which are missing per comment 2. I'm closing this as cannot reproduce.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days