Bug 1413896 - During stack update vlan interfaces are turned off and does not turn back on.
Summary: During stack update vlan interfaces are turned off and does not turn back on.
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openvswitch
Version: 9.0 (Mitaka)
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: 9.0 (Mitaka)
Assignee: Timothy Redaelli
QA Contact: Ofer Blaut
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-17 09:29 UTC by Navneet Krishnan
Modified: 2023-09-14 03:37 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-12-06 16:12:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Navneet Krishnan 2017-01-17 09:29:37 UTC
Description of problem:

During overcloud minor update during software configuration step, pcs turns of vlan interfaces on the controller nodes and do not up the interfaces back on. This causes the minor update to fail and therefore stops the cluster altogether.

The issue is only noticed for interfaces with the nomenclature of 'vlanxx'. 

Also while issuing nova hyper-list on the undercloud, compute nodes appear down.
But logging in manually to these node and checking nova statuses prove that the services are up.

$ nova hypervisor-list
+----+-------------------------------+-------+---------+
| ID | Hypervisor hostname           | State | Status  |
+----+-------------------------------+-------+---------+
| 2  | 798012-compute003.localdomain | down  | enabled |
| 5  | 798013-compute004.localdomain | down  | enabled |
| 8  | 798015-compute006.localdomain | down  | enabled |
| 11 | 798014-compute005.localdomain | down  | enabled |
| 14 | 798011-compute002.localdomain | down  | enabled |
| 17 | 798010-compute001.localdomain | down  | enabled |


Here are the stack deployment status outputs:


$ openstack stack resource list iad-4679686 | grep  FAILED
| Controller                                | 148dfb9e-b535-4866-a6b5-edd1ded6f6a9            | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2016-12-31T23:58:50 |
| Compute                                   | f417b699-c08a-4a63-84ac-640ff70f41cc            | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2017-01-01T02:10:14 |
$ openstack stack resource list 148dfb9e-b535-4866-a6b5-edd1ded6f6a9
+---------------+--------------------------------------+-------------------------+-----------------+---------------------+
| resource_name | physical_resource_id                 | resource_type           | resource_status | updated_time        |
+---------------+--------------------------------------+-------------------------+-----------------+---------------------+
| 1             | 5b5448c2-7f0e-4c3c-9531-2efcf4afbbdd | OS::TripleO::Controller | UPDATE_FAILED   | 2016-12-31T23:58:53 |
| 0             | 51d37914-f023-44be-b27f-8407bff236c1 | OS::TripleO::Controller | UPDATE_FAILED   | 2016-12-31T22:59:09 |
| 2             | 0485c1da-9ecc-4d61-94e2-23e266500f4a | OS::TripleO::Controller | UPDATE_FAILED   | 2016-12-31T22:58:58 |
+---------------+--------------------------------------+-------------------------+-----------------+---------------------+
[stack@798006-director01 161220-11524]$ openstack stack resource list 51d37914-f023-44be-b27f-8407bff236c1
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+
| resource_name            | physical_resource_id                 | resource_type                                   | resource_status | updated_time        |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+
| NodeUserData             | 27b9fd0a-75fd-4904-b28c-165a87f7f950 | OS::TripleO::NodeUserData                       | UPDATE_COMPLETE | 2016-12-31T22:59:24 |
| ControllerExtraConfigPre | 1a7ea770-ef39-4958-a63e-916f638d722a | OS::TripleO::ControllerExtraConfigPre           | UPDATE_COMPLETE | 2016-09-22T00:20:09 |
| NodeTLSCAData            | 9d9bb7e2-b741-4fbb-94b3-ceaed078ba4a | OS::TripleO::NodeTLSCAData                      | UPDATE_COMPLETE | 2016-12-31T22:59:41 |
| ExternalPort             | 627fffe3-0177-45ac-b31a-56e0e89a64f2 | OS::TripleO::Controller::Ports::ExternalPort    | UPDATE_COMPLETE | 2016-12-31T22:59:31 |
| StorageMgmtPort          | 6e35726c-09a5-42ba-a3a9-e3f752f29132 | OS::TripleO::Controller::Ports::StorageMgmtPort | UPDATE_COMPLETE | 2016-12-31T22:59:33 |
| InternalApiPort          | 613b398a-8a6a-420c-8680-8e67af9e11d1 | OS::TripleO::Controller::Ports::InternalApiPort | UPDATE_COMPLETE | 2016-12-31T22:59:30 |
| NodeTLSData              | 702077fc-0655-4e78-acc2-be58eb7116d2 | OS::TripleO::NodeTLSData                        | UPDATE_COMPLETE | 2016-12-31T22:59:45 |
| StoragePort              | 69574f4d-d390-429c-bd95-6e7a5a88b230 | OS::TripleO::Controller::Ports::StoragePort     | UPDATE_COMPLETE | 2016-12-31T22:59:32 |
| Controller               | 9763fd7e-a921-447b-8655-7ecc3109d94f | OS::Nova::Server                                | UPDATE_COMPLETE | 2016-12-31T22:59:29 |
| NodeAdminUserData        | 2d4e819d-8aed-46da-b546-d3598804e6b9 | OS::TripleO::NodeAdminUserData                  | UPDATE_COMPLETE | 2016-12-31T22:59:23 |
| NetIpMap                 | 477ddd95-e35c-4ffb-85e7-b58cb49b58c3 | OS::TripleO::Network::Ports::NetIpMap           | UPDATE_COMPLETE | 2016-12-31T22:59:39 |
| NodeExtraConfig          | 53299b6f-4c98-4a05-b225-7f4cf5fb6e87 | OS::TripleO::NodeExtraConfig                    | UPDATE_COMPLETE | 2016-09-22T00:20:13 |
| TenantPort               | 87152981-c16a-4f02-8ff2-5e98e262e2c4 | OS::TripleO::Controller::Ports::TenantPort      | UPDATE_COMPLETE | 2016-12-31T22:59:32 |
| NetworkConfig            | 9d6588c2-6b07-4d8d-8e46-c20693e70961 | OS::TripleO::Controller::Net::SoftwareConfig    | UPDATE_COMPLETE | 2016-12-31T22:59:37 |
| UpdateConfig             | 9e7021a9-396d-42c3-a255-6d88d909ae49 | OS::TripleO::Tasks::PackageUpdate               | UPDATE_COMPLETE | 2016-12-31T22:59:23 |
| ControllerDeployment     | 6320ec81-044a-4bc9-b65b-a66db1f8b9db | OS::TripleO::SoftwareDeployment                 | UPDATE_COMPLETE | 2016-09-21T23:56:31 |
| ControllerConfig         | 0d30a522-f72d-4bb3-b80d-5d6c00da8f6f | OS::Heat::StructuredConfig                      | CREATE_COMPLETE | 2016-09-21T18:47:33 |
| NetIpSubnetMap           | d21a19ef-ef0b-4832-9ee2-5d5b800ff07b | OS::TripleO::Network::Ports::NetIpSubnetMap     | UPDATE_COMPLETE | 2016-12-31T22:59:38 |
| UpdateDeployment         | 2d959540-f337-4e13-b4ac-2f159ac4c914 | OS::Heat::SoftwareDeployment                    | UPDATE_FAILED   | 2016-12-31T23:00:54 |
| UserData                 | cfb8358e-6441-4697-98e2-49845b439ff5 | OS::Heat::MultipartMime                         | CREATE_COMPLETE | 2016-08-10T21:33:54 |
| NetworkDeployment        | ed034810-b845-4152-bdb9-67cf6543cbfd | OS::TripleO::SoftwareDeployment                 | CREATE_COMPLETE | 2016-08-10T21:33:54 |
| ManagementPort           | 56e921c6-d785-4e9b-b3fc-0387eaf65eb5 | OS::TripleO::Controller::Ports::ManagementPort  | UPDATE_COMPLETE | 2016-12-31T22:59:33 |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+

$ openstack software deployment output show 2d959540-f337-4e13-b4ac-2f159ac4c914 --all
output_values:

  deploy_stdout: |
    ...
    warning: /var/lib/logrotate.status saved as /var/lib/logrotate.status.rpmsave
    Created symlink from /etc/systemd/system/sockets.target.wants/virtlogd.socket to /usr/lib/systemd/system/virtlogd.socket.
    2670 blocks
    yum return code: 0
    Starting cluster node
    Starting Cluster...
    Redirecting to /bin/systemctl start  corosync.service
    Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details.
    
    ERROR 798007-controller01 failed to join cluster in 600 seconds
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    (truncated, view all with --long)
  update_managed_packages: false
  deploy_status_code: 1

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-41.el7ost.noarch
openstack-heat-engine-6.0.0-12.el7ost.noarch
openstack-heat-templates-0-0.3.96a0b0bgit.el7ost.noarch



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Navneet Krishnan 2017-01-17 09:32:40 UTC
A workaround was applied to continue and complete the stack update:

**** Regarding the controllers.

1. make sure the vlans are up:

# for i in 204 206 207; do ifup vlan$i; done

2. restart the openvswitch daemons:

# systemctl restart openvswitch.service openvswitch-nonetwork.service neutron-openvswitch-agent.service

3. Remove the stale files:

# mount | grep xxx.xx.xx.xx 
# umount -f /var/lib....

4. restart the cluster daemons

# systemctl restart corosync.service pcsd.service

5. start the cluster

# pcs cluster start

Then we rerun the openstack overcloud update, and after the openstack overcloud update turned off the ifcfg-vlan* interfaces, we quickly turned them back on manually and after that the update continued.

Another thing that I've found in both nodes is that /var/lib/images/glance has the wrong owner and permissions after the update.

# chown glance:glance /var/lib/glance/images
# chmod 755 /var/lib/glance/images

**** Regarding the computes.

I only required to restart the vlans:

#for i in 203 204 206 207; do ifup vlan$i; done

Comment 5 Assaf Muller 2017-04-28 21:17:17 UTC
I don't think it's possible to root cause this without the logs of the event, which are missing per comment 2. I'm closing this as cannot reproduce.

Comment 17 Red Hat Bugzilla 2023-09-14 03:37:32 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.