Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1413896

Summary: During stack update vlan interfaces are turned off and does not turn back on.
Product: Red Hat OpenStack Reporter: Navneet Krishnan <nkrishna>
Component: openvswitchAssignee: Timothy Redaelli <tredaelli>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ofer Blaut <oblaut>
Severity: medium Docs Contact:
Priority: high    
Version: 9.0 (Mitaka)CC: amuller, apevec, aschultz, athomas, bschmaus, chrisw, dbecker, fdinitto, fleitner, jraju, mburns, mcornea, morazi, nkrishna, rhel-osp-director-maint, rhos-maint, rkhan, skaplons, srevivo, tredaelli
Target Milestone: ---Keywords: Reopened, Triaged, ZStream
Target Release: 9.0 (Mitaka)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-06 16:12:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Navneet Krishnan 2017-01-17 09:29:37 UTC
Description of problem:

During overcloud minor update during software configuration step, pcs turns of vlan interfaces on the controller nodes and do not up the interfaces back on. This causes the minor update to fail and therefore stops the cluster altogether.

The issue is only noticed for interfaces with the nomenclature of 'vlanxx'. 

Also while issuing nova hyper-list on the undercloud, compute nodes appear down.
But logging in manually to these node and checking nova statuses prove that the services are up.

$ nova hypervisor-list
+----+-------------------------------+-------+---------+
| ID | Hypervisor hostname           | State | Status  |
+----+-------------------------------+-------+---------+
| 2  | 798012-compute003.localdomain | down  | enabled |
| 5  | 798013-compute004.localdomain | down  | enabled |
| 8  | 798015-compute006.localdomain | down  | enabled |
| 11 | 798014-compute005.localdomain | down  | enabled |
| 14 | 798011-compute002.localdomain | down  | enabled |
| 17 | 798010-compute001.localdomain | down  | enabled |


Here are the stack deployment status outputs:


$ openstack stack resource list iad-4679686 | grep  FAILED
| Controller                                | 148dfb9e-b535-4866-a6b5-edd1ded6f6a9            | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2016-12-31T23:58:50 |
| Compute                                   | f417b699-c08a-4a63-84ac-640ff70f41cc            | OS::Heat::ResourceGroup                           | UPDATE_FAILED   | 2017-01-01T02:10:14 |
$ openstack stack resource list 148dfb9e-b535-4866-a6b5-edd1ded6f6a9
+---------------+--------------------------------------+-------------------------+-----------------+---------------------+
| resource_name | physical_resource_id                 | resource_type           | resource_status | updated_time        |
+---------------+--------------------------------------+-------------------------+-----------------+---------------------+
| 1             | 5b5448c2-7f0e-4c3c-9531-2efcf4afbbdd | OS::TripleO::Controller | UPDATE_FAILED   | 2016-12-31T23:58:53 |
| 0             | 51d37914-f023-44be-b27f-8407bff236c1 | OS::TripleO::Controller | UPDATE_FAILED   | 2016-12-31T22:59:09 |
| 2             | 0485c1da-9ecc-4d61-94e2-23e266500f4a | OS::TripleO::Controller | UPDATE_FAILED   | 2016-12-31T22:58:58 |
+---------------+--------------------------------------+-------------------------+-----------------+---------------------+
[stack@798006-director01 161220-11524]$ openstack stack resource list 51d37914-f023-44be-b27f-8407bff236c1
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+
| resource_name            | physical_resource_id                 | resource_type                                   | resource_status | updated_time        |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+
| NodeUserData             | 27b9fd0a-75fd-4904-b28c-165a87f7f950 | OS::TripleO::NodeUserData                       | UPDATE_COMPLETE | 2016-12-31T22:59:24 |
| ControllerExtraConfigPre | 1a7ea770-ef39-4958-a63e-916f638d722a | OS::TripleO::ControllerExtraConfigPre           | UPDATE_COMPLETE | 2016-09-22T00:20:09 |
| NodeTLSCAData            | 9d9bb7e2-b741-4fbb-94b3-ceaed078ba4a | OS::TripleO::NodeTLSCAData                      | UPDATE_COMPLETE | 2016-12-31T22:59:41 |
| ExternalPort             | 627fffe3-0177-45ac-b31a-56e0e89a64f2 | OS::TripleO::Controller::Ports::ExternalPort    | UPDATE_COMPLETE | 2016-12-31T22:59:31 |
| StorageMgmtPort          | 6e35726c-09a5-42ba-a3a9-e3f752f29132 | OS::TripleO::Controller::Ports::StorageMgmtPort | UPDATE_COMPLETE | 2016-12-31T22:59:33 |
| InternalApiPort          | 613b398a-8a6a-420c-8680-8e67af9e11d1 | OS::TripleO::Controller::Ports::InternalApiPort | UPDATE_COMPLETE | 2016-12-31T22:59:30 |
| NodeTLSData              | 702077fc-0655-4e78-acc2-be58eb7116d2 | OS::TripleO::NodeTLSData                        | UPDATE_COMPLETE | 2016-12-31T22:59:45 |
| StoragePort              | 69574f4d-d390-429c-bd95-6e7a5a88b230 | OS::TripleO::Controller::Ports::StoragePort     | UPDATE_COMPLETE | 2016-12-31T22:59:32 |
| Controller               | 9763fd7e-a921-447b-8655-7ecc3109d94f | OS::Nova::Server                                | UPDATE_COMPLETE | 2016-12-31T22:59:29 |
| NodeAdminUserData        | 2d4e819d-8aed-46da-b546-d3598804e6b9 | OS::TripleO::NodeAdminUserData                  | UPDATE_COMPLETE | 2016-12-31T22:59:23 |
| NetIpMap                 | 477ddd95-e35c-4ffb-85e7-b58cb49b58c3 | OS::TripleO::Network::Ports::NetIpMap           | UPDATE_COMPLETE | 2016-12-31T22:59:39 |
| NodeExtraConfig          | 53299b6f-4c98-4a05-b225-7f4cf5fb6e87 | OS::TripleO::NodeExtraConfig                    | UPDATE_COMPLETE | 2016-09-22T00:20:13 |
| TenantPort               | 87152981-c16a-4f02-8ff2-5e98e262e2c4 | OS::TripleO::Controller::Ports::TenantPort      | UPDATE_COMPLETE | 2016-12-31T22:59:32 |
| NetworkConfig            | 9d6588c2-6b07-4d8d-8e46-c20693e70961 | OS::TripleO::Controller::Net::SoftwareConfig    | UPDATE_COMPLETE | 2016-12-31T22:59:37 |
| UpdateConfig             | 9e7021a9-396d-42c3-a255-6d88d909ae49 | OS::TripleO::Tasks::PackageUpdate               | UPDATE_COMPLETE | 2016-12-31T22:59:23 |
| ControllerDeployment     | 6320ec81-044a-4bc9-b65b-a66db1f8b9db | OS::TripleO::SoftwareDeployment                 | UPDATE_COMPLETE | 2016-09-21T23:56:31 |
| ControllerConfig         | 0d30a522-f72d-4bb3-b80d-5d6c00da8f6f | OS::Heat::StructuredConfig                      | CREATE_COMPLETE | 2016-09-21T18:47:33 |
| NetIpSubnetMap           | d21a19ef-ef0b-4832-9ee2-5d5b800ff07b | OS::TripleO::Network::Ports::NetIpSubnetMap     | UPDATE_COMPLETE | 2016-12-31T22:59:38 |
| UpdateDeployment         | 2d959540-f337-4e13-b4ac-2f159ac4c914 | OS::Heat::SoftwareDeployment                    | UPDATE_FAILED   | 2016-12-31T23:00:54 |
| UserData                 | cfb8358e-6441-4697-98e2-49845b439ff5 | OS::Heat::MultipartMime                         | CREATE_COMPLETE | 2016-08-10T21:33:54 |
| NetworkDeployment        | ed034810-b845-4152-bdb9-67cf6543cbfd | OS::TripleO::SoftwareDeployment                 | CREATE_COMPLETE | 2016-08-10T21:33:54 |
| ManagementPort           | 56e921c6-d785-4e9b-b3fc-0387eaf65eb5 | OS::TripleO::Controller::Ports::ManagementPort  | UPDATE_COMPLETE | 2016-12-31T22:59:33 |
+--------------------------+--------------------------------------+-------------------------------------------------+-----------------+---------------------+

$ openstack software deployment output show 2d959540-f337-4e13-b4ac-2f159ac4c914 --all
output_values:

  deploy_stdout: |
    ...
    warning: /var/lib/logrotate.status saved as /var/lib/logrotate.status.rpmsave
    Created symlink from /etc/systemd/system/sockets.target.wants/virtlogd.socket to /usr/lib/systemd/system/virtlogd.socket.
    2670 blocks
    yum return code: 0
    Starting cluster node
    Starting Cluster...
    Redirecting to /bin/systemctl start  corosync.service
    Job for corosync.service failed because the control process exited with error code. See "systemctl status corosync.service" and "journalctl -xe" for details.
    
    ERROR 798007-controller01 failed to join cluster in 600 seconds
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    Error: cluster is not currently running on this node
    (truncated, view all with --long)
  update_managed_packages: false
  deploy_status_code: 1

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-2.0.0-41.el7ost.noarch
openstack-heat-engine-6.0.0-12.el7ost.noarch
openstack-heat-templates-0-0.3.96a0b0bgit.el7ost.noarch



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Navneet Krishnan 2017-01-17 09:32:40 UTC
A workaround was applied to continue and complete the stack update:

**** Regarding the controllers.

1. make sure the vlans are up:

# for i in 204 206 207; do ifup vlan$i; done

2. restart the openvswitch daemons:

# systemctl restart openvswitch.service openvswitch-nonetwork.service neutron-openvswitch-agent.service

3. Remove the stale files:

# mount | grep xxx.xx.xx.xx 
# umount -f /var/lib....

4. restart the cluster daemons

# systemctl restart corosync.service pcsd.service

5. start the cluster

# pcs cluster start

Then we rerun the openstack overcloud update, and after the openstack overcloud update turned off the ifcfg-vlan* interfaces, we quickly turned them back on manually and after that the update continued.

Another thing that I've found in both nodes is that /var/lib/images/glance has the wrong owner and permissions after the update.

# chown glance:glance /var/lib/glance/images
# chmod 755 /var/lib/glance/images

**** Regarding the computes.

I only required to restart the vlans:

#for i in 203 204 206 207; do ifup vlan$i; done

Comment 5 Assaf Muller 2017-04-28 21:17:17 UTC
I don't think it's possible to root cause this without the logs of the event, which are missing per comment 2. I'm closing this as cannot reproduce.

Comment 17 Red Hat Bugzilla 2023-09-14 03:37:32 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days