Bug 1833757
| Summary: | [cinder backup] VolumesBackupsTest failures during update OSP16 Z1 composable | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ronnie Rasouli <rrasouli> | ||||
| Component: | openstack-tempest | Assignee: | Chandan Kumar <chkumar> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Martin Kopec <mkopec> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 16.0 (Train) | CC: | abishop, apevec, lhh, ltoscano, mbollo, sathlang, slinaber, udesale | ||||
| Target Milestone: | --- | Keywords: | Triaged | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-05-15 16:14:41 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Ronnie Rasouli
2020-05-10 10:02:17 UTC
Is this an update test to 16.0.1, or an update from 16.0.1? Also, if it fails intermittently, it is possible to see *recent* builds where they didn't fail? It is worth noting that the reported failed job uses IPv6. Are the test passing on a composable non-IPv6 environment? Hi, (In reply to Luigi Toscano from comment #2) > Is this an update test to 16.0.1, or an update from 16.0.1? It's a update from 16.0z1 to phase1 puddle, RHOS_TRUNK-16.0-RHEL-8-20200506.n.2. > > Also, if it fails intermittently, it is possible to see *recent* builds > where they didn't fail? https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-ga-composable-ipv6/ It's from 16.0 GA to the same puddel. > > It is worth noting that the reported failed job uses IPv6. Are the test > passing on a composable non-IPv6 environment? See above where it passes ipv6 env. Another thing to note here, is that the failed tempest happens after *deployment*. After update and reboot everything is fine. In case this helps... I looked at the logs from link in comment #1 for the request to create volume 505e0f32-25a9-4f48-84a2-4cb7cc527497. The request was serviced by the cinder-scheduler running on controller-2, and the logs show the RBD backend was down at that time. 1. The RBD backend was up at 08:39:54.458 2020-05-08 08:39:54.458 7 DEBUG cinder.scheduler.host_manager [req-8be67f6c-0a35-4ccd-85f5-3620d8ee517d - - - - -] Received volume service update from hostgroup@tripleo_ceph: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 42.29, 'free_capacity_gb': 42.29, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:17cb5504-90fb-11ea-b0bd-525400dae9aa:openstack:volumes', 'backend_state': 'up', 'volume_backend_name': 'tripleo_ceph', 'replication_enabled': False, 'provisioned_capacity_gb': 0, 'allocated_capacity_gb': 1, 'filter_function': None, 'goodness_function': None} update_service_capabilities /usr/lib/python3.6/site-packages/cinder/scheduler/host_manager.py:557 2. The RBD backend was marked down due to no updates from the cinder-volume service 2020-05-08 08:42:51.797 7 WARNING cinder.scheduler.host_manager [req-1b0aa3c9-ce0f-403a-91f8-d88938140f6a c9b667db7211404c8507cfc727ae9291 14f5bd5bfb3a4b4895d73c5b4e06cbe1 - default default] volume service is down. (host: hostgroup@tripleo_ceph) 2020-05-08 08:50:43.083 7 WARNING cinder.scheduler.host_manager [req-00fe0179-febc-4ef7-bea7-a79d6c81ca7e 566530ebc36349d18b416444302fed42 bc912765a476400e945b5cfde79cc213 - default default] volume service is down. (host: hostgroup@tripleo_ceph) 3. Volume creation failed due to no available backend (RBD is down) 2020-05-08 08:50:43.708 7 DEBUG cinder.volume.flows.common [req-bbe36aac-2a8c-4c94-8f57-08d15555a010 5bbf50a84b264973b9ed7a1345692b47 465b6aaa9faf42c2bc0e48677d4057ad - default default] Setting Volume 505e0f32-25a9-4f48-84a2-4cb7cc527497 to error due to: No valid backend was found. No weighed backends available error_out /usr/lib/python3.6/site-packages/cinder/volume/flows/common.py:83 4. RBD backend comes back at 08:55:24.142 2020-05-08 08:55:24.142 7 DEBUG cinder.scheduler.host_manager [req-c5a27980-a6d6-4aff-a124-91f463fcfd84 - - - - -] Received volume service update from hostgroup@tripleo_ceph: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 42.34, 'free_capacity_gb': 42.34, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:17cb5504-90fb-11ea-b0bd-525400dae9aa:openstack:volumes', 'backend_state': 'up', 'volume_backend_name': 'tripleo_ceph', 'replication_enabled': False, 'provisioned_capacity_gb': 1, 'allocated_capacity_gb': 2, 'filter_function': None, 'goodness_function': None} update_service_capabilities /usr/lib/python3.6/site-packages/cinder/scheduler/host_manager.py:557 Pacemaker was running the cinder-volume service on controller-2. The logs confirm a "blackout" (gap) of logs during that period. 1. Periodic update for RBD backend occurred at 08:39:54.449 2020-05-08 08:39:54.449 71 DEBUG cinder.manager [req-9d1b119e-b2dd-40d7-afff-9eeb8342bd22 - - - - -] Notifying Schedulers of capabilities ... _publish_service_capabilities /usr/lib/python3.6/site-packages/cinder/manager.py:194 2. Next scheduled periodic update starts 1 minute later 2020-05-08 08:40:54.349 71 DEBUG oslo_service.periodic_task [req-9d1b119e-b2dd-40d7-afff-9eeb8342bd22 - - - - -] Running periodic task VolumeManager.publish_service_capabilities run_periodic_tasks /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:217 2020-05-08 08:40:54.406 71 DEBUG cinder.volume.drivers.rbd [req-9d1b119e-b2dd-40d7-afff-9eeb8342bd22 - - - - -] connecting to openstack@ceph (conf=/etc/ceph/ceph.conf, timeout=-1). _do_conn /usr/lib/python3.6/site-packages/cinder/volume/drivers/rbd.py:411 3. Something (network?) causes the process to hang, because next messages are 15 minutes later 2020-05-08 08:55:24.053 71 ERROR oslo.messaging._drivers.impl_rabbit [-] [c070ce5f-2599-4cd6-93c9-00a59cc635ed] AMQP server on messaging-2.internalapi.redhat.local:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error> 2020-05-08 08:55:24.055 71 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer 2020-05-08 08:55:24.067 71 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection 2020-05-08 08:55:24.155 71 WARNING oslo.service.loopingcall [req-9d1b119e-b2dd-40d7-afff-9eeb8342bd22 - - - - -] Function 'cinder.service.Service.periodic_tasks' run outlasted interval by 809.81 sec I couldn't see any other logs indicating what might be happening (or not happening) during that time period, but things seemed to recover. |