Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1833757

Summary: [cinder backup] VolumesBackupsTest failures during update OSP16 Z1 composable
Product: Red Hat OpenStack Reporter: Ronnie Rasouli <rrasouli>
Component: openstack-tempestAssignee: Chandan Kumar <chkumar>
Status: CLOSED NOTABUG QA Contact: Martin Kopec <mkopec>
Severity: medium Docs Contact:
Priority: medium    
Version: 16.0 (Train)CC: abishop, apevec, lhh, ltoscano, mbollo, sathlang, slinaber, udesale
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-15 16:14:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tempest config file from this run none

Description Ronnie Rasouli 2020-05-10 10:02:17 UTC
Created attachment 1686972 [details]
tempest config file from this run

Description of problem:

There are 2 tests that are intermittently  failing on cinder backup tempest test.

Run Tempest Tests / test_backup_create_attached_volume
Run Tempest Tests / test_bootable_volume_backup_and_restore

Version-Release number of selected component (if applicable):
core_puddle: RHOS_TRUNK-16.0-RHEL-8-20200226.n.1


How reproducible:
most likely

Steps to Reproduce:
1. Deploy OSP16 composable 3: networker, ceph, DB, controller, API. 2 computes
2. Update to latest RHOS16 Z1 puddle
3. perform a tempest test that includes cinder_backup suite

Actual results:
2 tests are failing 

Expected results:
pass

Additional info:

    Response - Headers: {'date': 'Fri, 08 May 2020 08:50:48 GMT', 'server': 'Apache', 'content-length': '942', 'x-compute-request-id': 'req-a9577083-c391-474b-be7e-3595b266b517', 'openstack-api-version': 'volume 3.0', 'vary': 'OpenStack-API-Version,Accept-Encoding', 'x-openstack-request-id': 'req-a9577083-c391-474b-be7e-3595b266b517', 'connection': 'close', 'content-type': 'application/json', 'status': '200', 'content-location': 'https://[2620:52:0:13b8:5054:ff:fe3e:1]:13776/v3/465b6aaa9faf42c2bc0e48677d4057ad/volumes/f883f104-d8cd-434b-a54c-229171acba50'}
        Body: b'{"volume": {"id": "f883f104-d8cd-434b-a54c-229171acba50", "status": "error", "size": 1, "availability_zone": "nova", "created_at": "2020-05-08T08:50:47.000000", "updated_at": "2020-05-08T08:50:47.000000", "attachments": [], "name": "tempest-VolumesBackupsTest-Volume-192006871", "description": null, "volume_type": "tripleo", "snapshot_id": null, "source_volid": null, "metadata": {}, "links": [{"rel": "self", "href": "https://[2620:52:0:13b8:5054:ff:fe3e:1]:13776/v3/465b6aaa9faf42c2bc0e48677d4057ad/volumes/f883f104-d8cd-434b-a54c-229171acba50"}, {"rel": "bookmark", "href": "https://[2620:52:0:13b8:5054:ff:fe3e:1]:13776/465b6aaa9faf42c2bc0e48677d4057ad/volumes/f883f104-d8cd-434b-a54c-229171acba50"}], "user_id": "5bbf50a84b264973b9ed7a1345692b47", "bootable": "false", "encrypted": false, "replication_status": null, "consistencygroup_id": null, "multiattach": false, "os-vol-tenant-attr:tenant_id": "465b6aaa9faf42c2bc0e48677d4057ad"}}'
}}}

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/api/volume/test_volumes_backup.py", line 129, in test_bootable_volume_backup_and_restore
    volume = self.create_volume(imageRef=img_uuid)
  File "/usr/lib/python3.6/site-packages/tempest/api/volume/base.py", line 136, in create_volume
    volume['id'], wait_until)
  File "/usr/lib/python3.6/site-packages/tempest/common/waiters.py", line 204, in wait_for_volume_resource_status
    resource_name=resource_name, resource_id=resource_id)
tempest.exceptions.VolumeResourceBuildErrorException: volume f883f104-d8cd-434b-a54c-229171acba50 failed to build and is in ERROR status

    Response - Headers: {'date': 'Fri, 08 May 2020 08:50:48 GMT', 'server': 'Apache', 'content-length': '942', 'x-compute-request-id': 'req-a9577083-c391-474b-be7e-3595b266b517', 'openstack-api-version': 'volume 3.0', 'vary': 'OpenStack-API-Version,Accept-Encoding', 'x-openstack-request-id': 'req-a9577083-c391-474b-be7e-3595b266b517', 'connection': 'close', 'content-type': 'application/json', 'status': '200', 'content-location': 'https://[2620:52:0:13b8:5054:ff:fe3e:1]:13776/v3/465b6aaa9faf42c2bc0e48677d4057ad/volumes/f883f104-d8cd-434b-a54c-229171acba50'}
        Body: b'{"volume": {"id": "f883f104-d8cd-434b-a54c-229171acba50", "status": "error", "size": 1, "availability_zone": "nova", "created_at": "2020-05-08T08:50:47.000000", "updated_at": "2020-05-08T08:50:47.000000", "attachments": [], "name": "tempest-VolumesBackupsTest-Volume-192006871", "description": null, "volume_type": "tripleo", "snapshot_id": null, "source_volid": null, "metadata": {}, "links": [{"rel": "self", "href": "https://[2620:52:0:13b8:5054:ff:fe3e:1]:13776/v3/465b6aaa9faf42c2bc0e48677d4057ad/volumes/f883f104-d8cd-434b-a54c-229171acba50"}, {"rel": "bookmark", "href": "https://[2620:52:0:13b8:5054:ff:fe3e:1]:13776/465b6aaa9faf42c2bc0e48677d4057ad/volumes/f883f104-d8cd-434b-a54c-229171acba50"}], "user_id": "5bbf50a84b264973b9ed7a1345692b47", "bootable": "false", "encrypted": false, "replication_status": null, "consistencygroup_id": null, "multiattach": false, "os-vol-tenant-attr:tenant_id": "465b6aaa9faf42c2bc0e48677d4057ad"}}'
}}}

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
    return f(*func_args, **func_kwargs)
  File "/usr/lib/python3.6/site-packages/tempest/api/volume/test_volumes_backup.py", line 129, in test_bootable_volume_backup_and_restore
    volume = self.create_volume(imageRef=img_uuid)
  File "/usr/lib/python3.6/site-packages/tempest/api/volume/base.py", line 136, in create_volume
    volume['id'], wait_until)
  File "/usr/lib/python3.6/site-packages/tempest/common/waiters.py", line 204, in wait_for_volume_resource_status
    resource_name=resource_name, resource_id=resource_id)
tempest.exceptions.VolumeResourceBuildErrorException: volume f883f104-d8cd-434b-a54c-229171acba50 failed to build and is in ERROR status

Comment 2 Luigi Toscano 2020-05-11 08:18:58 UTC
Is this an update test to 16.0.1, or an update from 16.0.1?

Also, if it fails intermittently, it is possible to see *recent* builds where they didn't fail?

It is worth noting that the reported failed job uses IPv6. Are the test passing on a composable non-IPv6 environment?

Comment 3 Sofer Athlan-Guyot 2020-05-11 09:05:47 UTC
Hi,

(In reply to Luigi Toscano from comment #2)
> Is this an update test to 16.0.1, or an update from 16.0.1?

It's a update from 16.0z1 to phase1 puddle, RHOS_TRUNK-16.0-RHEL-8-20200506.n.2.

> 
> Also, if it fails intermittently, it is possible to see *recent* builds
> where they didn't fail?

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-ga-composable-ipv6/

It's from 16.0 GA to the same puddel.

> 
> It is worth noting that the reported failed job uses IPv6. Are the test
> passing on a composable non-IPv6 environment?

See above where it passes ipv6 env.

Another thing to note here, is that the failed tempest happens after *deployment*.  After update and reboot everything is fine.

Comment 5 Alan Bishop 2020-05-11 14:28:14 UTC
In case this helps...

I looked at the logs from link in comment #1 for the request to create volume 505e0f32-25a9-4f48-84a2-4cb7cc527497. The request was serviced by the cinder-scheduler running on controller-2, and the logs show the RBD backend was down at that time.

1. The RBD backend was up at 08:39:54.458

2020-05-08 08:39:54.458 7 DEBUG cinder.scheduler.host_manager [req-8be67f6c-0a35-4ccd-85f5-3620d8ee517d - - - - -] Received volume service update from  hostgroup@tripleo_ceph: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 42.29, 'free_capacity_gb': 42.29, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:17cb5504-90fb-11ea-b0bd-525400dae9aa:openstack:volumes', 'backend_state': 'up', 'volume_backend_name': 'tripleo_ceph', 'replication_enabled': False, 'provisioned_capacity_gb': 0, 'allocated_capacity_gb': 1, 'filter_function': None, 'goodness_function': None} update_service_capabilities /usr/lib/python3.6/site-packages/cinder/scheduler/host_manager.py:557

2. The RBD backend was marked down due to no updates from the cinder-volume service

2020-05-08 08:42:51.797 7 WARNING cinder.scheduler.host_manager [req-1b0aa3c9-ce0f-403a-91f8-d88938140f6a c9b667db7211404c8507cfc727ae9291 14f5bd5bfb3a4b4895d73c5b4e06cbe1 - default default] volume service is down. (host: hostgroup@tripleo_ceph)

2020-05-08 08:50:43.083 7 WARNING cinder.scheduler.host_manager [req-00fe0179-febc-4ef7-bea7-a79d6c81ca7e 566530ebc36349d18b416444302fed42 bc912765a476400e945b5cfde79cc213 - default default] volume service is down. (host: hostgroup@tripleo_ceph)

3. Volume creation failed due to no available backend (RBD is down)

2020-05-08 08:50:43.708 7 DEBUG cinder.volume.flows.common [req-bbe36aac-2a8c-4c94-8f57-08d15555a010 5bbf50a84b264973b9ed7a1345692b47 465b6aaa9faf42c2bc0e48677d4057ad - default default] Setting Volume 505e0f32-25a9-4f48-84a2-4cb7cc527497 to error due to: No valid backend was found. No weighed backends available error_out /usr/lib/python3.6/site-packages/cinder/volume/flows/common.py:83

4. RBD backend comes back at 08:55:24.142

2020-05-08 08:55:24.142 7 DEBUG cinder.scheduler.host_manager [req-c5a27980-a6d6-4aff-a124-91f463fcfd84 - - - - -] Received volume service update from  hostgroup@tripleo_ceph: {'vendor_name': 'Open Source', 'driver_version': '1.2.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 42.34, 'free_capacity_gb': 42.34, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:17cb5504-90fb-11ea-b0bd-525400dae9aa:openstack:volumes', 'backend_state': 'up', 'volume_backend_name': 'tripleo_ceph', 'replication_enabled': False, 'provisioned_capacity_gb': 1, 'allocated_capacity_gb': 2, 'filter_function': None, 'goodness_function': None} update_service_capabilities /usr/lib/python3.6/site-packages/cinder/scheduler/host_manager.py:557

Pacemaker was running the cinder-volume service on controller-2. The logs confirm a "blackout" (gap) of logs during that period.

1. Periodic update for RBD backend occurred at 08:39:54.449

2020-05-08 08:39:54.449 71 DEBUG cinder.manager [req-9d1b119e-b2dd-40d7-afff-9eeb8342bd22 - - - - -] Notifying Schedulers of capabilities ... _publish_service_capabilities /usr/lib/python3.6/site-packages/cinder/manager.py:194

2. Next scheduled periodic update starts 1 minute later

2020-05-08 08:40:54.349 71 DEBUG oslo_service.periodic_task [req-9d1b119e-b2dd-40d7-afff-9eeb8342bd22 - - - - -] Running periodic task VolumeManager.publish_service_capabilities run_periodic_tasks /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:217

2020-05-08 08:40:54.406 71 DEBUG cinder.volume.drivers.rbd [req-9d1b119e-b2dd-40d7-afff-9eeb8342bd22 - - - - -] connecting to openstack@ceph (conf=/etc/ceph/ceph.conf, timeout=-1). _do_conn /usr/lib/python3.6/site-packages/cinder/volume/drivers/rbd.py:411

3. Something (network?) causes the process to hang, because next messages are 15 minutes later

2020-05-08 08:55:24.053 71 ERROR oslo.messaging._drivers.impl_rabbit [-] [c070ce5f-2599-4cd6-93c9-00a59cc635ed] AMQP server on messaging-2.internalapi.redhat.local:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>

2020-05-08 08:55:24.055 71 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer

2020-05-08 08:55:24.067 71 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection

2020-05-08 08:55:24.155 71 WARNING oslo.service.loopingcall [req-9d1b119e-b2dd-40d7-afff-9eeb8342bd22 - - - - -] Function 'cinder.service.Service.periodic_tasks' run outlasted interval by 809.81 sec

I couldn't see any other logs indicating what might be happening (or not happening) during that time period, but things seemed to recover.