Bug 1572607

Summary: Post FFWD upgrade, cinder volume creates using a ceph back end were failing.
Product: Red Hat OpenStack Reporter: Darin Sorrentino <dsorrent>
Component: openstack-cinderAssignee: Cinder Bugs List <cinder-bugs>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Avi Avraham <aavraham>
Severity: unspecified Docs Contact: Kim Nylander <knylande>
Priority: unspecified    
Version: 13.0 (Queens)CC: abishop, dsorrent, geguileo, rrubins, scohen, srevivo, tshefi
Target Milestone: ---Flags: dsorrent: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-30 13:56:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Cinder logs controller-0
none
Cinder logs controller-1
none
Cinder logs controller-2 none

Description Darin Sorrentino 2018-04-27 12:02:20 UTC
Description of problem:

Cinder volume creates were failing after upgrading Openstack to OSP 13 and Ceph to Ceph 3.

2018-04-27 07:31:43.517 1 WARNING cinder.scheduler.host_manager [req-c98fedaf-4153-497d-9359-27b95c88be74 fb2a3f50f24a4b3f9c97aa584f5eb275 fa3a35b560b94b7dba5dd507ba4f6058 - - -] volume service is down. (host: hostgroup@tripleo_iscsi)
2018-04-27 07:31:43.518 1 WARNING cinder.scheduler.host_manager [req-c98fedaf-4153-497d-9359-27b95c88be74 fb2a3f50f24a4b3f9c97aa584f5eb275 fa3a35b560b94b7dba5dd507ba4f6058 - - -] volume service is down. (host: hostgroup@tripleo_ceph)
2018-04-27 07:31:43.519 1 INFO cinder.scheduler.base_filter [req-c98fedaf-4153-497d-9359-27b95c88be74 fb2a3f50f24a4b3f9c97aa584f5eb275 fa3a35b560b94b7dba5dd507ba4f6058 - - -] Filtering removed all hosts for the request with volume ID '2f1f8b01-a4cf-4300-95c2-dcd96ffe66a3'. Filter results: AvailabilityZoneFilter: (start: 0, end: 0), CapacityFilter: (start: 0, end: 0), CapabilitiesFilter: (start: 0, end: 0)
2018-04-27 07:31:43.519 1 WARNING cinder.scheduler.filter_scheduler [req-c98fedaf-4153-497d-9359-27b95c88be74 fb2a3f50f24a4b3f9c97aa584f5eb275 fa3a35b560b94b7dba5dd507ba4f6058 - - -] No weighed backend found for volume with properties: None
2018-04-27 07:31:43.520 1 INFO cinder.message.api [req-c98fedaf-4153-497d-9359-27b95c88be74 fb2a3f50f24a4b3f9c97aa584f5eb275 fa3a35b560b94b7dba5dd507ba4f6058 - - -] Creating message record for request_id = req-c98fedaf-4153-497d-9359-27b95c88be74
2018-04-27 07:31:43.575 1 ERROR oslo.messaging._drivers.impl_rabbit [req-c98fedaf-4153-497d-9359-27b95c88be74 fb2a3f50f24a4b3f9c97aa584f5eb275 fa3a35b560b94b7dba5dd507ba4f6058 - - -] [69c757f8-8810-48f3-9e95-b16eddd6cea8] AMQP server on overcloud-controller-2.internalapi.localdomain:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. Client port: 44646: error: [Errno 32] Broken pipe
2018-04-27 07:31:44.670 1 INFO oslo.messaging._drivers.impl_rabbit [req-c98fedaf-4153-497d-9359-27b95c88be74 fb2a3f50f24a4b3f9c97aa584f5eb275 fa3a35b560b94b7dba5dd507ba4f6058 - - -] [69c757f8-8810-48f3-9e95-b16eddd6cea8] Reconnected to AMQP server on overcloud-controller-2.internalapi.localdomain:5672 via [amqp] client with port 57358.
2018-04-27 07:31:44.712 1 ERROR cinder.scheduler.flows.create_volume [req-c98fedaf-4153-497d-9359-27b95c88be74 fb2a3f50f24a4b3f9c97aa584f5eb275 fa3a35b560b94b7dba5dd507ba4f6058 - - -] Failed to run task cinder.scheduler.flows.create_volume.ScheduleCreateVolumeTask;volume:create: No valid backend was found. No weighed backends available: NoValidBackend: No valid backend was found. No weighed backends available


(overcloud) [stack@undercloud ~]$ openstack volume service list
+------------------+-------------------------+------+---------+-------+----------------------------+
| Binary           | Host                    | Zone | Status  | State | Updated At                 |
+------------------+-------------------------+------+---------+-------+----------------------------+
| cinder-scheduler | overcloud-controller-0  | nova | enabled | up    | 2018-04-27T11:39:40.000000 |
| cinder-scheduler | overcloud-controller-2  | nova | enabled | up    | 2018-04-27T11:39:32.000000 |
| cinder-scheduler | overcloud-controller-1  | nova | enabled | up    | 2018-04-27T11:39:32.000000 |
| cinder-volume    | hostgroup@tripleo_iscsi | nova | enabled | down  | 2018-04-26T11:38:27.000000 |
| cinder-volume    | hostgroup@tripleo_ceph  | nova | enabled | down  | 2018-04-26T11:38:28.000000 |
+------------------+-------------------------+------+---------+-------+----------------------------+

I was able to resolve this by:

1) Synchronizing NTP across the environment.
2) Restart Cinder Services on the Controller Nodes

(overcloud) [stack@undercloud ~]$ openstack volume service list
+------------------+-------------------------+------+---------+-------+----------------------------+
| Binary           | Host                    | Zone | Status  | State | Updated At                 |
+------------------+-------------------------+------+---------+-------+----------------------------+
| cinder-scheduler | overcloud-controller-0  | nova | enabled | up    | 2018-04-27T12:01:04.000000 |
| cinder-scheduler | overcloud-controller-2  | nova | enabled | up    | 2018-04-27T12:01:04.000000 |
| cinder-scheduler | overcloud-controller-1  | nova | enabled | up    | 2018-04-27T12:01:10.000000 |
| cinder-volume    | hostgroup@tripleo_iscsi | nova | enabled | down  | 2018-04-27T11:47:43.000000 |
| cinder-volume    | hostgroup@tripleo_ceph  | nova | enabled | up    | 2018-04-27T12:01:06.000000 |
+------------------+-------------------------+------+---------+-------+----------------------------+
(overcloud) [stack@undercloud ~]$ 


Volume creates were successful.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Alan Bishop 2018-04-27 12:08:07 UTC
Please provide the cinder logs (especially cinder-volume), with DEBUG enabled.

Comment 2 Gorka Eguileor 2018-04-27 12:48:01 UTC
This could very well a problem with the time on the controller nodes, they must be more or less in sync, otherwise the schedulers can't tell when a service is up.

As Alan mentioned logs will allow us to tell if the volume service was actually running correctly or not.  If it was running correctly the issue is NTP related and has nothing to do with Cinder.

Comment 3 Gorka Eguileor 2018-04-27 13:07:13 UTC
Ignore my comment on the time, from Cinder's perspective the issue has nothing to do with it, because we can see that when the service is down there is only an 8 seconds difference between the controller nodes.

Since this could be a Cinder or Ceph issue we really need the logs.

Comment 4 Darin Sorrentino 2018-05-02 14:06:52 UTC
Created attachment 1430071 [details]
Cinder logs controller-0

Comment 5 Darin Sorrentino 2018-05-02 14:07:22 UTC
Created attachment 1430072 [details]
Cinder logs controller-1

Comment 6 Darin Sorrentino 2018-05-02 14:07:48 UTC
Created attachment 1430073 [details]
Cinder logs controller-2

Comment 7 Darin Sorrentino 2018-05-02 14:09:56 UTC
(In reply to Alan Bishop from comment #1)
> Please provide the cinder logs (especially cinder-volume), with DEBUG
> enabled.

The logs I have provided have not had DEBUG enabled.  After time synchronization and restarting the Cinder Services, I do not have the issue.  Do you still want me to enable DEBUG, restart services and create volumes even though it appears to be working?

Comment 8 Randy Rubins 2018-05-02 19:20:18 UTC
I encountered a similar issue to Darin's, also FFU (OSP10 -> OSP13) upgrade related. I have a 6 node overcloud ( 3 Controller + 3 ComputeHCI nodes).  Prior to upgrade, I had 3 boot-from volume instances running. During ceph-upgrade portion of FFU, I lost connectivity to those vms.  I was able to recover them by adding ceph_mgr iptables rules and rebooting nodes. At this time I have a containerized ceph 3 and osp13 on my overcloud nodes. 'openstack volume service list' shows cinder-volume hostgroup@tripleo_ceph service as enabled and up.  I can create ceph volumes using 'rbd -p volumes create ...' just fine.  But any attempt creating cinder volumes get stuck what appears to be at cinder-scheduler.

Comment 9 Randy Rubins 2018-05-02 19:21:53 UTC
Also, I have debug enabled, so I can provide any logs you wish to see.  But here's a snippet from cinder-scheduler.log:  http://pastebin.test.redhat.com/585345

Comment 10 Alan Bishop 2018-05-02 20:26:40 UTC
Gorka and I appreciate the efforts being made to provide the debug data we need to troubleshoot the problem. What we need is a full set of cinder logs (all cinder services), with DEBUG enabled during the time that the problem occurs. This means DEBUG will need to be enabled prior to performing the FFU.

The reason we need this is there are several possible explanations for the symptoms you are seeing.

- The cinder-volume service is down, possibly because RBD driver thinks the Ceph cluster is unavailable/down
- The cinder-volume service appears to be down, but only because of clock skew between the cinder services
- The cinder-scheduler service is misbehaving (see bug #1560919 for another example)

Comment 11 Alan Bishop 2018-05-09 12:54:09 UTC
Need logs per my previous comment to make progress on this.

Comment 12 Sean Cohen 2018-05-30 13:56:54 UTC
Closing as insufficient data, feel free to re-open if problem still occours with fresh logs.
Sean