Bug 1275032

Summary:	Cinder Create Volume from image hang and failed
Product:	Red Hat OpenStack	Reporter:	James Biao <jbiao>
Component:	openstack-cinder	Assignee:	Flavio Percoco <fpercoco>
Status:	CLOSED ERRATA	QA Contact:	nlevinki <nlevinki>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	5.0 (RHEL 7)	CC:	dciabrin, dhill, eharney, fahmed, fdinitto, fpercoco, jraju, kgaillot, mbayer, plemenko, scohen, sgotliv, skinjo, slee, vstinner, yeylon
Target Milestone:	---	Keywords:	ZStream
Target Release:	5.0 (RHEL 7)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-12-22 14:26:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description James Biao 2015-10-25 04:45:16 UTC

Description of problem:


cinder create --display-name test9-vda 50 --volume_type e48e8086-c05d-4a48-a272-579f50c40dc8  --image-id 86005b37-05be-4c07-9ec3-f82cecf41ce7
ERROR: Unable to establish connection: ('Connection aborted.', BadStatusLine("''",))

A successful run on the same command finishes in 5 seconds

Version-Release number of selected component (if applicable):
RHEL OSP 5 on 6

How reproducible:

Intermittently


Steps to Reproduce:
1.
2.
3.

Actual results: Cinder creating failed


Expected results: CInder create successful


Additional info:

Comment 2 James Biao 2015-10-26 00:07:49 UTC

by pcs restart openstack-nova-api-clone, openstack-cinder-api-clone,
openstack-cinder-volume, openstack-glance-api-clone and pcs resource
cleanup on cinder volume service, the cinder latency was fixed but now customer experience the same again.

Comment 3 Flavio Percoco 2015-10-26 07:27:50 UTC

Most of the erorrs come from RabbitMQ not being available, which refers to a different issue. However, this one seems to be related to a problem with MySQL and a lock not being released, hence the BadStatusLine

sosreport-lpdcldiuctl01.01529004-20151024201519/var/log/cinder/scheduler.log:2015-10-23 07:31:54.579 29355 TRACE cinder.service OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restart
ing transaction') 'UPDATE services SET updated_at=%s, report_count=%s WHERE services.id = %s' (datetime.datetime(2015, 10, 23, 14, 31, 3, 634683), 7946361L, 3L)

I'll dig more into this and try to find the reason why it happened.

Comment 4 Shinobu KINJO 2015-10-26 07:48:26 UTC

/*
 *   The following sentences are just what I wrote
 *   message for customer.
 *                                         Shinobu
 */

First thing which I would like to tell you is that:


  lpdcldiuctl01 is _unhealthy_


Please check network on that host.

It's because there is a bunch of communication errors with
RabbitMQ like:


    2015-10-09 16:11:26.662 13900 ERROR oslo.messaging._drivers.impl_rabbit [-] 
    AMQP server on 172.17.10.23:5672 is unreachable: [Errno 111] ECONNREFUSED...


Those communication errors ended up with:


    [Errno 32] Broken pipe


After that, cinder was trying to access to RabbitMQ again.
And there is a huge log in:


    $ ls -lhart api.log-20151012
    -rw-r--r--. 1 shinobu shinobu 1.9G Oct 12 19:07 api.log-20151012


It's completely abnormal.
It could cause "partition full".


    /dev/mapper/vg_root-lv_root 290280976 204746708  85534268  71% /


You have to check utilization of "/" root partition now.
Please monitor that log file.

This is current pointed file which already got up to 1.4G at that time.


    -rw-r--r--.  1 shinobu shinobu 1.4G Oct 25 12:26 api.log


There are a little bit same error message on other controllers which are
lpdcldiuctl02, lpdcldiuctl03.

But they are not quite crazy like lpdcldiuctl01.
Please monitor this host especially during this case.

I will update you, if more you _must_ know.

Comment 6 Fabio Massimo Di Nitto 2015-10-26 10:56:17 UTC

Shinobu-san,

it's not clear from the previous message in comment #4 if the disk was full or not.

Did the disk fill up at any point in time? how was it freed? after free'ing the disk space, were services restarted?

Comment 7 Peter Lemenkov 2015-10-26 12:05:18 UTC

RabbitMQ logs on a machine lpdcldiuctl03 contains a hole between 23-Oct-2015::02:50". The next log record was added ~5 hours 30 minutes later. RabbitMQ instances on other machines started expecting issues with replicating their inner states across the cluster during this time.

Comment 8 Damien Ciabrini 2015-10-26 12:30:25 UTC

Regarding the MySQL log from comment #3,

Those "Lock wait timeout" messages do not indicate locking issue in Galera replication, they indicate that some queries took too long to execute. There are only a handfull of them and they also happened before the incident.

The machines don't look overloaded (plenty of RAM), so MySQL is certainly not experiencing a global slowdown during the incident.

MySQL config says max_connections is set to 2048. Looking at the MySQL logs, I cannot see evidence of any socket exhaustion which could force haproxy to retry connecting and make the client experience slow response time.

Note that the maximum number of connections has been reach several weeks before the incident. So if the setting has not been changed since then, it might be worth increasing the limit.

Comment 9 Peter Lemenkov 2015-10-26 12:43:11 UTC

Ok, bad news. RabbitMQ dies at node lpdcldiuctl03 at 23-Oct-2015::03:30 and was restarted only at 23-Oct-2015::09:15. So that's why there was a hole in logs.

So that's what we know about lpdcldiuctl03:

* 23-Oct-2015::02:50 some unindentified issue(s) started hitting RabbitMQ. I suppose other nodes (and clients connected) started experiencing connectivity issues at that point.

* 23-Oct-2015::03:40 RabbitMQ finally died while trying to allocate 127 gbytes of memory.

------------------ cut ---------------------------------
Oct 23 03:25:39 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[2040]: Crash dump was written to: erl_crash.dump
Oct 23 03:25:39 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[2040]: eheap_alloc: Cannot allocate 125757763272 bytes of memory (of type "heap").
Oct 23 03:37:13 lpdcldiuctl03.p1devqa.phx.aexp.com systemd[1]: rabbitmq-server.service: main process exited, code=killed, status=6/ABRT
Oct 23 03:37:33 lpdcldiuctl03.p1devqa.phx.aexp.com systemd[1]: Stopping RabbitMQ broker...
Oct 23 03:37:53 lpdcldiuctl03.p1devqa.phx.aexp.com lrmd[5252]: notice: Giving up on rabbitmq-server stop (rc=196): timeout (elapsed=19986ms, remaining=14ms)
Oct 23 03:37:53 lpdcldiuctl03.p1devqa.phx.aexp.com crmd[5255]: error: Operation rabbitmq-server_stop_0: Timed Out (node=lpdcldiuctl03.p1devqa.phx.aexp.com, call=623, timeout=20000ms)
Oct 23 03:38:10 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: Stopping and halting node rabbit@lpdcldiuctl03 ...
Oct 23 03:38:12 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: Error: unable to connect to node rabbit@lpdcldiuctl03: nodedown
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: DIAGNOSTICS
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: ===========
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: attempted to contact: [rabbit@lpdcldiuctl03]
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: rabbit@lpdcldiuctl03:
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: * connected to epmd (port 4369) on lpdcldiuctl03
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: * epmd reports: node 'rabbit' not running at all
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: no other nodes on lpdcldiuctl03
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: * suggestion: start the node
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: current node details:
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: - node name: rabbitmqctl16903@lpdcldiuctl03
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: - home dir: /var/lib/rabbitmq
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[16903]: - cookie hash: soeIWU2jk2YNseTyDSlsEA==
Oct 23 03:38:14 lpdcldiuctl03.p1devqa.phx.aexp.com systemd[1]: rabbitmq-server.service: control process exited, code=exited status=2
Oct 23 03:38:15 lpdcldiuctl03.p1devqa.phx.aexp.com systemd[1]: Stopped RabbitMQ broker.
Oct 23 03:38:15 lpdcldiuctl03.p1devqa.phx.aexp.com systemd[1]: Unit rabbitmq-server.service entered failed state.
------------------ cut ---------------------------------

* 23-Oct-2015::09:15 RabbitMQ was restarted again.

------------------ cut ---------------------------------
Oct 23 09:11:51 lpdcldiuctl03.p1devqa.phx.aexp.com crmd[5255]: notice: Operation rabbitmq-server_monitor_0: not running (node=lpdcldiuctl03.p1devqa.phx.aexp.com, call=726, rc=7, cib-update=359, confirmed=true)
Oct 23 09:12:01 lpdcldiuctl03.p1devqa.phx.aexp.com systemd[1]: Starting RabbitMQ broker...
Oct 23 09:12:02 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[1403]: Waiting for rabbit@lpdcldiuctl03 ...
Oct 23 09:12:02 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[1403]: pid is 1399 ...
Oct 23 09:12:03 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[1399]: RabbitMQ 3.3.5. Copyright (C) 2007-2014 GoPivotal, Inc.
Oct 23 09:12:03 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[1399]: ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
Oct 23 09:12:03 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[1399]: ##  ##
Oct 23 09:12:03 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[1399]: ##########  Logs: /var/log/rabbitmq/rabbit
Oct 23 09:12:03 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[1399]: ######  ##        /var/log/rabbitmq/rabbit
Oct 23 09:12:03 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[1399]: ##########
Oct 23 09:12:05 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmq-server[1399]: Starting broker... completed with 6 plugins.
Oct 23 09:12:05 lpdcldiuctl03.p1devqa.phx.aexp.com rabbitmqctl[1403]: ...done.
Oct 23 09:12:05 lpdcldiuctl03.p1devqa.phx.aexp.com systemd[1]: Started RabbitMQ broker.
Oct 23 09:12:05 lpdcldiuctl03.p1devqa.phx.aexp.com crmd[5255]: notice: Operation rabbitmq-server_start_0: ok (node=lpdcldiuctl03.p1devqa.phx.aexp.com, call=960, rc=0, cib-update=430, confirmed=true)
Oct 23 09:13:24 lpdcldiuctl03.p1devqa.phx.aexp.com crmd[5255]: notice: Operation rabbitmq-server_monitor_0: ok (node=lpdcldiuctl03.p1devqa.phx.aexp.com, call=1065, rc=0, cib-update=528, confirmed=true)
Oct 23 09:25:23 lpdcldiuctl03.p1devqa.phx.aexp.com crmd[5255]: notice: Operation rabbitmq-server_monitor_0: ok (node=lpdcldiuctl03.p1devqa.phx.aexp.com, call=1403, rc=0, cib-update=696, confirmed=true)
------------------ cut ---------------------------------

There were a couple of more restarts after than - I suppose that's exactly the restarts then operator tried to upgrade/rollback some packages and tried to restart rmq.

So far I've got one question - why is there a 6 hour timeout between restarts? I thought pacemaker should restart it if it's not running.

Also I feel really suspicious about 127 gbytes allocation. I know this machine have more memory than that, however I believe there should be something wrong with RabbitMQ. Queues full of unacked/unread messages? I'm going to find it out.

Comment 10 Michael Bayer 2015-10-26 21:30:47 UTC

Hi Flavio -

Can you link the errors creating volumes on the command line to the "lock wait timeout" errors we see in the logs?  When you say "However, this one seems to be related to a problem with MySQL and a lock not being released, hence the BadStatusLine", how are you able to correlate those two events?    I don't know enough about Cinder to be able to see in the logs any event that definitely corresponds to the command-line returning this error, grepping through every .log in every /var/log/cinder directory I don't see anything at all in fact.

We have noted that the customer appears to have commented out the haproxy configuration as well as the database configuration in all the openstack .conf files, meaning all database APIs are defaulting to talking to the localhost MariaDB instance; with Galera, this is a very bad idea as it means we are doing multi-master writes which are known to cause lots of problems.   Our current HAPRoxy config is intended to send all database traffic to only one node at a time.

Comment 11 Michael Bayer 2015-10-26 21:35:42 UTC

correction, it looks like SOSreport is filtering out the database URLs, so we can't see what database URL is in the openstack .conf files....arg.   We can see that all the galera confs are commented out in all haproxy logs though.

Comment 12 Fabio Massimo Di Nitto 2015-10-26 21:36:16 UTC

(In reply to Peter Lemenkov from comment #9)

> 
> There were a couple of more restarts after than - I suppose that's exactly
> the restarts then operator tried to upgrade/rollback some packages and tried
> to restart rmq.
> 
> So far I've got one question - why is there a 6 hour timeout between
> restarts? I thought pacemaker should restart it if it's not running.
> 

Ken: can you take a look at this? see comment above for details.

> Also I feel really suspicious about 127 gbytes allocation. I know this
> machine have more memory than that, however I believe there should be
> something wrong with RabbitMQ. Queues full of unacked/unread messages? I'm
> going to find it out.

Comment 13 Michael Bayer 2015-10-26 21:39:09 UTC

The ratio of how many gigs of info sosreports give you, to the lack of the actual four or five little things you actually want to know always amazes me.  I would like to know  1. how are they connecting to Galera here if all the haproxy confs are commented out, what database URLs did they use and 2. *when* did these timeout incidents actually occur timestamp-wise so we can correlate events in the logs ?

Comment 14 Damien Ciabrini 2015-10-26 21:57:50 UTC

Note for comment #12, the provided sosreport lacks output of crm_report.

We've asked customer to regenerate it manually and send it over for further analyses around 13:00 EDT

Comment 24 Sean Lee 2015-10-27 18:20:34 UTC

osapi_volume_workers is currently unset in cinder.conf, which defaults to only one worker serving request on each controller. when there are lots of requests hitting cinder-api, the process on all 3 controllers would be running at 100%.

is it advisable to increase osapi_volume_workers to match the number of cores (32) on the system?

Comment 25 Sean Lee 2015-10-27 18:38:27 UTC

In reply to comment #21, I was told UCS fence agent wasn't available when the cluster was first setup more than a year ago. I'll look into adding stonith to the environment.

You are correct about some PCS resources being unmanaged due to maintenance/troubleshooting. After everything was brought back late yesterday, everything is managed now.

Comment 26 Peter Lemenkov 2015-10-27 21:19:40 UTC

From the RabbitMQ side we have the following:

* RabbitMQ actually works fine - it receives and delivers messages.

* Unfortunately one of the major consumers had gone offline so one of the queues started growing indefinitely.

* Once RabbitMQ memory consumption reached some point it died spectacularly while trying to allocate 127 GBytes more.

* Pacemaker didn't care about this and didn't restart anything.

So we have several different issues here.

1. A receiver shouldn't go offline / disconnect from the queue /. fail to connect to already existing queue. The latter might happen due to a possible race condition in python-oslo-messaging.
2. If a receiver goes offline / disconnects / fails to connect to the existing queue, and the corresponding queue deleted (marked as auto-delete) RabbitMQ shouldn't receive additional messages intended for this receiver. Perhaps it should clean up existing ones. This is the potential issue in python-oslo-messaging and/or in RabbitMQ.
3. If a memory consumption grows then RabbitMQ shouldn't die. This moght be an issue somewhere on the lower levels of RabbitMQ or in Erlang VM itself.

Comment 27 Ken Gaillot 2015-10-27 21:27:19 UTC

(In reply to Peter Lemenkov from comment #26)
> From the RabbitMQ side we have the following:
> 
> * RabbitMQ actually works fine - it receives and delivers messages.
> 
> * Unfortunately one of the major consumers had gone offline so one of the
> queues started growing indefinitely.
> 
> * Once RabbitMQ memory consumption reached some point it died spectacularly
> while trying to allocate 127 GBytes more.
> 
> * Pacemaker didn't care about this and didn't restart anything.

To address that last point, Pacemaker tried to start rabbitmq but got a timeout (20s). It considers this fatal and doesn't try further.

So another question is why rabbitmq couldn't start in that time. If it was starting properly but just taking a long time due to the earlier problems, then increasing the timeout may be justified.

It is also possible to configure Pacemaker to react differently; setting a failure-timeout will cause the failure condition to be cleared automatically after that time, and allow Pacemaker to try starting it again. There is also start-failure-is-fatal which defaults to true, and may be worth setting to false in this case, if start failures are more likely to be transient than definitive.

Comment 28 Michael Bayer 2015-10-27 22:31:12 UTC

OK, now that I've been pointed to more relevant SOS reports I have another potential theory about cinder.

What this "services" table is used for in Cinder is that a running loop is applied to update this table every N seconds as a Cinder command runs - it runs a separate transaction asynchronously against whatever the main thing going on is (in Icehouse we see this in cinder/service.py -> Service.start using a loopingcall.LoopingCall(self.report_state) call).

The table itself is not SELECTed with any FOR UPDATE nor does it have foreign keys or indexes referring to it so to see it locking like this is unusual.   But my theory would be that if multiple Cinder commands are running at the same time, such as a bunch were hanging for a long time due to other issues such as rabbitmq or network issues, these commands which are all asynchronously emitting UPDATE statements to this table while they wait could start fighting with each other.

To confirm this theory we'd want to see if the error occurs correlating to multiple cinder commands or processes running at the same time - the table is impacted by both the "cinder-scheduler" component and the "cinder-backup" component.

pinging eharney to see what he thinks of this theory.

Comment 29 Fabio Massimo Di Nitto 2015-10-27 22:52:02 UTC

(In reply to Sean Lee from comment #25)
> In reply to comment #21, I was told UCS fence agent wasn't available when
> the cluster was first setup more than a year ago. I'll look into adding
> stonith to the environment.

The agent has been available since rhel7.0 and earlier. fence ipmilan would have worked as well.

> 
> You are correct about some PCS resources being unmanaged due to
> maintenance/troubleshooting. After everything was brought back late
> yesterday, everything is managed now.

Perhaps we should understand better why you guys are using managed/unmanaged here and I have the feeling there are some procedural issues in how you are using pacemaker that could avoid some headaches.

Comment 32 Flavio Percoco 2015-10-31 05:30:37 UTC

I've backported the patches for the bugs Gorka mentioned in his last email on the AMEX thread.

There were some conflicts during the backport but those were resolved. Going to test them better in a bit (gotta go offline now).

https://code.engineering.redhat.com/gerrit/#/c/60546/
https://code.engineering.redhat.com/gerrit/#/c/60545/
https://code.engineering.redhat.com/gerrit/#/c/60544/

Comment 33 Flavio Percoco 2015-11-02 17:31:51 UTC

I've built two packages for one of the patches above[0]. The second patch ended up being more invasive than expected and the third one depends on the second one. Therefore, I've skipped both.

[0] https://code.engineering.redhat.com/gerrit/#/c/60544/

OSP5 (RHEL6) https://brewweb.devel.redhat.com/taskinfo?taskID=10049162
OSP5 (RHEL7) https://brewweb.devel.redhat.com/taskinfo?taskID=10049135

Thanks eck for helping with the reviews!

Comment 34 David Hill 2015-11-07 21:04:21 UTC

Hello Flavio,

    I tried finding the package for this build and I can't seem to find it ...
The latest build for python-oslo-messaging is this one [1].  When will we have a package we can provide the customer with?   They called us today as the issue re-occured.

Thank you very much,

David Hill

[1] https://brewweb.devel.redhat.com/buildinfo?buildID=455589

Comment 37 errata-xmlrpc 2015-12-22 14:26:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2686.html

Comment 38 Red Hat Bugzilla 2023-09-14 03:11:53 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days