Bug 1267013

Summary: Instances stuck in build and deleting state
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-novaAssignee: Sylvain Bauza <sbauza>
Status: CLOSED NOTABUG QA Contact: nlevinki <nlevinki>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.0 (Juno)CC: abeekhof, berrange, dasmith, eglynn, fdinitto, jeckersb, jmelvin, jthomas, jwaterwo, kchamart, mflusche, ndipanov, pbrady, sbauza, sferdjao, sgordon, vromanso, yeylon
Target Milestone: ---Keywords: Unconfirmed, ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-27 20:40:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jeremy 2015-09-28 19:47:01 UTC
Description of problem:
Instances stuck in build and deleting state


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
 [root@osp1-controller01 ~(openstack_admin)]# nova show a1b32b1b-7cff-43c9-9ace-9070e815f6b6                                                    
+--------------------------------------+--------------------------------------------------------------+
| Property                             | Value                                                        |
+--------------------------------------+--------------------------------------------------------------+
| OS-DCF:diskConfig                    | AUTO                                                         |
| OS-EXT-AZ:availability_zone          | nova                                                         |
| OS-EXT-SRV-ATTR:host                 | osp1-compute04.osp.poc                                       |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | osp1-compute04.osp.poc                                       |
| OS-EXT-SRV-ATTR:instance_name        | instance-000063fc                                            |
| OS-EXT-STS:power_state               | 1                                                            |
| OS-EXT-STS:task_state                | deleting                                                     |
| OS-EXT-STS:vm_state                  | deleted                                                      |
| OS-SRV-USG:launched_at               | 2015-09-28T12:34:10.000000                                   |
| OS-SRV-USG:terminated_at             | 2015-09-28T15:34:49.000000                                   |
| accessIPv4                           |                                                              |
| accessIPv6                           |                                                              |
| config_drive                         |                                                              |
| created                              | 2015-09-28T12:33:39Z                                         |
| flavor                               | m1.small (2)                                                 |
| heanet-default network               | 192.168.0.215                                                |
| hostId                               | 2bbb88190c64b3f52ab02591cd9c7ac7cf589412dc6f4f00382f85a4     |
| id                                   | a1b32b1b-7cff-43c9-9ace-9070e815f6b6                         |
| image                                | au-file-download-test (d9369a32-4eb9-4758-8379-7b42ff27391a) |
| key_name                             | au-default                                                   |
| metadata                             | {}                                                           |
| name                                 | au-kernel-test-a1b32b1b-7cff-43c9-9ace-9070e815f6b6          |
| os-extended-volumes:volumes_attached | []                                                           |
| status                               | DELETED                                                      |
| tenant_id                            | dbc9c4efbd364687bb6e8ab4fe7b2fd2                             |
| updated                              | 2015-09-28T15:35:35Z                                         |
| user_id                              | 1a25f8911ce048c1afc855a15c7cdcf5                             |
+--------------------------------------+--------------------------------------------------------------+

Expected results:


Additional info:

Comment 9 John Eckersberg 2015-10-01 14:22:04 UTC
Check the output from...

# cat /proc/sys/net/ipv4/tcp_keepalive_*

and

# grep keepalive /etc/rabbitmq/rabbitmq.config

I'm assuming TCP keepalives are not enabled and the connections are just timing out due to inactivity.

Comment 19 Andrew Beekhof 2015-10-05 00:07:35 UTC
Pacemaker's not doing jack.

The new logs start at Sep 29 17:18:59 and we don't do anything until one of the galera instances fails almost a day later (which is well after the connection errors discussed in comment #17):

Sep 30 16:39:12 osp1-controller01 pengine[5416]: warning: unpack_rsc_op_failure: Processing failed op monitor for galera:1 on osp1-controller01: not running (7)
Sep 30 16:39:12 osp1-controller01 pengine[5416]: notice: LogActions: Recover galera:1	(Master osp1-controller01)

What makes you think pacemaker is bouncing resources around?

Comment 27 Dave Maley 2015-10-27 20:40:33 UTC
As we have not heard anything back from the customer for nearly 3 weeks I'm closing this bug.  Please re-open if the customer comes back w/ the requested information.