Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1081278

Summary: scheduler is running, but not scheduling
Product: Red Hat OpenStack Reporter: Ian Wienand <iwienand>
Component: openstack-novaAssignee: Russell Bryant <rbryant>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Ami Jeain <ajeain>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.0CC: aortega, gdubreui, ndipanov, sgordon, vpopovic, wfoster, yeylon
Target Milestone: ---   
Target Release: 5.0 (RHEL 7)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-05-27 15:43:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace of nova-scheduler
none
lsof output for scheduler none

Description Ian Wienand 2014-03-27 00:27:14 UTC
Description of problem:

Something has happened to the scheduler and it is running, but seemingly not consuming rpc messages.  The problem is that it has given no notification of being in this state.

---
[root@host02 nova]# date
Thu Mar 27 00:03:34 UTC 2014

[root@host02 nova]# tail -n4 scheduler.log
2014-03-26 17:05:51.662 30160 INFO nova.scheduler.filter_scheduler [req-13acb4eb-0ffa-4d3e-82b0-d079bf1d75cc 54b9f83693c84bf2b72286e9609eee36 210a99a1e68f43218f4cab705c908d45] Choosing host WeighedHost [host: host07.oslab.priv, weight: 9105.0] for instance df44492f-df4c-4de2-ae90-ea9e7857e2bc
2014-03-26 17:06:53.067 30160 INFO nova.scheduler.filter_scheduler [req-94e244f6-10b7-4836-8f6d-0697f3d9b237 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] Attempting to build 1 instance(s) uuids: [u'4f07e0ad-1c4a-47c0-9fa9-d67888aacf19']
2014-03-26 17:06:53.556 30160 INFO nova.scheduler.filter_scheduler [req-94e244f6-10b7-4836-8f6d-0697f3d9b237 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] Choosing host WeighedHost [host: host12.oslab.priv, weight: 8593.0] for instance 4f07e0ad-1c4a-47c0-9fa9-d67888aacf19
2014-03-26 20:02:23.952 30160 WARNING nova.openstack.common.loopingcall [-] task run outlasted interval by 5.290742 sec
---

This is in a fairly busy lab where the last entry was several hours ago.  VM's that are started are stuck in "scheduling" state

Seeing as there is nothing coming out in the logs, it's hard to tell why it's stuck.  I checked the qpid queues for what I believe are the scheduler entries

---
Object Summary:
    ID    Created   Destroyed  Index
    ============================================================================
    ...
    1556  23:29:36  -          2020.scheduler
    1557  23:29:36  -          2020.scheduler.host02.oslab.priv
    1558  00:04:36  -          2020.topic-host02.oslab.priv.30661.1
    ...

qpid: 


EOF     agents  call    help    id      list    quit    schema  set     show    
qpid: show 1556
Object of type: org.apache.qpid.broker:queue:_data(245bf196-58a4-e2cd-7b3b-5fea66198436)
    Attribute              1556
    ==================================
    vhostRef               2020
    name                   scheduler
    durable                False
    autoDelete             False
    exclusive              False
    arguments              {}
    altExchange            absent
    msgTotalEnqueues       11675
    msgTotalDequeues       11675
    msgTxnEnqueues         0
    msgTxnDequeues         0
    msgPersistEnqueues     0
    msgPersistDequeues     0
    msgDepth               0
    byteDepth              0
    byteTotalEnqueues      134424837
    byteTotalDequeues      134424837
    byteTxnEnqueues        0
    byteTxnDequeues        0
    bytePersistEnqueues    0
    bytePersistDequeues    0
    consumerCount          1
    consumerCountHigh      1
    consumerCountLow       1
    bindingCount           2
    bindingCountHigh       2
    bindingCountLow        2
    unackedMessages        0
    unackedMessagesHigh    0
    unackedMessagesLow     0
    messageLatencySamples  0s
    messageLatencyMin      0s
    messageLatencyMax      0s
    messageLatencyAverage  0s
    flowStopped            False
    flowStoppedCount       0
qpid: show 1557
Object of type: org.apache.qpid.broker:queue:_data(245bf196-58a4-e2cd-7b3b-5fea66198436)
    Attribute              1557
    ====================================================
    vhostRef               2020
    name                   scheduler.host02.oslab.priv
    durable                False
    autoDelete             False
    exclusive              False
    arguments              {}
    altExchange            absent
    msgTotalEnqueues       0
    msgTotalDequeues       0
    msgTxnEnqueues         0
    msgTxnDequeues         0
    msgPersistEnqueues     0
    msgPersistDequeues     0
    msgDepth               0
    byteDepth              0
    byteTotalEnqueues      0
    byteTotalDequeues      0
    byteTxnEnqueues        0
    byteTxnDequeues        0
    bytePersistEnqueues    0
    bytePersistDequeues    0
    consumerCount          1
    consumerCountHigh      1
    consumerCountLow       1
    bindingCount           2
    bindingCountHigh       2
    bindingCountLow        2
    unackedMessages        0
    unackedMessagesHigh    0
    unackedMessagesLow     0
    messageLatencySamples  0s
    messageLatencyMin      0s
    messageLatencyMax      0s
    messageLatencyAverage  0s
    flowStopped            False
    flowStoppedCount       0
qpid: 
---

I also ran an strace of the scheduler for a little while, it's attached.  also attached is lsof output.  Nothing in particular stands out, the only ampq communication I can see is to this fd

---
nova-sche 30160 nova    9u  IPv4 77079819      0t0      TCP localhost:38355->localhost:amqp (ESTABLISHED)
---

which consists of

---
recvfrom(9, "\17\0\0\20\0\0\0\0\0\0\0\0\1\n\0\0", 65536, 0, NULL, NULL) = 16
---

Comment 1 Ian Wienand 2014-03-27 00:27:38 UTC
Created attachment 879251 [details]
strace of nova-scheduler

Comment 2 Ian Wienand 2014-03-27 00:28:23 UTC
Created attachment 879252 [details]
lsof output for scheduler

Comment 4 Russell Bryant 2014-04-07 14:09:53 UTC
The log looks normal, and the qmf output shows all messages being consumed.

What makes you think the scheduler is broken here?  What's the failure observed?

Comment 5 Ian Wienand 2014-04-14 05:39:59 UTC
(In reply to Russell Bryant from comment #4)
> What makes you think the scheduler is broken here?  What's the failure
> observed?

Sorry, the original comment did bury the lead a little.  The problem was that vm's never got out of scheduling state.  The scheduler was not logging that it was seeing the requests for the new vm's come in; it wasn't logging anything at all.  As you say, it seems to be consuming all rpc messages it was sent.

I don't expect it to get fixed from this bug unless it happens to trigger somebodies memory of an issue similar to this; it's mostly for tracking.  This is a semi-production system so we had to restart things.