Bug 1081278
| Summary: | scheduler is running, but not scheduling | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ian Wienand <iwienand> | ||||||
| Component: | openstack-nova | Assignee: | Russell Bryant <rbryant> | ||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Ami Jeain <ajeain> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.0 | CC: | aortega, gdubreui, ndipanov, sgordon, vpopovic, wfoster, yeylon | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | 5.0 (RHEL 7) | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-05-27 15:43:33 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Created attachment 879251 [details]
strace of nova-scheduler
Created attachment 879252 [details]
lsof output for scheduler
The log looks normal, and the qmf output shows all messages being consumed. What makes you think the scheduler is broken here? What's the failure observed? (In reply to Russell Bryant from comment #4) > What makes you think the scheduler is broken here? What's the failure > observed? Sorry, the original comment did bury the lead a little. The problem was that vm's never got out of scheduling state. The scheduler was not logging that it was seeing the requests for the new vm's come in; it wasn't logging anything at all. As you say, it seems to be consuming all rpc messages it was sent. I don't expect it to get fixed from this bug unless it happens to trigger somebodies memory of an issue similar to this; it's mostly for tracking. This is a semi-production system so we had to restart things. |
Description of problem: Something has happened to the scheduler and it is running, but seemingly not consuming rpc messages. The problem is that it has given no notification of being in this state. --- [root@host02 nova]# date Thu Mar 27 00:03:34 UTC 2014 [root@host02 nova]# tail -n4 scheduler.log 2014-03-26 17:05:51.662 30160 INFO nova.scheduler.filter_scheduler [req-13acb4eb-0ffa-4d3e-82b0-d079bf1d75cc 54b9f83693c84bf2b72286e9609eee36 210a99a1e68f43218f4cab705c908d45] Choosing host WeighedHost [host: host07.oslab.priv, weight: 9105.0] for instance df44492f-df4c-4de2-ae90-ea9e7857e2bc 2014-03-26 17:06:53.067 30160 INFO nova.scheduler.filter_scheduler [req-94e244f6-10b7-4836-8f6d-0697f3d9b237 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] Attempting to build 1 instance(s) uuids: [u'4f07e0ad-1c4a-47c0-9fa9-d67888aacf19'] 2014-03-26 17:06:53.556 30160 INFO nova.scheduler.filter_scheduler [req-94e244f6-10b7-4836-8f6d-0697f3d9b237 6c84f979672f44039145387a0e0fb864 bd6159c7f8aa4baba9a33f0b2725e6d1] Choosing host WeighedHost [host: host12.oslab.priv, weight: 8593.0] for instance 4f07e0ad-1c4a-47c0-9fa9-d67888aacf19 2014-03-26 20:02:23.952 30160 WARNING nova.openstack.common.loopingcall [-] task run outlasted interval by 5.290742 sec --- This is in a fairly busy lab where the last entry was several hours ago. VM's that are started are stuck in "scheduling" state Seeing as there is nothing coming out in the logs, it's hard to tell why it's stuck. I checked the qpid queues for what I believe are the scheduler entries --- Object Summary: ID Created Destroyed Index ============================================================================ ... 1556 23:29:36 - 2020.scheduler 1557 23:29:36 - 2020.scheduler.host02.oslab.priv 1558 00:04:36 - 2020.topic-host02.oslab.priv.30661.1 ... qpid: EOF agents call help id list quit schema set show qpid: show 1556 Object of type: org.apache.qpid.broker:queue:_data(245bf196-58a4-e2cd-7b3b-5fea66198436) Attribute 1556 ================================== vhostRef 2020 name scheduler durable False autoDelete False exclusive False arguments {} altExchange absent msgTotalEnqueues 11675 msgTotalDequeues 11675 msgTxnEnqueues 0 msgTxnDequeues 0 msgPersistEnqueues 0 msgPersistDequeues 0 msgDepth 0 byteDepth 0 byteTotalEnqueues 134424837 byteTotalDequeues 134424837 byteTxnEnqueues 0 byteTxnDequeues 0 bytePersistEnqueues 0 bytePersistDequeues 0 consumerCount 1 consumerCountHigh 1 consumerCountLow 1 bindingCount 2 bindingCountHigh 2 bindingCountLow 2 unackedMessages 0 unackedMessagesHigh 0 unackedMessagesLow 0 messageLatencySamples 0s messageLatencyMin 0s messageLatencyMax 0s messageLatencyAverage 0s flowStopped False flowStoppedCount 0 qpid: show 1557 Object of type: org.apache.qpid.broker:queue:_data(245bf196-58a4-e2cd-7b3b-5fea66198436) Attribute 1557 ==================================================== vhostRef 2020 name scheduler.host02.oslab.priv durable False autoDelete False exclusive False arguments {} altExchange absent msgTotalEnqueues 0 msgTotalDequeues 0 msgTxnEnqueues 0 msgTxnDequeues 0 msgPersistEnqueues 0 msgPersistDequeues 0 msgDepth 0 byteDepth 0 byteTotalEnqueues 0 byteTotalDequeues 0 byteTxnEnqueues 0 byteTxnDequeues 0 bytePersistEnqueues 0 bytePersistDequeues 0 consumerCount 1 consumerCountHigh 1 consumerCountLow 1 bindingCount 2 bindingCountHigh 2 bindingCountLow 2 unackedMessages 0 unackedMessagesHigh 0 unackedMessagesLow 0 messageLatencySamples 0s messageLatencyMin 0s messageLatencyMax 0s messageLatencyAverage 0s flowStopped False flowStoppedCount 0 qpid: --- I also ran an strace of the scheduler for a little while, it's attached. also attached is lsof output. Nothing in particular stands out, the only ampq communication I can see is to this fd --- nova-sche 30160 nova 9u IPv4 77079819 0t0 TCP localhost:38355->localhost:amqp (ESTABLISHED) --- which consists of --- recvfrom(9, "\17\0\0\20\0\0\0\0\0\0\0\0\1\n\0\0", 65536, 0, NULL, NULL) = 16 ---