Bug 918530
| Summary: | dnsmask no address available after system with 8 VM is idle for few days | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ofer Blaut <oblaut> | ||||
| Component: | openstack-nova | Assignee: | Brent Eagles <beagles> | ||||
| Status: | CLOSED DEFERRED | QA Contact: | Ofer Blaut <oblaut> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 2.0 (Folsom) | CC: | beagles, dallan, ndipanov | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | 3.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-04-29 20:06:28 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Do you still have an environment that demonstrates this problem? I'm wondering if this patch fixes it:
https://bugs.launchpad.net/nova/+bug/1103260
https://review.openstack.org/#/c/21057/
If so, can we try applying this patch to the environment to see if it fixes it?
So far I've been unable to reproduce this in a passive fashion as implied by the report. The upstream issue that Russell points out in Comment #2 is interesting as it lends clues on how to synthesize the same or similar issue. I'll give it a shot. Actually, I was just looking more closely at the logs and focussed a bit on the data that was included in the stack traces present in the network.log log file. It shows an error that suggests that the Network with id "1" does not exist, yet the nova-manage network list output above shows an id of "2". Value equivalence isn't sufficient unfortunately and records have to match key values. Is it possible that someone deleted and recreated the network? issue is not reproduced on openstack-nova-network-2012.2.3-7.el6ost.noarch setup was running for 7 days, this test was running without vlans Added test with vlan tags [root@puma04 ~(keystone_admin)]$ nova-manage network list id IPv4 IPv6 start address DNS1 DNS2 VlanID project uuid 6 10.35.175.0/24 None 10.35.175.3 None None 190 83198303726c4f989d35d251407ef93e 656e5991-8b7a-401c-be6c-4495060efd0f issue happned again ( SETUP WILL REMAIN VM is Active :( 10.35.175.18) nova list --host puma04 +--------------------------------------+----------+--------+---------------------------+ | ID | Name | Status | Networks | +--------------------------------------+----------+--------+---------------------------+ | 063e7aa2-5449-4e16-99eb-807e521ddfc4 | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.10 | | 1032947a-78bc-4e3f-9466-d6d1c56bf973 | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.18 | | 194f1fdb-7e17-4a9d-b3da-554f4770cd89 | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.12 | | 5006f1a2-9d0c-4466-8033-8c78f56cbe8c | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.16 | | a99dcacf-d5f1-4694-8be4-5a291cbe69fe | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.6 | | d7468870-de04-494c-a3d6-9ef6b5e4d807 | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.14 | | 86138fe1-5bf5-4311-a053-6957edf8bca9 | perf | ACTIVE | net_vlan_190=10.35.175.4 | +--------------------------------------+----------+--------+---------------------------+ Since there was no ping i restart the network on the VM and errors seen in on compute node /var/log/messages grep fa:16:3e:69:3a:28 /var/log/messages Apr 15 10:29:51 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available Apr 15 10:29:56 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available Apr 15 10:30:04 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available Apr 15 10:30:13 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available Apr 15 10:30:21 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available Apr 15 10:30:35 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available Logs from compute.log 2013-04-14 04:05:33 7454 INFO nova.compute.manager [-] [instance: 1032947a-78bc-4e3f-9466-d6d1c56bf973] During sync_power_state the instance has a pending task. Skip. data from fixed_ips DB +---------------------+---------------------+---------------------+---------+------+----------------+------------+-----------+--------+----------+----------------------+------+--------------------------------------+ | created_at | updated_at | deleted_at | deleted | id | address | network_id | allocated | leased | reserved | virtual_interface_id | host | instance_uuid | +---------------------+---------------------+---------------------+---------+------+----------------+------------+-----------+--------+----------+----------------------+------+--------------------------------------+ | 2013-04-02 07:13:14 | NULL | NULL | 0 | 1537 | 10.35.175.0 | 6 | 0 | 0 | 1 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | NULL | NULL | 0 | 1538 | 10.35.175.1 | 6 | 0 | 0 | 1 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | NULL | NULL | 0 | 1539 | 10.35.175.2 | 6 | 0 | 0 | 1 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | 2013-04-10 07:04:28 | NULL | 0 | 1540 | 10.35.175.3 | 6 | 1 | 1 | 0 | 269 | NULL | d36014c4-b711-4c8d-bbe4-f88b30bee72f | | 2013-04-02 07:13:14 | 2013-04-10 07:04:44 | NULL | 0 | 1541 | 10.35.175.4 | 6 | 1 | 1 | 0 | 270 | NULL | 86138fe1-5bf5-4311-a053-6957edf8bca9 | | 2013-04-02 07:13:14 | 2013-04-04 06:55:58 | NULL | 0 | 1542 | 10.35.175.5 | 6 | 1 | 0 | 0 | 248 | NULL | fad10580-ff08-4455-88a1-f119857ba013 | | 2013-04-02 07:13:14 | 2013-04-04 07:11:00 | NULL | 0 | 1543 | 10.35.175.6 | 6 | 1 | 1 | 0 | 249 | NULL | a99dcacf-d5f1-4694-8be4-5a291cbe69fe | | 2013-04-02 07:13:14 | 2013-04-04 06:55:58 | NULL | 0 | 1544 | 10.35.175.7 | 6 | 1 | 0 | 0 | 267 | NULL | 684818e6-0eef-42a0-8095-77c784a1bd05 | | 2013-04-02 07:13:14 | 2013-04-04 08:24:07 | NULL | 0 | 1545 | 10.35.175.8 | 6 | 0 | 0 | 0 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | 2013-04-04 06:55:58 | NULL | 0 | 1546 | 10.35.175.9 | 6 | 1 | 0 | 0 | 252 | NULL | 96fcd4e9-1060-4c50-88b2-5ac7f14a7d4e | | 2013-04-02 07:13:14 | 2013-04-03 06:43:52 | NULL | 0 | 1547 | 10.35.175.10 | 6 | 0 | 0 | 0 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | 2013-04-03 06:42:42 | NULL | 0 | 1548 | 10.35.175.11 | 6 | 0 | 0 | 0 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | 2013-04-04 07:02:36 | NULL | 0 | 1549 | 10.35.175.12 | 6 | 1 | 0 | 0 | 255 | NULL | 194f1fdb-7e17-4a9d-b3da-554f4770cd89 | | 2013-04-02 07:13:14 | 2013-04-03 06:38:25 | NULL | 0 | 1550 | 10.35.175.13 | 6 | 0 | 0 | 0 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | 2013-04-04 07:10:16 | NULL | 0 | 1551 | 10.35.175.14 | 6 | 1 | 1 | 0 | 257 | NULL | d7468870-de04-494c-a3d6-9ef6b5e4d807 | | 2013-04-02 07:13:14 | 2013-04-04 06:56:10 | NULL | 0 | 1552 | 10.35.175.15 | 6 | 1 | 0 | 0 | 258 | NULL | a200ce2d-cb71-481e-a61a-07bb41a17a9c | | 2013-04-02 07:13:14 | 2013-04-02 12:42:53 | NULL | 0 | 1553 | 10.35.175.16 | 6 | 1 | 1 | 0 | 259 | NULL | 5006f1a2-9d0c-4466-8033-8c78f56cbe8c | | 2013-04-02 07:13:14 | 2013-04-03 06:43:13 | NULL | 0 | 1554 | 10.35.175.17 | 6 | 0 | 0 | 0 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | 2013-04-03 06:51:01 | NULL | 0 | 1555 | 10.35.175.18 | 6 | 0 | 0 | 0 | NULL | NULL | NULL | | 2013-04-02 07:13:14 | 2013-04-03 06:51:01 | NULL | 0 | 1556 | 10.35.175.19 | 6 | 0 | 0 | 0 | NULL | NULL | NULL IMHO it seems relate to the power state 2013-04-15 10:42:01 7454 INFO nova.compute.manager [-] [instance: 063e7aa2-5449-4e16-99eb-807e521ddfc4] During sync_power_state the instance has a pending task. Skip. 2013-04-15 10:42:01 7454 INFO nova.compute.manager [-] [instance: 194f1fdb-7e17-4a9d-b3da-554f4770cd89] During sync_power_state the instance has a pending task. Skip. 2013-04-15 10:42:01 7454 INFO nova.compute.manager [-] [instance: 5006f1a2-9d0c-4466-8033-8c78f56cbe8c] During sync_power_state the instance has a pending task. Skip. 2013-04-15 10:42:01 7454 INFO nova.compute.manager [-] [instance: 1032947a-78bc-4e3f-9466-d6d1c56bf973] During sync_power_state the instance has a pending task. Skip. other VMs like 10.35.175.10 [instance: 063e7aa2-5449-4e16-99eb-807e521ddfc4] has no ping and ip as well as well as the rest with sync-power messages openstack-nova-network-2012.2.3-9.el6ost.noarch The test was running with VLANs Something very interesting is going on here, but I'm not sure exactly what it is yet. I compiled a list of the results of 'nova show' on each vm as well as a 'virsh list' on both compute nodes. The addresses are indeed gone... but in some cases, so are the VMs! Also, several VMs have a task state of "deleting" and the value in the related "instance_name" attribute show up in the shutoff list on virsh list. The "deleting" task is likely the cause of the "pending task" message indicated in comment 9. Okay, after grepping all of the nova logs for the all of the instance ids, the four mentioned in comment 9 at least were all deleted. Actually, on April 3, around 9:41 server time, delete requests for the following instance ids went across the nova API: a.) 1032947a-78bc-4e3f-9466-d6d1c56bf973 b.) 5006f1a2-9d0c-4466-8033-8c78f56cbe8c c.) a200ce2d-cb71-481e-a61a-07bb41a17a9c d.) ca281c48-3ede-4e0a-baba-26fbb696c1fe e.) 063e7aa2-5449-4e16-99eb-807e521ddfc4 f.) 194f1fdb-7e17-4a9d-b3da-554f4770cd89 g.) 96fcd4e9-1060-4c50-88b2-5ac7f14a7d4e Actually, there were two delete requests for a.), b.), c.) that were not a minute apart. Maybe a race condition there somewhere? In the "show" results for each of these instances, the task state is "deleting". Some or all of them have also been shutdown as far as "virsh list --all" is concerned. I say "some" because I didn't make a UUID->instance name->virsh list table and a few spot checks of the list show that the VMs themselves are indeed down. So the problem would not be dnsmasq... the problem appears to be that deleting a VM cans sometimes "not complete" leaving bad data around. In at least one instance, there was a timeout exception when processing the instance deletion. A likely culprit! By rights, the caller that set the task state to deleting should have set it back to None when the exception occurred, allowing other processes to do something about repairing the inconsistency. The plot thickens, the deleting state being kept around in the event of an error is deliberate and and part of a fix for incorrect quota manipulation. # Note(eglynn): we do not decorate this action with reverts_task_state # because a failure during termination should leave the task state as # DELETING, as a signal to the API layer that a subsequent deletion # attempt should not result in a further decrement of the quota_usages # in_use count (see bug 1046236). It would seem then that the deleting task is somewhat of a special case. I'm going through the code to see if there is anything that is capable dealing with the pending task that will never complete. The VM isn't put into error state either... which it probably should since an operation has been initiated that we cannot intelligently recover from. As this behavior isn't a dnsmasq or network issue specifically, I've summarized the result of the analysis and actual behavior in: https://bugzilla.redhat.com/show_bug.cgi?id=957888 I'll add a reference back to this issue and close it. While the information on the test systems was instrumental in determining what was actually happening, it won't stick around forever and much of the information in the comments imply other "behavior" entirely. Removing external tracker bug with the id 'https://bugs.launchpad.net/nova/+bug/958593' as it is not valid for this tracker |
Created attachment 705925 [details] Nova Logs Description of problem: I have installation of openstack nova and compute node ( snap 3 ) my setup was idle for few days. when trying to access the VMs. they had no ip address and no ip is leased attached nova-network log files (tar) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1.create network 2.launch few VMs, wait until you see "no address availabl" in var/log/messages Mar 6 14:35:01 puma04 dnsmasq-dhcp[4244]: DHCPDISCOVER(br190) fa:16:3e:0f:cc:34 no address available 3.attached NOVA log files Network does exist [root@puma04 ~(keystone_admin)]$ nova-manage network list id IPv4 IPv6 start address DNS1 DNS2 VlanID project uuid 2 10.35.175.0/24 None 10.35.175.3 None None 190 4da88bfcd411417bb98eebee000c8952 8e9f19f9-029e-4b76-ac4c-9f485d8be036 Actual results: Expected results: Additional info: