Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 918530

Summary: dnsmask no address available after system with 8 VM is idle for few days
Product: Red Hat OpenStack Reporter: Ofer Blaut <oblaut>
Component: openstack-novaAssignee: Brent Eagles <beagles>
Status: CLOSED DEFERRED QA Contact: Ofer Blaut <oblaut>
Severity: urgent Docs Contact:
Priority: high    
Version: 2.0 (Folsom)CC: beagles, dallan, ndipanov
Target Milestone: ---Keywords: Reopened
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-04-29 20:06:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Nova Logs none

Description Ofer Blaut 2013-03-06 13:15:31 UTC
Created attachment 705925 [details]
Nova Logs

Description of problem:

I have installation of openstack nova and compute node ( snap 3 )

my setup was idle for few days.

when trying to access the VMs. they had no ip address and no ip is leased 

attached nova-network log files (tar)

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.create network 
2.launch few VMs, wait until you see "no address availabl" in var/log/messages

Mar  6 14:35:01 puma04 dnsmasq-dhcp[4244]: DHCPDISCOVER(br190) fa:16:3e:0f:cc:34 no address available
3.attached NOVA log files 
  

Network does exist 
[root@puma04 ~(keystone_admin)]$ nova-manage network list
id   	IPv4              	IPv6           	start address  	DNS1           	DNS2           	VlanID         	project        	uuid           
2    	10.35.175.0/24    	None           	10.35.175.3    	None           	None           	190            	4da88bfcd411417bb98eebee000c8952	8e9f19f9-029e-4b76-ac4c-9f485d8be036


Actual results:


Expected results:


Additional info:

Comment 2 Russell Bryant 2013-03-19 16:11:49 UTC
Do you still have an environment that demonstrates this problem?  I'm wondering if this patch fixes it:

    https://bugs.launchpad.net/nova/+bug/1103260
    https://review.openstack.org/#/c/21057/

If so, can we try applying this patch to the environment to see if it fixes it?

Comment 4 Brent Eagles 2013-03-20 20:23:51 UTC
So far I've been unable to reproduce this in a passive fashion as implied by the report. The upstream issue that Russell points out in Comment #2 is interesting as it lends clues on how to synthesize the same or similar issue. I'll give it a shot.

Comment 5 Brent Eagles 2013-03-21 17:26:26 UTC
Actually, I was just looking more closely at the logs and focussed a bit on the data that was included in the stack traces present in the network.log log file. It shows an error that suggests that the Network with id "1" does not exist, yet the nova-manage network list output above shows an id of "2". Value equivalence isn't sufficient unfortunately and records have to match key values. Is it possible that someone deleted and recreated the network?

Comment 7 Ofer Blaut 2013-04-02 08:32:36 UTC
issue is not reproduced on openstack-nova-network-2012.2.3-7.el6ost.noarch
setup was running for 7 days, this test was running without vlans


Added test with vlan tags

[root@puma04 ~(keystone_admin)]$ nova-manage network list
id   	IPv4              	IPv6           	start address  	DNS1           	DNS2           	VlanID         	project        	uuid           
6    	10.35.175.0/24    	None           	10.35.175.3    	None           	None           	190            	83198303726c4f989d35d251407ef93e	656e5991-8b7a-401c-be6c-4495060efd0f

Comment 8 Ofer Blaut 2013-04-15 07:48:02 UTC
issue happned again ( SETUP WILL REMAIN

VM is Active :( 10.35.175.18)

nova list --host puma04
+--------------------------------------+----------+--------+---------------------------+
| ID                                   | Name     | Status | Networks                  |
+--------------------------------------+----------+--------+---------------------------+
| 063e7aa2-5449-4e16-99eb-807e521ddfc4 | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.10 |
| 1032947a-78bc-4e3f-9466-d6d1c56bf973 | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.18 |
| 194f1fdb-7e17-4a9d-b3da-554f4770cd89 | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.12 |
| 5006f1a2-9d0c-4466-8033-8c78f56cbe8c | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.16 |
| a99dcacf-d5f1-4694-8be4-5a291cbe69fe | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.6  |
| d7468870-de04-494c-a3d6-9ef6b5e4d807 | VM-FED17 | ACTIVE | net_vlan_190=10.35.175.14 |
| 86138fe1-5bf5-4311-a053-6957edf8bca9 | perf     | ACTIVE | net_vlan_190=10.35.175.4  |
+--------------------------------------+----------+--------+---------------------------+

Since there was no ping i restart the network on the VM
and errors seen in on compute node /var/log/messages

grep  fa:16:3e:69:3a:28 /var/log/messages
Apr 15 10:29:51 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available
Apr 15 10:29:56 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available
Apr 15 10:30:04 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available
Apr 15 10:30:13 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available
Apr 15 10:30:21 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available
Apr 15 10:30:35 puma04 dnsmasq-dhcp[8998]: DHCPDISCOVER(br190) 10.35.175.18 fa:16:3e:69:3a:28 no address available



Logs from compute.log 

2013-04-14 04:05:33 7454 INFO nova.compute.manager [-] [instance: 1032947a-78bc-4e3f-9466-d6d1c56bf973] During sync_power_state the instance has a pending task. Skip.

data from fixed_ips DB


+---------------------+---------------------+---------------------+---------+------+----------------+------------+-----------+--------+----------+----------------------+------+--------------------------------------+
| created_at          | updated_at          | deleted_at          | deleted | id   | address        | network_id | allocated | leased | reserved | virtual_interface_id | host | instance_uuid                        |
+---------------------+---------------------+---------------------+---------+------+----------------+------------+-----------+--------+----------+----------------------+------+--------------------------------------+
| 2013-04-02 07:13:14 | NULL                | NULL                |       0 | 1537 | 10.35.175.0    |          6 |         0 |      0 |        1 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | NULL                | NULL                |       0 | 1538 | 10.35.175.1    |          6 |         0 |      0 |        1 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | NULL                | NULL                |       0 | 1539 | 10.35.175.2    |          6 |         0 |      0 |        1 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | 2013-04-10 07:04:28 | NULL                |       0 | 1540 | 10.35.175.3    |          6 |         1 |      1 |        0 |                  269 | NULL | d36014c4-b711-4c8d-bbe4-f88b30bee72f |
| 2013-04-02 07:13:14 | 2013-04-10 07:04:44 | NULL                |       0 | 1541 | 10.35.175.4    |          6 |         1 |      1 |        0 |                  270 | NULL | 86138fe1-5bf5-4311-a053-6957edf8bca9 |
| 2013-04-02 07:13:14 | 2013-04-04 06:55:58 | NULL                |       0 | 1542 | 10.35.175.5    |          6 |         1 |      0 |        0 |                  248 | NULL | fad10580-ff08-4455-88a1-f119857ba013 |
| 2013-04-02 07:13:14 | 2013-04-04 07:11:00 | NULL                |       0 | 1543 | 10.35.175.6    |          6 |         1 |      1 |        0 |                  249 | NULL | a99dcacf-d5f1-4694-8be4-5a291cbe69fe |
| 2013-04-02 07:13:14 | 2013-04-04 06:55:58 | NULL                |       0 | 1544 | 10.35.175.7    |          6 |         1 |      0 |        0 |                  267 | NULL | 684818e6-0eef-42a0-8095-77c784a1bd05 |
| 2013-04-02 07:13:14 | 2013-04-04 08:24:07 | NULL                |       0 | 1545 | 10.35.175.8    |          6 |         0 |      0 |        0 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | 2013-04-04 06:55:58 | NULL                |       0 | 1546 | 10.35.175.9    |          6 |         1 |      0 |        0 |                  252 | NULL | 96fcd4e9-1060-4c50-88b2-5ac7f14a7d4e |
| 2013-04-02 07:13:14 | 2013-04-03 06:43:52 | NULL                |       0 | 1547 | 10.35.175.10   |          6 |         0 |      0 |        0 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | 2013-04-03 06:42:42 | NULL                |       0 | 1548 | 10.35.175.11   |          6 |         0 |      0 |        0 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | 2013-04-04 07:02:36 | NULL                |       0 | 1549 | 10.35.175.12   |          6 |         1 |      0 |        0 |                  255 | NULL | 194f1fdb-7e17-4a9d-b3da-554f4770cd89 |
| 2013-04-02 07:13:14 | 2013-04-03 06:38:25 | NULL                |       0 | 1550 | 10.35.175.13   |          6 |         0 |      0 |        0 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | 2013-04-04 07:10:16 | NULL                |       0 | 1551 | 10.35.175.14   |          6 |         1 |      1 |        0 |                  257 | NULL | d7468870-de04-494c-a3d6-9ef6b5e4d807 |
| 2013-04-02 07:13:14 | 2013-04-04 06:56:10 | NULL                |       0 | 1552 | 10.35.175.15   |          6 |         1 |      0 |        0 |                  258 | NULL | a200ce2d-cb71-481e-a61a-07bb41a17a9c |
| 2013-04-02 07:13:14 | 2013-04-02 12:42:53 | NULL                |       0 | 1553 | 10.35.175.16   |          6 |         1 |      1 |        0 |                  259 | NULL | 5006f1a2-9d0c-4466-8033-8c78f56cbe8c |
| 2013-04-02 07:13:14 | 2013-04-03 06:43:13 | NULL                |       0 | 1554 | 10.35.175.17   |          6 |         0 |      0 |        0 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | 2013-04-03 06:51:01 | NULL                |       0 | 1555 | 10.35.175.18   |          6 |         0 |      0 |        0 |                 NULL | NULL | NULL                                 |
| 2013-04-02 07:13:14 | 2013-04-03 06:51:01 | NULL                |       0 | 1556 | 10.35.175.19   |          6 |         0 |      0 |        0 |                 NULL | NULL | NULL

Comment 9 Ofer Blaut 2013-04-15 07:53:08 UTC
IMHO it seems relate to the power state

2013-04-15 10:42:01 7454 INFO nova.compute.manager [-] [instance: 063e7aa2-5449-4e16-99eb-807e521ddfc4] During sync_power_state the instance has a pending task. Skip.
2013-04-15 10:42:01 7454 INFO nova.compute.manager [-] [instance: 194f1fdb-7e17-4a9d-b3da-554f4770cd89] During sync_power_state the instance has a pending task. Skip.
2013-04-15 10:42:01 7454 INFO nova.compute.manager [-] [instance: 5006f1a2-9d0c-4466-8033-8c78f56cbe8c] During sync_power_state the instance has a pending task. Skip.
2013-04-15 10:42:01 7454 INFO nova.compute.manager [-] [instance: 1032947a-78bc-4e3f-9466-d6d1c56bf973] During sync_power_state the instance has a pending task. Skip.

other VMs like 10.35.175.10 [instance: 063e7aa2-5449-4e16-99eb-807e521ddfc4]

has no ping and ip as well as well as the rest with sync-power messages 

openstack-nova-network-2012.2.3-9.el6ost.noarch

The test was running with VLANs

Comment 10 Brent Eagles 2013-04-17 15:22:45 UTC
Something very interesting is going on here, but I'm not sure exactly what it is yet. I compiled a list of the results of 'nova show' on each vm as well as a 'virsh list' on both compute nodes. The addresses are indeed gone... but in some cases, so are the VMs! Also, several VMs have a task state of "deleting" and the value in the related "instance_name" attribute show up in the shutoff list on virsh list. The "deleting" task is likely the cause of the "pending task" message indicated in comment 9.

Comment 11 Brent Eagles 2013-04-17 16:56:15 UTC
Okay, after grepping all of the nova logs for the all of the instance ids, the four mentioned in comment 9 at least were all deleted. Actually, on April 3, around 9:41 server time, delete requests for the following instance ids went across the nova API:

a.) 1032947a-78bc-4e3f-9466-d6d1c56bf973
b.) 5006f1a2-9d0c-4466-8033-8c78f56cbe8c
c.) a200ce2d-cb71-481e-a61a-07bb41a17a9c
d.) ca281c48-3ede-4e0a-baba-26fbb696c1fe
e.) 063e7aa2-5449-4e16-99eb-807e521ddfc4
f.) 194f1fdb-7e17-4a9d-b3da-554f4770cd89
g.) 96fcd4e9-1060-4c50-88b2-5ac7f14a7d4e

Actually, there were two delete requests for a.), b.), c.) that were not a minute apart. Maybe a race condition there somewhere? In the "show" results for each of these instances, the task state is "deleting". Some or all of them have also been shutdown as far as "virsh list --all" is concerned. I say "some" because I didn't make a UUID->instance name->virsh list table and a few spot checks of the list show that the VMs themselves are indeed down. So the problem would not be dnsmasq... the problem appears to be that deleting a VM cans sometimes "not complete" leaving bad data around.

Comment 12 Brent Eagles 2013-04-17 18:57:25 UTC
In at least one instance, there was a timeout exception when processing the instance deletion. A likely culprit! By rights, the caller that set the task state to deleting should have set it back to None when the exception occurred, allowing other processes to do something about repairing the inconsistency.

Comment 13 Brent Eagles 2013-04-17 19:18:27 UTC
The plot thickens, the deleting state being kept around in the event of an error is deliberate and and part of a fix for incorrect quota manipulation.

# Note(eglynn): we do not decorate this action with reverts_task_state
# because a failure during termination should leave the task state as
# DELETING, as a signal to the API layer that a subsequent deletion
# attempt should not result in a further decrement of the quota_usages
# in_use count (see bug 1046236).

It would seem then that the deleting task is somewhat of a special case. I'm going through the code to see if there is anything that is capable dealing with the pending task that will never complete. The VM isn't put into error state either... which it probably should since an operation has been initiated that we cannot intelligently recover from.

Comment 14 Brent Eagles 2013-04-29 20:06:28 UTC
As this behavior isn't a dnsmasq or network issue specifically, I've summarized the result of the analysis and actual behavior in:

https://bugzilla.redhat.com/show_bug.cgi?id=957888

I'll add a reference back to this issue and close it. While the information on the test systems was instrumental in determining what was actually happening, it won't stick around forever and much of the information in the comments imply other "behavior" entirely.

Comment 15 Red Hat Bugzilla 2013-10-04 00:17:41 UTC
Removing external tracker bug with the id 'https://bugs.launchpad.net/nova/+bug/958593' as it is not valid for this tracker