Bug 1372257

Summary: neutron router was active on two controller nodes
Product: Red Hat OpenStack Reporter: VIKRANT <vaggarwa>
Component: openstack-neutronAssignee: John Schwarz <jschwarz>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: adhingra, amuller, chrisw, jlibosva, jschwarz, majopela, nyechiel, srevivo, vaggarwa
Target Milestone: asyncKeywords: ZStream
Target Release: 7.0 (Kilo)Flags: vaggarwa: needinfo-
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-08 11:32:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description VIKRANT 2016-09-01 09:20:37 UTC
Description of problem:

Openstack neutron router was active on two controller nodes simultaneously. Cu removed the ports from router after that it started running only on one controller node. 

Output at time of issue : 

# neutron l3-agent-list-hosting-router PUSH_ROUTER
+--------------------------------------+-----------+----------------+-------+----------+
| id                                   | host      | admin_state_up | alive | ha_state |
+--------------------------------------+-----------+----------------+-------+----------+
| f42f49d9-8651-4963-8722-66c8aafa39b3 | ospctrl01 | True           | :-)   | active   |
| e58a950d-e348-4804-a000-9b8ca163fced | ospctrl02 | True           | :-)   | active  |
| 0f55da82-46f1-4354-bece-aadcec1f1924 | ospctrl03 | True           | :-)   | standby  |
+--------------------------------------+-----------+----------------+-------+----------+

Output after issue fixed:

# neutron l3-agent-list-hosting-router PUSH_ROUTER
+--------------------------------------+-----------+----------------+-------+----------+
| id                                   | host      | admin_state_up | alive | ha_state |
+--------------------------------------+-----------+----------------+-------+----------+
| f42f49d9-8651-4963-8722-66c8aafa39b3 | ospctrl01 | True           | :-)   | active   |
| e58a950d-e348-4804-a000-9b8ca163fced | ospctrl02 | True           | :-)   | standby  |
| 0f55da82-46f1-4354-bece-aadcec1f1924 | ospctrl03 | True           | :-)   | standby  |
+--------------------------------------+-----------+----------------+-------+----------+

Version-Release number of selected component (if applicable):
RHEL OSP 7

How reproducible:
First time issue seen. 

Steps to Reproduce:
1.
2.
3.

Actual results:
neutron router was running on two controller nodes due to which instances were not reachable using floating IP.

Expected results:
neutron router in any case should be running on a single controller node. 

Additional info:

Adding more info in next private comment.

Comment 6 John Schwarz 2016-09-25 13:54:58 UTC
Also, this might or might not be related to an upstream bug currently in flight: https://bugs.launchpad.net/neutron/+bug/1580648. I'm posting this here for future reference.

Comment 7 John Schwarz 2016-09-26 16:26:32 UTC
Lastly, this sounds a bit like https://bugzilla.redhat.com/show_bug.cgi?id=1181592. Miguel, can you take a look at the logs please?

Comment 9 Miguel Angel Ajo 2016-09-27 08:29:18 UTC
Hey, I needed the /var/log/messages, /etc/hosts and some other details to confirm jschwarz's theory in Comment 7, that sounds very reasonable, @vikrant, check that specific bz.

Could you post the full sosreport logs for confirmation, please? 

Extra details: 

This happens because keepalived in the qrouter namespace does not have access to the host defined DNS in /etc/resolv.conf, and keepalive tries to resolv the IP address of the current host via DNS and locks for 60 seconds (stopping VRRP, so other host transitions in as MASTER)

You can set a workaround in place with instructions in: 

https://bugzilla.redhat.com/show_bug.cgi?id=1181592#c12

So when keepalived tries to resolv it, it will be found in /etc/hosts, and DNS query will be avoided.

Best regards.

Comment 10 John Schwarz 2016-09-27 08:36:16 UTC
I would also like to add that all the flip-flop transitions doesn't occur sporadically (i.e. during the entire day), but during specific times of the day (mostly 14:00 - 01:00, which can be considered normal depending on the time zone). This can go towards the idea of some kind of user actions done on the setup, which in turns causes a re-write of the keepalived.conf, causing the process to reload the configuration file and then encounter the DNS issue.

If you could also ask what he was doing during the times where the issue was encountered (comment #2), i.e. if he was adding new VMs, etc - that would be also very helpful.

Comment 12 Anil Dhingra 2016-11-18 06:07:52 UTC
Any update , user is hitting same issue again & Again

Comment 13 John Schwarz 2016-11-18 12:43:51 UTC
Apologizes, for some reason I didn't receive email notifications about this Bugzilla.

From a brief look at the logs, it looks like the flip-flop pattern occurs once every 37-40 seconds consistently, which implies that there might indeed be an issue with the DNS. Miguel, please have a look at the log and let me know what you think.

Also, Anil, can we ask the user to run the command specified in comment #9 on each of the network nodes (specifically also on ospctrl02 and ospctrl03):

dig A $(hostname) | grep -A1 "ANSWER SEC" | tail -n 1 | awk '{print $NF " " $1}' | sed -e 's/.$//g'  >>/etc/hosts ;   grep $(hostname) /etc/hosts || echo "Failure setting up the hostname entry"

We'll make sure to follow up on this.