Bug 1372257 - neutron router was active on two controller nodes
Summary: neutron router was active on two controller nodes
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 7.0 (Kilo)
Hardware: All
OS: Linux
high
high
Target Milestone: async
: 7.0 (Kilo)
Assignee: John Schwarz
QA Contact: Toni Freger
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-01 09:20 UTC by VIKRANT
Modified: 2020-01-17 15:54 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-08 11:32:26 UTC
Target Upstream Version:
vaggarwa: needinfo-


Attachments (Terms of Use)

Description VIKRANT 2016-09-01 09:20:37 UTC
Description of problem:

Openstack neutron router was active on two controller nodes simultaneously. Cu removed the ports from router after that it started running only on one controller node. 

Output at time of issue : 

# neutron l3-agent-list-hosting-router PUSH_ROUTER
+--------------------------------------+-----------+----------------+-------+----------+
| id                                   | host      | admin_state_up | alive | ha_state |
+--------------------------------------+-----------+----------------+-------+----------+
| f42f49d9-8651-4963-8722-66c8aafa39b3 | ospctrl01 | True           | :-)   | active   |
| e58a950d-e348-4804-a000-9b8ca163fced | ospctrl02 | True           | :-)   | active  |
| 0f55da82-46f1-4354-bece-aadcec1f1924 | ospctrl03 | True           | :-)   | standby  |
+--------------------------------------+-----------+----------------+-------+----------+

Output after issue fixed:

# neutron l3-agent-list-hosting-router PUSH_ROUTER
+--------------------------------------+-----------+----------------+-------+----------+
| id                                   | host      | admin_state_up | alive | ha_state |
+--------------------------------------+-----------+----------------+-------+----------+
| f42f49d9-8651-4963-8722-66c8aafa39b3 | ospctrl01 | True           | :-)   | active   |
| e58a950d-e348-4804-a000-9b8ca163fced | ospctrl02 | True           | :-)   | standby  |
| 0f55da82-46f1-4354-bece-aadcec1f1924 | ospctrl03 | True           | :-)   | standby  |
+--------------------------------------+-----------+----------------+-------+----------+

Version-Release number of selected component (if applicable):
RHEL OSP 7

How reproducible:
First time issue seen. 

Steps to Reproduce:
1.
2.
3.

Actual results:
neutron router was running on two controller nodes due to which instances were not reachable using floating IP.

Expected results:
neutron router in any case should be running on a single controller node. 

Additional info:

Adding more info in next private comment.

Comment 6 John Schwarz 2016-09-25 13:54:58 UTC
Also, this might or might not be related to an upstream bug currently in flight: https://bugs.launchpad.net/neutron/+bug/1580648. I'm posting this here for future reference.

Comment 7 John Schwarz 2016-09-26 16:26:32 UTC
Lastly, this sounds a bit like https://bugzilla.redhat.com/show_bug.cgi?id=1181592. Miguel, can you take a look at the logs please?

Comment 9 Miguel Angel Ajo 2016-09-27 08:29:18 UTC
Hey, I needed the /var/log/messages, /etc/hosts and some other details to confirm jschwarz's theory in Comment 7, that sounds very reasonable, @vikrant, check that specific bz.

Could you post the full sosreport logs for confirmation, please? 

Extra details: 

This happens because keepalived in the qrouter namespace does not have access to the host defined DNS in /etc/resolv.conf, and keepalive tries to resolv the IP address of the current host via DNS and locks for 60 seconds (stopping VRRP, so other host transitions in as MASTER)

You can set a workaround in place with instructions in: 

https://bugzilla.redhat.com/show_bug.cgi?id=1181592#c12

So when keepalived tries to resolv it, it will be found in /etc/hosts, and DNS query will be avoided.

Best regards.

Comment 10 John Schwarz 2016-09-27 08:36:16 UTC
I would also like to add that all the flip-flop transitions doesn't occur sporadically (i.e. during the entire day), but during specific times of the day (mostly 14:00 - 01:00, which can be considered normal depending on the time zone). This can go towards the idea of some kind of user actions done on the setup, which in turns causes a re-write of the keepalived.conf, causing the process to reload the configuration file and then encounter the DNS issue.

If you could also ask what he was doing during the times where the issue was encountered (comment #2), i.e. if he was adding new VMs, etc - that would be also very helpful.

Comment 12 Anil Dhingra 2016-11-18 06:07:52 UTC
Any update , user is hitting same issue again & Again

Comment 13 John Schwarz 2016-11-18 12:43:51 UTC
Apologizes, for some reason I didn't receive email notifications about this Bugzilla.

From a brief look at the logs, it looks like the flip-flop pattern occurs once every 37-40 seconds consistently, which implies that there might indeed be an issue with the DNS. Miguel, please have a look at the log and let me know what you think.

Also, Anil, can we ask the user to run the command specified in comment #9 on each of the network nodes (specifically also on ospctrl02 and ospctrl03):

dig A $(hostname) | grep -A1 "ANSWER SEC" | tail -n 1 | awk '{print $NF " " $1}' | sed -e 's/.$//g'  >>/etc/hosts ;   grep $(hostname) /etc/hosts || echo "Failure setting up the hostname entry"

We'll make sure to follow up on this.


Note You need to log in before you can comment on or make changes to this bug.