Bug 1576398
Summary: | IP failover doesn't react on router's pod being scaled down | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vladislav Walek <vwalek> | |
Component: | Networking | Assignee: | Ivan Chavero <ichavero> | |
Networking sub component: | router | QA Contact: | zhaozhanqi <zzhao> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | andcosta, aos-bugs, asolanas, bbennett, bmeng, dmoessne, hongli, ichavero, marc.popp, mmasters, seferovic, vwalek, weliang | |
Version: | 3.7.0 | Keywords: | NeedsTestCase, Reopened | |
Target Milestone: | --- | |||
Target Release: | 3.10.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1607538 (view as bug list) | Environment: | ||
Last Closed: | 2018-07-30 19:14:54 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1607538 |
Description
Vladislav Walek
2018-05-09 11:14:20 UTC
Looking into this. When the router scales down haproxy remains until current sessions complete. This also happens when router reloads occur. As long as haproxy accepts connections ipfailover thinks its still alive. Continuing the line of reasoning in comment 1, can you run the following commands on the host where the router has been scaled down, substituting the router's address for $router_addr and the router service's address for $service_addr? </dev/tcp/$router_addr/80 echo $? </dev/tcp/$service_addr/80 echo $? If the first command prints 0, then the router is still accepting connections. If the first command does not print 0 but the second does, then it may be that we need to change the router check to use the router's address instead of the service's. I also saw that in the case associated with this Bugzilla report, it is reported that failover is not happening "when the check script should fail (verified by manual test)". Are you not seeing failover even when OPENSHIFT_HA_CHECK_SCRIPT is set to a command that exits non-zero? @Phil, I saw the same problem in the latest 3.10 code, the virtual IP address stay in the node even router pod is removed from that node. Another finding is without deploying router pods to any nodes, ipfailover pod can be deployed in the nodes, and virtual IP addressed can be assigned to those nodes. Saw Unable to access script `</dev/tcp/172.17.0.4/80` message when oc log ipfailover-pod. [root@qe-weliang-3master-etcd-nfs-1 keepalived]# oc log ipf-har-1-w89dk log is DEPRECATED and will be removed in a future version. Use logs instead. - Loading ip_vs module ... - Checking if ip_vs module is available ... ip_vs 141432 0 - Module ip_vs is loaded. - check for iptables rule for keepalived multicast (224.0.0.18) ... - Generating and writing config to /etc/keepalived/keepalived.conf - Starting failover services ... Starting Keepalived v1.3.5 (03/19,2017), git commit v1.3.5-6-g6fa32f2 Opening file '/etc/keepalived/keepalived.conf'. Starting Healthcheck child process, pid=96 Initializing ipvs Opening file '/etc/keepalived/keepalived.conf'. Starting VRRP child process, pid=97 Registering Kernel netlink reflector Registering Kernel netlink command channel Registering gratuitous ARP shared channel Opening file '/etc/keepalived/keepalived.conf'. WARNING - default user 'keepalived_script' for script execution does not exist - please create. Unable to access script `</dev/tcp/172.17.0.4/80` Disabling track script chk_ipf_har since not found The key error is: Unable to access script `</dev/tcp/172.17.0.4/80` That seems to kill the monitoring and disable the script. This was fixed in keepalived with https://github.com/acassen/keepalived/commit/5cd5fff78de11178c51ca245ff5de61a86b85049 The question is when the security checks were added and if we can work out an alternative... or if we should make a check script that does the same thing and takes the ip and port as args (if that works). *** Bug 1517723 has been marked as a duplicate of this bug. *** Verified this bug on v3.10.0-0.58.0 steps 1. Create two routers 2. Create ipfailover pods oc adm ipfailover --create --replicas=2 -w 80 --virtual-ips=10.10.10.10-11 3. Check the logs When there are two pod become running, no found the logs like " Unable to access script `</dev/tcp/172.17.0.4/80` 4. stop one router pod by replicas=1 5. Check the vip has been removed and switched to another node. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 Hi, would it be possible to implement this change in the next 3.9 release as well? Thank you! Kind regards, E. |