Description of problem: When trying to create IP failover pod in openstack IPI profile, the pod is crashing. The logs says keepalived_script is doesnot exist. OpenShift release version: 4.11.0-0.ci-2022-03-08-161954 Cluster Platform: open stack IPI How reproducible: Steps to Reproduce (in detail): melvinjoseph@mjoseph-mac Downloads % oc create sa ipfailover serviceaccount/ipfailover created melvinjoseph@mjoseph-mac Downloads % oc adm policy add-scc-to-user priviledged -z ipfailover clusterrole.rbac.authorization.k8s.io/system:openshift:scc:priviledged added: "ipfailover" melvinjoseph@mjoseph-mac Downloads % oc adm policy add-scc-to-user hostnetwork -z ipfailover clusterrole.rbac.authorization.k8s.io/system:openshift:scc:hostnetwork added: "ipfailover" melvinjoseph@mjoseph-mac Downloads % oc create -f https://github.com/jechen0648/ipfailover/blob/main/deploy-ipfailover.yaml deployment.apps/ipfailover created melvinjoseph@mjoseph-mac Downloads % oc create -f https://github.com/jechen0648/ipfailover/blob/main/web-server-rc.yaml replicationcontroller/web-server-rc created melvinjoseph@mjoseph-mac Downloads % melvinjoseph@mjoseph-mac Downloads % melvinjoseph@mjoseph-mac Downloads % oc get all -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/ipfailover-788b595477-5wp4g 0/1 CrashLoopBackOff 1 (9s ago) 14s 192.168.2.97 mjoseph-bug15-gg58b-worker-0-ntcs4 <none> <none> pod/ipfailover-788b595477-8xccw 0/1 CrashLoopBackOff 1 (10s ago) 14s 192.168.3.147 mjoseph-bug15-gg58b-worker-0-fq8qc <none> <none> pod/web-server-rc-26v44 1/1 Running 0 10s 192.168.2.97 mjoseph-bug15-gg58b-worker-0-ntcs4 <none> <none> pod/web-server-rc-w8v42 1/1 Running 0 10s 192.168.3.147 mjoseph-bug15-gg58b-worker-0-fq8qc <none> <none> NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR replicationcontroller/web-server-rc 2 2 2 10s nginx quay.io/openshifttest/nginx-alpine@sha256:5d3f3372288b8a93fc9fc7747925df2328c24db41e4b4226126c3af293c5ad88 name=web-server-rc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 6h5m <none> service/openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 6h1m <none> NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR deployment.apps/ipfailover 0/2 2 0 15s ipfailover-keepalived quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:644bf2d63cc24035ec82a39e0b14e6d61e3ca4ba39181b409590132f59bfc2cf ipfailover=ipfailover NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR replicaset.apps/ipfailover-788b595477 2 2 0 16s ipfailover-keepalived quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:644bf2d63cc24035ec82a39e0b14e6d61e3ca4ba39181b409590132f59bfc2cf ipfailover=ipfailover,pod-template-hash=788b595477 Actual results: IP failover pod is crashing. Expected results: Pod should come up Impact of the problem: Additional info: melvinjoseph@mjoseph-mac Downloads % oc logs ipfailover-788b595477-5wp4g - Loading ip_vs module ... - Checking if ip_vs module is available ... ip_vs 172032 0 - Module ip_vs is loaded. - check for iptables rule for keepalived multicast (224.0.0.18) ... chroot: cannot change root directory to '/host': No such file or directory - adding iptables rule to INPUT to access 224.0.0.18. chroot: cannot change root directory to '/host': No such file or directory - Generating and writing config to /etc/keepalived/keepalived.conf - Starting failover services ... Wed Mar 9 08:24:52 2022: Starting Keepalived v2.1.5 (07/13,2020) Wed Mar 9 08:24:52 2022: Running on Linux 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 (built for Linux 4.18.0) Wed Mar 9 08:24:52 2022: Command line: '/usr/sbin/keepalived' '-D' '-n' '--log-console' Wed Mar 9 08:24:52 2022: Opening file '/etc/keepalived/keepalived.conf'. Wed Mar 9 08:24:52 2022: NOTICE: setting config option max_auto_priority should result in better keepalived performance Wed Mar 9 08:24:52 2022: Starting VRRP child process, pid=69 Wed Mar 9 08:24:52 2022: Registering Kernel netlink reflector Wed Mar 9 08:24:52 2022: Registering Kernel netlink command channel Wed Mar 9 08:24:52 2022: Opening file '/etc/keepalived/keepalived.conf'. Wed Mar 9 08:24:52 2022: WARNING - default user 'keepalived_script' for script execution does not exist - please create. Wed Mar 9 08:24:52 2022: (/etc/keepalived/keepalived.conf: Line 21) WARNING - interface enp0s3 for vrrp_instance ipfailover_VIP_1 doesn't exist Wed Mar 9 08:24:52 2022: (/etc/keepalived/keepalived.conf: Line 29) Truncating auth_pass to 8 characters Wed Mar 9 08:24:52 2022: (/etc/keepalived/keepalived.conf: Line 39) WARNING - interface enp0s3 for ip address 172.31.248.200 doesn't exist Wed Mar 9 08:24:52 2022: Non-existent interface specified in configuration Wed Mar 9 08:24:52 2022: Stopped - used 0.002067 user time, 0.001038 system time Wed Mar 9 08:24:52 2022: pid 69 exited with permanent error CONFIG. Terminating Wed Mar 9 08:24:52 2022: CPU usage (self/children) user: 0.004887/0.003075 system: 0.004880/0.001067 Wed Mar 9 08:24:52 2022: Stopped Keepalived v2.1.5 (07/13,2020) monitor.sh: OpenShift IP Failover service terminated. ** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.
Setting blocker- as this appears to be a configuration issue. The "default user 'keepalived_script' for script execution does not exist" message is only a warning. According to <https://www.keepalived.org/manpage.html>, keepalived defaults to the user "keepalived_script" if it exists and otherwise defaults to the user that keepalived is already running as. The reason why keepalived exits is probably because the "enp0s3" interface doesn't exist. Does the host have an "enp0s3" interface? If not, try setting OPENSHIFT_HA_NETWORK_INTERFACE in the ipfailover deployment to the correct interface name.
melvinjoseph@mjoseph-mac Downloads % oc set env deploy/ipfailover OPENSHIFT_HA_NETWORK_INTERFACE=ens3 deployment.apps/ipfailover updated melvinjoseph@mjoseph-mac Downloads % oc get po NAME READY STATUS RESTARTS AGE ipfailover-796d85684d-6dlb4 1/1 Running 0 20s ipfailover-796d85684d-7dtx2 1/1 Running 0 20s When i changed the interface to ens3 in the ipfailover config file the pods came up. Normally when i use this deploy script there was not such issue in past, and i was verifying one ipfailvoer bug, so thought the fix break the feature.
The naming difference is related to the underlying hardware. The naming system that systemd/udev uses to name network interfaces is defined at <https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/>: > The following different naming schemes for network interfaces are now supported by udev natively: > > 1. Names incorporating Firmware/BIOS provided index numbers for on-board devices (example: eno1) > 2. Names incorporating Firmware/BIOS provided PCI Express hotplug slot index numbers (example: ens1) > 3. Names incorporating physical/geographical location of the connector of the hardware (example: enp2s0) > 4. Names incorporating the interfaces's MAC address (example: enx78e7d1ea46da) > 5. Classic, unpredictable kernel-native ethX naming (example: eth0) So apparently previous clusters on which you have tested keepalived-ipfailover used scheme 3 (resulting in the name "enp0s3"), but for more recent tests, you have a cluster with PCIe hotplug slots, which are named using scheme 2 (resulting in the name "ens3"). The keepalived-ipfailover configuration script guesses a few names if the user doesn't configure one: > VBOX_INTERFACES="enp0s3 enp0s8 eth1" Source: <https://github.com/openshift/images/blob/86494446733fc171ee757e8166191e32d5931eb9/ipfailover/keepalived/lib/utils.sh#L6>. > function get_network_device() { > for dev in $1 ${VBOX_INTERFACES}; do > if ip addr show dev "$dev" &> /dev/null; then > echo "$dev" > return > fi > done > > ip route get 8.8.8.8 | awk '/dev/ { f=NR }; f && (NR-1 == f)' RS=" " > } Source: <https://github.com/openshift/images/blob/86494446733fc171ee757e8166191e32d5931eb9/ipfailover/keepalived/lib/utils.sh#L246-L255>. We could add "ens3" to the end of VBOX_INTERFACES to autodetect this name as well. Is that desirable?
melvinjoseph@mjoseph-mac Downloads % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.ci.test-2022-03-12-022506-ci-ln-rttk652-latest True False 70m Cluster version is 4.10.0-0.ci.test-2022-03-12-022506-ci-ln-rttk652-latest melvinjoseph@mjoseph-mac Downloads % oc create sa ipfailover serviceaccount/ipfailover created melvinjoseph@mjoseph-mac Downloads % oc adm policy add-scc-to-user priviledged -z ipfailover clusterrole.rbac.authorization.k8s.io/system:openshift:scc:priviledged added: "ipfailover" melvinjoseph@mjoseph-mac Downloads % oc adm policy add-scc-to-user hostnetwork -z ipfailover clusterrole.rbac.authorization.k8s.io/system:openshift:scc:hostnetwork added: "ipfailover" melvinjoseph@mjoseph-mac Downloads % oc create configmap keepalived-checkscript --from-file=mycheckscript.sh configmap/keepalived-checkscript createdmelvinjoseph@mjoseph-mac Downloads % oc create -f https://github.com/jechen0648/ipfailover/blob/main/deploy-ipfailover.yaml deployment.apps/ipfailover created melvinjoseph@mjoseph-mac Downloads % oc get all -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/ipfailover-788b595477-cd672 1/1 Running 0 18s 10.0.3.139 rttk652-b5564-vnk62-worker-0-t4v78 <none> <none> pod/ipfailover-788b595477-xblc5 1/1 Running 0 18s 10.0.1.135 rttk652-b5564-vnk62-worker-0-m77tv <none> <none> NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR service/kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 93m <none> service/openshift ExternalName <none> kubernetes.default.svc.cluster.local <none> 89m <none> NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR deployment.apps/ipfailover 2/2 2 2 20s ipfailover-keepalived quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:644bf2d63cc24035ec82a39e0b14e6d61e3ca4ba39181b409590132f59bfc2cf ipfailover=ipfailover NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR replicaset.apps/ipfailover-788b595477 2 2 2 19s ipfailover-keepalived quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:644bf2d63cc24035ec82a39e0b14e6d61e3ca4ba39181b409590132f59bfc2cf ipfailover=ipfailover,pod-template-hash=788b595477 melvinjoseph@mjoseph-mac Downloads % oc logs ipfailover-788b595477-cd672 - Loading ip_vs module ... - Checking if ip_vs module is available ... ip_vs 172032 0 - Module ip_vs is loaded. - check for iptables rule for keepalived multicast (224.0.0.18) ... chroot: cannot change root directory to '/host': No such file or directory - adding iptables rule to INPUT to access 224.0.0.18. chroot: cannot change root directory to '/host': No such file or directory - Generating and writing config to /etc/keepalived/keepalived.conf - Starting failover services ... Sat Mar 12 04:25:56 2022: Starting Keepalived v2.1.5 (07/13,2020) Sat Mar 12 04:25:56 2022: Running on Linux 4.18.0-305.40.2.el8_4.x86_64 #1 SMP Tue Mar 8 14:29:54 EST 2022 (built for Linux 4.18.0) Sat Mar 12 04:25:56 2022: Command line: '/usr/sbin/keepalived' '-D' '-n' '--log-console' Sat Mar 12 04:25:56 2022: Opening file '/etc/keepalived/keepalived.conf'. Sat Mar 12 04:25:56 2022: NOTICE: setting config option max_auto_priority should result in better keepalived performance Sat Mar 12 04:25:56 2022: Starting VRRP child process, pid=74 Sat Mar 12 04:25:56 2022: Registering Kernel netlink reflector Sat Mar 12 04:25:56 2022: Registering Kernel netlink command channel Sat Mar 12 04:25:56 2022: Opening file '/etc/keepalived/keepalived.conf'. Sat Mar 12 04:25:56 2022: WARNING - default user 'keepalived_script' for script execution does not exist - please create. Sat Mar 12 04:25:56 2022: (/etc/keepalived/keepalived.conf: Line 29) Truncating auth_pass to 8 characters Sat Mar 12 04:25:56 2022: SECURITY VIOLATION - scripts are being executed but script_security not enabled. Sat Mar 12 04:25:56 2022: (ipfailover_VIP_1) Warning - nopreempt will not work with initial state MASTER - clearing Sat Mar 12 04:25:56 2022: Assigned address 10.0.3.139 for interface ens3 Sat Mar 12 04:25:56 2022: Assigned address fe80::f7c0:638d:a465:aba7 for interface ens3 Sat Mar 12 04:25:56 2022: Registering gratuitous ARP shared channel Sat Mar 12 04:25:56 2022: (ipfailover_VIP_1) removing VIPs. Sat Mar 12 04:25:56 2022: VRRP sockpool: [ifindex( 2), family(IPv4), proto(112), fd(10,11)] Sat Mar 12 04:25:56 2022: Script `chk_ipfailover` now returning 1 Sat Mar 12 04:25:56 2022: VRRP_Script(chk_ipfailover) failed (exited with status 1) Sat Mar 12 04:25:56 2022: (ipfailover_VIP_1) Entering FAULT STATE
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069