Bug 2062126 - IPfailover pod is crashing during creation showing keepalived_script doesn't exist
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Assignee: Miciah Dashiel Butler Masters
QA Contact: Melvin Joseph
Reported: 2022-03-09 08:39 UTC by Melvin Joseph
Modified: 2022-08-10 10:52 UTC (History)
Doc Type: Bug Fix
Cause: If the user does not specify a network interface name, keepalived-ipfailover tries a few default names. This list of default names is hard-coded and previously included only the following values: "enp0s3", "enp0s8", and "eth1". If the host uses predictable network interface names as assigned by systemd/udev and the host has a network interface in a PCI Express hotplug slot, then the network interface is given a name of the form "ens3", in which case none of these default names matches it. Consequence: keepalived-ipfailover failed to start if the user didn't specify a network interface name and the network interface was in a PCI Express hotplug slot. Fix: The list of default names in keepalived-ipfailover was augmented by adding "ens3" to the end of the list. Result: keepalived-ipfailover now checks for an "ens3" network interface, increasing the likelihood that keepalived-ipfailover finds a network interface in a PCI Express hotplug slot when the user does not specify which network interface name to use. Because this fix only changes the defaulting logic and adds to the end of the list of default names, it does not affect the behavior when the user specifies a name or when one of the other default names matches a network interface.
Last Closed: 2022-08-10 10:52:44 UTC
System ID Private Priority Status Summary Last Updated
Github openshift images pull 111 0 None Merged Bug 2062126: ipfailover: Autodetect the "ens3" NIC 2022-08-08 12:00:03 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:52:54 UTC

Description Melvin Joseph 2022-03-09 08:39:56 UTC
Description of problem:
When trying to create IP failover pod in openstack IPI profile, the pod is crashing. The logs says keepalived_script is doesnot exist.

OpenShift release version:

Cluster Platform:
open stack IPI

How reproducible:
Steps to Reproduce (in detail):
melvinjoseph@mjoseph-mac Downloads % oc create sa ipfailover
serviceaccount/ipfailover created
melvinjoseph@mjoseph-mac Downloads % oc adm policy add-scc-to-user priviledged -z ipfailover
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:priviledged added: "ipfailover"
melvinjoseph@mjoseph-mac Downloads %  oc adm policy add-scc-to-user hostnetwork -z ipfailover
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:hostnetwork added: "ipfailover"
melvinjoseph@mjoseph-mac Downloads % oc create -f   https://github.com/jechen0648/ipfailover/blob/main/deploy-ipfailover.yaml
deployment.apps/ipfailover created
melvinjoseph@mjoseph-mac Downloads % oc create -f  https://github.com/jechen0648/ipfailover/blob/main/web-server-rc.yaml 
replicationcontroller/web-server-rc created
melvinjoseph@mjoseph-mac Downloads % 
melvinjoseph@mjoseph-mac Downloads % 
melvinjoseph@mjoseph-mac Downloads %  oc get all -owide
NAME                              READY   STATUS             RESTARTS      AGE   IP              NODE                                 NOMINATED NODE   READINESS GATES
pod/ipfailover-788b595477-5wp4g   0/1     CrashLoopBackOff   1 (9s ago)    14s    mjoseph-bug15-gg58b-worker-0-ntcs4   <none>           <none>
pod/ipfailover-788b595477-8xccw   0/1     CrashLoopBackOff   1 (10s ago)   14s   mjoseph-bug15-gg58b-worker-0-fq8qc   <none>           <none>
pod/web-server-rc-26v44           1/1     Running            0             10s    mjoseph-bug15-gg58b-worker-0-ntcs4   <none>           <none>
pod/web-server-rc-w8v42           1/1     Running            0             10s   mjoseph-bug15-gg58b-worker-0-fq8qc   <none>           <none>

NAME                                  DESIRED   CURRENT   READY   AGE   CONTAINERS   IMAGES                                                                                                       SELECTOR
replicationcontroller/web-server-rc   2         2         2       10s   nginx        quay.io/openshifttest/nginx-alpine@sha256:5d3f3372288b8a93fc9fc7747925df2328c24db41e4b4226126c3af293c5ad88   name=web-server-rc

NAME                 TYPE           CLUSTER-IP   EXTERNAL-IP                            PORT(S)   AGE    SELECTOR
service/kubernetes   ClusterIP   <none>                                 443/TCP   6h5m   <none>
service/openshift    ExternalName   <none>       kubernetes.default.svc.cluster.local   <none>    6h1m   <none>

NAME                         READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS              IMAGES                                                                                                                   SELECTOR
deployment.apps/ipfailover   0/2     2            0           15s   ipfailover-keepalived   quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:644bf2d63cc24035ec82a39e0b14e6d61e3ca4ba39181b409590132f59bfc2cf   ipfailover=ipfailover

NAME                                    DESIRED   CURRENT   READY   AGE   CONTAINERS              IMAGES                                                                                                                   SELECTOR
replicaset.apps/ipfailover-788b595477   2         2         0       16s   ipfailover-keepalived   quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:644bf2d63cc24035ec82a39e0b14e6d61e3ca4ba39181b409590132f59bfc2cf   ipfailover=ipfailover,pod-template-hash=788b595477

Actual results:
IP failover pod is crashing.

Expected results:
Pod should come up

Impact of the problem:

Additional info:
melvinjoseph@mjoseph-mac Downloads % oc logs ipfailover-788b595477-5wp4g
  - Loading ip_vs module ...
  - Checking if ip_vs module is available ...
ip_vs                 172032  0
  - Module ip_vs is loaded.
  - check for iptables rule for keepalived multicast ( ...
chroot: cannot change root directory to '/host': No such file or directory
  - adding iptables rule to INPUT to access
chroot: cannot change root directory to '/host': No such file or directory
  - Generating and writing config to /etc/keepalived/keepalived.conf
  - Starting failover services ...
Wed Mar  9 08:24:52 2022: Starting Keepalived v2.1.5 (07/13,2020)
Wed Mar  9 08:24:52 2022: Running on Linux 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Mon Jan 17 07:06:06 EST 2022 (built for Linux 4.18.0)
Wed Mar  9 08:24:52 2022: Command line: '/usr/sbin/keepalived' '-D' '-n' '--log-console'
Wed Mar  9 08:24:52 2022: Opening file '/etc/keepalived/keepalived.conf'.
Wed Mar  9 08:24:52 2022: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Wed Mar  9 08:24:52 2022: Starting VRRP child process, pid=69
Wed Mar  9 08:24:52 2022: Registering Kernel netlink reflector
Wed Mar  9 08:24:52 2022: Registering Kernel netlink command channel
Wed Mar  9 08:24:52 2022: Opening file '/etc/keepalived/keepalived.conf'.
Wed Mar  9 08:24:52 2022: WARNING - default user 'keepalived_script' for script execution does not exist - please create.
Wed Mar  9 08:24:52 2022: (/etc/keepalived/keepalived.conf: Line 21) WARNING - interface enp0s3 for vrrp_instance ipfailover_VIP_1 doesn't exist
Wed Mar  9 08:24:52 2022: (/etc/keepalived/keepalived.conf: Line 29) Truncating auth_pass to 8 characters
Wed Mar  9 08:24:52 2022: (/etc/keepalived/keepalived.conf: Line 39) WARNING - interface enp0s3 for ip address doesn't exist
Wed Mar  9 08:24:52 2022: Non-existent interface specified in configuration
Wed Mar  9 08:24:52 2022: Stopped - used 0.002067 user time, 0.001038 system time
Wed Mar  9 08:24:52 2022: pid 69 exited with permanent error CONFIG. Terminating
Wed Mar  9 08:24:52 2022: CPU usage (self/children) user: 0.004887/0.003075 system: 0.004880/0.001067
Wed Mar  9 08:24:52 2022: Stopped Keepalived v2.1.5 (07/13,2020)
monitor.sh: OpenShift IP Failover service terminated.

** Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report.  You may also mark the bug private if you wish.

Comment 1 Miciah Dashiel Butler Masters 2022-03-09 20:08:34 UTC
Setting blocker- as this appears to be a configuration issue.  

The "default user 'keepalived_script' for script execution does not exist" message is only a warning.  According to <https://www.keepalived.org/manpage.html>, keepalived defaults to the user "keepalived_script" if it exists and otherwise defaults to the user that keepalived is already running as.  The reason why keepalived exits is probably because the "enp0s3" interface doesn't exist.  Does the host have an "enp0s3" interface?  If not, try setting OPENSHIFT_HA_NETWORK_INTERFACE in the ipfailover deployment to the correct interface name.

Comment 2 Melvin Joseph 2022-03-10 12:33:24 UTC
melvinjoseph@mjoseph-mac Downloads % oc set env deploy/ipfailover OPENSHIFT_HA_NETWORK_INTERFACE=ens3
deployment.apps/ipfailover updated
melvinjoseph@mjoseph-mac Downloads % oc get po                                                       
NAME                          READY   STATUS    RESTARTS   AGE
ipfailover-796d85684d-6dlb4   1/1     Running   0          20s
ipfailover-796d85684d-7dtx2   1/1     Running   0          20s

When i changed the interface to ens3 in the ipfailover config file the pods came up. 
Normally when i use this deploy script there was not such issue in past, and i was verifying one ipfailvoer bug, so thought the fix break the feature.

Comment 3 Miciah Dashiel Butler Masters 2022-03-10 22:44:52 UTC
The naming difference is related to the underlying hardware.  The naming system that systemd/udev uses to name network interfaces is defined at <https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/>:

> The following different naming schemes for network interfaces are now supported by udev natively:
> 1. Names incorporating Firmware/BIOS provided index numbers for on-board devices (example: eno1)
> 2. Names incorporating Firmware/BIOS provided PCI Express hotplug slot index numbers (example: ens1)
> 3. Names incorporating physical/geographical location of the connector of the hardware (example: enp2s0)
> 4. Names incorporating the interfaces's MAC address (example: enx78e7d1ea46da)
> 5. Classic, unpredictable kernel-native ethX naming (example: eth0)

So apparently previous clusters on which you have tested keepalived-ipfailover used scheme 3 (resulting in the name "enp0s3"), but for more recent tests, you have a cluster with PCIe hotplug slots, which are named using scheme 2 (resulting in the name "ens3").  

The keepalived-ipfailover configuration script guesses a few names if the user doesn't configure one:

> VBOX_INTERFACES="enp0s3 enp0s8 eth1"

Source: <https://github.com/openshift/images/blob/86494446733fc171ee757e8166191e32d5931eb9/ipfailover/keepalived/lib/utils.sh#L6>.

> function get_network_device() {
>   for dev in $1 ${VBOX_INTERFACES}; do
>     if ip addr show dev "$dev" &> /dev/null; then
>       echo "$dev"
>       return
>     fi
>   done
>   ip route get | awk '/dev/ { f=NR }; f && (NR-1 == f)' RS=" "
> }

Source: <https://github.com/openshift/images/blob/86494446733fc171ee757e8166191e32d5931eb9/ipfailover/keepalived/lib/utils.sh#L246-L255>.

We could add "ens3" to the end of VBOX_INTERFACES to autodetect this name as well.  Is that desirable?

Comment 4 Melvin Joseph 2022-03-12 04:33:55 UTC
melvinjoseph@mjoseph-mac Downloads % oc get clusterversion
NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.ci.test-2022-03-12-022506-ci-ln-rttk652-latest   True        False         70m     Cluster version is 4.10.0-0.ci.test-2022-03-12-022506-ci-ln-rttk652-latest
melvinjoseph@mjoseph-mac Downloads % oc create sa ipfailover
serviceaccount/ipfailover created
melvinjoseph@mjoseph-mac Downloads % oc adm policy add-scc-to-user priviledged -z ipfailover
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:priviledged added: "ipfailover"
melvinjoseph@mjoseph-mac Downloads % oc adm policy add-scc-to-user hostnetwork -z ipfailover
clusterrole.rbac.authorization.k8s.io/system:openshift:scc:hostnetwork added: "ipfailover"
melvinjoseph@mjoseph-mac Downloads % oc create configmap keepalived-checkscript --from-file=mycheckscript.sh
configmap/keepalived-checkscript createdmelvinjoseph@mjoseph-mac Downloads % oc create -f  https://github.com/jechen0648/ipfailover/blob/main/deploy-ipfailover.yaml
deployment.apps/ipfailover created
melvinjoseph@mjoseph-mac Downloads %  oc get all -owide
NAME                              READY   STATUS    RESTARTS   AGE   IP           NODE                                 NOMINATED NODE   READINESS GATES
pod/ipfailover-788b595477-cd672   1/1     Running   0          18s   rttk652-b5564-vnk62-worker-0-t4v78   <none>           <none>
pod/ipfailover-788b595477-xblc5   1/1     Running   0          18s   rttk652-b5564-vnk62-worker-0-m77tv   <none>           <none>

NAME                 TYPE           CLUSTER-IP   EXTERNAL-IP                            PORT(S)   AGE   SELECTOR
service/kubernetes   ClusterIP   <none>                                 443/TCP   93m   <none>
service/openshift    ExternalName   <none>       kubernetes.default.svc.cluster.local   <none>    89m   <none>

NAME                         READY   UP-TO-DATE   AVAILABLE   AGE   CONTAINERS              IMAGES                                                                                                                   SELECTOR
deployment.apps/ipfailover   2/2     2            2           20s   ipfailover-keepalived   quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:644bf2d63cc24035ec82a39e0b14e6d61e3ca4ba39181b409590132f59bfc2cf   ipfailover=ipfailover

NAME                                    DESIRED   CURRENT   READY   AGE   CONTAINERS              IMAGES                                                                                                                   SELECTOR
replicaset.apps/ipfailover-788b595477   2         2         2       19s   ipfailover-keepalived   quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:644bf2d63cc24035ec82a39e0b14e6d61e3ca4ba39181b409590132f59bfc2cf   ipfailover=ipfailover,pod-template-hash=788b595477
melvinjoseph@mjoseph-mac Downloads % oc logs ipfailover-788b595477-cd672
  - Loading ip_vs module ...
  - Checking if ip_vs module is available ...
ip_vs                 172032  0
  - Module ip_vs is loaded.
  - check for iptables rule for keepalived multicast ( ...
chroot: cannot change root directory to '/host': No such file or directory
  - adding iptables rule to INPUT to access
chroot: cannot change root directory to '/host': No such file or directory
  - Generating and writing config to /etc/keepalived/keepalived.conf
  - Starting failover services ...
Sat Mar 12 04:25:56 2022: Starting Keepalived v2.1.5 (07/13,2020)
Sat Mar 12 04:25:56 2022: Running on Linux 4.18.0-305.40.2.el8_4.x86_64 #1 SMP Tue Mar 8 14:29:54 EST 2022 (built for Linux 4.18.0)
Sat Mar 12 04:25:56 2022: Command line: '/usr/sbin/keepalived' '-D' '-n' '--log-console'
Sat Mar 12 04:25:56 2022: Opening file '/etc/keepalived/keepalived.conf'.
Sat Mar 12 04:25:56 2022: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Sat Mar 12 04:25:56 2022: Starting VRRP child process, pid=74
Sat Mar 12 04:25:56 2022: Registering Kernel netlink reflector
Sat Mar 12 04:25:56 2022: Registering Kernel netlink command channel
Sat Mar 12 04:25:56 2022: Opening file '/etc/keepalived/keepalived.conf'.
Sat Mar 12 04:25:56 2022: WARNING - default user 'keepalived_script' for script execution does not exist - please create.
Sat Mar 12 04:25:56 2022: (/etc/keepalived/keepalived.conf: Line 29) Truncating auth_pass to 8 characters
Sat Mar 12 04:25:56 2022: SECURITY VIOLATION - scripts are being executed but script_security not enabled.
Sat Mar 12 04:25:56 2022: (ipfailover_VIP_1) Warning - nopreempt will not work with initial state MASTER - clearing
Sat Mar 12 04:25:56 2022: Assigned address for interface ens3
Sat Mar 12 04:25:56 2022: Assigned address fe80::f7c0:638d:a465:aba7 for interface ens3
Sat Mar 12 04:25:56 2022: Registering gratuitous ARP shared channel
Sat Mar 12 04:25:56 2022: (ipfailover_VIP_1) removing VIPs.
Sat Mar 12 04:25:56 2022: VRRP sockpool: [ifindex(  2), family(IPv4), proto(112), fd(10,11)]
Sat Mar 12 04:25:56 2022: Script `chk_ipfailover` now returning 1
Sat Mar 12 04:25:56 2022: VRRP_Script(chk_ipfailover) failed (exited with status 1)
Sat Mar 12 04:25:56 2022: (ipfailover_VIP_1) Entering FAULT STATE

Comment 8 errata-xmlrpc 2022-08-10 10:52:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


