Bug 1822945 - Egress Router pod is stuck in Init:CrashLoopBackOff
Summary: Egress Router pod is stuck in Init:CrashLoopBackOff
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Dan Winship
QA Contact: Weibin Liang
URL:
Whiteboard: SDN-CI-IMPACT,SDN-STALE
Depends On:
Blocks: 1855894
TreeView+ depends on / blocked
 
Reported: 2020-04-10 15:29 UTC by Weibin Liang
Modified: 2020-10-27 15:58 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Consequence: Egress router pods could not be run in OCP 4.x Fix: A modification was made to the RHCOS image to allow containers to use legacy iptables binaries in their own network namespace Result: Egress router pods can be run in OCP 4.x
Clone Of:
: 1855894 (view as bug list)
Environment:
Last Closed: 2020-10-27 15:57:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Testing log (25.64 KB, text/plain)
2020-04-13 20:44 UTC, Weibin Liang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1834 0 None closed Bug 1822945: templates: add a file to load legacy iptables kernel modules 2021-02-05 01:34:25 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:58:14 UTC

Description Weibin Liang 2020-04-10 15:29:56 UTC
Description of problem:
Follow https://docs.openshift.com/container-platform/3.11/admin_guide/managing_networking.html#admin-guide-deploying-an-egress-router-pod to deploy egress router pod in v4.5, but egress router pod is stuck in Init:CrashLoopBackOff state

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-04-08-134816

How reproducible:
Always

Steps to Reproduce:
[weliang@weliang FILE]$ cat egressrouterpod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: egress-redirect-pod
  labels:
    name: egress-redirect-pod
  annotations:
    pod.network.openshift.io/assign-macvlan: "true"
spec:
  initContainers:
  - name: egress-router
    image: registry.redhat.io/openshift4/ose-egress-router 
    imagePullPolicy:  IfNotPresent
    securityContext:
      privileged: true
    env:
    - name: EGRESS_SOURCE
      value: 139.178.76.12
    - name: EGRESS_GATEWAY
      value: 139.178.76.1
    - name: EGRESS_DESTINATION 
      value:  172.217.7.206
    - name: EGRESS_ROUTER_MODE
      value: init
  containers:
  - name: egressrouter-redirect
    image: registry.redhat.io/openshift4/ose-egress-router
    imagePullPolicy:  IfNotPresent
[weliang@weliang FILE]$  oc create -f egressrouterpod.yaml
[weliang@weliang FILE]$ oc get pod
NAME                  READY   STATUS                  RESTARTS   AGE
egress-redirect-pod   0/1     Init:CrashLoopBackOff   6          7m41s
test-pod-1            1/1     Running                 0          18m
test-pod-2            1/1     Running                 0          18m
test-pod-3            1/1     Running                 0          18m

[weliang@weliang FILE]$ oc logs egress-redirect-pod
Error from server (BadRequest): container "egressrouter-redirect" in pod "egress-redirect-pod" is waiting to start: PodInitializing
[weliang@weliang FILE]$ oc describe pods egress-redirect-pod
Name:         egress-redirect-pod
Namespace:    test
Priority:     0
Node:         compute-0/139.178.76.11
Start Time:   Fri, 10 Apr 2020 10:52:19 -0400
Labels:       name=egress-redirect-pod
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.0.41"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "openshift-sdn",
                    "interface": "eth0",
                    "ips": [
                        "10.131.0.41"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: node-exporter
              pod.network.openshift.io/assign-macvlan: true
Status:       Pending
IP:           10.131.0.41
IPs:
  IP:  10.131.0.41
Init Containers:
  egress-router:
    Container ID:   cri-o://3934dcea39c81f2e0b70d7aca0909bbdbb911ee8a172a86e8fb2283bbff2c728
    Image:          registry.redhat.io/openshift4/ose-egress-router
    Image ID:       registry.redhat.io/openshift4/ose-egress-router@sha256:1e7abd047edcd20034f1abd4526240956564cb252f941de8806e40c59a953eb6
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    3
      Started:      Fri, 10 Apr 2020 10:53:56 -0400
      Finished:     Fri, 10 Apr 2020 10:53:56 -0400
    Ready:          False
    Restart Count:  4
    Environment:
      EGRESS_SOURCE:       139.178.76.12
      EGRESS_GATEWAY:      139.178.76.1
      EGRESS_DESTINATION:  172.217.7.206
      EGRESS_ROUTER_MODE:  init
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c9hpg (ro)
Containers:
  egressrouter-redirect:
    Container ID:   
    Image:          registry.redhat.io/openshift4/ose-egress-router
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-c9hpg (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  default-token-c9hpg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-c9hpg
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason          Age                 From                Message
  ----     ------          ----                ----                -------
  Normal   Scheduled       <unknown>           default-scheduler   Successfully assigned test/egress-redirect-pod to compute-0
  Normal   AddedInterface  113s                multus              Add eth0 [10.131.0.41/23]
  Normal   Pulling         113s                kubelet, compute-0  Pulling image "registry.redhat.io/openshift4/ose-egress-router"
  Normal   Pulled          108s                kubelet, compute-0  Successfully pulled image "registry.redhat.io/openshift4/ose-egress-router"
  Normal   Created         18s (x5 over 108s)  kubelet, compute-0  Created container egress-router
  Normal   Started         18s (x5 over 108s)  kubelet, compute-0  Started container egress-router
  Normal   Pulled          18s (x4 over 107s)  kubelet, compute-0  Container image "registry.redhat.io/openshift4/ose-egress-router" already present on machine
  Warning  BackOff         18s (x9 over 106s)  kubelet, compute-0  Back-off restarting failed container
[weliang@weliang FILE]$ 

Actual results:
egress-redirect-pod   0/1     Init:CrashLoopBackOff

Expected results:
egress-redirect-pod   1/1     Running

Additional info:

Comment 1 Dan Winship 2020-04-13 16:09:53 UTC
> [weliang@weliang FILE]$ oc logs egress-redirect-pod
> Error from server (BadRequest): container "egressrouter-redirect" in pod "egress-redirect-pod" is waiting to start: PodInitializing

hm... does it work if you do "oc logs -c egress-router egress-redirect-pod" ? Or failing that, try modifying the egress-router initContainer's definition to include "terminationMessagePolicy: FallbackToLogsOnError" so that the logs will be captured into the pod status.

Comment 2 Weibin Liang 2020-04-13 20:44:53 UTC
Created attachment 1678542 [details]
Testing log

Comment 3 Dan Winship 2020-04-13 22:55:34 UTC
> [weliang@weliang FILE]$ oc logs -c egress-router egress-redirect-pod
> iptables v1.4.21: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)

Ah. We are not loading the iptables legacy kernel modules by default, so pods that try to use legacy iptables in their own network namespace will fail.

I think we had agreed that we wanted this to work, right? So we should fix RHCOS (or something) to ensure that the legacy iptables modules are loaded no matter what?

Comment 4 Casey Callendrello 2020-04-14 11:36:19 UTC
Yes, that was what we decided: privileged pods (et al.) can use legacy iptables in their network namespace. Among other things, istio makes use of this.

Of course, we have no way of enforcing that they don't insert rules in to the root network namespace. Perhaps we should write an alert for that.

Comment 5 zhaozhanqi 2020-04-16 03:51:15 UTC
this block the egress router feature on all version (4.1/2/3/4) as well

Comment 6 Dan Winship 2020-05-07 15:24:40 UTC
MCO folks say it would make sense to add a file to the default template for this (eg, something in /etc/modules-load.d/ to get systemd to load the modules). It will need some testing though (eg, to make confirm that it doesn't break the logic of containers using https://github.com/kubernetes-sigs/iptables-wrappers)

Comment 7 Ben Bennett 2020-05-27 13:25:19 UTC
Dan: Was that something that the MCO team was going to do, or that they wanted the CNO to do?  It seems a little weird for the CNO to do it if we expect the platform to be able to use iptables.

Comment 8 Dan Winship 2020-05-27 13:50:11 UTC
Network team was going to submit a patch to MCO (not CNO), after testing that it doesn't break various scenarios (comment 6)

Comment 9 Ben Bennett 2020-05-27 13:53:27 UTC
Pushing to 4.6 since this has been in every 4.y release.

Comment 12 Weibin Liang 2020-06-30 18:49:56 UTC
Tested and verified in 4.6.0-0.nightly-2020-06-30-112422

oc logs -c egress-router egress-redirect-pod will not see below error any more:
iptables v1.4.21: can't initialize iptables table `nat': Table does not exist (do you need to insmod?)

Comment 13 Patrick Strick 2020-07-06 14:36:17 UTC
Ben, can this be backported to 4.4, please? We have at least one OSD customer who has hit this and we do not have a workaround for them.

Comment 16 errata-xmlrpc 2020-10-27 15:57:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.