Bug 1909260 - Pods stuck in ContainerCreating on OVN in 4.7.0-0.nightly-2020-12-18-031435: timed out waiting for OVS flows
Summary: Pods stuck in ContainerCreating on OVN in 4.7.0-0.nightly-2020-12-18-031435: ...
Keywords:
Status: CLOSED DUPLICATE of bug 1865743
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-18 18:31 UTC by Mike Fiedler
Modified: 2021-01-04 22:36 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-04 22:36:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
journal from compute node with pods stuck in ContainerCreating (18.42 MB, application/gzip)
2021-01-04 21:44 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2020-12-18 18:31:47 UTC
Description of problem:

Running a test to create 500 pods/node gets stuck on the first batch of 50 pods on 4.7.0-0.nightly-2020-12-18-031435.   All pods are ContainerCreating with the following event.

  Warning  FailedCreatePodSandBox  30s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pausepods0_clusterproject0_6f4f7e3e-5d97-44fe-857e-a11038cd53da_0(6826918aab70596f063b57f53ed1d3d03970f771af2fbdb24b91c12fcecedc25): [clusterproject0/pausepods0:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[clusterproject0/pausepods0 6826918aab70596f063b57f53ed1d3d03970f771af2fbdb24b91c12fcecedc25] [clusterproject0/pausepods0 6826918aab70596f063b57f53ed1d3d03970f771af2fbdb24b91c12fcecedc25] failed to configure pod interface: error while waiting on flows for pod: timed out waiting for OVS flows

ovs-vswitchd and ovs-configuration on the nodes :

● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
  Drop-In: /etc/systemd/system/ovs-vswitchd.service.d
           └─10-ovs-vswitchd-restart.conf
   Active: active (running) since Fri 2020-12-18 17:10:14 UTC; 1h 16min ago
  Process: 1298 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVS_USER_OPT} start $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 1296 ExecStartPre=/usr/bin/chmod 0775 /dev/hugepages (code=exited, status=0/SUCCESS)
  Process: 1294 ExecStartPre=/bin/sh -c /usr/bin/chown :$${OVS_USER_ID##*:} /dev/hugepages (code=exited, status=0/SUCCESS)
 Main PID: 1342 (ovs-vswitchd)
    Tasks: 11 (limit: 202241)
   Memory: 409.0M
      CPU: 4min 5.916s
   CGroup: /system.slice/ovs-vswitchd.service
           └─1342 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach

Dec 18 18:25:49 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3237339|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:25:55 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3241722|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:26:02 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3246103|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:26:08 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3250483|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:26:14 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3254864|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:26:21 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3259302|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:26:27 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3263682|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:26:34 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3268059|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:26:41 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3272438|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 18 18:26:47 ip-10-0-141-138 ovs-vswitchd[1342]: ovs|3276814|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)

ovs-configuration

● ovs-configuration.service - Configures OVS with proper host networking configuration
   Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Fri 2020-12-18 17:10:15 UTC; 1h 17min ago
  Process: 11414 ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes (code=exited, status=0/SUCCESS)
 Main PID: 11414 (code=exited, status=0/SUCCESS)
      CPU: 230ms

Dec 18 17:10:14 ip-10-0-141-138 configure-ovs.sh[11414]: + echo 'Driver name is'
Dec 18 17:10:14 ip-10-0-141-138 configure-ovs.sh[11414]: Driver name is
Dec 18 17:10:14 ip-10-0-141-138 configure-ovs.sh[11414]: + '[' '' = vmxnet3 ']'
Dec 18 17:10:14 ip-10-0-141-138 configure-ovs.sh[11414]: + echo 'Networking already configured and up for br-ex!'
Dec 18 17:10:14 ip-10-0-141-138 configure-ovs.sh[11414]: Networking already configured and up for br-ex!
Dec 18 17:10:14 ip-10-0-141-138 configure-ovs.sh[11414]: + ovs-vsctl --timeout=30 --if-exists del-br br0
Dec 18 17:10:15 ip-10-0-141-138 configure-ovs.sh[11414]: + exit 0
Dec 18 17:10:15 ip-10-0-141-138 systemd[1]: ovs-configuration.service: Succeeded.
Dec 18 17:10:15 ip-10-0-141-138 systemd[1]: Started Configures OVS with proper host networking configuration.
Dec 18 17:10:15 ip-10-0-141-138 systemd[1]: ovs-configuration.service: Consumed 230ms CPU time


Will attach must-gather and must-gather gather_network_ovn_trace


Version-Release number of selected component (if applicable): 4.7.0-0.nightly-2020-12-18-031435


How reproducible: 1 cluster so far.   Will try creating another


Steps to Reproduce:
1. 3 master, m5.2xlarge node cluster on AWS
2. Create KubeletConfig to allow 500 pods/node and apply it to the machineconfigpool
3. Run cluster-loader to create 1 project with 1440 pause pods across the 3 worker nodes

Actual results:

First batch of 50 pods are all stuck in ContainerCreating with the event above.


Additional info:

Will attach must-gather and must-gather gather_network_ovn_trace

Comment 3 Tim Rozet 2021-01-04 21:04:21 UTC
Looks like there is some issue talking to OVSDB on the worker nodes, and OVS opening the interface:

79b532f5dd44d80cbd0eb8af8ef764cb5b27a82e57cb01] [clusterproject0/pausepods255 1cf8c0d238a1cf79aa79b532f5dd44d80cbd0eb8af8ef764cb5b27a82e57cb01] failed to configure pod interface: failure in plugging pod interface: failed to run 'ovs-vsctl --timeout=30 add-port br-int 1cf8c0d238a1cf7 -- set interface 1cf8c0d238a1cf7 external_ids:attached_mac=0a:58:0a:80:02:6e external_ids:iface-id=clusterproject0_pausepods255 external_ids:ip_addresses=10.128.2.110/23 external_ids:sandbox=1cf8c0d238a1cf79aa79b532f5dd44d80cbd0eb8af8ef764cb5b27a82e57cb01': signal: alarm clock


1218-15:04:39.226560 ip-10-0-161-203 crio[5487]: time="2020-12-18 15:04:39.226499505Z" level=error msg="Error adding network: [clusterproject0/pausepods805:ovn-kubernetes]: error adding container to network \"ovn-kubernetes\": CNI request failed with status 400: '[clusterproject0/pausepods805 2ae0252f9bfc11d4528278f4c533f3b78dfa89ecddf8a9d9e0951c32a7249903] [clusterproject0/pausepods805 2ae0252f9bfc11d4528278f4c533f3b78dfa89ecddf8a9d9e0951c32a7249903] failed to configure pod interface: failure in plugging pod interface: failed to run 'ovs-vsctl --timeout=30 add-port br-int 2ae0252f9bfc11d -- set interface 2ae0252f9bfc11d external_ids:attached_mac=0a:58:0a:83:01:39 external_ids:iface-id=clusterproject0_pausepods805 external_ids:ip_addresses=10.131.1.57/23 external_ids:sandbox=2ae0252f9bfc11d4528278f4c533f3b78dfa89ecddf8a9d9e0951c32a7249903': exit status 1\n  \"ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\\n\"\n  \"\"\n'"

OVS:
2020-12-18T18:13:43.691658109Z 2020-12-18T18:13:43.691Z|2637735|bridge|WARN|could not open network device c57403853a28213 (No such device)
2020-12-18T18:13:43.692853965Z 2020-12-18T18:13:43.692Z|2637736|bridge|WARN|could not open network device 0777a6727038243 (No such device)
2020-12-18T18:13:43.694306342Z 2020-12-18T18:13:43.694Z|2637737|bridge|WARN|could not open network device b95c1af4b2dcebc (No such device)
2020-12-18T18:13:43.695931434Z 2020-12-18T18:13:43.695Z|2637738|bridge|WARN|could not open network device 22ae777af509c32 (No such device)
2020-12-18T18:13:43.697175778Z 2020-12-18T18:13:43.697Z|2637739|bridge|WARN|could not open network device 8ae95bb1c10ee8d (No such device)
2020-12-18T18:13:43.698184648Z 2020-12-18T18:13:43.698Z|01685|ovs_rcu(urcu5)|WARN|blocked 1000 ms waiting for main to quiesce

Wondering if something is wrong with the folder mounts into ovnkube, or some permissions problem. Mike is going to try to reproduce it so we can get more information.

Comment 4 Mike Fiedler 2021-01-04 21:44:44 UTC
Created attachment 1744439 [details]
journal from compute node with pods stuck in ContainerCreating

Re-ran the scenario with Tim watching.   Symptoms were a bit different:

- 1440 pods initially wen running
- Then devolved into being stuck in containercreating and terminating.
- Pods in containercreating showed multiples of this event:

 Warning  FailedCreatePodSandBox  2m56s (x4 over 7m8s)  kubelet, ip-10-0-196-96.us-east-2.compute.internal  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_pausepods998_proj0_ed720bf9-046e-4e96-ad47-32801575cf26_0(ed48087c86bba874a4bb2dda7c9fc473c3808fae8ae265131bfdb0df697c7bf1): [proj0/pausepods998:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[proj0/pausepods998 ed48087c86bba874a4bb2dda7c9fc473c3808fae8ae265131bfdb0df697c7bf1] [proj0/pausepods998 ed48087c86bba874a4bb2dda7c9fc473c3808fae8ae265131bfdb0df697c7bf1] failed to configure pod interface: failure in plugging pod interface: failed to run 'ovs-vsctl --timeout=30 add-port br-int ed48087c86bba87 -- set interface ed48087c86bba87 external_ids:attached_mac=0a:58:0a:80:05:8f external_ids:iface-id=proj0_pausepods998 external_ids:ip_addresses=10.128.5.143/23 external_ids:sandbox=ed48087c86bba874a4bb2dda7c9fc473c3808fae8ae265131bfdb0df697c7bf1': exit status 1
  "ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (Protocol error)\n"

Tim also noted these errors:

2020-12-18T18:13:43.694306342Z 2020-12-18T18:13:43.694Z|2637737|bridge|WARN|could not open network device b95c1af4b2dcebc (No such device)
2020-12-18T18:13:43.695931434Z 2020-12-18T18:13:43.695Z|2637738|bridge|WARN|could not open network device 22ae777af509c32 (No such device)
2020-12-18T18:13:43.697175778Z 2020-12-18T18:13:43.697Z|2637739|bridge|WARN|could not open network device 8ae95bb1c10ee8d (No such device)

and

79b532f5dd44d80cbd0eb8af8ef764cb5b27a82e57cb01] [clusterproject0/pausepods255 1cf8c0d238a1cf79aa79b532f5dd44d80cbd0eb8af8ef764cb5b27a82e57cb01] failed to configure pod interface: failure in plugging pod interface: failed to run 'ovs-vsctl --timeout=30 add-port br-int 1cf8c0d238a1cf7 -- set interface 1cf8c0d238a1cf7 external_ids:attached_mac=0a:58:0a:80:02:6e external_ids:iface-id=clusterproject0_pausepods255 external_ids:ip_addresses=10.128.2.110/23 external_ids:sandbox=1cf8c0d238a1cf79aa79b532f5dd44d80cbd0eb8af8ef764cb5b27a82e57cb01': signal: alarm clock

Comment 5 Tim Rozet 2021-01-04 22:36:32 UTC
This looks like the same bug as 1865743. I on the host:
sh-4.4# id -u openvswitch
800
sh-4.4# id -g openvswitch
800

sh-4.4# ls -lrt
total 8
srwxr-x---. 1 openvswitch openvswitch 0 Jan  4 21:10 ovs-vswitchd.1331.ctl
-rw-r--r--. 1 openvswitch hugetlbfs   7 Jan  4 21:16 ovsdb-server.pid
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 db.sock
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 ovsdb-server.189473.ctl
-rw-r--r--. 1 openvswitch hugetlbfs   7 Jan  4 21:17 ovs-vswitchd.pid
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 ovs-vswitchd.195529.ctl
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 br-int.snoop
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 br-int.mgmt
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 br-ex.mgmt
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 br-local.snoop
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 br-local.mgmt
srwxr-x---. 1 openvswitch hugetlbfs   0 Jan  4 21:17 br-ex.snoop

sh-4.4# ls -lrtn
total 8
srwxr-x---. 1 800 800 0 Jan  4 21:10 ovs-vswitchd.1331.ctl
-rw-r--r--. 1 800 801 7 Jan  4 21:16 ovsdb-server.pid
srwxr-x---. 1 800 801 0 Jan  4 21:17 db.sock
srwxr-x---. 1 800 801 0 Jan  4 21:17 ovsdb-server.189473.ctl
-rw-r--r--. 1 800 801 7 Jan  4 21:17 ovs-vswitchd.pid
srwxr-x---. 1 800 801 0 Jan  4 21:17 ovs-vswitchd.195529.ctl
srwxr-x---. 1 800 801 0 Jan  4 21:17 br-int.snoop
srwxr-x---. 1 800 801 0 Jan  4 21:17 br-int.mgmt
srwxr-x---. 1 800 801 0 Jan  4 21:17 br-ex.mgmt
srwxr-x---. 1 800 801 0 Jan  4 21:17 br-local.snoop
srwxr-x---. 1 800 801 0 Jan  4 21:17 br-local.mgmt
srwxr-x---. 1 800 801 0 Jan  4 21:17 br-ex.snoop

My first instinct here is that the ovs-daemons container tried to start ovs-vswitchd and changed the ownership to itself, but that is not the case because 801 is not the ovs-daemons ids:
sh-4.4# crictl exec -it 8de0773604506 /bin/bash
[root@ip-10-0-196-96 ~]# id
uid=0(root) gid=0(root) groups=0(root) context=system_u:system_r:spc_t:s0
[root@ip-10-0-196-96 ~]# id -g openvswitch
995
[root@ip-10-0-196-96 ~]# id -u openvswitch
997

Eventually ovs-vswitchd crashes with a segfault, but there are no cores present on the host and no backtrace.

Seeing as the files are being changed to hugetlbfs which is the same as 1865743, I think this is the same issue. Marking it dupe and if it turns out to be different please reopen.

*** This bug has been marked as a duplicate of bug 1865743 ***


Note You need to log in before you can comment on or make changes to this bug.