Bug 1872098 - [OVN ] Multiple pods stuck in ContainerCreating
Summary: [OVN ] Multiple pods stuck in ContainerCreating
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Alexander Constantinescu
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-25 04:10 UTC by huirwang
Modified: 2020-09-14 14:26 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-14 14:26:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description huirwang 2020-08-25 04:10:04 UTC
Description of Problem:
Multiple pods stuck in  ContainerCreating. It happend in the cluster with multiblockdevices and etcd_encrption on

Text-matrix: IPI_AWS_Connected_No Proxy_RHCOS 4.6_Normal Kernel_Disk Encyption off_FIPS off_OVN_IPv4_Etcd Encyption on_Multiple Block Devices for Machinesets Resources

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-23-185640 


How reproducible:
Always

Steps to Reproduce:
1. Lanched a cluster with multiblockdevices
2. Create project and pods in it

Actual Result:
oc get pods -n test
NAME            READY   STATUS              RESTARTS   AGE
hello-pod       0/1     ContainerCreating   0          6h40m
test-rc-dpjzs   0/1     ContainerCreating   0          86s
test-rc-fgxr2   0/1     ContainerCreating   0          28s

 Warning  FailedCreatePodSandBox  21s  kubelet, ip-10-0-174-60.us-east-2.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-dpjzs_test_1531237d-a3d4-49cd-907e-631e8ed6edb1_0(b9aec112e21ba389a686fcc511708fa52e8c2154fd1e5c59bb3ed4f54eee2acb): [test/test-rc-dpjzs:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[test/test-rc-dpjzs] failed to configure pod interface: timed out waiting for pod flows for pod: test-rc-dpjzs, error: timed out waiting for the condition
 
 Recreate pods does not work here
 



oc get pods --all-namespaces -o wide | egrep Contai       
openshift-authentication                           oauth-openshift-69dfc87d5c-lhzml                                      0/1     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-cloud-credential-operator                pod-identity-webhook-7f99757f4c-7q4qw                                 0/1     ContainerCreating   0          9h      <none>         ip-10-0-198-152.us-east-2.compute.internal   <none>           <none>
openshift-cluster-storage-operator                 csi-snapshot-controller-7c5944b484-rjbcs                              0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-cluster-storage-operator                 csi-snapshot-controller-operator-8694cdd5f5-xrl8w                     0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-console-operator                         console-operator-7879c8b586-vjssd                                     0/1     ContainerCreating   0          9h      <none>         ip-10-0-198-152.us-east-2.compute.internal   <none>           <none>
openshift-console                                  console-6c7fb9bc64-m22dq                                              0/1     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-console                                  downloads-fdfd8f85c-hzqcg                                             0/1     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-etcd                                     revision-pruner-3-ip-10-0-147-80.us-east-2.compute.internal           0/1     ContainerCreating   0          9h      <none>         ip-10-0-147-80.us-east-2.compute.internal    <none>           <none>
openshift-image-registry                           cluster-image-registry-operator-8ff5f5b56-p8677                       0/1     ContainerCreating   0          9h      <none>         ip-10-0-198-152.us-east-2.compute.internal   <none>           <none>
openshift-image-registry                           image-pruner-1598313600-ppmmf                                         0/1     ContainerCreating   0          4h8m    <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
openshift-image-registry                           image-registry-75c79757f6-n46d6                                       0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-ingress                                  router-default-69f6895456-8g259                                       0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-insights                                 insights-operator-668bb94fb8-9g7bz                                    0/1     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-kube-apiserver                           installer-7-ip-10-0-167-104.us-east-2.compute.internal                0/1     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-kube-apiserver                           revision-pruner-6-ip-10-0-147-80.us-east-2.compute.internal           0/1     ContainerCreating   0          9h      <none>         ip-10-0-147-80.us-east-2.compute.internal    <none>           <none>
openshift-kube-controller-manager-operator         kube-controller-manager-operator-86d8f47-p5gmf                        0/1     ContainerCreating   0          9h      <none>         ip-10-0-198-152.us-east-2.compute.internal   <none>           <none>
openshift-kube-controller-manager                  revision-pruner-8-ip-10-0-167-104.us-east-2.compute.internal          0/1     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-kube-scheduler-operator                  openshift-kube-scheduler-operator-69bd6d7c8-s99zq                     0/1     ContainerCreating   0          9h      <none>         ip-10-0-198-152.us-east-2.compute.internal   <none>           <none>
openshift-kube-storage-version-migrator            migrator-8465f64b7f-r8rvq                                             0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-machine-api                              machine-api-controllers-74b577f656-nhdd9                              0/7     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-marketplace                              certified-operators-rrqpp                                             0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              certified-operators-z2gfq                                             0/1     ContainerCreating   0          3m25s   <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              community-operators-4vnpp                                             0/1     ContainerCreating   0          3m9s    <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              community-operators-9f4qx                                             0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              marketplace-operator-b67687ddb-6kspv                                  0/1     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-marketplace                              qe-app-registry-bscpp                                                 0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              qe-app-registry-x22cq                                                 0/1     ContainerCreating   0          14m     <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              redhat-marketplace-dbhvn                                              0/1     ContainerCreating   0          8h      <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              redhat-marketplace-hgb6z                                              0/1     ContainerCreating   0          7m50s   <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              redhat-operators-4p9lq                                                0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-marketplace                              redhat-operators-tvsgx                                                0/1     ContainerCreating   0          2s      <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               alertmanager-main-1                                                   0/5     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               openshift-state-metrics-5b6f99854d-n9lgz                              0/3     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               prometheus-adapter-5984b99488-nbp2d                                   0/1     ContainerCreating   0          3h50m   <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               prometheus-adapter-b5764f4b8-97v68                                    0/1     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               prometheus-k8s-1                                                      0/7     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-monitoring                               prometheus-operator-7857d6cf98-k5kqq                                  0/2     ContainerCreating   0          9h      <none>         ip-10-0-167-104.us-east-2.compute.internal   <none>           <none>
openshift-monitoring                               thanos-querier-7bc7c4c6cf-wxb72                                       0/5     ContainerCreating   0          9h      <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
openshift-operator-lifecycle-manager               packageserver-595494f96c-gv8d8                                        0/1     ContainerCreating   0          42s     <none>         ip-10-0-198-152.us-east-2.compute.internal   <none>           <none>
openshift-operator-lifecycle-manager               packageserver-67947c5698-8nk8k                                        0/1     ContainerCreating   0          43s     <none>         ip-10-0-147-80.us-east-2.compute.internal    <none>           <none>
test                                               hello-pod                                                             0/1     ContainerCreating   0          8h      <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>
test                                               test-rc-dpjzs                                                         0/1     ContainerCreating   0          124m    <none>         ip-10-0-174-60.us-east-2.compute.internal    <none>           <none>
test                                               test-rc-fgxr2                                                         0/1     ContainerCreating   0          123m    <none>         ip-10-0-141-13.us-east-2.compute.internal    <none>           <none>

 
 Expected Result:
 The new created pods successfully.

Comment 5 Anurag saxena 2020-08-27 19:38:24 UTC
This seems fairly reproducible on plain OVN clusters as well without any special matrix config. Changing title to reflect the same.

Comment 8 Tim Rozet 2020-08-31 14:14:48 UTC
ovn-controller is segfaulting:

2020-08-27T03:36:02.997630937Z 2020-08-27T03:36:02Z|01161|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-08-27T03:36:49.137070093Z 2020-08-27T03:36:49Z|01162|pinctrl|WARN|Dropped 227 log messages in last 70 seconds (most recently, 19 seconds ago) due to excessive rate
2020-08-27T03:36:49.137070093Z 2020-08-27T03:36:49Z|01163|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address
2020-08-27T03:37:53.827015405Z 2020-08-27T03:37:53Z|01164|pinctrl|WARN|Dropped 173 log messages in last 65 seconds (most recently, 16 seconds ago) due to excessive rate
2020-08-27T03:37:53.827015405Z 2020-08-27T03:37:53Z|01165|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address
2020-08-27T03:38:35.130790057Z 2020-08-27T03:38:35Z|00001|fatal_signal(stopwatch2)|WARN|terminating with signal 11 (Segmentation fault)

Need the core dumps using must gather:

https://github.com/openshift/must-gather/pull/173

Comment 9 Anurag saxena 2020-08-31 15:42:05 UTC
The above clusters are no more accessible may be they are too sick i believe. SO i tried to bring up 3 OVN clusters on latest nighlty

- 2 clusters on AWS
- 1 on GCP

Out of 3 clusters 2 came up fine (1 AWS and 1 GCP) but 1 on AWS is again depicting some system pods stuck with following error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_console-757d94d5d-pflz4_openshift-console_10be6d7e-1a82-4fea-90f5-c62ab6ba5f9b_0(d06ea467ef1fdea810f8fa2eff55449a9d0d4c90461c5397d49effd402b78fc1): [openshift-console/console-757d94d5d-pflz4:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-console/console-757d94d5d-pflz4] failed to configure pod interface: timed out waiting for pod flows for pod: console-757d94d5d-pflz4, error: timed out waiting for the condition

Cluster Kubeconfig: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/109339/artifact/workdir/install-dir/auth/kubeconfig

I don't see ovn-controller seg fault on this cluster though.

Next steps
----------

Bringing cluster on last week build to see if ovn-controller segfault is observed there

Comment 10 Anurag saxena 2020-09-01 00:09:37 UTC
Tim, and I investigated it today and Initial conclusion is

1) Initial investigation suggests that its connected to https://github.com/openshift/cluster-network-operator/pull/769 and https://bugzilla.redhat.com/show_bug.cgi?id=1867185.

2) I brought up 3 clusters via cluster-bot today with above reverted PR and they came up fine. However limitation with cluster bot is its only available for couple of hours so its unclear whether that reverted PR would promise no more problems later ( >2 hours).

3) The above reverted PR doesn't justify the Segmentation fault occurred in ovn-controller but thats no more reproducible apparently so need to keep eye on that

4) ovn-controller stuck due to segfault in one of the case cause it thinks that the database is inconsistent. This is where https://bugzilla.redhat.com/show_bug.cgi?id=1867185 might help.

Comment 11 zhaozhanqi 2020-09-03 05:43:31 UTC
Met similar issue on ipv6 single stack cluster with 4.5.7 version

it caused the pods on the node cannot be communicated with other nodes

#oc exec -n default hello-4w4cq -- curl --connect-timeout 10 [fd01::3:28dc:5eff:fe00:18]:8080 https://[fd01::3:28dc:5eff:fe00:18]:8443 -k
 pod hello-4w4cq can not communicate with fd01::3:28dc:5eff:fe00:18 with 8080 or 8443 port 


#oc logs ovnkube-node-scw9l -n openshift-ovn-kubernetes -c ovn-controller
2020-09-03T02:55:27Z|00001|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2020-09-03T02:55:27Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2020-09-03T02:55:27Z|00003|main|INFO|OVS IDL reconnected, force recompute.
2020-09-03T02:55:27Z|00004|reconnect|INFO|ssl:[2620:52:0:1386::92]:9642: connecting...
2020-09-03T02:55:27Z|00005|main|INFO|OVNSB IDL reconnected, force recompute.
2020-09-03T02:55:27Z|00006|reconnect|INFO|ssl:[2620:52:0:1386::92]:9642: connected
2020-09-03T02:55:27Z|00007|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2020-09-03T02:55:27Z|00008|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2020-09-03T02:55:27Z|00009|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2020-09-03T02:55:27Z|00010|binding|INFO|Claiming lport openshift-logging_fluentd-wmt6s for this chassis.
2020-09-03T02:55:27Z|00011|binding|INFO|openshift-logging_fluentd-wmt6s: Claiming dynamic
2020-09-03T02:55:27Z|00012|binding|INFO|Claiming lport default_hello-k4h26 for this chassis.
2020-09-03T02:55:27Z|00013|binding|INFO|default_hello-k4h26: Claiming dynamic
2020-09-03T02:55:27Z|00014|binding|INFO|Claiming lport openshift-apiserver_apiserver-66fd7d5854-7hsrn for this chassis.
2020-09-03T02:55:27Z|00015|binding|INFO|openshift-apiserver_apiserver-66fd7d5854-7hsrn: Claiming dynamic
2020-09-03T02:55:27Z|00016|lflow|WARN|error parsing actions "trigger_event(event = "empty_lb_backends", meter = "", vip = "fd02::3a5a:50051", protocol = "tcp", load_balancer = "52e4aa4a-456b-4ee6-81a2-90b67bcd8b55");": Invalid load balancer VIP 'fd02::3a5a:50051'
2020-09-03T02:55:27Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch
2020-09-03T02:55:27Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting...
2020-09-03T02:55:27Z|00017|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T02:55:27Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected
2020-09-03T03:18:55Z|00018|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T03:19:56Z|00019|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T03:22:20Z|00020|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T03:23:20Z|00021|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T03:30:20Z|00022|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T03:31:20Z|00023|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}
2020-09-03T03:41:02Z|00024|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}


The ovn version:
sh-4.2# rpm -qa | grep ovn
ovn2.13-2.13.0-39.el7fdp.x86_64
ovn2.13-central-2.13.0-39.el7fdp.x86_64
ovn2.13-host-2.13.0-39.el7fdp.x86_64
ovn2.13-vtep-2.13.0-39.el7fdp.x86_64

Comment 12 Alexander Constantinescu 2020-09-03 11:12:41 UTC
I think we're mixing too many things in this bug, I've had a look at #comment 11 and nothing in the OVN database seems to indicate issues. ovn-controller however logs two types of messages which indicate that things are not working properly, as mentioned in #comment 11

1) 2020-09-03T04:21:56Z|00058|lflow|WARN|error parsing actions "trigger_event(event = "empty_lb_backends", meter = "", vip = "fd02::3a5a:50051", protocol = "tcp", load_balancer = "52e4aa4a-456b-4ee6-81a2-90b67bcd8b55");": Invalid load balancer VIP 'fd02::3a5a:50051'

For which I've opened: https://bugzilla.redhat.com/show_bug.cgi?id=1875337 

2) 2020-09-03T05:09:30Z|00113|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"}

For which we will need the fix: https://bugzilla.redhat.com/show_bug.cgi?id=1867185

@zhao: please open a new BZ for that so that we can track the integration of these two eventual fixes into OpenShift.

Comment 13 Alexander Constantinescu 2020-09-03 15:55:37 UTC
Anurag, given the revert of the PR mentioned in #comment 10, could you re-test this bug? And let me know the outcome.

Also, you never mentioned that all the pods are stuck in container creating as #comment 0 indicates. Thus, if you are able to reproduce the errors *you* mentioned, please open a separate bug. 

/Alex

Comment 14 Alexander Constantinescu 2020-09-03 15:58:02 UTC
Hi Huiran

Could you reproduce the issue again? The kubeconfig you provided in #comment 3 is no longer valid

And only attach something if all the pods are stuck in ContainerCreating as you mentioned in #comment 0, if that is not the case open a new bug - or change the title of this one. 

/Alex

Comment 21 zhaozhanqi 2020-09-08 04:12:41 UTC
(In reply to Alexander Constantinescu from comment #12)
> I think we're mixing too many things in this bug, I've had a look at
> #comment 11 and nothing in the OVN database seems to indicate issues.
> ovn-controller however logs two types of messages which indicate that things
> are not working properly, as mentioned in #comment 11
> 
> 1) 2020-09-03T04:21:56Z|00058|lflow|WARN|error parsing actions
> "trigger_event(event = "empty_lb_backends", meter = "", vip =
> "fd02::3a5a:50051", protocol = "tcp", load_balancer =
> "52e4aa4a-456b-4ee6-81a2-90b67bcd8b55");": Invalid load balancer VIP
> 'fd02::3a5a:50051'
> 
> For which I've opened: https://bugzilla.redhat.com/show_bug.cgi?id=1875337 
> 
> 2) 2020-09-03T05:09:30Z|00113|ovsdb_idl|WARN|transaction error:
> {"details":"inconsistent data","error":"ovsdb error"}
> 
> For which we will need the fix:
> https://bugzilla.redhat.com/show_bug.cgi?id=1867185
> 
> @zhao: please open a new BZ for that so that we can track the integration of
> these two eventual fixes into OpenShift.

ok, thanks the information @Alexander

Comment 22 Anurag saxena 2020-09-08 16:57:27 UTC
This bug doesn't seem to be a testblocker on recent builds based on following observations

1) I brought up 5 clusters on OVN on multiple cloud providers and they seem to be working fine so far. 12 hours under observation and ongoing
2) The original "containercreating" stuck on pods issue is no more reproducible. Its possible that that might have taken care by timer revert fix by Tim.
3) https://bugzilla.redhat.com/show_bug.cgi?id=1867185 merge might solve some intermittent db issues

Based on above considerations, I am removing testblocker keyword and reducing the severity.

Will keep posting updates as cluster longevity increases. As @Alex mentioned, can file separate bugs if that doesn't seem connected with this.


Note You need to log in before you can comment on or make changes to this bug.