Description of Problem: Multiple pods stuck in ContainerCreating. It happend in the cluster with multiblockdevices and etcd_encrption on Text-matrix: IPI_AWS_Connected_No Proxy_RHCOS 4.6_Normal Kernel_Disk Encyption off_FIPS off_OVN_IPv4_Etcd Encyption on_Multiple Block Devices for Machinesets Resources Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-23-185640 How reproducible: Always Steps to Reproduce: 1. Lanched a cluster with multiblockdevices 2. Create project and pods in it Actual Result: oc get pods -n test NAME READY STATUS RESTARTS AGE hello-pod 0/1 ContainerCreating 0 6h40m test-rc-dpjzs 0/1 ContainerCreating 0 86s test-rc-fgxr2 0/1 ContainerCreating 0 28s Warning FailedCreatePodSandBox 21s kubelet, ip-10-0-174-60.us-east-2.compute.internal Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-dpjzs_test_1531237d-a3d4-49cd-907e-631e8ed6edb1_0(b9aec112e21ba389a686fcc511708fa52e8c2154fd1e5c59bb3ed4f54eee2acb): [test/test-rc-dpjzs:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[test/test-rc-dpjzs] failed to configure pod interface: timed out waiting for pod flows for pod: test-rc-dpjzs, error: timed out waiting for the condition Recreate pods does not work here oc get pods --all-namespaces -o wide | egrep Contai openshift-authentication oauth-openshift-69dfc87d5c-lhzml 0/1 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-cloud-credential-operator pod-identity-webhook-7f99757f4c-7q4qw 0/1 ContainerCreating 0 9h <none> ip-10-0-198-152.us-east-2.compute.internal <none> <none> openshift-cluster-storage-operator csi-snapshot-controller-7c5944b484-rjbcs 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-cluster-storage-operator csi-snapshot-controller-operator-8694cdd5f5-xrl8w 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-console-operator console-operator-7879c8b586-vjssd 0/1 ContainerCreating 0 9h <none> ip-10-0-198-152.us-east-2.compute.internal <none> <none> openshift-console console-6c7fb9bc64-m22dq 0/1 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-console downloads-fdfd8f85c-hzqcg 0/1 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-etcd revision-pruner-3-ip-10-0-147-80.us-east-2.compute.internal 0/1 ContainerCreating 0 9h <none> ip-10-0-147-80.us-east-2.compute.internal <none> <none> openshift-image-registry cluster-image-registry-operator-8ff5f5b56-p8677 0/1 ContainerCreating 0 9h <none> ip-10-0-198-152.us-east-2.compute.internal <none> <none> openshift-image-registry image-pruner-1598313600-ppmmf 0/1 ContainerCreating 0 4h8m <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> openshift-image-registry image-registry-75c79757f6-n46d6 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-ingress router-default-69f6895456-8g259 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-insights insights-operator-668bb94fb8-9g7bz 0/1 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-kube-apiserver installer-7-ip-10-0-167-104.us-east-2.compute.internal 0/1 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-kube-apiserver revision-pruner-6-ip-10-0-147-80.us-east-2.compute.internal 0/1 ContainerCreating 0 9h <none> ip-10-0-147-80.us-east-2.compute.internal <none> <none> openshift-kube-controller-manager-operator kube-controller-manager-operator-86d8f47-p5gmf 0/1 ContainerCreating 0 9h <none> ip-10-0-198-152.us-east-2.compute.internal <none> <none> openshift-kube-controller-manager revision-pruner-8-ip-10-0-167-104.us-east-2.compute.internal 0/1 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-kube-scheduler-operator openshift-kube-scheduler-operator-69bd6d7c8-s99zq 0/1 ContainerCreating 0 9h <none> ip-10-0-198-152.us-east-2.compute.internal <none> <none> openshift-kube-storage-version-migrator migrator-8465f64b7f-r8rvq 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-machine-api machine-api-controllers-74b577f656-nhdd9 0/7 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-marketplace certified-operators-rrqpp 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-marketplace certified-operators-z2gfq 0/1 ContainerCreating 0 3m25s <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> openshift-marketplace community-operators-4vnpp 0/1 ContainerCreating 0 3m9s <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> openshift-marketplace community-operators-9f4qx 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-marketplace marketplace-operator-b67687ddb-6kspv 0/1 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-marketplace qe-app-registry-bscpp 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-marketplace qe-app-registry-x22cq 0/1 ContainerCreating 0 14m <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> openshift-marketplace redhat-marketplace-dbhvn 0/1 ContainerCreating 0 8h <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> openshift-marketplace redhat-marketplace-hgb6z 0/1 ContainerCreating 0 7m50s <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> openshift-marketplace redhat-operators-4p9lq 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-marketplace redhat-operators-tvsgx 0/1 ContainerCreating 0 2s <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> openshift-monitoring alertmanager-main-1 0/5 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-monitoring openshift-state-metrics-5b6f99854d-n9lgz 0/3 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-adapter-5984b99488-nbp2d 0/1 ContainerCreating 0 3h50m <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-adapter-b5764f4b8-97v68 0/1 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-k8s-1 0/7 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-monitoring prometheus-operator-7857d6cf98-k5kqq 0/2 ContainerCreating 0 9h <none> ip-10-0-167-104.us-east-2.compute.internal <none> <none> openshift-monitoring thanos-querier-7bc7c4c6cf-wxb72 0/5 ContainerCreating 0 9h <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> openshift-operator-lifecycle-manager packageserver-595494f96c-gv8d8 0/1 ContainerCreating 0 42s <none> ip-10-0-198-152.us-east-2.compute.internal <none> <none> openshift-operator-lifecycle-manager packageserver-67947c5698-8nk8k 0/1 ContainerCreating 0 43s <none> ip-10-0-147-80.us-east-2.compute.internal <none> <none> test hello-pod 0/1 ContainerCreating 0 8h <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> test test-rc-dpjzs 0/1 ContainerCreating 0 124m <none> ip-10-0-174-60.us-east-2.compute.internal <none> <none> test test-rc-fgxr2 0/1 ContainerCreating 0 123m <none> ip-10-0-141-13.us-east-2.compute.internal <none> <none> Expected Result: The new created pods successfully.
This seems fairly reproducible on plain OVN clusters as well without any special matrix config. Changing title to reflect the same.
ovn-controller is segfaulting: 2020-08-27T03:36:02.997630937Z 2020-08-27T03:36:02Z|01161|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-08-27T03:36:49.137070093Z 2020-08-27T03:36:49Z|01162|pinctrl|WARN|Dropped 227 log messages in last 70 seconds (most recently, 19 seconds ago) due to excessive rate 2020-08-27T03:36:49.137070093Z 2020-08-27T03:36:49Z|01163|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address 2020-08-27T03:37:53.827015405Z 2020-08-27T03:37:53Z|01164|pinctrl|WARN|Dropped 173 log messages in last 65 seconds (most recently, 16 seconds ago) due to excessive rate 2020-08-27T03:37:53.827015405Z 2020-08-27T03:37:53Z|01165|pinctrl|WARN|MLD Querier enabled with invalid IPv6 src address 2020-08-27T03:38:35.130790057Z 2020-08-27T03:38:35Z|00001|fatal_signal(stopwatch2)|WARN|terminating with signal 11 (Segmentation fault) Need the core dumps using must gather: https://github.com/openshift/must-gather/pull/173
The above clusters are no more accessible may be they are too sick i believe. SO i tried to bring up 3 OVN clusters on latest nighlty - 2 clusters on AWS - 1 on GCP Out of 3 clusters 2 came up fine (1 AWS and 1 GCP) but 1 on AWS is again depicting some system pods stuck with following error: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_console-757d94d5d-pflz4_openshift-console_10be6d7e-1a82-4fea-90f5-c62ab6ba5f9b_0(d06ea467ef1fdea810f8fa2eff55449a9d0d4c90461c5397d49effd402b78fc1): [openshift-console/console-757d94d5d-pflz4:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-console/console-757d94d5d-pflz4] failed to configure pod interface: timed out waiting for pod flows for pod: console-757d94d5d-pflz4, error: timed out waiting for the condition Cluster Kubeconfig: https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/109339/artifact/workdir/install-dir/auth/kubeconfig I don't see ovn-controller seg fault on this cluster though. Next steps ---------- Bringing cluster on last week build to see if ovn-controller segfault is observed there
Tim, and I investigated it today and Initial conclusion is 1) Initial investigation suggests that its connected to https://github.com/openshift/cluster-network-operator/pull/769 and https://bugzilla.redhat.com/show_bug.cgi?id=1867185. 2) I brought up 3 clusters via cluster-bot today with above reverted PR and they came up fine. However limitation with cluster bot is its only available for couple of hours so its unclear whether that reverted PR would promise no more problems later ( >2 hours). 3) The above reverted PR doesn't justify the Segmentation fault occurred in ovn-controller but thats no more reproducible apparently so need to keep eye on that 4) ovn-controller stuck due to segfault in one of the case cause it thinks that the database is inconsistent. This is where https://bugzilla.redhat.com/show_bug.cgi?id=1867185 might help.
Met similar issue on ipv6 single stack cluster with 4.5.7 version it caused the pods on the node cannot be communicated with other nodes #oc exec -n default hello-4w4cq -- curl --connect-timeout 10 [fd01::3:28dc:5eff:fe00:18]:8080 https://[fd01::3:28dc:5eff:fe00:18]:8443 -k pod hello-4w4cq can not communicate with fd01::3:28dc:5eff:fe00:18 with 8080 or 8443 port #oc logs ovnkube-node-scw9l -n openshift-ovn-kubernetes -c ovn-controller 2020-09-03T02:55:27Z|00001|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting... 2020-09-03T02:55:27Z|00002|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected 2020-09-03T02:55:27Z|00003|main|INFO|OVS IDL reconnected, force recompute. 2020-09-03T02:55:27Z|00004|reconnect|INFO|ssl:[2620:52:0:1386::92]:9642: connecting... 2020-09-03T02:55:27Z|00005|main|INFO|OVNSB IDL reconnected, force recompute. 2020-09-03T02:55:27Z|00006|reconnect|INFO|ssl:[2620:52:0:1386::92]:9642: connected 2020-09-03T02:55:27Z|00007|ofctrl|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch 2020-09-03T02:55:27Z|00008|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting... 2020-09-03T02:55:27Z|00009|rconn|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected 2020-09-03T02:55:27Z|00010|binding|INFO|Claiming lport openshift-logging_fluentd-wmt6s for this chassis. 2020-09-03T02:55:27Z|00011|binding|INFO|openshift-logging_fluentd-wmt6s: Claiming dynamic 2020-09-03T02:55:27Z|00012|binding|INFO|Claiming lport default_hello-k4h26 for this chassis. 2020-09-03T02:55:27Z|00013|binding|INFO|default_hello-k4h26: Claiming dynamic 2020-09-03T02:55:27Z|00014|binding|INFO|Claiming lport openshift-apiserver_apiserver-66fd7d5854-7hsrn for this chassis. 2020-09-03T02:55:27Z|00015|binding|INFO|openshift-apiserver_apiserver-66fd7d5854-7hsrn: Claiming dynamic 2020-09-03T02:55:27Z|00016|lflow|WARN|error parsing actions "trigger_event(event = "empty_lb_backends", meter = "", vip = "fd02::3a5a:50051", protocol = "tcp", load_balancer = "52e4aa4a-456b-4ee6-81a2-90b67bcd8b55");": Invalid load balancer VIP 'fd02::3a5a:50051' 2020-09-03T02:55:27Z|00001|pinctrl(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting to switch 2020-09-03T02:55:27Z|00002|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connecting... 2020-09-03T02:55:27Z|00017|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T02:55:27Z|00003|rconn(ovn_pinctrl0)|INFO|unix:/var/run/openvswitch/br-int.mgmt: connected 2020-09-03T03:18:55Z|00018|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T03:19:56Z|00019|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T03:22:20Z|00020|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T03:23:20Z|00021|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T03:30:20Z|00022|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T03:31:20Z|00023|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} 2020-09-03T03:41:02Z|00024|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} The ovn version: sh-4.2# rpm -qa | grep ovn ovn2.13-2.13.0-39.el7fdp.x86_64 ovn2.13-central-2.13.0-39.el7fdp.x86_64 ovn2.13-host-2.13.0-39.el7fdp.x86_64 ovn2.13-vtep-2.13.0-39.el7fdp.x86_64
I think we're mixing too many things in this bug, I've had a look at #comment 11 and nothing in the OVN database seems to indicate issues. ovn-controller however logs two types of messages which indicate that things are not working properly, as mentioned in #comment 11 1) 2020-09-03T04:21:56Z|00058|lflow|WARN|error parsing actions "trigger_event(event = "empty_lb_backends", meter = "", vip = "fd02::3a5a:50051", protocol = "tcp", load_balancer = "52e4aa4a-456b-4ee6-81a2-90b67bcd8b55");": Invalid load balancer VIP 'fd02::3a5a:50051' For which I've opened: https://bugzilla.redhat.com/show_bug.cgi?id=1875337 2) 2020-09-03T05:09:30Z|00113|ovsdb_idl|WARN|transaction error: {"details":"inconsistent data","error":"ovsdb error"} For which we will need the fix: https://bugzilla.redhat.com/show_bug.cgi?id=1867185 @zhao: please open a new BZ for that so that we can track the integration of these two eventual fixes into OpenShift.
Anurag, given the revert of the PR mentioned in #comment 10, could you re-test this bug? And let me know the outcome. Also, you never mentioned that all the pods are stuck in container creating as #comment 0 indicates. Thus, if you are able to reproduce the errors *you* mentioned, please open a separate bug. /Alex
Hi Huiran Could you reproduce the issue again? The kubeconfig you provided in #comment 3 is no longer valid And only attach something if all the pods are stuck in ContainerCreating as you mentioned in #comment 0, if that is not the case open a new bug - or change the title of this one. /Alex
(In reply to Alexander Constantinescu from comment #12) > I think we're mixing too many things in this bug, I've had a look at > #comment 11 and nothing in the OVN database seems to indicate issues. > ovn-controller however logs two types of messages which indicate that things > are not working properly, as mentioned in #comment 11 > > 1) 2020-09-03T04:21:56Z|00058|lflow|WARN|error parsing actions > "trigger_event(event = "empty_lb_backends", meter = "", vip = > "fd02::3a5a:50051", protocol = "tcp", load_balancer = > "52e4aa4a-456b-4ee6-81a2-90b67bcd8b55");": Invalid load balancer VIP > 'fd02::3a5a:50051' > > For which I've opened: https://bugzilla.redhat.com/show_bug.cgi?id=1875337 > > 2) 2020-09-03T05:09:30Z|00113|ovsdb_idl|WARN|transaction error: > {"details":"inconsistent data","error":"ovsdb error"} > > For which we will need the fix: > https://bugzilla.redhat.com/show_bug.cgi?id=1867185 > > @zhao: please open a new BZ for that so that we can track the integration of > these two eventual fixes into OpenShift. ok, thanks the information @Alexander
This bug doesn't seem to be a testblocker on recent builds based on following observations 1) I brought up 5 clusters on OVN on multiple cloud providers and they seem to be working fine so far. 12 hours under observation and ongoing 2) The original "containercreating" stuck on pods issue is no more reproducible. Its possible that that might have taken care by timer revert fix by Tim. 3) https://bugzilla.redhat.com/show_bug.cgi?id=1867185 merge might solve some intermittent db issues Based on above considerations, I am removing testblocker keyword and reducing the severity. Will keep posting updates as cluster longevity increases. As @Alex mentioned, can file separate bugs if that doesn't seem connected with this.