Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1882845

Summary:	proxy CI: pod "connectivity-test" failed with status... containers with unready status... connectivity-test-container
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Networking	Assignee:	Victor Pickard <vpickard>
Networking sub component:	openshift-sdn	QA Contact:	zhaozhanqi <zzhao>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aconstan
Version:	4.6	Keywords:	Reopened
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-02 15:56:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-09-25 22:14:05 UTC

Test:
[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]

is failing frequently in proxy CI:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=pod.*connectivity-test.*failed%20with%20status.*containers%20with%20unready%20status.*connectivity-test-container' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ocp-4.4-e2e-aws-proxy - 1 runs, 100% failed, 100% of failures match
periodic-ci-openshift-release-master-ocp-4.5-e2e-aws-proxy - 1 runs, 100% failed, 100% of failures match
periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy - 19 runs, 100% failed, 89% of failures match
rehearse-12228-pull-ci-openshift-installer-master-e2e-aws-proxy - 3 runs, 100% failed, 33% of failures match

Example job [1] had:

fail [k8s.io/kubernetes.0/test/e2e/network/networking.go:110]: Unexpected error:
    <*errors.errorString | 0xc0020d6670>: {
        s: "pod \"connectivity-test\" failed with status: {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-25 18:37:17 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-25 18:37:52 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [connectivity-test-container]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-25 18:37:52 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [connectivity-test-container]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-25 18:37:17 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.57.79 PodIP:10.128.3.224 PodIPs:[{IP:10.128.3.224}] StartTime:2020-09-25 18:37:17 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:connectivity-test-container State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2020-09-25 18:37:21 +0000 UTC,FinishedAt:2020-09-25 18:37:51 +0000 UTC,ContainerID:cri-o://0340bf5e8875e1f194a1eb98354c72ede3c55ee1beecd812285ca302efef6d38,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:k8s.gcr.io/e2e-test-images/agnhost:2.20 ImageID:k8s.gcr.io/e2e-test-images/agnhost@sha256:17e61a0b9e498b6c73ed97670906be3d5a3ae394739c1bd5b619e1a004885cf0 ContainerID:cri-o://0340bf5e8875e1f194a1eb98354c72ede3c55ee1beecd812285ca302efef6d38 Started:0xc001ef650f}] QOSClass:BestEffort EphemeralContainerStatuses:[]}",
    }
    pod "connectivity-test" failed with status: {Phase:Failed Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-25 18:37:17 +0000 UTC Reason: Message:} {Type:Ready Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-25 18:37:52 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [connectivity-test-container]} {Type:ContainersReady Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-25 18:37:52 +0000 UTC Reason:ContainersNotReady Message:containers with unready status: [connectivity-test-container]} {Type:PodScheduled Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-09-25 18:37:17 +0000 UTC Reason: Message:}] Message: Reason: NominatedNodeName: HostIP:10.0.57.79 PodIP:10.128.3.224 PodIPs:[{IP:10.128.3.224}] StartTime:2020-09-25 18:37:17 +0000 UTC InitContainerStatuses:[] ContainerStatuses:[{Name:connectivity-test-container State:{Waiting:nil Running:nil Terminated:&ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2020-09-25 18:37:21 +0000 UTC,FinishedAt:2020-09-25 18:37:51 +0000 UTC,ContainerID:cri-o://0340bf5e8875e1f194a1eb98354c72ede3c55ee1beecd812285ca302efef6d38,}} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:k8s.gcr.io/e2e-test-images/agnhost:2.20 ImageID:k8s.gcr.io/e2e-test-images/agnhost@sha256:17e61a0b9e498b6c73ed97670906be3d5a3ae394739c1bd5b619e1a004885cf0 ContainerID:cri-o://0340bf5e8875e1f194a1eb98354c72ede3c55ee1beecd812285ca302efef6d38 Started:0xc001ef650f}] QOSClass:BestEffort EphemeralContainerStatuses:[]}
occurred

I dunno why connectivity-test-container is taking so long to go ready in the proxy environment.  Perhaps the pod needs access to the outside world and should have the HTTP(S)_PROXY environment variables set and trusted CAs mounted, like bug 1882486?

Once we get this bug sorted, we should drop this skip [2].

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.6-e2e-aws-proxy/1309544921245421568
[2]: https://github.com/openshift/release/blob/4c9dd40104656afb73e609e3c3d39c0c86bc57b4/ci-operator/step-registry/openshift/e2e/aws/proxy/openshift-e2e-aws-proxy-workflow.yaml#L17

Comment 1 Victor Pickard 2020-10-21 19:00:10 UTC

I am not able to reproduce this failure. I spun up 2 different AWS clusters on cluster-bot with the proxy option as follows:

launch 4.6.0-0.nightly aws,proxy

Then, I ran this specific test (4 times), and it passed each time:

[vpickard@rippleRider$][~/go/src/github.com/openshift/origin] (master +$%>)$ KUBE_SSH_USER=core KUBE_SSH_KEY_PATH=~/openshift/.ssh/id_rsa.pub ./openshift-tests run-test "[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]"
I1021 14:54:34.476129  242803 test_context.go:429] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready
Oct 21 14:54:34.514: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable
Oct 21 14:54:34.560: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Oct 21 14:54:34.709: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Oct 21 14:54:34.709: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Oct 21 14:54:34.709: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start
Oct 21 14:54:34.759: INFO: e2e test version: v0.0.0-master+$Format:%h$
Oct 21 14:54:34.792: INFO: kube-apiserver version: v1.19.0+d59ce34
Oct 21 14:54:34.836: INFO: Cluster IP family: ipv4
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1429
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1429
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:59
[BeforeEach] [sig-network] Networking
  k8s.io/kubernetes.0/test/e2e/framework/framework.go:174
STEP: Creating a kubernetes client
STEP: Building a namespace api object, basename nettest
Oct 21 14:54:34.990: INFO: About to run a Kube e2e test, ensuring namespace is privileged
Oct 21 14:54:35.418: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled.
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [sig-network] Networking
  k8s.io/kubernetes.0/test/e2e/network/networking.go:94
STEP: Executing a successful http request from the external internet
[It] should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]
  k8s.io/kubernetes.0/test/e2e/network/networking.go:108
STEP: Running container which tries to connect to 8.8.8.8
Oct 21 14:54:35.794: INFO: Waiting up to 5m0s for pod "connectivity-test" in namespace "e2e-nettest-4122" to be "Succeeded or Failed"
Oct 21 14:54:35.832: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 38.812047ms
Oct 21 14:54:37.879: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 2.085627985s
Oct 21 14:54:39.924: INFO: Pod "connectivity-test": Phase="Succeeded", Reason="", readiness=false. Elapsed: 4.130012739s
STEP: Saw pod success
Oct 21 14:54:39.924: INFO: Pod "connectivity-test" satisfied condition "Succeeded or Failed"
[AfterEach] [sig-network] Networking
  k8s.io/kubernetes.0/test/e2e/framework/framework.go:175
Oct 21 14:54:39.924: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-nettest-4122" for this suite.
Oct 21 14:54:40.099: INFO: Running AfterSuite actions on all nodes
Oct 21 14:54:40.099: INFO: Running AfterSuite actions on node 1


Finally, I also ran the entire conformance suite, and verfied that this test passed in the suite also:

started: (5/962/2431) "[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]"
passed: (16.6s) 2020-10-21T14:48:21 "[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]"


I'm going to close this as works for me. Please reopen if you see this issue again. Thanks.

Comment 2 Victor Pickard 2020-10-22 13:50:26 UTC

Re-opening this bz to re-enable the skipped test.

Comment 3 Victor Pickard 2020-10-28 15:18:40 UTC

I see the job failed again on the PR:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/12956/rehearse-12956-pull-ci-openshift-installer-master-e2e-aws-proxy/1321119347678121984

[It] should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]
  k8s.io/kubernetes.0/test/e2e/network/networking.go:108
STEP: Running container which tries to connect to 8.8.8.8
Oct 27 17:31:50.431: INFO: Waiting up to 5m0s for pod "connectivity-test" in namespace "e2e-nettest-352" to be "Succeeded or Failed"
Oct 27 17:31:50.516: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 84.491459ms
Oct 27 17:31:52.602: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 2.170169559s
Oct 27 17:31:54.687: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 4.255197172s
Oct 27 17:31:56.785: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 6.353546045s
Oct 27 17:31:58.870: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 8.438446377s
Oct 27 17:32:00.956: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 10.524378568s
Oct 27 17:32:03.043: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 12.61148537s
Oct 27 17:32:05.128: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 14.696920292s
Oct 27 17:32:07.214: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 16.782603616s
Oct 27 17:32:09.299: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 18.867396968s
Oct 27 17:32:11.384: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 20.952635126s
Oct 27 17:32:13.469: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 23.037589928s
Oct 27 17:32:15.554: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 25.122667677s
Oct 27 17:32:17.639: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 27.207716518s
Oct 27 17:32:19.724: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 29.292483653s
Oct 27 17:32:21.810: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 31.378323027s
Oct 27 17:32:23.919: INFO: Pod "connectivity-test": Phase="Running", Reason="", readiness=true. Elapsed: 33.487076444s
Oct 27 17:32:26.005: INFO: Pod "connectivity-test": Phase="Failed", Reason="", readiness=false. Elapsed: 35.573639804s
Oct 27 17:32:26.103: INFO: pod e2e-nettest-352/connectivity-test logs:
nc: connect to 8.8.8.8 port 53 (tcp) timed out: Operation in progress



I spun up another AWS cluster, with the proxy option, and ran the test by itself, as well as the entire suite of tests. Both passed. So, what is different in the cluster config in the CI job where the PR is failing vs. the AWS cluster with proxy option, like below? Trevor, any ideas on what the difference in the cluster config may be between the two clusters?

Launch cluster
===============
launch 4.7.0-0.ci aws,proxy


Test passing when running entire suite
=======================================
KUBE_SSH_USER=core KUBE_SSH_KEY_PATH=~/openshift/.ssh/id_rsa.pub ./openshift-tests run --junit-dir $HOME/openshift-test-runs -o test.out --include-success openshift/network/stress


[It] should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]
  k8s.io/kubernetes.0/test/e2e/network/networking.go:108
STEP: Running container which tries to connect to 8.8.8.8
Oct 28 10:47:42.602: INFO: Waiting up to 5m0s for pod "connectivity-test" in namespace "e2e-nettest-7906" to be "Succeeded or Failed"
Oct 28 10:47:42.651: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 49.053086ms
Oct 28 10:47:44.856: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 2.254526735s
Oct 28 10:47:46.907: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 4.305084648s
Oct 28 10:47:48.976: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 6.374075053s
Oct 28 10:47:51.038: INFO: Pod "connectivity-test": Phase="Succeeded", Reason="", readiness=false. Elapsed: 8.435686744s
STEP: Saw pod success
Oct 28 10:47:51.038: INFO: Pod "connectivity-test" satisfied condition "Succeeded or Failed"
[AfterEach] [sig-network] Networking
  k8s.io/kubernetes.0/test/e2e/framework/framework.go:175
Oct 28 10:47:51.038: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-nettest-7906" for this suite.
Oct 28 10:47:51.311: INFO: Running AfterSuite actions on all nodes
Oct 28 10:47:51.311: INFO: Running AfterSuite actions on node 1

passed: (12.9s) 2020-10-28T14:47:51 "[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]"


Run just the one test
=====================
KUBE_SSH_USER=core KUBE_SSH_KEY_PATH=~/openshift/.ssh/id_rsa.pub ./openshift-tests run-test "[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]"


KUBE_SSH_USER=core KUBE_SSH_KEY_PATH=~/openshift/.ssh/id_rsa.pub ./openshift-tests run-test "[sig-network] Networking should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]"
I1028 11:14:32.989870 1596097 test_context.go:429] Tolerating taints "node-role.kubernetes.io/master" when considering if nodes are ready
Oct 28 11:14:33.039: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable
Oct 28 11:14:33.139: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Oct 28 11:14:33.299: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Oct 28 11:14:33.299: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Oct 28 11:14:33.299: INFO: Waiting up to 5m0s for all daemonsets in namespace 'kube-system' to start
Oct 28 11:14:33.356: INFO: e2e test version: v0.0.0-master+$Format:%h$
Oct 28 11:14:33.408: INFO: kube-apiserver version: v1.19.0-rc.2.1099+e67f5dcb92ff67-dirty
Oct 28 11:14:33.456: INFO: Cluster IP family: ipv4
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1429
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/framework.go:1429
[BeforeEach] [Top Level]
  github.com/openshift/origin/test/extended/util/test.go:59
[BeforeEach] [sig-network] Networking
  k8s.io/kubernetes.0/test/e2e/framework/framework.go:174
STEP: Creating a kubernetes client
STEP: Building a namespace api object, basename nettest
Oct 28 11:14:33.625: INFO: About to run a Kube e2e test, ensuring namespace is privileged
Oct 28 11:14:34.590: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled.
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [sig-network] Networking
  k8s.io/kubernetes.0/test/e2e/network/networking.go:94
STEP: Executing a successful http request from the external internet
[It] should provide Internet connection for containers [Feature:Networking-IPv4] [Skipped:azure] [Suite:openshift/conformance/parallel] [Suite:k8s]
  k8s.io/kubernetes.0/test/e2e/network/networking.go:108
STEP: Running container which tries to connect to 8.8.8.8
Oct 28 11:14:35.085: INFO: Waiting up to 5m0s for pod "connectivity-test" in namespace "e2e-nettest-4395" to be "Succeeded or Failed"
Oct 28 11:14:35.132: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 46.998249ms
Oct 28 11:14:37.189: INFO: Pod "connectivity-test": Phase="Pending", Reason="", readiness=false. Elapsed: 2.104139622s
Oct 28 11:14:39.248: INFO: Pod "connectivity-test": Phase="Succeeded", Reason="", readiness=false. Elapsed: 4.163717026s
STEP: Saw pod success
Oct 28 11:14:39.249: INFO: Pod "connectivity-test" satisfied condition "Succeeded or Failed"
[AfterEach] [sig-network] Networking
  k8s.io/kubernetes.0/test/e2e/framework/framework.go:175
Oct 28 11:14:39.249: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-nettest-4395" for this suite.
Oct 28 11:14:39.467: INFO: Running AfterSuite actions on all nodes
Oct 28 11:14:39.469: INFO: Running AfterSuite actions on node 1

Comment 4 W. Trevor King 2020-10-28 17:05:38 UTC

> launch 4.7.0-0.ci aws,proxy

cluster-bot does not create clusters with the blackholed private subnets we get in the CI proxy job [1].  That means that proxy requests in cluster-bot clusters are currently "we would like you to go through this proxy, but if you go directly that will work too".  The CI jobs, on the other hand, are "only way off the cluster is through the proxy.  You will fail if you try to go directly".

[1]: https://github.com/openshift/release/tree/92ab42c725b305355dd0c250cf59608914f95712/ci-operator/step-registry/ipi/conf/aws/blackholenetwork

Comment 6 Victor Pickard 2020-11-02 15:56:52 UTC

This test that is failing is part of the K8s E2E test network suite. This test suite is not expected to consume the proxy settings.