Hide Forgot
Customer upgraded control plane to 3.11.248 and master api now repeatedly exiting with: logging error output: "[+]ping ok\n[+]log ok\n[+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/image.openshift.io-apiserver-caches ok\n[+]poststarthook/authorization.openshift.io-bootstrapclusterroles ok\n[+]poststarthook/authorization.openshift.io-ensureopenshift-infra ok\n[+]poststarthook/project.openshift.io-projectcache ok\n[+]poststarthook/project.openshift.io-projectauthorizationcache ok\n[-]poststarthook/security.openshift.io-bootstrapscc failed: reason withheld\n[+]poststarthook/openshift.io-startinformers ok\n[+]poststarthook/openshift.io-restmapperupdater ok\n[+]poststarthook/quota.openshift.io-clusterquotamapping ok\n[-]poststarthook/bootstrap-controller failed: reason withheld\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[-]poststarthook/scheduling/bootstrap-system-priority-classes failed: reason withheld\n[-]poststarthook/ca-registration failed: reason withheld\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/apiservice-openapi-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[-]autoregister-completion failed: reason withheld\n[+]poststarthook/oauth.openshift.io-startoauthclientsbootstrapping ok\n[+]poststarthook/openshift.io-webconsolepublicurl ok\nhealthz check failed\n" [[openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0] 10.29.144.133:47124] goroutine 40200 [running]: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).recordStatus(0xc4367d7a40, 0x1f4) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:207 +0xd2 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog.(*respLogger).WriteHeader(0xc4367d7a40, 0x1f4) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/httplog/httplog.go:186 +0x35 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).WriteHeader(0xc44366ea00, 0x1f4) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:206 +0xac net/http.Error(0x7f1c8858dda8, 0xc43bfa3aa8, 0xc43c786000, 0x5ba, 0x1f4) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/net/http/server.go:1959 +0xda github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/healthz.handleRootHealthz.func1(0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/healthz/healthz.go:161 +0x4c4 net/http.HandlerFunc.ServeHTTP(0xc4443695a0, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/net/http/server.go:1947 +0x44 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/mux.(*pathHandler).ServeHTTP(0xc444daf680, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:241 +0x514 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/mux.(*PathRecorderMux).ServeHTTP(0xc4407999d0, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/mux/pathrecorder.go:234 +0x8a github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server.director.ServeHTTP(0x51e3b8e, 0xf, 0xc440a81ef0, 0xc4407999d0, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/handler.go:154 +0x6a8 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthorization.func1(0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:64 +0x4bd net/http.HandlerFunc.ServeHTTP(0xc43c3e4480, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/net/http/server.go:1947 +0x44 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters/maxinflight.go:160 +0x3fc net/http.HandlerFunc.ServeHTTP(0xc43051fe90, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/net/http/server.go:1947 +0x44 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithImpersonation.func1(0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/filters/impersonation.go:49 +0x1ce4 net/http.HandlerFunc.ServeHTTP(0xc43c3e44c0, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/net/http/server.go:1947 +0x44 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithAuthentication.func1(0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3900) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/filters/authentication.go:75 +0x2b3 net/http.HandlerFunc.ServeHTTP(0xc43c3e4500, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3700) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/net/http/server.go:1947 +0x44 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.WithCORS.func1(0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3700) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters/cors.go:75 +0x178 net/http.HandlerFunc.ServeHTTP(0xc43781d3e0, 0x7f1c8858dda8, 0xc43bfa3aa8, 0xc42c2a3700) /opt/rh/go-toolset-1.10/root/usr/lib/go-toolset-1.10-golang/src/net/http/server.go:1947 +0x44 github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1(0xc451e0ea80, 0xc42abe45a0, 0x78581c0, 0xc43bfa3aa8, 0xc42c2a3700) /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:109 +0xb3 created by github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP /builddir/build/BUILD/atomic-openshift-git-0.92ee8ac/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:98 +0x1b8 logging error output: "[+]ping ok\n[+]log ok\n[+]etcd ok\n[+]poststarthook/generic-apiserver-start-informers ok\n[+]poststarthook/start-apiextensions-informers ok\n[+]poststarthook/start-apiextensions-controllers ok\n[+]poststarthook/image.openshift.io-apiserver-caches ok\n[+]poststarthook/authorization.openshift.io-bootstrapclusterroles ok\n[+]poststarthook/authorization.openshift.io-ensureopenshift-infra ok\n[+]poststarthook/project.openshift.io-projectcache ok\n[+]poststarthook/project.openshift.io-projectauthorizationcache ok\n[-]poststarthook/security.openshift.io-bootstrapscc failed: reason withheld\n[+]poststarthook/openshift.io-startinformers ok\n[+]poststarthook/openshift.io-restmapperupdater ok\n[+]poststarthook/quota.openshift.io-clusterquotamapping ok\n[-]poststarthook/bootstrap-controller failed: reason withheld\n[-]poststarthook/rbac/bootstrap-roles failed: reason withheld\n[-]poststarthook/scheduling/bootstrap-system-priority-classes failed: reason withheld\n[-]poststarthook/ca-registration failed: reason withheld\n[+]poststarthook/start-kube-aggregator-informers ok\n[+]poststarthook/apiservice-registration-controller ok\n[+]poststarthook/apiservice-status-available-controller ok\n[+]poststarthook/apiservice-openapi-controller ok\n[+]poststarthook/kube-apiserver-autoregistration ok\n[-]autoregister-completion failed: reason withheld\n[+]poststarthook/oauth.openshift.io-startoauthclientsbootstrapping ok\n[+]poststarthook/openshift.io-webconsolepublicurl ok\nhealthz check failed\n" [[openshift/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0] 10.29.144.133:47124] F0804 21:39:11.247169 1 start_api.go:68] dial tcp4 0.0.0.0:443: i/o timeout
Sending to openshift-apiserver component (but this is OCP 3, not 4, sorry if this isn't the right component).
poststarthook/security.openshift.io-bootstrapscc failed: reason withheld poststarthook/bootstrap-controller failed poststarthook/scheduling/bootstrap-system-priority-classes failed poststarthook/ca-registration failed autoregister-completion failed The above post startup hooks failed but we see the following connectivity error. F0804 21:39:11.247169 1 start_api.go:68] dial tcp4 0.0.0.0:443: i/o timeout I traced the source of the error to - https://github.com/openshift/origin/blob/release-3.11/pkg/cmd/server/start/start_api.go#L68 I don't see the the following message in the log: glog.Infof("Starting master on %s (%s)", m.config.ServingInfo.BindAddress, version.Get().String()) from https://github.com/openshift/origin/blob/release-3.11/pkg/cmd/server/start/start_master.go#L484 That means master Start is failing before that. Before that we check etcd connectivity and setup informer cache. One hypothesis is that etcd connectivity check is failing - https://github.com/openshift/origin/blob/release-3.11/pkg/cmd/server/start/start_master.go#L458 Or could it be local loopback? can we check the master configuration file? sttts, do you think the above analysis is on the right track?
Checked both kubeconfig files, these look okay to me. I am going to consult with the team and continue investigation.
The /healthz issue happens on master startup. Can you attach a complete log from the beginning until the issue, not only one starting way into the issue (like the one attached) ?
I see this in the other log: E0804 20:15:19.834612 1 helpers.go:134] Encountered config error json: unknown field "masterCount" in object *config.MasterConfig, raw JSON: I suspect the config is malformed. taking a look at it.
Disregard my previous previous comment, we see it on a working cluster
by following the log and code path, I think I tracked down the source of error here - https://github.com/openshift/origin/blob/release-3.11/pkg/cmd/server/origin/master.go#L239 // Attempt to verify the server came up for 20 seconds (100 tries * 100ms, 100ms timeout per try) return cmdutil.WaitForSuccessfulDial(true, c.Options.ServingInfo.BindNetwork, c.Options.ServingInfo.BindAddress, 100*time.Millisecond, 100*time.Millisecond, 100) F0804 21:39:11.247169 1 start_api.go:68] dial tcp4 0.0.0.0:443: i/o timeout Focusing on the above error, if we ssh into the master node and do a netcat on 0.0.0.0:443 we see the timeout error even though the apiserver listens on port 443 for a while. # while true; do nc -vz --ssl 0.0.0.0 443; echo ; sleep 1; done Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connection timed out. But if we run the same test on a functioning 3.11 cluster, we see success # hostname master-0.sharedocp311cns.lab.upshift.rdu2.redhat.com # oc version oc v3.11.248 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://openshift.internal.sharedocp311cns.lab.upshift.rdu2.redhat.com:443 openshift v3.11.248 kubernetes v1.11.0+d4cacc0 # while true; do nc -vz --ssl 0.0.0.0 443; echo ; sleep 1; done Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: SSL connection to 0.0.0.0:443. Ncat: SHA-1 fingerprint: 0DFD CDA0 F3BB CFF0 45FE AB33 EF2C 6049 FB9D F674 Ncat: 0 bytes sent, 0 bytes received in 0.02 seconds. Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: SSL connection to 0.0.0.0:443. Ncat: SHA-1 fingerprint: 0DFD CDA0 F3BB CFF0 45FE AB33 EF2C 6049 FB9D F674 Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds. This leads me to believe that it might be a network issue, we should get someone with network expertise to look at this and validate this hypothesis.
It looks like the api server didn't start correctly overall. The dial timeout on https://github.com/openshift/origin/blob/release-3.11/pkg/cmd/server/origin/master.go#L240 seems to indicate that the connection to server failed - that is more of the symptom than the problem. The /healthz check failure is also a post-start hook symptom of the master pods not starting correctly?
Hi anbhat, From the apiserver log we have the following: I0804 20:15:52.211470 1 serve.go:116] Serving securely on 0.0.0.0:443 We can trace it back to https://github.com/openshift/origin/blob/release-3.11/vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/server/serve.go#L116. After this point "RunServer" is called. And if we dig down further, "server.Serve(listener)" runs in its own go routine and if "Serve" fails we should see an error message or a panic. https://github.com/openshift/origin/blob/release-3.11/vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/server/serve.go#L142-L160 We don't see any error message in the log indicating failure to configure the net listener or bind to 0.0.0.0:443. This indicates the apiserver successfully binds and starts serving. Also, we ran netstat on the node and we saw that 443 was actively being listened on for some duration. Another line of thought is whether crio/kubelet killing the container. If that was the case we would not see the "err" being thrown from https://github.com/openshift/origin/blob/release-3.11/pkg/cmd/util/net.go#L96 every time. "WaitForSuccessfulDial" tries to connect and waits for at most 20 seconds (100ms sleep, 100ms timeout and at most 100 tries). The log has evidence of that I0804 20:15:52.211470 1 serve.go:116] Serving securely on 0.0.0.0:443 . . . F0804 20:16:12.614633 1 start_api.go:68] dial tcp4 0.0.0.0:443: i/o timeout Around "20:15:52" both "Run" and "WaitForSuccessfulDial" are called in their own go routines. Serving starts immediately and After 20 seconds "WaitForSuccessfulDial" times out.
$ cat ~/Downloads/command-1-output.log | head -n 1 E0807 14:42:28.381555 12923 helpers.go:134] Encountered config error json: unknown field "masterCount" in object *config.MasterConfig, raw JSON: src/github.com $ cat ~/Downloads/command-1-output.log | tail -n 1 I0807 14:43:59.686379 12923 handler.go:153] kube-aggregator: GET "/api/v1/namespaces/default/serviceaccounts/router" satisfied by nonGoRestful From 14:42:28 to 14:43:59 => The process ran for ~ 90s $ cat ~/Downloads/command-1-output.log | grep 'Serving securely' I0807 14:43:00.938218 12923 serve.go:116] Serving securely on 0.0.0.0:443 But the apiserver started listening/serving at 14:43:00 The apiserver can handle requests from external entity: I0807 14:43:59.685638 12923 round_trippers.go:438] GET https://paasmaster-144-133.preprod.ohc-us-east-2.redhat.com:443/api/v1/namespaces?resourceVersion=417065977&timeoutSeconds=335&watch=true 200 OK in 45054 milliseconds I0807 14:43:59.685647 12923 round_trippers.go:438] GET https://paasmaster-144-133.preprod.ohc-us-east-2.redhat.com:443/apis/admissionregistration.k8s.io/v1beta1/validatingwebhookconfigurations?resourceVersion=417066021&timeoutSeconds=565&watch=true 200 OK in 45054 milliseconds But in the log I see plenty of error where "https://127.0.0.1:443/" get i/o timeout error. E0807 14:43:51.654129 12923 reflector.go:136] k8s.io/kube-aggregator/pkg/client/informers/internalversion/factory.go:117: Failed to list *apiregistration.APIService: Get https://127.0.0.1:443/apis/apiregistration.k8s.io/v1/apiservices?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: i/o timeout E0807 14:43:51.654158 12923 reflector.go:136] github.com/openshift/origin/pkg/image/apiserver/registry/imagestream/etcd/image.go:58: Failed to list *etcd.ImageLayers: Get https://127.0.0.1:443/apis/image.openshift.io/v1/images?limit=500&resourceVersion=0: dial tcp 127.0.0.1:443: i/o timeout
We were able to identify that SYN packets are being dropped when running curl to localhost on the master. This is what is preventing Master API from remaining up since it needs to send HTTP request to itself for health checks. This was our test case: # ssh master # mv /etc/origin/node/pods/apiserver.yaml /etc/origin/node/pods-stopped/apiserver.yaml # python -m SimpleHTTPServer 443 In another terminal on the same master: # curl -v http://127.0.0.0:443 The curl command times out during TCP handshake. However, using curl to the hostname or eth0 IP address of the master works fine. This explains why `oc get` commands work occasionally during Master API Pod restarts. We noticed the following rule in IPTables packet count increasing at the same time as running the curl command to localhost. Chain KUBE-FIREWALL (2 references) pkts bytes target prot opt in out source destination 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes firewall for dropping marked packets */ mark match 0x8000/0x8000 7438 463125 DROP all -- * * !127.0.0.0/8 127.0.0.0/8 /* block incoming localnet connections */ ! ctstate RELATED,ESTABLISHED,DNAT The rule is added by SDN starting in 3.11.248 which customer upgraded to and immediately started seeing issues: https://github.com/openshift/origin/commit/2f3469c8f75d028aec2a3f8836ccabf2f5417479
I realize now the kubelet is probably the one adding the IPTables rule, not the SDN.
*** This bug has been marked as a duplicate of bug 1866495 ***
*** Bug 1866495 has been marked as a duplicate of this bug. ***
Verified on cri-o-1.11.16-0.11.dev.rhaos3.11.gitd6a416d.el7.x86_64.rpm (https://errata.devel.redhat.com/advisory/58206) Reproduced issue: - 3.11.252 with cri-o 1.11.16-0.10.dev.rhaos3.11.git1eee681.el7 - created pods with HostPort on all nodes kind: Pod apiVersion: v1 metadata: name: hostport spec: containers: - name: hostport image: 'openshift/hello-openshift' ports: - containerPort: 8080 hostPort: 9988 protocol: TCP - verified iptable-save had the bad rule: -A POSTROUTING -m comment --comment "SNAT for localhost access to hostports" -o lo -s 127.0.0.0/8 -j MASQUERADE - verified I could not curl -k https://127.0.0.0:443 on master - it timed out Verified issue: - rpm -Uvh cri-o-1.11.16-0.11.dev.rhaos3.11.gitd6a416d.el7.x86_64.rpm on each node - reboot each node - repeat scenario above - bad rule not present - curl works
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 3.11.248 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3416