Description of problem: Performing a OCP 4.6 Installation in a restricted network on zVM fails. The Version-Release number of selected component (if applicable): RHCOS 4.6.0-0.nightly-s390x-2020-09-10-112115 OCP 4.6.0-0.nightly-s390x-2020-09-22-223822 How reproducible: Consistently Steps to Reproduce: 1. Follow steps to configure the mirror host on bastion: https://docs.openshift.com/container-platform/4.5/installing/install_config/installing-restricted-networks-preparations.html 2. Install cluster using restricted network steps: https://docs.openshift.com/container-platform/4.5/installing/installing_bare_metal/installing-restricted-networks-bare-metal.html#installing-restricted-networks-bare-metal 3. IPL the bootstrap and cluster nodes. Actual results: Bootstrap, master and worker nodes all start. However, the master nodes never become Ready: [root@OSPAMGR2 ~]# oc get nodes NAME STATUS ROLES AGE VERSION master-0.ospamgr2-sep22.zvmocp.notld NotReady master 4h1m v1.19.0+8a39924 master-1.ospamgr2-sep22.zvmocp.notld NotReady master 3h56m v1.19.0+8a39924 master-2.ospamgr2-sep22.zvmocp.notld NotReady master 3h48m v1.19.0+8a39924 Preventing the worker nodes from starting. The bootkube.service reports this: Sep 23 23:02:41 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[19435]: E0923 23:02:41.319432 1 reflector.go:251] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to watch *v1.Pod: Get "https://localhost:6443/api/v1/pods?watch=true": dial tcp [::1]:6443: connect: connection refused Sep 23 23:02:42 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[19435]: E0923 23:02:42.325119 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get "https://localhost:6443/api/v1/pods": dial tcp [::1]:6443: connect: connection refused Sep 23 23:02:43 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[19435]: E0923 23:02:43.327963 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get "https://localhost:6443/api/v1/pods": dial tcp [::1]:6443: connect: connection refused Sep 23 23:02:44 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[19435]: E0923 23:02:44.332599 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get "https://localhost:6443/api/v1/pods": dial tcp [::1]:6443: connect: connection refused Expected results: Master and worker nodes start successfully Additional info:
I believe there are differences between bare metal and Z installation on a restricted network. These are the instructions for Z on restricted network: https://docs.openshift.com/container-platform/4.5/installing/installing_ibm_z/installing-restricted-networks-ibm-z.html Can you confirm you did everything necessary according to the set of instructions for Z? Also, can you do and 'oc adm must-gather' and provide the entire bootkube.log and logs for the masters?
Hi Carvel, My mistake, let me clarify. We do follow the specific instructions for a Z restricted network install. These are the same installations steps that we've performed previously for releases such as OCP 4.4 and 4.5. I cannot run 'oc adm must-gather' because the cluster is not up: [root@OSPAMGR2 ~]# oc adm must-gather [must-gather ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather) [must-gather ] OUT [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest [must-gather ] OUT namespace/openshift-must-gather-9fq6f created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-cdvrq created [must-gather ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created [must-gather-f9gkr] OUT gather did not start: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-cdvrq deleted [must-gather ] OUT namespace/openshift-must-gather-9fq6f deleted error: gather did not start for pod must-gather-f9gkr: timed out waiting for the condition I gathered and will attach the logs for bootkube and masters. Note, the master logs are very large and will be a problem with the upload limits. I broke them into tail 10,000 line segments. Please let me know if you need more. Thank you, -Phil
Created attachment 1716365 [details] bootkube.log
Created attachment 1716366 [details] master-0.kubelet.service.log
Created attachment 1716367 [details] master-1.kubelet.service.log
Created attachment 1716368 [details] master-2.kubelet.service.log
I see this happening over and over in the bootstrap: Sep 23 22:20:57 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: https://localhost:2379 is healthy: successfully committed proposal: took = 21.833623ms Sep 23 22:20:57 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: Starting cluster-bootstrap... Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: Starting temporary bootstrap control plane... Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: E0923 22:21:00.084126 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get "https://localhost:6443/api/v1/pods": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#1] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#2] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#3] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#4] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#5] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Whats interesting is that localhost:2379 is reachable, but localhost:6443 (the apiserver) is not. The temporary control plane never is able to come up. My first thought is that this is network configuration (firewall on the bootstrap??), but it is not clear from the bootkube.log. Can you take a look at the bootstrap machine and see if there are any other indicators why 6443 is getting refused?
Adding Needinfo for Phil per Carvel's Comment 7 as the original needinfo may not have triggered a notification
Hi, We've taken a closer look at the bootstrap node, and unfortunately we could not determine why the connections are refused. However, we do see that through netstat, port 6443 is up and connections have been established: [root@OSPAMGR2 ~]# netstat -an | grep 6443 tcp 0 0 10.20.116.2:6443 0.0.0.0:* LISTEN tcp 0 0 9.12.23.25:6443 0.0.0.0:* LISTEN tcp 0 0 10.20.116.2:6443 10.20.116.11:40606 ESTABLISHED tcp 0 0 10.20.116.2:52100 10.20.116.10:6443 ESTABLISHED tcp 0 0 10.20.116.2:6443 10.20.116.12:39364 ESTABLISHED tcp 0 0 10.20.116.2:6443 10.20.116.13:57002 ESTABLISHED tcp 0 0 10.20.116.2:52096 10.20.116.10:6443 ESTABLISHED tcp 0 0 10.20.116.2:52098 10.20.116.10:6443 ESTABLISHED Also, we successfully installed OCP 4.5.11 and 4.5.12 disconnected installs on this same cluster (same z/VM hosted bastion, master, and worker nodes) within the last week and a half. Using the same disconnected install procedure, there seems to be an issue specific with OCP 4.6. Note that we have also seen the same error with a OCP 4.6 disconnected install on z/KVM. If there are any other logs or traces that we can generate and provide, please let me know. Thank you, -Phil
i did recently see the same error in a cluster that did not have enough cpus/memory, could this be related to the increase in used resources of some operators in https://bugzilla.redhat.com/show_bug.cgi?id=1878770 ?
Could we get a list of memory and cpu configured for each machine?
Hi, Under the previous disconnected installs on zVM and zKVM; we had 8 CPUs and 32GB of memory for zVM & 4 CPUs and 32GB of memory for zKVM. This was defined for the cluster nodes bootstrap, masters, and workers. Today, we increased all the cluster guests for both platforms to the following: zVM (18 CPU and 64GB Memory) for bootstrap, masters, and workers: $ cat /proc/sysinfo ... VM00 CPUs Total: 18 VM00 CPUs Configured: 18 VM00 CPUs Standby: 0 VM00 CPUs Reserved: 0 $ free -h total used free shared buff/cache available Mem: 62Gi 571Mi 62Gi 1.0Mi 371Mi 61Gi Swap: 0B 0B 0B zKVM (8 CPU and 64GB Memory) for bootstrap, masters, and workers: $ cat /proc/sysinfo ... VM00 CPUs Total: 8 VM00 CPUs Configured: 8 VM00 CPUs Standby: 0 VM00 CPUs Reserved: 0 VM00 Extended Name: bootstrap-0 VM00 UUID: d42fea3a-6919-46a5-8976-104ecd023fb0 $ free -h total used free shared buff/cache available Mem: 62Gi 1.6Gi 59Gi 4.0Mi 2.2Gi 60Gi Swap: 0B 0B 0B We re-tested the disconnected install on both platforms and there is no change. The workers nodes are still unable to connect. -Phil
Hi Carvel, I'm attaching additional logs from the bootstrap node as requested. We attempted a disconnected install today using the CI nightly build 4.6.0-0.nightly-s390x-2020-09-30-122156 earlier today. Please let me know if there are any additional logs you may need. Thank you, -Phil
Created attachment 1717972 [details] bootkube.service.journalctl.log
Created attachment 1717973 [details] cluster-policy-controller.log
Created attachment 1717974 [details] kube-apiserver.log
Created attachment 1717976 [details] kube-controller-manager.log
Created attachment 1717977 [details] kube-scheduler.log
In the master kubelet logs : x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b: Error reading manifest sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" looks like this is a known issue being tracked, There are several places requiring fixes and this is one of the fixes: https://github.com/openshift/installer/pull/4210
In this particular case it is coming from the network-operator and is seen only on master-0: Sep 23 22:48:34 master-0.ospamgr2-sep22.zvmocp.notld hyperkube[1521]: E0923 22:48:34.954286 1521 kuberuntime_manager.go:730] createPodSandbox for pod "network-operator-77547d9b84-jj2f5_openshift-network-operator(9902f8f3-5fb0-43a3-8cd8-554b7f933e20)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_network-operator-77547d9b84-jj2f5_openshift-network-operator_9902f8f3-5fb0-43a3-8cd8-554b7f933e20_0": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b: (Mirrors also failed: [bastion:5000/ocp4/openshift4@sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b: error pinging docker registry bastion:5000: Get "https://bastion:5000/v2/": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b: Error reading manifest sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
Hi, As requested, I will upload the 3 master logs that coincide with the bootstrap logs that were uploaded the day before from 9/30/2020. Please let me know if you need any additional logs and info. Thank you, -Phil
Created attachment 1718200 [details] master-0.kubelet.service.log.tgz from zKVM 9/30/2020
Created attachment 1718201 [details] master-1.kubelet.service.log.tgz from zKVM 9/30/2020
Created attachment 1718202 [details] master-2.kubelet.service.log.tgz from zKVM 9/30/2020
Yes, i see those logs in all the master nodes. Phil, Which version of the installer and which version release image are you using? Can you use the latest and also check that the installer you are using to install the image is in sync with the release image you are using? Also please remember to use the latest rhcos image. The image you are using seems a bit old. Thanks
Prashanth, Thank you for all your assistance with this issue. 1. My colleague Phil Chan and I have also been regularly testing OCP 4.6 disconnected installs on a second zVM hosted cluster, with unfortunately the same unsuccessful results. This includes recent OCP 4.6 and RHCOS builds. 2. For comparison and debug purposes, we have also been testing OCP 4.5 disconnected installs with the latest OCP 4.5 builds on this same cluster (all successfully). This includes OCP 4.5.11, 4.5.12, 4.5.13, and 4.5.14. 3. Here is a summary of our initial OCP 4.5 and 4.6 connected and disconnected install related tests for today, all on this same zVM hosted cluster. OCP level RHCOS level Installation Type Status Comments =============== ====================== ================= ============ ==================================================================== 1. OCP 4.5.14 45.82.202009261457-0 connected successful 2. OCP 4.5.14 45.82.202009261457-0 disconnected successful 3. OCP 4.6.0-fc.9 46.82.202010010439-0 connected successful 4. OCP 4.6.0-fc.9 46.82.202010010439-0 disconnected unsuccessful Master nodes do not successfully install, encounter x509 certificate issue mentioned in comments 19 and 20 4. Here is this zVM hosted cluster's bootstrap, master, and worker nodes' vCPU and Real Memory configuration. This zVM cluster is hosted on an IBM z15 server. Node vCPU Real Memory (GB) ========= ==== ================ 1. bootstrap-0 8 64 2. master-0 8 32 3. master-1 8 32 4. master-2 8 32 5. worker-0 8 32 6. worker-1 8 32 Thank you, Kyle
A suggestion of something to look at, given the x509 errors, is something in the self-generated certs is causing the latest go TLS verification to fail. I think what is mentioned here might be worth focusing in on w/r/t the cert itself: "https://bastion:5000/v2/": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0])
Carvel and Prashanth, Thank you for your assistance and the information on the x509 certificate SAN configuration for OCP 4.6 disconnected installs. Using certificates configured using SANs, Phil and I have successfully tested OCP 4.6 disconnected installs on both zKVM and zVM hosted clusters. The following OCP 4.6 builds and associated RHCOS builds have been successfully tested for disconnected installs on zVM hosted clusters. OCP level RHCOS level Installation Type Status Comments =============== ====================== ================= ============ ==================================================================== 1. 4.6.0-fc.6 46.82.202009130739-0 disconnected successful RHCOS 46.82.202010010439-0 used to install bootstrap node 2. 4.6.0-fc.7 46.82.202009170439-0 disconnected successful RHCOS 46.82.202010010439-0 used to install bootstrap node 3. 4.6.0-fc.8 46.82.202009241338-0 disconnected successful RHCOS 46.82.202010010439-0 used to install bootstrap node 4. 4.6.0-fc.9 46.82.202010010439-0 disconnected successful RHCOS 46.82.202010010439-0 used to install bootstrap node Thank you, Kyle
Kyle, Nice! Thanks for the confirmation. Now I am wondering whether this bug should be a generic doc bug with the Openshift component for 4.6 Disconnected installs. Prashanth
Moving this to the docs team. This is not an IBM Z specific issue but rather it needs to be addressed as a whole for Disconnected installs across arches. In 4.6 because of the move to Golang1.15 and Go1.15 requiring SAN in the certificates, the certificates for the local image registry needs to have SANs rather than CNs. This procedure needs to be documented as part of the Disconnected install documentation.
Can you please test 4.6.0-fc.9 with certificates that lack a SAN? Our intent is to document this deprecation in 4.6 but defer enforcement until a later release given how late it was that we discovered this change. If we find that we break on certs that lack a SAN in 4.6.0-fc.9 we need to dig deeper and fix those components which still fail. If we find that things work for you then we can use this bug to track the documentation necessary to announce deprecation.
You will see in comment #27 that FC9 was tested with certs without SANs and it does break. This breakage was caused by this when we switched to go 1.15: https://tip.golang.org/doc/go1.15#commonname Not sure there is something to "fix" per se, except the documentation, unless I misunderstood your intent. (In reply to Scott Dodson from comment #33) > Can you please test 4.6.0-fc.9 with certificates that lack a SAN? Our intent > is to document this deprecation in 4.6 but defer enforcement until a later > release given how late it was that we discovered this change. > > If we find that we break on certs that lack a SAN in 4.6.0-fc.9 we need to > dig deeper and fix those components which still fail. > > If we find that things work for you then we can use this bug to track the > documentation necessary to announce deprecation.
Thanks, we need to ensure that host level services have the environment variable set. I'm working on coordinating that, moving to installer component for now.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-10-26-124513 True False 5h4m Cluster version is 4.7.0-0.nightly-2020-10-26-124513 $ oc debug node/ip-10-0-136-111.us-west-2.compute.internal Starting pod/ip-10-0-136-111us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# systemctl show-environment GODEBUG=x509ignoreCN=0 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin sh-4.4# cat << EOF > /etc/systemd/system/test.service > [Service] > ExecStart=echo $GODEBUG > EOF sh-4.4# systemctl start test.service sh-4.4# journalctl -lu test.service -- Logs begin at Mon 2020-10-26 19:34:50 UTC, end at Tue 2020-10-27 01:08:08 UTC. -- Oct 27 01:06:07 ip-10-0-136-111 systemd[1]: /etc/systemd/system/test.service:1: Assignment outside of section. Ignoring. Oct 27 01:06:07 ip-10-0-136-111 systemd[1]: test.service: Service lacks both ExecStart= and ExecStop= setting. Refusing. Oct 27 01:06:55 ip-10-0-136-111 systemd[1]: Started test.service. Oct 27 01:06:55 ip-10-0-136-111 echo[580628]: x509ignoreCN=0 Oct 27 01:06:55 ip-10-0-136-111 systemd[1]: test.service: Consumed 1ms CPU time
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days