Bug 1882191
| Summary: | Installation fails against external resources which lack DNS Subject Alternative Name | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Philip Chan <chanphil> | |
| Component: | Machine Config Operator | Assignee: | Scott Dodson <sdodson> | |
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.6 | CC: | alklein, aos-bugs, bschmaus, cbaus, christian.lapolt, clnperez, danili, Holger.Wolf, jokerman, krmoser, nbziouec, psundara, sdodson, vigoyal, wking | |
| Target Milestone: | --- | |||
| Target Release: | 4.7.0 | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1885737 (view as bug list) | Environment: | ||
| Last Closed: | 2021-02-24 15:19:20 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1881153, 1885737 | |||
| Attachments: | ||||
|
Description
Philip Chan
2020-09-24 02:31:44 UTC
I believe there are differences between bare metal and Z installation on a restricted network. These are the instructions for Z on restricted network: https://docs.openshift.com/container-platform/4.5/installing/installing_ibm_z/installing-restricted-networks-ibm-z.html Can you confirm you did everything necessary according to the set of instructions for Z? Also, can you do and 'oc adm must-gather' and provide the entire bootkube.log and logs for the masters? Hi Carvel, My mistake, let me clarify. We do follow the specific instructions for a Z restricted network install. These are the same installations steps that we've performed previously for releases such as OCP 4.4 and 4.5. I cannot run 'oc adm must-gather' because the cluster is not up: [root@OSPAMGR2 ~]# oc adm must-gather [must-gather ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather) [must-gather ] OUT [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest [must-gather ] OUT namespace/openshift-must-gather-9fq6f created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-cdvrq created [must-gather ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created [must-gather-f9gkr] OUT gather did not start: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-cdvrq deleted [must-gather ] OUT namespace/openshift-must-gather-9fq6f deleted error: gather did not start for pod must-gather-f9gkr: timed out waiting for the condition I gathered and will attach the logs for bootkube and masters. Note, the master logs are very large and will be a problem with the upload limits. I broke them into tail 10,000 line segments. Please let me know if you need more. Thank you, -Phil Created attachment 1716365 [details]
bootkube.log
Created attachment 1716366 [details]
master-0.kubelet.service.log
Created attachment 1716367 [details]
master-1.kubelet.service.log
Created attachment 1716368 [details]
master-2.kubelet.service.log
I see this happening over and over in the bootstrap: Sep 23 22:20:57 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: https://localhost:2379 is healthy: successfully committed proposal: took = 21.833623ms Sep 23 22:20:57 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: Starting cluster-bootstrap... Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: Starting temporary bootstrap control plane... Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: E0923 22:21:00.084126 1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get "https://localhost:6443/api/v1/pods": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#1] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#2] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#3] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#4] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Sep 23 22:21:00 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[2333]: [#5] failed to fetch discovery: Get "https://localhost:6443/api?timeout=32s": dial tcp [::1]:6443: connect: connection refused Whats interesting is that localhost:2379 is reachable, but localhost:6443 (the apiserver) is not. The temporary control plane never is able to come up. My first thought is that this is network configuration (firewall on the bootstrap??), but it is not clear from the bootkube.log. Can you take a look at the bootstrap machine and see if there are any other indicators why 6443 is getting refused? Adding Needinfo for Phil per Carvel's Comment 7 as the original needinfo may not have triggered a notification Hi, We've taken a closer look at the bootstrap node, and unfortunately we could not determine why the connections are refused. However, we do see that through netstat, port 6443 is up and connections have been established: [root@OSPAMGR2 ~]# netstat -an | grep 6443 tcp 0 0 10.20.116.2:6443 0.0.0.0:* LISTEN tcp 0 0 9.12.23.25:6443 0.0.0.0:* LISTEN tcp 0 0 10.20.116.2:6443 10.20.116.11:40606 ESTABLISHED tcp 0 0 10.20.116.2:52100 10.20.116.10:6443 ESTABLISHED tcp 0 0 10.20.116.2:6443 10.20.116.12:39364 ESTABLISHED tcp 0 0 10.20.116.2:6443 10.20.116.13:57002 ESTABLISHED tcp 0 0 10.20.116.2:52096 10.20.116.10:6443 ESTABLISHED tcp 0 0 10.20.116.2:52098 10.20.116.10:6443 ESTABLISHED Also, we successfully installed OCP 4.5.11 and 4.5.12 disconnected installs on this same cluster (same z/VM hosted bastion, master, and worker nodes) within the last week and a half. Using the same disconnected install procedure, there seems to be an issue specific with OCP 4.6. Note that we have also seen the same error with a OCP 4.6 disconnected install on z/KVM. If there are any other logs or traces that we can generate and provide, please let me know. Thank you, -Phil i did recently see the same error in a cluster that did not have enough cpus/memory, could this be related to the increase in used resources of some operators in https://bugzilla.redhat.com/show_bug.cgi?id=1878770 ? Could we get a list of memory and cpu configured for each machine? Hi,
Under the previous disconnected installs on zVM and zKVM; we had 8 CPUs and 32GB of memory for zVM & 4 CPUs and 32GB of memory for zKVM. This was defined for the cluster nodes bootstrap, masters, and workers. Today, we increased all the cluster guests for both platforms to the following:
zVM (18 CPU and 64GB Memory) for bootstrap, masters, and workers:
$ cat /proc/sysinfo
...
VM00 CPUs Total: 18
VM00 CPUs Configured: 18
VM00 CPUs Standby: 0
VM00 CPUs Reserved: 0
$ free -h
total used free shared buff/cache available
Mem: 62Gi 571Mi 62Gi 1.0Mi 371Mi 61Gi
Swap: 0B 0B 0B
zKVM (8 CPU and 64GB Memory) for bootstrap, masters, and workers:
$ cat /proc/sysinfo
...
VM00 CPUs Total: 8
VM00 CPUs Configured: 8
VM00 CPUs Standby: 0
VM00 CPUs Reserved: 0
VM00 Extended Name: bootstrap-0
VM00 UUID: d42fea3a-6919-46a5-8976-104ecd023fb0
$ free -h
total used free shared buff/cache available
Mem: 62Gi 1.6Gi 59Gi 4.0Mi 2.2Gi 60Gi
Swap: 0B 0B 0B
We re-tested the disconnected install on both platforms and there is no change. The workers nodes are still unable to connect.
-Phil
Hi Carvel, I'm attaching additional logs from the bootstrap node as requested. We attempted a disconnected install today using the CI nightly build 4.6.0-0.nightly-s390x-2020-09-30-122156 earlier today. Please let me know if there are any additional logs you may need. Thank you, -Phil Created attachment 1717972 [details]
bootkube.service.journalctl.log
Created attachment 1717973 [details]
cluster-policy-controller.log
Created attachment 1717974 [details]
kube-apiserver.log
Created attachment 1717976 [details]
kube-controller-manager.log
Created attachment 1717977 [details]
kube-scheduler.log
In the master kubelet logs : x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b: Error reading manifest sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" looks like this is a known issue being tracked, There are several places requiring fixes and this is one of the fixes: https://github.com/openshift/installer/pull/4210 In this particular case it is coming from the network-operator and is seen only on master-0: Sep 23 22:48:34 master-0.ospamgr2-sep22.zvmocp.notld hyperkube[1521]: E0923 22:48:34.954286 1521 kuberuntime_manager.go:730] createPodSandbox for pod "network-operator-77547d9b84-jj2f5_openshift-network-operator(9902f8f3-5fb0-43a3-8cd8-554b7f933e20)" failed: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_network-operator-77547d9b84-jj2f5_openshift-network-operator_9902f8f3-5fb0-43a3-8cd8-554b7f933e20_0": Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b: (Mirrors also failed: [bastion:5000/ocp4/openshift4@sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b: error pinging docker registry bastion:5000: Get "https://bastion:5000/v2/": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0]): quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b: Error reading manifest sha256:3ecd439bba69ca2a6ed21df05911963baef721e9265f64095a31cd5f6c41d32b in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized Hi, As requested, I will upload the 3 master logs that coincide with the bootstrap logs that were uploaded the day before from 9/30/2020. Please let me know if you need any additional logs and info. Thank you, -Phil Created attachment 1718200 [details]
master-0.kubelet.service.log.tgz from zKVM 9/30/2020
Created attachment 1718201 [details]
master-1.kubelet.service.log.tgz from zKVM 9/30/2020
Created attachment 1718202 [details]
master-2.kubelet.service.log.tgz from zKVM 9/30/2020
Yes, i see those logs in all the master nodes. Phil, Which version of the installer and which version release image are you using? Can you use the latest and also check that the installer you are using to install the image is in sync with the release image you are using? Also please remember to use the latest rhcos image. The image you are using seems a bit old. Thanks Prashanth,
Thank you for all your assistance with this issue.
1. My colleague Phil Chan and I have also been regularly testing OCP 4.6 disconnected installs on a second zVM hosted cluster, with unfortunately the same unsuccessful results. This includes recent OCP 4.6 and RHCOS builds.
2. For comparison and debug purposes, we have also been testing OCP 4.5 disconnected installs with the latest OCP 4.5 builds on this same cluster (all successfully). This includes OCP 4.5.11, 4.5.12, 4.5.13, and 4.5.14.
3. Here is a summary of our initial OCP 4.5 and 4.6 connected and disconnected install related tests for today, all on this same zVM hosted cluster.
OCP level RHCOS level Installation Type Status Comments
=============== ====================== ================= ============ ====================================================================
1. OCP 4.5.14 45.82.202009261457-0 connected successful
2. OCP 4.5.14 45.82.202009261457-0 disconnected successful
3. OCP 4.6.0-fc.9 46.82.202010010439-0 connected successful
4. OCP 4.6.0-fc.9 46.82.202010010439-0 disconnected unsuccessful Master nodes do not successfully install, encounter x509 certificate
issue mentioned in comments 19 and 20
4. Here is this zVM hosted cluster's bootstrap, master, and worker nodes' vCPU and Real Memory configuration. This zVM cluster is hosted on an IBM z15 server.
Node vCPU Real Memory (GB)
========= ==== ================
1. bootstrap-0 8 64
2. master-0 8 32
3. master-1 8 32
4. master-2 8 32
5. worker-0 8 32
6. worker-1 8 32
Thank you,
Kyle
A suggestion of something to look at, given the x509 errors, is something in the self-generated certs is causing the latest go TLS verification to fail. I think what is mentioned here might be worth focusing in on w/r/t the cert itself: "https://bastion:5000/v2/": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0])
Carvel and Prashanth,
Thank you for your assistance and the information on the x509 certificate SAN configuration for OCP 4.6 disconnected installs. Using certificates configured using SANs, Phil and I have successfully tested OCP 4.6 disconnected installs on both zKVM and zVM hosted clusters.
The following OCP 4.6 builds and associated RHCOS builds have been successfully tested for disconnected installs on zVM hosted clusters.
OCP level RHCOS level Installation Type Status Comments
=============== ====================== ================= ============ ====================================================================
1. 4.6.0-fc.6 46.82.202009130739-0 disconnected successful RHCOS 46.82.202010010439-0 used to install bootstrap node
2. 4.6.0-fc.7 46.82.202009170439-0 disconnected successful RHCOS 46.82.202010010439-0 used to install bootstrap node
3. 4.6.0-fc.8 46.82.202009241338-0 disconnected successful RHCOS 46.82.202010010439-0 used to install bootstrap node
4. 4.6.0-fc.9 46.82.202010010439-0 disconnected successful RHCOS 46.82.202010010439-0 used to install bootstrap node
Thank you,
Kyle
Kyle, Nice! Thanks for the confirmation. Now I am wondering whether this bug should be a generic doc bug with the Openshift component for 4.6 Disconnected installs. Prashanth Moving this to the docs team. This is not an IBM Z specific issue but rather it needs to be addressed as a whole for Disconnected installs across arches. In 4.6 because of the move to Golang1.15 and Go1.15 requiring SAN in the certificates, the certificates for the local image registry needs to have SANs rather than CNs. This procedure needs to be documented as part of the Disconnected install documentation. Can you please test 4.6.0-fc.9 with certificates that lack a SAN? Our intent is to document this deprecation in 4.6 but defer enforcement until a later release given how late it was that we discovered this change. If we find that we break on certs that lack a SAN in 4.6.0-fc.9 we need to dig deeper and fix those components which still fail. If we find that things work for you then we can use this bug to track the documentation necessary to announce deprecation. You will see in comment #27 that FC9 was tested with certs without SANs and it does break. This breakage was caused by this when we switched to go 1.15: https://tip.golang.org/doc/go1.15#commonname Not sure there is something to "fix" per se, except the documentation, unless I misunderstood your intent. (In reply to Scott Dodson from comment #33) > Can you please test 4.6.0-fc.9 with certificates that lack a SAN? Our intent > is to document this deprecation in 4.6 but defer enforcement until a later > release given how late it was that we discovered this change. > > If we find that we break on certs that lack a SAN in 4.6.0-fc.9 we need to > dig deeper and fix those components which still fail. > > If we find that things work for you then we can use this bug to track the > documentation necessary to announce deprecation. Thanks, we need to ensure that host level services have the environment variable set. I'm working on coordinating that, moving to installer component for now. $ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.0-0.nightly-2020-10-26-124513 True False 5h4m Cluster version is 4.7.0-0.nightly-2020-10-26-124513
$ oc debug node/ip-10-0-136-111.us-west-2.compute.internal
Starting pod/ip-10-0-136-111us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# systemctl show-environment
GODEBUG=x509ignoreCN=0
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
sh-4.4# cat << EOF > /etc/systemd/system/test.service
> [Service]
> ExecStart=echo $GODEBUG
> EOF
sh-4.4# systemctl start test.service
sh-4.4# journalctl -lu test.service
-- Logs begin at Mon 2020-10-26 19:34:50 UTC, end at Tue 2020-10-27 01:08:08 UTC. --
Oct 27 01:06:07 ip-10-0-136-111 systemd[1]: /etc/systemd/system/test.service:1: Assignment outside of section. Ignoring.
Oct 27 01:06:07 ip-10-0-136-111 systemd[1]: test.service: Service lacks both ExecStart= and ExecStop= setting. Refusing.
Oct 27 01:06:55 ip-10-0-136-111 systemd[1]: Started test.service.
Oct 27 01:06:55 ip-10-0-136-111 echo[580628]: x509ignoreCN=0
Oct 27 01:06:55 ip-10-0-136-111 systemd[1]: test.service: Consumed 1ms CPU time
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |