Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1710012

Summary: Provide user's information on making sure registry cluster object exists before applying patch.
Product: OpenShift Container Platform Reporter: Eric Rich <erich>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: adahiya, aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-22 20:24:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Install Logs none

Description Eric Rich 2019-05-14 17:53:05 UTC
Description of problem: If you follow: https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#installation-installing-bare-metal_installing-bare-metal you can't apply registry storage. You are hit with the following error: 

> $ oc --config test_cluster/auth/kubeconfig patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
> Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found

Version-Release number of selected component (if applicable): beta4

How reproducible: Rare (only seen this once)

Steps to Reproduce:
1. Follow the Docs 

Actual results: See above

Expected results: I have seen this patch command work in the past. 

Additional info:

$ oc --config test_cluster/auth/kubeconfig get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-rc.0   False       True          45m     Unable to apply 4.1.0-rc.0: an unknown error has occurred

$ oc --config test_cluster/auth/kubeconfig get clusteroperators
NAME                                 VERSION      AVAILABLE   PROGRESSING   FAILING   SINCE
cloud-credential                     4.1.0-rc.0   True        False         False     43m
cluster-autoscaler                   4.1.0-rc.0   True        False         False     43m
dns                                  4.1.0-rc.0   False       False         False     40m
kube-apiserver                       4.1.0-rc.0   True        True                    42m
kube-controller-manager              4.1.0-rc.0   True        False                   40m
kube-scheduler                       4.1.0-rc.0   True        False                   40m
machine-api                          4.1.0-rc.0   True        False         False     43m
machine-config                       4.1.0-rc.0   False       False         True      30m
network                              4.1.0-rc.0   True        True                    44m
openshift-apiserver                  4.1.0-rc.0   False       False                   40m
openshift-controller-manager         4.1.0-rc.0   False       False                   33m
operator-lifecycle-manager           4.1.0-rc.0   True        False         False     42m
operator-lifecycle-manager-catalog   4.1.0-rc.0   True        False         False     42m
service-ca                           4.1.0-rc.0   True        True          False     43m

$ oc --config test_cluster/auth/kubeconfig -n openshift-cluster-version logs cluster-version-operator-864544f74f-9hfw4
Error from server: Get https://master-0:10250/containerLogs/openshift-cluster-version/cluster-version-operator-864544f74f-9hfw4/cluster-version-operator: dial tcp 192.168.100.10:10250: connect: connection refused

However, if you are on master-0 (where the pod gets deployed); you can connect to this port: 

$ oc --config test_cluster/auth/kubeconfig -n openshift-cluster-version get pod cluster-version-operator-864544f74f-9hfw4 -o jsonpath='{.spec.nodeName}{"\n"}'

[core@master-0 ~]$ echo >/dev/tcp/master-0.thoran.dwarf.mine/10250
[core@master-0 ~]$ echo $?
0
[core@master-0 ~]$ echo >/dev/tcp/master-0/10250
[core@master-0 ~]$ echo $?
0
[core@master-0 ~]$ echo >/dev/tcp/192.168.100.10/10250
-bash: connect: Connection refused
-bash: /dev/tcp/192.168.100.10/10250: Connection refused

However, this goes away (eventually) and then produces. 

$ oc --config test_cluster/auth/kubeconfig -n openshift-cluster-version logs cluster-version-operator-864544f74f-9hfw4
Error from server: Get https://master-0:10250/containerLogs/openshift-cluster-version/cluster-version-operator-864544f74f-9hfw4/cluster-version-operator: remote error: tls: internal error

$ curl https://master-0.thoran.dwarf.mine:10250/containerLogs/openshift-cluster-version/cluster-version-operator-864544f74f-9hfw4/cluster-version-operator?follow=true:
curl: (35) error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error

Comment 1 Abhinav Dahiya 2019-05-14 18:02:56 UTC
(In reply to Eric Rich from comment #0)
> Description of problem: If you follow:
> https://docs.openshift.com/container-platform/4.1/installing/
> installing_bare_metal/installing-bare-metal.html#installation-installing-
> bare-metal_installing-bare-metal you can't apply registry storage. You are
> hit with the following error: 
> 
> > $ oc --config test_cluster/auth/kubeconfig patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"storage":{"emptyDir":{}}}}'
> > Error from server (NotFound): configs.imageregistry.operator.openshift.io "cluster" not found

You cannot patch the object if the api or object doesn't exist. If user's need that hand-holding, documentation can be updated to make sure user's see the object exist and then patch. Also to be fair this command exists as "example" "not for production" way to configure registry in the mentioned docs.

> Version-Release number of selected component (if applicable): beta4
> 
> How reproducible: Rare (only seen this once)
> 
> Steps to Reproduce:
> 1. Follow the Docs 
> 
> Actual results: See above
> 
> Expected results: I have seen this patch command work in the past. 
> 
> Additional info:
> 
> $ oc --config test_cluster/auth/kubeconfig get clusterversion
> NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
> version   4.1.0-rc.0   False       True          45m     Unable to apply
> 4.1.0-rc.0: an unknown error has occurred
> 
> $ oc --config test_cluster/auth/kubeconfig get clusteroperators
> NAME                                 VERSION      AVAILABLE   PROGRESSING  
> FAILING   SINCE
> cloud-credential                     4.1.0-rc.0   True        False        
> False     43m
> cluster-autoscaler                   4.1.0-rc.0   True        False        
> False     43m
> dns                                  4.1.0-rc.0   False       False        
> False     40m
> kube-apiserver                       4.1.0-rc.0   True        True          
> 42m
> kube-controller-manager              4.1.0-rc.0   True        False         
> 40m
> kube-scheduler                       4.1.0-rc.0   True        False         
> 40m
> machine-api                          4.1.0-rc.0   True        False        
> False     43m
> machine-config                       4.1.0-rc.0   False       False        
> True      30m
> network                              4.1.0-rc.0   True        True          
> 44m
> openshift-apiserver                  4.1.0-rc.0   False       False         
> 40m
> openshift-controller-manager         4.1.0-rc.0   False       False         
> 33m
> operator-lifecycle-manager           4.1.0-rc.0   True        False        
> False     42m
> operator-lifecycle-manager-catalog   4.1.0-rc.0   True        False        
> False     42m
> service-ca                           4.1.0-rc.0   True        True         
> False     43m
> 
> $ oc --config test_cluster/auth/kubeconfig -n openshift-cluster-version logs
> cluster-version-operator-864544f74f-9hfw4
> Error from server: Get
> https://master-0:10250/containerLogs/openshift-cluster-version/cluster-
> version-operator-864544f74f-9hfw4/cluster-version-operator: dial tcp
> 192.168.100.10:10250: connect: connection refused
> 
> However, if you are on master-0 (where the pod gets deployed); you can
> connect to this port: 
> 
> $ oc --config test_cluster/auth/kubeconfig -n openshift-cluster-version get
> pod cluster-version-operator-864544f74f-9hfw4 -o
> jsonpath='{.spec.nodeName}{"\n"}'
> 
> [core@master-0 ~]$ echo >/dev/tcp/master-0.thoran.dwarf.mine/10250
> [core@master-0 ~]$ echo $?
> 0
> [core@master-0 ~]$ echo >/dev/tcp/master-0/10250
> [core@master-0 ~]$ echo $?
> 0
> [core@master-0 ~]$ echo >/dev/tcp/192.168.100.10/10250
> -bash: connect: Connection refused
> -bash: /dev/tcp/192.168.100.10/10250: Connection refused
> 
> However, this goes away (eventually) and then produces. 
> 
> $ oc --config test_cluster/auth/kubeconfig -n openshift-cluster-version logs
> cluster-version-operator-864544f74f-9hfw4
> Error from server: Get
> https://master-0:10250/containerLogs/openshift-cluster-version/cluster-
> version-operator-864544f74f-9hfw4/cluster-version-operator: remote error:
> tls: internal error
> 
> $ curl
> https://master-0.thoran.dwarf.mine:10250/containerLogs/openshift-cluster-
> version/cluster-version-operator-864544f74f-9hfw4/cluster-version-
> operator?follow=true:
> curl: (35) error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal
> error

Comment 2 Eric Rich 2019-05-14 19:56:27 UTC
Created attachment 1568635 [details]
Install Logs

I think the question (we need to be focused on) is why the install never completes, or why that resource is not available at that state in the docs for a user to run said command.

Comment 4 W. Trevor King 2019-05-14 20:12:55 UTC
I've filed [1] to get more specifics in the CVO error instead of the above 'Unable to apply 4.1.0-rc.0: an unknown error has occurred'.

[1]: https://github.com/openshift/cluster-version-operator/pull/185

Comment 5 Eric Rich 2019-05-14 20:51:37 UTC
This may be connected to https://bugzilla.redhat.com/show_bug.cgi?id=1684049

Comment 6 Eric Rich 2019-05-14 22:04:08 UTC
This clusters failed install seems to be caused because of the kubelet certificate being expired: 

$ sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            08:89:6e:e5:3c:da:dd:28:b4:16:37:ad:75:bb:e0:9a:70:f6:5f:56
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: OU = openshift, CN = kubelet-signer
        Validity
            Not Before: May 14 16:34:00 2019 GMT
            Not After : May 14 16:55:30 2019 GMT
        Subject: O = system:nodes, CN = system:node:master-0