Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1752358

Summary: authentication and console are failed to start on Azure
Product: OpenShift Container Platform Reporter: sheng.lao <shlao>
Component: NetworkingAssignee: Ricardo Carrillo Cruz <ricarril>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: high CC: aos-bugs, bbennett, dhansen, ewolinet, mfojtik, ricarril
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-05 14:31:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1765280    
Bug Blocks:    

Description sheng.lao 2019-09-16 07:25:46 UTC
Description of problem:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/280

level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console"
2019/09/16 06:26:09 Container setup in pod e2e-azure failed, exit code 1, reason Error


Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 Dan Mace 2019-09-18 21:48:36 UTC
On the receiving end (ingress controllers), I haven't found anything obviously wrong yet. Router pods are reporting ready, endpoints are present, iptables looks consistent, the SDN reports opening the health check port (although I can't see the state of the health check).

One thing I don't have which I'd like to in this case is a load balancer state dump from Azure. Right now I have no idea if the nodes were considered healthy LB pool targets at the time of diagnostic collection, or whether the LB even existed.

Comment 6 Daneyon Hansen 2019-09-24 16:13:31 UTC
@sheng see if the ingress controller pods are scheduled to master nodes. I ran into a similar issue using UPI for installation and masters were being provisioned with master and worker roles, causing the scheduler to schedule ingress controller pods to masters. If this is the case, you can update schedulers.config.openshift.io/cluster mastersSchedulable to false, delete the ingress pods and they should get scheduled to worker nodes.

Comment 7 Dan Mace 2019-10-08 21:56:27 UTC
The iptables rules on the node hosting the authentication operator (ci-op-ij3tf3sf-282fe-zthp4-master-2) seem to have stale entries associated with the default ingress controller load balancer IP.

Here are the stale rules from the nodes/ci-op-ij3tf3sf-282fe-zthp4-master-2:

[0:0] -A KUBE-SERVICES -d 52.141.221.165/32 -p tcp -m comment --comment "openshift-ingress/router-default:http has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable
[0:0] -A KUBE-SERVICES -d 52.141.221.165/32 -p tcp -m comment --comment "openshift-ingress/router-default:https has no endpoints" -m tcp --dport 443 -j REJECT --reject-with icmp-port-unreachable

And here are the rules from another node captured at the same time:

0:0] -A KUBE-SERVICES -d 52.141.221.165/32 -p tcp -m comment --comment "openshift-ingress/router-default:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-MBAZS3WDHL45BPIZ
[0:0] -A KUBE-SERVICES -d 52.141.221.165/32 -p tcp -m comment --comment "openshift-ingress/router-default:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-HEVFQXAKPPGAL4BV

If you look at the endpoints from the artifacts, you can find the openshift-ingress/router-default endpoint addresses.

It appears that the stale iptables rule is the reason the auth operator can't reach its route host over the ingress LB IP.

Reassigning to networking for further diagnosis.

Comment 8 Casey Callendrello 2019-10-09 14:33:13 UTC
Took a quick look at the logs. It seems the informers missed another updated. On a good node, I see:

I0916 05:58:30.880580    3820 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.129.2.4:80]
I0916 05:58:31.004146    3820 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.128.2.6:80 10.129.2.4:80]

whereas on the bad node, I see
I0916 05:58:52.697200    4105 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.128.2.6:80 10.129.2.4:80]

despite the node being declared "sunk" at 05:49

Comment 10 Ricardo Carrillo Cruz 2020-02-05 14:31:35 UTC
Closing as a dupe of 1765280

*** This bug has been marked as a duplicate of bug 1765280 ***