Bug 1752358
| Summary: | authentication and console are failed to start on Azure | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | sheng.lao <shlao> |
| Component: | Networking | Assignee: | Ricardo Carrillo Cruz <ricarril> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | medium | ||
| Priority: | high | CC: | aos-bugs, bbennett, dhansen, ewolinet, mfojtik, ricarril |
| Version: | 4.2.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-02-05 14:31:35 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1765280 | ||
| Bug Blocks: | |||
|
Description
sheng.lao
2019-09-16 07:25:46 UTC
On the receiving end (ingress controllers), I haven't found anything obviously wrong yet. Router pods are reporting ready, endpoints are present, iptables looks consistent, the SDN reports opening the health check port (although I can't see the state of the health check). One thing I don't have which I'd like to in this case is a load balancer state dump from Azure. Right now I have no idea if the nodes were considered healthy LB pool targets at the time of diagnostic collection, or whether the LB even existed. @sheng see if the ingress controller pods are scheduled to master nodes. I ran into a similar issue using UPI for installation and masters were being provisioned with master and worker roles, causing the scheduler to schedule ingress controller pods to masters. If this is the case, you can update schedulers.config.openshift.io/cluster mastersSchedulable to false, delete the ingress pods and they should get scheduled to worker nodes. The iptables rules on the node hosting the authentication operator (ci-op-ij3tf3sf-282fe-zthp4-master-2) seem to have stale entries associated with the default ingress controller load balancer IP. Here are the stale rules from the nodes/ci-op-ij3tf3sf-282fe-zthp4-master-2: [0:0] -A KUBE-SERVICES -d 52.141.221.165/32 -p tcp -m comment --comment "openshift-ingress/router-default:http has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable [0:0] -A KUBE-SERVICES -d 52.141.221.165/32 -p tcp -m comment --comment "openshift-ingress/router-default:https has no endpoints" -m tcp --dport 443 -j REJECT --reject-with icmp-port-unreachable And here are the rules from another node captured at the same time: 0:0] -A KUBE-SERVICES -d 52.141.221.165/32 -p tcp -m comment --comment "openshift-ingress/router-default:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-MBAZS3WDHL45BPIZ [0:0] -A KUBE-SERVICES -d 52.141.221.165/32 -p tcp -m comment --comment "openshift-ingress/router-default:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-HEVFQXAKPPGAL4BV If you look at the endpoints from the artifacts, you can find the openshift-ingress/router-default endpoint addresses. It appears that the stale iptables rule is the reason the auth operator can't reach its route host over the ingress LB IP. Reassigning to networking for further diagnosis. Took a quick look at the logs. It seems the informers missed another updated. On a good node, I see: I0916 05:58:30.880580 3820 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.129.2.4:80] I0916 05:58:31.004146 3820 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.128.2.6:80 10.129.2.4:80] whereas on the bad node, I see I0916 05:58:52.697200 4105 roundrobin.go:310] LoadBalancerRR: Setting endpoints for openshift-ingress/router-default:http to [10.128.2.6:80 10.129.2.4:80] despite the node being declared "sunk" at 05:49 Closing as a dupe of 1765280 *** This bug has been marked as a duplicate of bug 1765280 *** |