Bug 1889584 - "authentication" operator is failing
Summary: "authentication" operator is failing
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: apiserver-auth
Version: 4.6
Hardware: s390x
OS: Linux
low
low
Target Milestone: ---
: 4.7.0
Assignee: Standa Laznicka
QA Contact: pmali
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-20 05:12 UTC by Nishant Chauhan
Modified: 2020-11-18 09:33 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-18 09:22:11 UTC
Target Upstream Version:


Attachments (Terms of Use)
cluster-01 & cluster-02 "oc get co" (236.30 KB, application/zip)
2020-10-20 05:12 UTC, Nishant Chauhan
no flags Details
cluster-01 "oc get co" (109.96 KB, image/png)
2020-10-20 08:11 UTC, Nishant Chauhan
no flags Details

Description Nishant Chauhan 2020-10-20 05:12:40 UTC
Created attachment 1722762 [details]
cluster-01 & cluster-02 "oc get co"

Description of problem:

After a fresh setup of ocp, authentication operator starts failing after every  20 - 50 hours hours, without creating any external workload.
I tested this on two cluster setup.

Version-Release number of selected component (if applicable):

Cluster-1:
Client Version: 4.6.0-0.nightly-s390x-2020-09-05-222506
Server Version: 4.6.0-0.nightly-s390x-2020-10-03-051120
Kubernetes Version: v1.19.0+db1fc96

Cluster-2:
Client Version: 4.6.0-rc.4
Server Version: 4.6.0-0.nightly-s390x-2020-10-10-041058
Kubernetes Version: v1.19.0+d59ce34

How reproducible:
Fresh OCP setup


Steps to Reproduce:
1. Fresh OCP setup


Actual results:


Expected results:


Additional info:

I have collected logs of both the cluster, please let me know if needed or its a known bug or some configuration issue ?

Comment 1 Standa Laznicka 2020-10-20 07:45:28 UTC
This bug is getting immediately closed and I'm reporting it as suspicious to the company's security department. I don't believe anyone with non-malicious intentions would ever post a .docx attachment where text file would be enough.

TO ANYONE READING THIS BUGZILLA - DO NOT OPEN THE ATTACHMENT, IT CAN BE MALICIOUS.

The description of the bug provides literally no information for troubleshooting the problem, as if to force the reviewer to open the attachment.

Nishant, if you're a real person, send a response to this BZ till Friday, or I'm also going to report your email address to your company's IT - as compromised.

Comment 2 Nishant Chauhan 2020-10-20 08:11:29 UTC
Created attachment 1722805 [details]
cluster-01 "oc get co"

Comment 3 Standa Laznicka 2020-10-20 08:26:29 UTC
The current attachment shows everything's alright.

If you encounter the problem again, reopen this bugzilla and attach a link with must-gather: https://docs.openshift.com/container-platform/4.5/support/gathering-cluster-data.html.

Please know that simply knowing that the operator is degraded has informational value nearing 0 when troubleshooting the problem. Actually seeing the conditions reported would help, but must-gather would probably still be necessary.

Comment 4 Nishant Chauhan 2020-10-20 09:42:57 UTC
Hi Standa,

Thanks a lot for your response!

And sorry, actually ".docx" was containing screenshots I took for the command "oc get co" and in bugzilla portal there is no option to add more than one attachment.

For sure, now noted.

As far as the bug i raised is just the observation I had with the newly installed clusters and operator(authentication) was failing and rest of the operator were fine.
cluster is 19 Days old and in last 13 Days the "oauth-openshift" pod restarted 3 times and to know is it a normal behavior or some configuration issue.

Can I upload the "must-gather" data in my company cloud file sharing portal <https://ibm.ent.box.com/> ? as the size of complete log data is 135mb.
or do i attach a part of "must-gather" data in  zip/tar format in bugzilla only?

Or do you want me to monitor this behavior for few more days ?

OCP version:
==============
# oc version
Client Version: 4.6.0-0.nightly-s390x-2020-09-05-222506
Server Version: 4.6.0-0.nightly-s390x-2020-10-03-051120
Kubernetes Version: v1.19.0+db1fc96

here is the snippet:
====================

# oc get pods -n openshift-authentication
NAME                               READY   STATUS    RESTARTS   AGE
oauth-openshift-84444458d9-2v9bd   1/1     Running   0          13d
oauth-openshift-84444458d9-6cx72   1/1     Running   3          13d

# oc get co
NAME                                       VERSION                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      47h
cloud-credential                           4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
cluster-autoscaler                         4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
config-operator                            4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
console                                    4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      9h
csi-snapshot-controller                    4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      13d
dns                                        4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
etcd                                       4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
image-registry                             4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      13d
ingress                                    4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
insights                                   4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
kube-apiserver                             4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
kube-controller-manager                    4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
kube-scheduler                             4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
kube-storage-version-migrator              4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      13d
machine-api                                4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
machine-approver                           4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
machine-config                             4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      13d
marketplace                                4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      13d
monitoring                                 4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      3d5h
network                                    4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
node-tuning                                4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      13d
openshift-apiserver                        4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      47h
openshift-controller-manager               4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      3d22h
openshift-samples                          4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      13d
operator-lifecycle-manager                 4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
operator-lifecycle-manager-catalog         4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      5h7m
service-ca                                 4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d
storage                                    4.6.0-0.nightly-s390x-2020-10-03-051120   True        False         False      19d

Thanks!

Comment 5 Standa Laznicka 2020-10-20 10:56:39 UTC
The pods shouldn't usually restart, although if authentication keeps working, I wouldn't worry too much about it. However, if the operator goes degraded, there might be an issue somewhere.

Given the fact that you provided "s390x" as the architecture, I'm assuming the installation is UPI? You may want to check what the operator actually report when it goes degraded (see `oc get co authentication -o yaml` and look for the Degraded condition). It might be an issue with your underlying infrastructure, too. If, however, it appears more like a problem in the product, I'd ask you to please share a must-gather collected when the issue occurs (or shortly afterwards) so that we can observe it in the logs.

It's ok to upload your must-gather in your company's sharing portal as long as I'm able to access it :)

Comment 6 Standa Laznicka 2020-10-20 11:01:23 UTC
I'll set the Target Release to 4.7 for now so that the bug does not block 4.6 release as we don't have anything actionable at this moment.

Comment 8 Nishant Chauhan 2020-10-23 14:10:49 UTC
Hi Standa,

Sorry for the late response, I found there is one more restart of the "pod" in the server, and this cluster mostly running openshifts own components workload.

# oc get pods -n openshift-authentication
NAME                               READY   STATUS    RESTARTS   AGE
oauth-openshift-84444458d9-2v9bd   1/1     Running   0          17d
oauth-openshift-84444458d9-6cx72   1/1     Running   4          17d


You might have received the mail from nishantchauhan@in.ibm.com for the link, if not then please let me know, will send it again.

Thanks!

Comment 9 Nishant Chauhan 2020-10-23 14:13:29 UTC
(In reply to Nishant Chauhan from comment #8)
> Hi Standa,
> 
> Sorry for the late response, I found there is one more restart of the "pod"
> in the server, and this cluster mostly running openshifts own components
> workload.
> 
> # oc get pods -n openshift-authentication
> NAME                               READY   STATUS    RESTARTS   AGE
> oauth-openshift-84444458d9-2v9bd   1/1     Running   0          17d
> oauth-openshift-84444458d9-6cx72   1/1     Running   4          17d
> 
> 
> You might have received the mail from nishantchauhan@in.ibm.com for the
> link, if not then please let me know, will send it again.
> 
> Thanks!

you might have received the link for "must-gather-server01.tar.gz" from IBM box cloud sharing portal from my id "nishantchauhan@in.ibm.com"

Comment 10 Standa Laznicka 2020-10-27 11:43:04 UTC
Hello,

Yes, I got the email and viewed the must-gather. I noticed the authentication operator was misbehaving - https://bugzilla.redhat.com/show_bug.cgi?id=1891758. This could have triggered some of the reported degrades.

I can also see in the logs of the etcd operator that on the 2020-10-21, there were issues connecting to etcd and eventually, leader election got triggered. Was there an issue with the infrastructure of the cluster that could have caused?

Comment 11 Nishant Chauhan 2020-10-27 14:30:16 UTC
(In reply to Standa Laznicka from comment #10)
> Hello,
> 
> Yes, I got the email and viewed the must-gather. I noticed the
> authentication operator was misbehaving -
> https://bugzilla.redhat.com/show_bug.cgi?id=1891758. This could have
> triggered some of the reported degrades.

Thanks for updating!!

> 
> I can also see in the logs of the etcd operator that on the 2020-10-21,
> there were issues connecting to etcd and eventually, leader election got
> triggered. Was there an issue with the infrastructure of the cluster that
> could have caused?

No, there were no infra issues, even when I was working with previous version "4.6.0-0.nightly-s390x-2020-09-30-110053" the cluster was crashing quite frequently but after upgrading to version "4.6.0-0.nightly-s390x-2020-10-03-051120" it was much better.

Comment 12 Standa Laznicka 2020-11-16 11:56:14 UTC
I'm flooded by other work. Please check with the current stable and supported release that the issues persist, if you've got the chance.

Comment 13 Nishant Chauhan 2020-11-17 04:01:44 UTC
Hi Standa,

This issue seems to be resolved now in the version I am working on. I did not see any restart in last 10 days.

# oc project openshift-authentication
Now using project "openshift-authentication" on server "https://api.ocp-t8359001.lnxne.boe:6443".
# oc get pods
NAME                               READY   STATUS    RESTARTS   AGE
oauth-openshift-7d8b779b85-dpg69   1/1     Running   0          10d
oauth-openshift-7d8b779b85-t6r5w   1/1     Running   0          10d
# oc version
Client Version: 4.6.0-rc.4
Server Version: 4.6.1
Kubernetes Version: v1.19.0+d59ce34


Thanks!

Comment 14 Standa Laznicka 2020-11-18 09:22:11 UTC
That's great to hear! I'm going to close this bugzilla so that it no longer appears on my list, feel free to reopen if the issue reappears in the nearer future.

Comment 15 Nishant Chauhan 2020-11-18 09:33:07 UTC
Sure, Thanks Standa!!


Note You need to log in before you can comment on or make changes to this bug.