Bug 1958643 - All pods creation stuck due to SR-IOV webhook timeout
Summary: All pods creation stuck due to SR-IOV webhook timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.8.0
Assignee: zenghui.shi
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-09 13:42 UTC by Sabina Aledort
Modified: 2021-07-27 23:07 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:07:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sriov-network-operator.log (1.11 MB, text/plain)
2021-05-09 13:42 UTC, Sabina Aledort
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 502 0 None open Bug 1958643: Fix network injector webhook 2021-05-10 08:08:09 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:07:50 UTC

Description Sabina Aledort 2021-05-09 13:42:55 UTC
Created attachment 1781294 [details]
sriov-network-operator.log

Description of problem:
OLM is failing due to SR-IOV webhook error:
failed calling webhook "network-resources-injector-config.k8s.io": Post "https://network-resources-injector-service.openshift-sriov-network-operator.svc:443/mutate?timeout=10s": no endpoints available for service "network-resources-injector-service"

Version-Release number of selected component (if applicable):
registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-sriov-network-operator-bundle:v4.8.0-202105080740.p0-1

How reproducible:
Deploy latest 4.8 sriov operator from brew.

Steps to Reproduce:
Deploy latest 4.8 sriov operator from brew.

Actual results:
OLM is failing due to sriov webhook.

openshift-operator-lifecycle-manager               packageserver                                  Package Server               0.17.0                             Failed

# oc describe replicaset -n openshift-operator-lifecycle-manager packageserver-5f5c58776d
...
Events:
  Type     Reason        Age                     From                   Message
  ----     ------        ----                    ----                   -------
  Warning  FailedCreate  3m13s (x92 over 7h39m)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "network-resources-injector-config.k8s.io": Post "https://network-resources-injector-service.openshift-sriov-network-operator.svc:443/mutate?timeout=10s": no endpoints available for service "network-resources-injector-service"

Expected results:
OLM should run and sriov deployment should succeed.  

Additional info:
This is blocking our CI as we are unable to deploy PTP and SRIOV operators.

Comment 1 Yuval Kashtan 2021-05-09 15:49:16 UTC
we also know that the previous build (v4.8.0.202105042126.p0-1) did not exhibit this error

Comment 2 Sebastian Scheinkman 2021-05-09 17:20:13 UTC
The problem was the upgrade of the admission webhook to version v1 I open PR to fix it.

https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/131

Comment 5 zhaozhanqi 2021-05-11 03:09:13 UTC
@Sebastian Scheinkman

seems I did not meet this issue when I setup the version 4.8.0-202105080740.p0

# oc get csv
NAME                                           DISPLAY                   VERSION                 REPLACES   PHASE
sriov-network-operator.4.8.0-202105080740.p0   SR-IOV Network Operator   4.8.0-202105080740.p0              Succeeded


# oc get pod
NAME                                     READY   STATUS    RESTARTS   AGE
network-resources-injector-8mksv         1/1     Running   0          4m28s
network-resources-injector-949kk         1/1     Running   0          4m43s
network-resources-injector-glskv         1/1     Running   0          5m1s
operator-webhook-5kglx                   1/1     Running   0          4m43s
operator-webhook-qssgz                   1/1     Running   0          4m28s
operator-webhook-x9k8b                   1/1     Running   0          5m1s
sriov-cni-j6ntc                          2/2     Running   0          4m40s
sriov-device-plugin-g7hfr                1/1     Running   0          4m19s
sriov-network-config-daemon-wb424        1/1     Running   0          5m4s
sriov-network-config-daemon-xk48z        1/1     Running   0          4m44s
sriov-network-operator-cd85d8457-5swhp   1/1     Running   0          5m34s

Comment 6 zhaozhanqi 2021-05-11 03:13:50 UTC
also have a try with upgrade from sriov-network-operator.4.8.0-202105042126.p0 to  4.8.0-202105080740.p0 

# oc get csv -n openshift-sriov-network-operator
NAME                                           DISPLAY                   VERSION                 REPLACES                                       PHASE
sriov-network-operator.4.8.0-202105080740.p0   SR-IOV Network Operator   4.8.0-202105080740.p0   sriov-network-operator.4.8.0-202105042126.p0   Succeeded

Comment 7 zhaozhanqi 2021-05-17 06:00:12 UTC
Hi, Sabina

Could you help verified this bug?  or if you can provide the steps to reproduce this issue since it's not happen in our QE side, thanks

Comment 8 Sabina Aledort 2021-05-18 07:21:26 UTC
(In reply to zhaozhanqi from comment #7)
> Hi, Sabina
> 
> Could you help verified this bug?  or if you can provide the steps to
> reproduce this issue since it's not happen in our QE side, thanks

Hi, 

@Sebastian Scheinkman's PR fixed the issue.
All the pods are up and running.

Tested image:
registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-sriov-network-operator-bundle:v4.8.0.202105111002.p0-1

Comment 9 zhaozhanqi 2021-05-26 04:05:07 UTC
Thanks, Sabina

then move this bug to verified according to 8

Comment 12 errata-xmlrpc 2021-07-27 23:07:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.