Bug 1695475 - [network-operator] The network operator cannot return correct status causes installation failed when installing cluster with SRIOV enabled
Summary: [network-operator] The network operator cannot return correct status causes i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.0
Assignee: zenghui.shi
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-03 08:03 UTC by Meng Bo
Modified: 2019-06-04 10:47 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:47:00 UTC
Target Upstream Version:


Attachments (Terms of Use)
network_operator_log (152.04 KB, text/plain)
2019-04-03 08:03 UTC, Meng Bo
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:47:10 UTC

Description Meng Bo 2019-04-03 08:03:46 UTC
Created attachment 1551276 [details]
network_operator_log

Description of problem:
Trying to setup cluster with SRIOV enabled. The installation will report fail eventually with the network operator is still updating.

When checking the network operator after the installation finished (with fail), there is no version populated.


Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-04-02-133735

How reproducible:
always

Steps to Reproduce:
1. Generate the manifests and update the cluster-network-03-config.yaml with following
# openshift-install create manifests
# cat manifests/cluster-network-03-config.yaml
apiVersion: "operator.openshift.io/v1"
kind: "Network"
metadata:
  name: "cluster"
spec:
  serviceNetwork:
  - "172.30.0.0/16"
  clusterNetwork:
  - cidr: "10.128.0.0/14"
    hostPrefix: 23
  defaultNetwork:
    type: OpenShiftSDN
    openshiftSDNConfig:
      mode: NetworkPolicy
  additionalNetworks:
    - type: Raw
      name: sriov-conf
      rawCNIConfig: '{
        "type": "sriov",
        "name": "sriov-network",
        "ipam": {
                "type": "host-local",
                "subnet": "10.11.11.0/24",
                "routes": [{
                        "dst": "0.0.0.0/0"
                }],
                "gateway": "10.11.11.1"
        }
      }'

2. Install the cluster
# openshift-install create cluster

3. Check the network operator
# oc get clusteroperator network

4. Check the cluster version
# oc get clusterversion

5. Check the operator pod log
# oc logs -f network-operator-d6c8c48b7-w8cm7 -n openshift-network-operator


Actual results:
Step2: The cluster installation will get failed eventually.
INFO Waiting up to 30m0s for the cluster at https://API_SERVER:6443 to initialize...
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 64% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 88% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 90% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 91% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 95% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 97% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 98% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 98% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.0.0-0.nightly-2019-04-02-133735: 99% complete
DEBUG Still waiting for the cluster to initialize: Cluster operator network is still updating
FATAL failed to initialize the cluster: Cluster operator network is still updating: timed out waiting for the condition

Step3: There is no operator version populated.
# oc get clusteroperator network
NAME      VERSION   AVAILABLE   PROGRESSING   FAILING   SINCE
network             True        False         False     30m

Step4: The clusterversion shows that the cluster is not AVAILABLE.
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-04-02-133735   False       True          25m     Unable to apply 4.0.0-0.nightly-2019-04-02-133735: the cluster operator network has not yet successfully rolled out

Step5: Full operator log attached.


Expected results:
Should be able to setup cluster successfully with the sriov enabled.

Additional info:

Comment 1 Meng Bo 2019-04-03 10:24:09 UTC
Some more info about the cluster.

The sriov service are running well under the correct project

# oc get po,ds,sa -n openshift-sriov
NAME                            READY   STATUS    RESTARTS   AGE
pod/sriov-cni-8g555             1/1     Running   0          20m
pod/sriov-cni-92fkn             1/1     Running   0          20m
pod/sriov-cni-bnxsz             1/1     Running   0          20m
pod/sriov-cni-k4tdr             1/1     Running   0          26m
pod/sriov-cni-n4hqp             1/1     Running   0          26m
pod/sriov-cni-vfc8k             1/1     Running   0          26m
pod/sriov-device-plugin-5r55w   1/1     Running   0          26m
pod/sriov-device-plugin-94mjs   1/1     Running   0          26m
pod/sriov-device-plugin-k2rrx   1/1     Running   0          20m
pod/sriov-device-plugin-m7mwp   1/1     Running   0          19m
pod/sriov-device-plugin-mn62z   1/1     Running   0          25m
pod/sriov-device-plugin-rbv2q   1/1     Running   0          19m

NAME                                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
daemonset.extensions/sriov-cni             6         6         6       6            6           beta.kubernetes.io/os=linux   26m
daemonset.extensions/sriov-device-plugin   6         6         6       6            6           beta.kubernetes.io/os=linux   26m

NAME                                 SECRETS   AGE
serviceaccount/builder               2         20m
serviceaccount/default               2         24m
serviceaccount/deployer              2         20m
serviceaccount/sriov-cni             2         26m
serviceaccount/sriov-device-plugin   2         26m

Comment 2 zenghui.shi 2019-04-04 11:51:41 UTC
fix merged in CNO: https://github.com/openshift/cluster-network-operator/pull/138

Comment 4 Meng Bo 2019-04-11 06:44:32 UTC
Tested with build 4.0.0-0.nightly-2019-04-10-182914

Issue has been fixed.

The cluster setup can finish successfully, the network operator gets the correct version and status.

The version field is added for both sriov-cni and sriov-device-plugin ds.

Comment 6 errata-xmlrpc 2019-06-04 10:47:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.