Bug 1809614

Summary:	oc adm must-gather fails on disconnected IPv6 environments because it's unable to reach quay.io
Product:	OpenShift Container Platform	Reporter:	Marius Cornea <mcornea>
Component:	oc	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED NOTABUG	QA Contact:	zhou ying <yinzhou>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.3.z	CC:	aos-bugs, jokerman, mfojtik
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-10 21:55:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2020-03-03 14:13:00 UTC

Description of problem:

oc adm must-gather fails on disconnected IPv6 environments because it's unable to reach quay.io:

[kni@provisionhost-0 ~]$ oc adm must-gather
[must-gather      ] OUT unable to resolve the imagestream tag openshift/must-gather:latest
[must-gather      ] OUT 
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-k45bw created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-57tnq created
[must-gather      ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created
[must-gather-5nht8] OUT gather did not start: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp 34.195.60.239:443: connect: network is unreachable
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-57tnq deleted
[must-gather      ] OUT namespace/openshift-must-gather-k45bw deleted
error: gather did not start for pod must-gather-5nht8: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp 34.195.60.239:443: connect: network is unreachable

Version-Release number of selected component (if applicable):
version   4.3.0-0.nightly-2020-03-01-194304   True        False         22h     Cluster version is 4.3.0-0.nightly-2020-03-01-194304


How reproducible:
100%

Steps to Reproduce:
1. Deploy bare metal IPv6 environment
2. Run oc adm must-gather

Actual results:
Fails because it tries reaching quay.io via its public IPv4 address which doesn't work because the environment is single stack IPv6 hence no IPv4 connectivity.

Expected results:
When running oc adm must-gather it doesn't try reaching public quay.io. Is there any way we can mirror the image on the disconnected registry at install time?

Additional info:

apiVersion: v1
baseDomain: qe.lab.redhat.com
networking:
  networkType: OVNKubernetes
  machineCIDR: fd2e:6f44:5dd8:c956::/64
  clusterNetwork:
  - cidr: fd01::/48
    hostPrefix: 64
  serviceNetwork:
  - fd02::/112
metadata:
  name: ocp-edge-cluster
compute:
- name: worker
  replicas: 2
controlPlane:
  name: master
  replicas: 3
  platform:
    baremetal: {}
platform:
  baremetal:
    libvirtURI: qemu+ssh://root.qe.lab.redhat.com/system
    bootstrapOSImage: http://registry.ocp-edge-cluster.qe.lab.redhat.com:8080/images/rhcos-43.81.202001142154.0-qemu.x86_64.qcow2.gz?sha256=891c93d4ac0a0ed5ea4e3867dd5ecefd77674f4a6c1f9ca9218a176e1695e156
    clusterOSImage: http://registry.ocp-edge-cluster.qe.lab.redhat.com:8080/images/rhcos-43.81.202001142154.0-openstack.x86_64.qcow2.gz?sha256=75de2a60078408d237ff20b88145831f152188d04dc705ab2ea086f169b520ba
    apiVIP: fd2e:6f44:5dd8:c956::5
    dnsVIP: fd2e:6f44:5dd8:c956:0:0:0:2
    ingressVIP: fd2e:6f44:5dd8:c956::10
    hosts:
      - name: openshift-master-0
        role: master
        bmc:
          address: ipmi://[fd2e:6f44:5dd8:c956::1]:6230
          username: admin
          password: password
        bootMACAddress: 52:54:00:f7:4b:18
        hardwareProfile: default
      - name: openshift-master-1
        role: master
        bmc:
          address: ipmi://[fd2e:6f44:5dd8:c956::1]:6231
          username: admin
          password: password
        bootMACAddress: 52:54:00:4e:98:a4
        hardwareProfile: default
      - name: openshift-master-2
        role: master
        bmc:
          address: ipmi://[fd2e:6f44:5dd8:c956::1]:6232
          username: admin
          password: password
        bootMACAddress: 52:54:00:42:b9:a3
        hardwareProfile: default
      - name: openshift-worker-0
        role: worker
        bmc:
          address: ipmi://[fd2e:6f44:5dd8:c956::1]:6233
          username: admin
          password: password
        bootMACAddress: 52:54:00:23:5c:3f
        hardwareProfile: unknown
      - name: openshift-worker-1
        role: worker
        bmc:
          address: ipmi://[fd2e:6f44:5dd8:c956::1]:6234
          username: admin
          password: password
        bootMACAddress: 52:54:00:ef:d6:1d
        hardwareProfile: unknown
additionalTrustBundle: |
  -----BEGIN CERTIFICATE-----
pullSecret: |
  { "auths": { "registry.ocp-edge-cluster.qe.lab.redhat.com:5000": { "auth": "" } }}
fips: false
sshKey: |
  ssh-rsa kni.qe.lab.redhat.com
imageContentSources:
- mirrors:
  - registry.ocp-edge-cluster.qe.lab.redhat.com:5000/localimages/local-release-image
  source: quay.io/openshift-release-dev/ocp-v4.0-art-dev
- mirrors:
  - registry.ocp-edge-cluster.qe.lab.redhat.com:5000/localimages/local-release-image
  source: registry.svc.ci.openshift.org/ocp/release

Comment 1 Scott Dodson 2020-03-10 13:04:59 UTC

This is a generic problem with `oc adm must-gather` in disconnected environments. See these docs bugs https://bugzilla.redhat.com/show_bug.cgi?id=1771435

If anything were to change in the product it'd either be on the `oc adm mirror` or `oc adm must-gather` side of things, definitely not the installer so I'm moving this to oc component.

Comment 2 Marius Cornea 2020-03-10 18:54:37 UTC

I mirrored the must-gather image to the disconnected registry that I used for initial deployment:

oc image mirror quay.io/openshift/origin-must-gather:latest registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest

but when I run oc adm must-gather it gets stuck:


oc adm must-gather --image registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest
[must-gather      ] OUT Using must-gather plugin-in image: registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-8hq7j created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-lldv8 created
[must-gather      ] OUT pod for plug-in image registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest created

pod is stuck in Init state:


openshift-must-gather-8hq7j                             must-gather-k626f                                                      0/1     Init:0/1            0          13s


[kni@provisionhost-0 ~]$ oc -n openshift-must-gather-8hq7j get pods must-gather-k626f -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/networks-status: ""
  creationTimestamp: "2020-03-10T18:50:34Z"
  generateName: must-gather-
  labels:
    app: must-gather
  name: must-gather-k626f
  namespace: openshift-must-gather-8hq7j
  resourceVersion: "242963"
  selfLink: /api/v1/namespaces/openshift-must-gather-8hq7j/pods/must-gather-k626f
  uid: a8c09f54-72cd-423e-949e-f16b1da35b56
spec:
  containers:
  - command:
    - /bin/bash
    - -c
    - 'trap : TERM INT; sleep infinity & wait'
    image: registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest
    imagePullPolicy: Always
    name: copy
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /must-gather
      name: must-gather-output
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-gl9wz
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  imagePullSecrets:
  - name: default-dockercfg-sjk9h
  initContainers:
  - command:
    - /usr/bin/gather
    image: registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest
    imagePullPolicy: Always
    name: gather
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /must-gather
      name: must-gather-output
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-gl9wz
      readOnly: true
  nodeName: master-2.ocp-edge-cluster.qe.lab.redhat.com
  priority: 0
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 0
  tolerations:
  - operator: Exists
  volumes:
  - emptyDir: {}
    name: must-gather-output
  - name: default-token-gl9wz
    secret:
      defaultMode: 420
      secretName: default-token-gl9wz
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2020-03-10T18:50:34Z"
    message: 'containers with incomplete status: [gather]'
    reason: ContainersNotInitialized
    status: "False"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2020-03-10T18:50:34Z"
    message: 'containers with unready status: [copy]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2020-03-10T18:50:34Z"
    message: 'containers with unready status: [copy]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2020-03-10T18:50:34Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest
    imageID: ""
    lastState: {}
    name: copy
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  hostIP: fd2e:6f44:5dd8:c956::107
  initContainerStatuses:
  - image: registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest
    imageID: ""
    lastState: {}
    name: gather
    ready: false
    restartCount: 0
    state:
      waiting:
        reason: PodInitializing
  phase: Pending
  qosClass: BestEffort
  startTime: "2020-03-10T18:50:34Z"

Comment 3 Marius Cornea 2020-03-10 20:29:22 UTC

The issue in previous comment seems to be caused by another BZ getting the cluster in a broken state. I could run `oc adm must-gather --image registry.ocp-edge-cluster.qe.lab.redhat.com:5000/openshift/origin-must-gather:latest` against a healthy cluster.