Bug 1579227
| Summary: | [upgrade]service-catalog upgrade to 3.10 failed | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Zihan Tang <zitang> |
| Component: | Service Catalog | Assignee: | Jay Boyd <jaboyd> |
| Status: | CLOSED DUPLICATE | QA Contact: | Zihan Tang <zitang> |
| Severity: | medium | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.10.0 | CC: | chezhang, jaboyd, jiazha, wmeng, zhsun, zitang |
| Target Milestone: | --- | ||
| Target Release: | 3.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-05-22 14:03:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Comment 1
Zihan Tang
2018-05-17 08:36:23 UTC
is this deployment still available? I'd like to get the output from oc describe pod XXX Or, can I ssh into the deployment to look around? In other v3.9 env, service-catalog upgrade to 3.10 successfully. This may be caused by env. bug 1579261 which is about asb upgrade failed in the same env. daemonset was updated to list image as v3.10: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.10
but pod is running a v3.9 image
[root@ip-172-18-12-195 ~]# oc describe daemonset controller-manager -n kube-service-catalog
Name: controller-manager
Selector: app=controller-manager
Node-Selector: node-role.kubernetes.io/master=true
Labels: app=controller-manager
Annotations: <none>
Desired Number of Nodes Scheduled: 1
Current Number of Nodes Scheduled: 1
Number of Nodes Scheduled with Up-to-date Pods: 1
Number of Nodes Scheduled with Available Pods: 1
Number of Nodes Misscheduled: 0
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=controller-manager
Service Account: service-catalog-controller
Containers:
controller-manager:
Image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.10
Port: 6443/TCP
Command:
/usr/bin/service-catalog
Args:
controller-manager
--secure-port
6443
-v
3
--leader-election-namespace
kube-service-catalog
--leader-elect-resource-lock
configmaps
--broker-relist-interval
5m
--feature-gates
OriginatingIdentity=true
--feature-gates
AsyncBindingOperations=true
Environment:
K8S_NAMESPACE: (v1:metadata.namespace)
Mounts:
/var/run/kubernetes-service-catalog from service-catalog-ssl (ro)
Volumes:
service-catalog-ssl:
Type: Secret (a volume populated by a Secret)
SecretName: controllermanager-ssl
Optional: false
Events: <none>
$ oc get daemonsets controller-manager -n kube-service-catalog -o yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
creationTimestamp: 2018-05-21T07:12:06Z
generation: 2
labels:
app: controller-manager
name: controller-manager
namespace: kube-service-catalog
resourceVersion: "9789"
selfLink: /apis/extensions/v1beta1/namespaces/kube-service-catalog/daemonsets/controller-manager
uid: 44e758d3-5cc6-11e8-9299-0e9a287ac26e
spec:
revisionHistoryLimit: 10
selector:
matchLabels:
app: controller-manager
template:
metadata:
creationTimestamp: null
labels:
app: controller-manager
spec:
containers:
- args:
- controller-manager
- --secure-port
- "6443"
- -v
- "3"
- --leader-election-namespace
- kube-service-catalog
- --leader-elect-resource-lock
- configmaps
- --broker-relist-interval
- 5m
- --feature-gates
- OriginatingIdentity=true
- --feature-gates
- AsyncBindingOperations=true
command:
- /usr/bin/service-catalog
env:
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.10
imagePullPolicy: IfNotPresent
name: controller-manager
ports:
- containerPort: 6443
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/kubernetes-service-catalog
name: service-catalog-ssl
readOnly: true
dnsPolicy: ClusterFirst
nodeSelector:
node-role.kubernetes.io/master: "true"
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: service-catalog-controller
serviceAccountName: service-catalog-controller
terminationGracePeriodSeconds: 30
volumes:
- name: service-catalog-ssl
secret:
defaultMode: 420
items:
- key: tls.crt
path: apiserver.crt
- key: tls.key
path: apiserver.key
secretName: controllermanager-ssl
templateGeneration: 2
updateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
status:
currentNumberScheduled: 1
desiredNumberScheduled: 1
numberAvailable: 1
numberMisscheduled: 0
numberReady: 1
observedGeneration: 1
updatedNumberScheduled: 1
POD is showing old version though: image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.29
%oc describe pod -n kube-service-catalog controller-manager-6fj6c
Name: controller-manager-6fj6c
Namespace: kube-service-catalog
Node: ip-172-18-12-195.ec2.internal/172.18.12.195
Start Time: Mon, 21 May 2018 03:12:06 -0400
Labels: app=controller-manager
controller-revision-hash=2156777767
pod-template-generation=1
Annotations: openshift.io/scc=restricted
Status: Running
IP: 10.2.0.11
Controlled By: DaemonSet/controller-manager
Containers:
controller-manager:
Container ID: docker://7444d9002f61620a2f27cac3621e752a400a40d14a47451b1251e59931e28d6e
Image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.29
Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:0b5ad372b0e0e63d94d087311aaabb9af21a090b6487d9e820a4de9df9c85b83
Port: 8080/TCP
Command:
/usr/bin/service-catalog
Args:
controller-manager
--port
8080
-v
3
--leader-election-namespace
kube-service-catalog
--leader-elect-resource-lock
configmaps
--broker-relist-interval
5m
--feature-gates
OriginatingIdentity=true
State: Running
Started: Mon, 21 May 2018 03:55:31 -0400
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Mon, 21 May 2018 03:53:47 -0400
Finished: Mon, 21 May 2018 03:54:37 -0400
Ready: True
Restart Count: 5
Environment:
K8S_NAMESPACE: kube-service-catalog (v1:metadata.namespace)
Mounts:
/var/run/kubernetes-service-catalog from service-catalog-ssl (ro)
/var/run/secrets/kubernetes.io/serviceaccount from service-catalog-controller-token-bwkp6 (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
service-catalog-ssl:
Type: Secret (a volume populated by a Secret)
SecretName: apiserver-ssl
Optional: false
service-catalog-controller-token-bwkp6:
Type: Secret (a volume populated by a Secret)
SecretName: service-catalog-controller-token-bwkp6
Optional: false
QoS Class: BestEffort
Node-Selectors: node-role.kubernetes.io/master=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events: <none>
$oc get pod -n kube-service-catalog controller-manager-6fj6c -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
openshift.io/scc: restricted
creationTimestamp: 2018-05-21T07:12:06Z
generateName: controller-manager-
labels:
app: controller-manager
controller-revision-hash: "2156777767"
pod-template-generation: "1"
name: controller-manager-6fj6c
namespace: kube-service-catalog
ownerReferences:
- apiVersion: extensions/v1beta1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: controller-manager
uid: 44e758d3-5cc6-11e8-9299-0e9a287ac26e
resourceVersion: "9447"
selfLink: /api/v1/namespaces/kube-service-catalog/pods/controller-manager-6fj6c
uid: 44ed3929-5cc6-11e8-9299-0e9a287ac26e
spec:
containers:
- args:
- controller-manager
- --port
- "8080"
- -v
- "3"
- --leader-election-namespace
- kube-service-catalog
- --leader-elect-resource-lock
- configmaps
- --broker-relist-interval
- 5m
- --feature-gates
- OriginatingIdentity=true
command:
- /usr/bin/service-catalog
env:
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.29
imagePullPolicy: IfNotPresent
name: controller-manager
ports:
- containerPort: 8080
protocol: TCP
resources: {}
securityContext:
capabilities:
drop:
- KILL
- MKNOD
- SETGID
- SETUID
runAsUser: 1000090000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/kubernetes-service-catalog
name: service-catalog-ssl
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: service-catalog-controller-token-bwkp6
readOnly: true
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: service-catalog-controller-dockercfg-tfxjp
nodeName: ip-172-18-12-195.ec2.internal
nodeSelector:
node-role.kubernetes.io/master: "true"
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1000090000
seLinuxOptions:
level: s0:c10,c0
serviceAccount: service-catalog-controller
serviceAccountName: service-catalog-controller
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
volumes:
- name: service-catalog-ssl
secret:
defaultMode: 420
items:
- key: tls.crt
path: apiserver.crt
secretName: apiserver-ssl
- name: service-catalog-controller-token-bwkp6
secret:
defaultMode: 420
secretName: service-catalog-controller-token-bwkp6
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2018-05-21T07:12:06Z
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: 2018-05-21T07:55:32Z
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: 2018-05-21T07:12:14Z
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://7444d9002f61620a2f27cac3621e752a400a40d14a47451b1251e59931e28d6e
image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.9.29
imageID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog@sha256:0b5ad372b0e0e63d94d087311aaabb9af21a090b6487d9e820a4de9df9c85b83
lastState:
terminated:
containerID: docker://18ca99cb95325c39938ca22e8325e24f6f8b7e548153d372774fef87cc31ac37
exitCode: 255
finishedAt: 2018-05-21T07:54:37Z
reason: Error
startedAt: 2018-05-21T07:53:47Z
name: controller-manager
ready: true
restartCount: 5
state:
running:
startedAt: 2018-05-21T07:55:31Z
hostIP: 172.18.12.195
phase: Running
podIP: 10.2.0.11
qosClass: BestEffort
startTime: 2018-05-21T07:12:06Z
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
registry.reg-aws.openshift.com:443/openshift3/registry-console v3.9 34d5fa3a2e44 4 days ago 244 MB
registry.reg-aws.openshift.com:443/openshift3/ose-node v3.10 2476ee92df86 5 days ago 1.2 GB
registry.reg-aws.openshift.com:443/openshift3/ose-control-plane v3.10 a752326cdcd8 5 days ago 633 MB
registry.reg-aws.openshift.com:443/openshift3/ose-pod v3.10.0-0.47.0 0ae5c896b92c 5 days ago 214 MB
registry.reg-aws.openshift.com:443/openshift3/openvswitch v3.9.29 7e1947e66b50 12 days ago 1.48 GB
registry.reg-aws.openshift.com:443/openshift3/node v3.9.29 5b0cba6e43a1 12 days ago 1.46 GB
registry.reg-aws.openshift.com:443/openshift3/ose-deployer v3.9.29 17067913d670 12 days ago 1.23 GB
registry.reg-aws.openshift.com:443/openshift3/ose v3.9 a5cc46c3354e 12 days ago 1.23 GB
registry.reg-aws.openshift.com:443/openshift3/ose v3.9.29 a5cc46c3354e 12 days ago 1.23 GB
registry.reg-aws.openshift.com:443/openshift3/ose-web-console v3.9.29 ae9986489fc7 12 days ago 466 MB
registry.reg-aws.openshift.com:443/openshift3/ose-pod v3.9.29 490f82693c58 12 days ago 214 MB
registry.reg-aws.openshift.com:443/openshift3/ose-template-service-broker v3.9.29 e017c1d36a84 12 days ago 299 MB
registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog v3.9.29 6800c342ce7e 12 days ago 288 MB
registry.access.redhat.com/rhel7/etcd latest 924412659272 12 days ago 256 MB
registry.access.redhat.com/rhel7/etcd 3.2.15 4f35b6516d22 6 weeks ago 256 MB
registry.reg-aws.openshift.com:443/openshift3/ose v3.10 a5bb4dd94b69 7 weeks ago 1.59 GB
I added a new label to the daemonset for service-catalog controller, we should see the catalog controller pod restarted, it is not happening.
It looks like the kube controller pod is in trouble, its in crash loop backoff:
kube-system master-controllers-ip-172-18-12-195.ec2.internal 0/1 CrashLoopBackOff 81 7h
tail of the master-controllers pod:
I0521 14:49:15.867483 1 aws.go:1026] Building AWS cloudprovider
I0521 14:49:15.867540 1 regions.go:74] found AWS region "us-east-1"
I0521 14:49:15.867557 1 aws_credentials.go:103] registering credentials provider for AWS region "us-east-1"
I0521 14:49:15.867575 1 plugins.go:41] Registered credential provider "aws-ecr-us-east-1"
I0521 14:49:15.870583 1 log_handler.go:32] AWS API Send: ec2metadata GetMetadata &{GetMetadata GET /meta-data/instance-id <nil> <nil>} <nil>
I0521 14:49:15.870615 1 log_handler.go:37] AWS API ValidateResponse: ec2metadata GetMetadata &{GetMetadata GET /meta-data/instance-id <nil> <nil>} <nil> 200 OK
I0521 14:49:15.870782 1 log_handler.go:27] AWS request: ec2 DescribeInstances
I0521 14:49:15.966435 1 log_handler.go:32] AWS API Send: ec2 DescribeInstances &{DescribeInstances POST / 0xc4211ff040 <nil>} {
InstanceIds: ["i-07bf0c9c6b2f7f248"]
}
I0521 14:49:15.966504 1 log_handler.go:37] AWS API ValidateResponse: ec2 DescribeInstances &{DescribeInstances POST / 0xc4211ff040 <nil>} {
InstanceIds: ["i-07bf0c9c6b2f7f248"]
} 401 Unauthorized
F0521 14:49:15.966744 1 controllermanager.go:166] error building controller context: cloud provider could not be initialized: could not init cloud provider "aws": error finding instance i-07bf0c9c6b2f7f248: "error listing AWS instances: \"AuthFailure: AWS was not able to validate the provided access credentials\\n\\tstatus code: 401, request id: 1b9749b1-0985-4297-985b-0a59125f3678\""
sdodson reviewed, found that the password in the 3.10 configuration file /etc/origin/master/master.env was different from what was used in 3.9 (and located in /etc/sysconfig/atomic-openshift-master-controllers). He updated master.env and restarted the controllers and got past the AuthFailure issue. Next blocking problem is this in the controller-manager pod: I0521 16:48:22.319671 1 request.go:1099] body was not decodable (unable to check for Status): Object 'Kind' is missing in 'Error: 'x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "service-catalog-signer")' Trying to reach: 'https://172.31.138.133:443/apis/servicecatalog.k8s.io/v1beta1?timeout=32s'' F0521 16:48:22.319888 1 controller_manager.go:194] Error starting "openshift.io/cluster-quota-reconciliation" (failed to discover resources: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request) similar errors are being logged in the Service Catalog API Server pod indicating it doesn't recognize the service-catalog-signer cert. This is a new cert created during the upgrade, the catalog pods should be restarted by the master controller, but we appear to be in a bit of a deadlock/race condition. I deleted the catalog api server pod and master controller started successfully and then restarted the Catalog's api server and controller manager pods. I can see in the Catalog api server and Catalog controller manager that both pods are showing Service Catalog version v3.10.0-0.47.0 (built 2018-05-16T01:36:14Z). Catalog has been upgraded successfully. I don't believe this is normally a problem - these pods are usually restarted and pick up the new certs properly. I discussed with sdodson why this deployment has this issue with wrong credentials in the master.env. His reply: I'm not sure, I was very surprised to see different credentials between the two files. If anything I was expecting to see them present in /etc/sysconfig/atomic-openshift-master-controllers but to be absent in /etc/origin/master/master.env. We've got a bug (https://bugzilla.redhat.com/show_bug.cgi?id=1571608) where we're going to copy the old file to the new file during upgrade which may make this go away. *** This bug has been marked as a duplicate of bug 1571608 *** |