Bug 1575717 - Build default setting of node selectors does not work when crio as container runtime
Summary: Build default setting of node selectors does not work when crio as container ...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.10.0
Assignee: Ben Parees
QA Contact: Wenjing Zheng
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-07 17:52 UTC by Hongkai Liu
Modified: 2018-05-08 11:25 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-08 11:25:25 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Hongkai Liu 2018-05-07 17:52:28 UTC
Description of problem:
I cannot restrict the build pods running on selected nodes when crio is container run-time.


Version-Release number of selected component (if applicable):

# yum list installed | grep openshift
atomic-openshift.x86_64       3.10.0-0.32.0.git.0.2b17fd0.el7

# oc describe node | grep -i run
 Container Runtime Version:  cri-o://1.10.0-beta.1
                    runtime=cri-o

How reproducible:
Always.

Steps to Reproduce:
# head /etc/origin/master/master-config.yaml 
admissionConfig:
  pluginConfig:
    BuildDefaults:
      configuration:
        apiVersion: v1
        env: []
        kind: BuildDefaultsConfig
        nodeSelector:
          build: build
        resources:

### the goal here is to restart the master-api/controllers
### however, with crio, the script does not work (see the following bz)
### https://bugzilla.redhat.com/show_bug.cgi?id=1574660
### So I tried this
# oc delete pod master-controllers-ip-172-31-56-80.us-west-2.compute.internal
pod "master-controllers-ip-172-31-56-80.us-west-2.compute.internal" deleted
# oc delete pod master-api-ip-172-31-56-80.us-west-2.compute.internal
pod "master-api-ip-172-31-56-80.us-west-2.compute.internal" deleted


# oc new-project testproject
# oc new-app https://github.com/openshift/cakephp-ex

###No node with the desired label
# oc get node -l build=build
No resources found.
# oc get node -l region=primary
NAME                                          STATUS    ROLES     AGE       VERSION
ip-172-31-55-235.us-west-2.compute.internal   Ready     compute   59m       v1.10.0+b81c8f8

Actual results:
# oc get pod -o wide
NAME                 READY     STATUS    RESTARTS   AGE       IP           NODE
cakephp-ex-1-build   1/1       Running   0          1m        172.22.0.5   ip-172-31-55-235.us-west-2.compute.internal

Expected results:
The above build pod should be Pending since no node with the label build=build.

Additional info:

Comment 1 Ben Parees 2018-05-07 17:55:12 UTC
please provide the build pod yaml so we can see whether the nodeselector was set on it or not.

Comment 2 Ben Parees 2018-05-07 18:02:12 UTC
Also have you tried setting the nodeselector via the build overrider instead of the defaulter?

if your project has a default nodeselector, the build defaulter may not be able to apply its default nodeselector.

Comment 4 Hongkai Liu 2018-05-07 18:07:10 UTC
Hi Ben,

What is the right way to do the restart of master-api/controllers for the changes on master-config taking effects?
I feel that oc-delete api/controller pod did not do the work.

Comment 5 Ben Parees 2018-05-07 18:11:59 UTC
correct, because your pod already has a nodeselector value:

 nodeSelector:
    region: primary

the build defaulter will only apply its nodeselector if there is not one already on the pod.

Since your pod already has a nodeselector (probably from the project defaults) you need to use the overrider.

Note that once you do this, you are going to end up w/ pods that have two nodeselectors:

region=primary
build=build


So you'll need to add the "build=build" label to some of your region=primary nodes, if you want the build pods to be able to run somewhere.

Comment 6 Ben Parees 2018-05-07 18:13:00 UTC
I don't know how the config is picked up since the refactor, that's a question for the master team.  I think the config is now stored in a configmap that is mounted into the api/controller pod..so you may have to edit that configmap directly.

Comment 7 Hongkai Liu 2018-05-07 18:47:20 UTC
# ps -ef | grep master
root     20565 20554  6 14:30 ?        00:15:55 openshift start master controllers --config=/etc/origin/master/master-config.yaml --listen=https://0.0.0.0:8444 --loglevel=2
root     20612 20601  4 14:30 ?        00:09:32 openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2

Looks like the config file is still used in the api/controllers.

Is this new?
```
the build defaulter will only apply its nodeselector if there is not one already on the pod.
```

Because I did the same change on master-config when docker as runtime, it works when BuildDefaults is changed (no override change is necessary).

Now I also added the overrite part and delete api/controller again
# cat /etc/origin/master/master-config.yaml 
admissionConfig:
  pluginConfig:
    BuildDefaults:
      configuration:
        apiVersion: v1
        env: []
        kind: BuildDefaultsConfig
        nodeSelector:
          build: build
        resources:
          limits: {}
          requests: {}
    BuildOverrides:
      configuration:
        apiVersion: v1
        kind: BuildOverridesConfig
        nodeSelector:
          build: build

When I start a build, the build pod still runs. :(

Also tried to understand the configMap:

# oc get configmap
NAME                                 DATA      AGE
extension-apiserver-authentication   6         4h
kube-controller-manager              0         4h
kube-scheduler                       0         4h
openshift-master-controllers         0         4h
# oc get pod -o yaml | grep -i map


It seems that no config map is used in the pods. Which config map should I edit?

Comment 8 Ben Parees 2018-05-07 19:01:43 UTC
> Looks like the config file is still used in the api/controllers.

if those processes are running inside containers, then those paths are relative to the container, not the host filesystem.


> Is this new?

no.  the build defaulter will only apply its nodeselector key/value pair if the pod does not already have any nodeselector key/value pairs.  That has always been the case.

> When I start a build, the build pod still runs.

check the pod yaml to see if the nodeselector value was applied.  but i'm guessing your changes to master-config did not take effect.


> It seems that no config map is used in the pods. Which config map should I edit?

you'll have to ask the master team how config is managed since they made their changes.

Comment 9 Ben Parees 2018-05-07 19:13:46 UTC
so I spoke w/ Scott Dodson and he says that the master-config is bind mounted into the controller container.  So changing it on the host and restarting the controller container should be sufficient to pick up your changes.

If you turn on loglevel 5 for the controller process and gather logs we can see exactly what the build overrider is doing.

Comment 10 Hongkai Liu 2018-05-07 19:57:54 UTC
# vi /etc/origin/master/master.env
DEBUG_LOGLEVEL=5

And then oc-delete the pod. 
# ps -ef | grep master

It still shows 
--config=/etc/origin/master/master-config.yaml --loglevel=2

So I need a way to stop controller container.

# crictl ps | grep controllers
95526916794a6       registry.reg-aws.openshift.com:443/openshift3/ose-control-plane@sha256:7d5395addf13b47e75e65609fde5d7639487f695f86beb5fd64bc035bb819a63          5 hours ago         CONTAINER_RUNNING   controllers         0


# crictl stop 95526916794a6

# oc get pod
NAME                                                            READY     STATUS             RESTARTS   AGE
master-api-ip-172-31-56-80.us-west-2.compute.internal           1/1       Running            0          20m
master-controllers-ip-172-31-56-80.us-west-2.compute.internal   0/1       CrashLoopBackOff   6          20m
master-etcd-ip-172-31-56-80.us-west-2.compute.internal          1/1       Running            0          5h


# oc logs master-controllers-ip-172-31-56-80.us-west-2.compute.internal
# See the attachment

Comment 12 Hongkai Liu 2018-05-07 20:02:48 UTC
See this one showed up again in crio.
https://bugzilla.redhat.com/show_bug.cgi?id=1570877

So the question is:
Is it the right way to stop the container under crio?
crictl stop controller_container_id

Comment 13 Ben Parees 2018-05-07 20:04:54 UTC
> Is it the right way to stop the container under crio?
crictl stop controller_container_id

I don't know, that would be a question for the crio team.  It certainly looks reasonable.  Did the container not stop?  Why is your master-controller now in a crashloop?  What do its logs show?


In general, please open a separate bug for dealing w/ the issues you are having getting your master-config updated in your control plane.  And mark it as blocking this bug.  It should be assigned to the master team.

Comment 14 Hongkai Liu 2018-05-07 20:17:08 UTC
The log is in comment 11.
Ok. I will create another bz.

Comment 15 Hongkai Liu 2018-05-07 21:20:08 UTC
Since I cannot recover the broke master-controller, I rerun the commands on a new cluster.

And now it works this time (do not know why it broke in the first cluster: hoping it is one time thing. I will create a bz if I can replicate it).

crictl stop controller_container_id is the right command to restart (I see the log level changed in the ps output and more importantly, selector works).

oc delete pod wont make the new config have effects.

So I have got what I want, but ...

# cat /etc/origin/master/master-config.yaml 
admissionConfig:
  pluginConfig:
    BuildDefaults:
      configuration:
        apiVersion: v1
        env: []
        nodeSelector:
          build: build
        kind: BuildDefaultsConfig
        resources:
          limits: {}
          requests: {}
    BuildOverrides:
      configuration:
        apiVersion: v1
        kind: BuildOverridesConfig
    PodPreset:

I did not change the BuildOverrides part. It works as I expected.


# oc get pod
NAME                 READY     STATUS    RESTARTS   AGE
cakephp-ex-1-build   0/1       Pending   0          29m


# oc get pod cakephp-ex-1-build -o yaml | grep nodeSelector: -A3
  nodeSelector:
    build: build
    region: primary
  restartPolicy: Never

=============================
Although this is what I want, it is different from your logic description above in comment 8.
Let me know if you want me fire a bz for it or provide more information.


We can also see this in the controller log:

# oc logs master-controllers-ip-172-31-23-32.us-west-2.compute.internal -f | grep -v graph_builder.go | grep -E  "override|build|Build"
...
I0507 20:44:48.222188       1 create_dockercfg_secrets.go:75] Adding service account builder
I0507 20:44:48.258434       1 create_dockercfg_secrets.go:80] Updating service account builder
I0507 20:44:48.279673       1 create_dockercfg_secrets.go:460] Token secret for service account aaa/builder is not populated yet
I0507 20:44:48.279782       1 create_dockercfg_secrets.go:351] The dockercfg secret was not created for service account aaa/builder, will retry
I0507 20:44:48.290739       1 create_dockercfg_secrets.go:80] Updating service account builder
I0507 20:44:48.290246       1 create_dockercfg_secrets.go:441] Creating token secret "builder-token-bqtbt" for service account aaa/builder
I0507 20:44:48.302359       1 create_dockercfg_secrets.go:460] Token secret for service account aaa/builder is not populated yet
I0507 20:44:48.302389       1 create_dockercfg_secrets.go:351] The dockercfg secret was not created for service account aaa/builder, will retry
I0507 20:44:48.302418       1 create_dockercfg_secrets.go:441] Creating token secret "builder-token-bqtbt" for service account aaa/builder
I0507 20:44:48.303195       1 create_dockercfg_secrets.go:149] Adding token secret aaa/builder-token-bqtbt
I0507 20:44:48.317588       1 create_dockercfg_secrets.go:460] Token secret for service account aaa/builder is not populated yet
I0507 20:44:48.317677       1 create_dockercfg_secrets.go:351] The dockercfg secret was not created for service account aaa/builder, will retry
I0507 20:44:48.321971       1 create_dockercfg_secrets.go:147] Updating token secret aaa/builder-token-bqtbt
I0507 20:44:48.322008       1 create_dockercfg_secrets.go:478] Creating dockercfg secret "builder-dockercfg-4rff2" for service account aaa/builder
I0507 20:44:48.337397       1 create_dockercfg_secrets.go:80] Updating service account builder
I0507 20:45:10.989424       1 reflector.go:428] github.com/openshift/origin/pkg/build/generated/informers/internalversion/factory.go:58: Watch close - *build.Build total 0 items received
I0507 20:45:13.150968       1 buildconfig_controller.go:100] Handling BuildConfig aaa/cakephp-ex (0)
I0507 20:45:13.151167       1 util.go:62] Current builds: 0, SuccessfulBuildsHistoryLimit: 5
I0507 20:45:13.151572       1 util.go:79] Current builds: 0, FailedBuildsHistoryLimit: 5
I0507 20:45:13.151651       1 buildconfig_controller.go:116] Running build for BuildConfig aaa/cakephp-ex (0)
I0507 20:45:13.150998       1 image_trigger_controller.go:361] Started syncing resource "buildconfigs.build.openshift.io/aaa/cakephp-ex"
I0507 20:45:13.151854       1 buildconfigs.go:213] Requesting build for BuildConfig based on image triggers aaa/cakephp-ex: &build.BuildRequest{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"cakephp-ex", GenerateName:"", Namespace:"aaa", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Revision:(*build.SourceRevision)(nil), TriggeredByImage:(*core.ObjectReference)(0xc426c83f80), From:(*core.ObjectReference)(0xc428448d10), Binary:(*build.BinaryBuildSource)(nil), LastVersion:(*int64)(nil), Env:[]core.EnvVar(nil), TriggeredBy:[]build.BuildTriggerCause{build.BuildTriggerCause{Message:"Image change", GenericWebHook:(*build.GenericWebHookCause)(nil), GitHubWebHook:(*build.GitHubWebHookCause)(nil), ImageChangeBuild:(*build.ImageChangeCause)(0xc427685c20), GitLabWebHook:(*build.GitLabWebHookCause)(nil), BitbucketWebHook:(*build.BitbucketWebHookCause)(nil)}}, DockerStrategyOptions:(*build.DockerStrategyOptions)(nil), SourceStrategyOptions:(*build.SourceStrategyOptions)(nil)}
I0507 20:45:13.189253       1 build_controller.go:333] Handling build aaa/cakephp-ex-1 (New)
I0507 20:45:13.189284       1 policy.go:36] Using *policy.SerialPolicy run policy for build aaa/cakephp-ex-1
I0507 20:45:13.193231       1 image_trigger_controller.go:363] Finished syncing resource "buildconfigs.build.openshift.io/aaa/cakephp-ex" (42.230306ms)
E0507 20:45:13.195055       1 buildconfig_controller.go:139] gave up on Build for BuildConfig aaa/cakephp-ex (0) due to fatal error: the LastVersion(1) on build config aaa/cakephp-ex does not match the build request LastVersion(0)
I0507 20:45:13.195144       1 buildconfig_controller.go:243] Will not retry fatal error for key aaa/cakephp-ex: fatal: the LastVersion(1) on build config aaa/cakephp-ex does not match the build request LastVersion(0)
I0507 20:45:13.195221       1 buildconfig_controller.go:100] Handling BuildConfig aaa/cakephp-ex (1)
I0507 20:45:13.195301       1 util.go:62] Current builds: 0, SuccessfulBuildsHistoryLimit: 5
I0507 20:45:13.195378       1 util.go:79] Current builds: 0, FailedBuildsHistoryLimit: 5
I0507 20:45:13.212381       1 util.go:171] /var/run/secrets/openshift.io/push will be used for docker push in cakephp-ex-1-build
I0507 20:45:13.212413       1 util.go:179] /var/run/secrets/openshift.io/pull will be used for docker pull in cakephp-ex-1-build
I0507 20:45:13.212964       1 defaults.go:48] Applying defaults to build aaa/cakephp-ex-1
I0507 20:45:13.212996       1 defaults.go:51] Applying defaults to pod aaa/cakephp-ex-1-build
I0507 20:45:13.213259       1 overrides.go:50] Applying overrides to build aaa/cakephp-ex-1
I0507 20:45:13.213640       1 build_controller.go:894] Pod aaa/cakephp-ex-1-build for build aaa/cakephp-ex-1 (New) is about to be created
I0507 20:45:13.221808       1 build_controller.go:921] Created pod aaa/cakephp-ex-1-build for build aaa/cakephp-ex-1 (New)
I0507 20:45:13.222002       1 build_controller.go:1053] Updating build aaa/cakephp-ex-1 (New) -> Pending
I0507 20:45:13.228368       1 factory.go:1147] About to try and schedule pod cakephp-ex-1-build
I0507 20:45:13.228882       1 scheduler.go:439] Attempting to schedule pod: aaa/cakephp-ex-1-build
I0507 20:45:13.229519       1 scheduler.go:191] Failed to schedule pod: aaa/cakephp-ex-1-build
I0507 20:45:13.229558       1 factory.go:1265] Unable to schedule aaa cakephp-ex-1-build: no fit: 0/16 nodes are available: 16 node(s) didn't match node selector.; waiting
I0507 20:45:13.229630       1 factory.go:1375] Updating pod condition for aaa/cakephp-ex-1-build to (PodScheduled==False)
I0507 20:45:13.230287       1 build_controller.go:1136] Patching build aaa/cakephp-ex-1 (New) with buildUpdate(phase: "Pending", reason: "", message: "", outputRef: "docker-registry.default.svc:5000/aaa/cakephp-ex:latest", podName: "cakephp-ex-1-build", pushSecret: {builder-dockercfg-4rff2})
I0507 20:45:13.230397       1 taint_manager.go:345] Noticed pod update: types.NamespacedName{Namespace:"aaa", Name:"cakephp-ex-1-build"}
I0507 20:45:13.230526       1 disruption.go:328] addPod called on pod "cakephp-ex-1-build"
I0507 20:45:13.230573       1 disruption.go:403] No PodDisruptionBudgets found for pod cakephp-ex-1-build, PodDisruptionBudget controller will avoid syncing.
I0507 20:45:13.230592       1 disruption.go:331] No matching pdb for pod "cakephp-ex-1-build"
I0507 20:45:13.230611       1 pvc_protection_controller.go:276] Got event on pod aaa/cakephp-ex-1-build
I0507 20:45:13.243077       1 pvc_protection_controller.go:276] Got event on pod aaa/cakephp-ex-1-build
I0507 20:45:13.243063       1 disruption.go:340] updatePod called on pod "cakephp-ex-1-build"
I0507 20:45:13.244638       1 disruption.go:403] No PodDisruptionBudgets found for pod cakephp-ex-1-build, PodDisruptionBudget controller will avoid syncing.
I0507 20:45:13.244774       1 disruption.go:343] No matching pdb for pod "cakephp-ex-1-build"
I0507 20:45:13.248591       1 factory.go:1147] About to try and schedule pod cakephp-ex-1-build
I0507 20:45:13.248677       1 scheduler.go:439] Attempting to schedule pod: aaa/cakephp-ex-1-build
I0507 20:45:13.253130       1 scheduler.go:191] Failed to schedule pod: aaa/cakephp-ex-1-build
I0507 20:45:13.252011       1 build_controller.go:333] Handling build aaa/cakephp-ex-1 (New)
I0507 20:45:13.253225       1 factory.go:1265] Unable to schedule aaa cakephp-ex-1-build: no fit: 0/16 nodes are available: 16 node(s) didn't match node selector.; waiting
I0507 20:45:13.253523       1 factory.go:1375] Updating pod condition for aaa/cakephp-ex-1-build to (PodScheduled==False)
I0507 20:45:13.253323       1 build_controller.go:1053] Updating build aaa/cakephp-ex-1 (New) -> Pending
I0507 20:45:13.255337       1 build_controller.go:1136] Patching build aaa/cakephp-ex-1 (New) with buildUpdate(phase: "Pending", reason: "", message: "", podName: "cakephp-ex-1-build")
W0507 20:45:13.253614       1 factory.go:1304] Request for pod aaa/cakephp-ex-1-build already in flight, abandoning
I0507 20:45:13.259905       1 build_controller.go:333] Handling build aaa/cakephp-ex-1 (Pending)
I0507 20:45:14.232228       1 factory.go:1147] About to try and schedule pod cakephp-ex-1-build
I0507 20:45:14.232246       1 scheduler.go:439] Attempting to schedule pod: aaa/cakephp-ex-1-build
I0507 20:45:14.233056       1 scheduler.go:191] Failed to schedule pod: aaa/cakephp-ex-1-build
I0507 20:45:14.233122       1 factory.go:1265] Unable to schedule aaa cakephp-ex-1-build: no fit: 0/16 nodes are available: 16 node(s) didn't match node selector.; waiting
I0507 20:45:14.233179       1 factory.go:1375] Updating pod condition for aaa/cakephp-ex-1-build to (PodScheduled==False)
I0507 20:45:16.235438       1 factory.go:1147] About to try and schedule pod cakephp-ex-1-build
I0507 20:45:16.235485       1 scheduler.go:439] Attempting to schedule pod: aaa/cakephp-ex-1-build
I0507 20:45:16.235867       1 scheduler.go:191] Failed to schedule pod: aaa/cakephp-ex-1-build
I0507 20:45:16.235899       1 factory.go:1265] Unable to schedule aaa cakephp-ex-1-build: no fit: 0/16 nodes are available: 16 node(s) didn't match node selector.; waiting
I0507 20:45:16.235996       1 factory.go:1375] Updating pod condition for aaa/cakephp-ex-1-build to (PodScheduled==False)

Comment 16 Ben Parees 2018-05-07 21:28:03 UTC
> Although this is what I want, it is different from your logic description above in comment 8.

That's right, I forgot how the build defaulter applies itself to the pod, it actually mutates the pod definition before creating the pod, so it makes sense that it would apply the default nodeselector label to the pod.

That said, given that the pod now has the nodeselector, is it behaving as you expected?  If so, can this be closed?

Comment 17 Hongkai Liu 2018-05-08 11:25:25 UTC
Thanks for guiding me to find the solution to my issue.
Let me close this.


Note You need to log in before you can comment on or make changes to this bug.