| Summary: | [DOCKER] Openshift node fails to run pods due to doc due to error : runContainer: Error response from daemon: devmapper: | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Elvir Kuric <ekuric> | ||||||||
| Component: | Containers | Assignee: | Daniel Walsh <dwalsh> | ||||||||
| Status: | CLOSED NOTABUG | QA Contact: | weiwei jiang <wjiang> | ||||||||
| Severity: | low | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 3.3.1 | CC: | amurdaca, aos-bugs, bchilds, dwalsh, ekuric, imcleod, jeder, jhonce, jokerman, mmccomas, mpatel, nagrawal, tcarlin, vgoyal | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | aos-scalability-34 | ||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2019-04-18 15:37:46 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
Created attachment 1215345 [details]
logs
I would like to try to reproduce this, can you provide more info about the workload that leads to this? How many pods are you creating? All at once? I see dmesg is filled with `device-mapper: block manager: validator mismatch (old=index vs new=btree_node)` messages, can you follow the steps here to try to reproduce this other bug that leads to similar errors https://bugzilla.redhat.com/show_bug.cgi?id=1286500#c2 ? It should already been fixed in the version rhel you are running but I want to make sure. (In reply to Matthew Wong from comment #2) > I would like to try to reproduce this, can you provide more info about the > workload that leads to this? How many pods are you creating? All at once? > > I see dmesg is filled with `device-mapper: block manager: validator mismatch > (old=index vs new=btree_node)` messages, can you follow the steps here to > try to reproduce this other bug that leads to similar errors > https://bugzilla.redhat.com/show_bug.cgi?id=1286500#c2 ? It should already > been fixed in the version rhel you are running but I want to make sure. A bit more about environment: - This is 220 node OpenShift cluster in RHEV. Storage domain in RHEV used is iscsi domain. All VMs ( OpenShift machines ) uses this storage - In that environment I was creating up to 20k pods ( hello-pod ) . I was using https://github.com/openshift/svt/tree/master/openshift_scalability ( cluster-loader ) for mass pod creation. It will create 5 pod at one run, wait on them to get in "Ready" state and then create next set and so on up to number of pods specified in yaml Looks like something is wrong with thin pool. - There is around 20MB of metadata space used but data space used seem to be 0. - in lvs output, there are two _tdata volumes and two _tmeta volumes. There should have been one each. So I suspect that if we wipe out storage and associated data and restart, it might be just fine. But question still remains that how did we reach in this stage and that might be a real bug. Could you clean this up and see if you can reproduce with docker-1.12. Are you able to reproduce it? I need a reproducer too and that will help figure out how thin pool reached in this state. Switching to needinfo. Reiterating Vivek's comment. This seems to have gotten into an invalid and completely unexpected configuration. We really need a reproducer to even begin to sort this out. I logged into the system and thin pool looks fine to me now. I downloaded one image fedora and tried to start container and it fails. I suspect, following is the core error message. Nov 15 09:45:40 ansible.example.net kernel: SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) I set selinux in permissive mode and container start started working. So SELinux does not like something. Dan, I am attaching docker daemon debug logs for a container start. Do you any idea what might be going on. Created attachment 1220858 [details]
fedora container start logs
Give me the audit logs. We had a bug a while back with /dev/mqueue not handling labeling properly. But I think that error is ignored. Dan,
I think following is the avc.
type=AVC msg=audit(1479311551.514:28746): avc: denied { transition } for pid=63850 comm="exe" path="/usr/bin/bash" dev="dm-3" ino=46139410 scontext=system_u:system_r:unconfined_service_t:s0 tcontext=system_u:system_r:svirt_lxc_net_t:s0:c223,c419 tclass=process
Also, attaching the audit logs I captured.
Created attachment 1221256 [details]
docker run audit logs
Ping Elko, this is on the blocker list for 3.4. Please advise. Does reinstalling docker-selinux show any errors. Did you see any errors when you installed docker-selinux? ps -eZ | grep docker Does this show them running as unconfined_service_t or docker_t? Elvir, switching back to needinfo from you based on Dan's comment above. (In reply to Daniel Walsh from comment #19) > Does reinstalling docker-selinux show any errors. Did you see any errors > when you installed docker-selinux? > > ps -eZ | grep docker > > > Does this show them running as unconfined_service_t or docker_t? I do not know for this specific node, on other machines ( created from same template ) it is docker_t. I did not reinstalled docker-selinux. Initial installation of docker-selinux did not reported any errors ( I prepared VM template in advance and installed docker*, atomic-openshift* packages and it was ok ) and yum just worked fine to get and install these packages. Is this still an issue? |
Description of problem: Openshift node will fail to start new pod(s) Version-Release number of selected component (if applicable): kernel-3.10.0-512.el7.x86_64 atomic-openshift-dockerregistry-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-pod-3.3.1.1-1.git.0.629a1d8.el7.x86_64 docker-common-1.10.3-55.el7.x86_64 atomic-openshift-clients-3.3.1.1-1.git.0.629a1d8.el7.x86_64 docker-1.10.3-55.el7.x86_64 atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-tests-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-clients-redistributable-3.3.1.1-1.git.0.629a1d8.el7.x86_64 docker-rhel-push-plugin-1.10.3-55.el7.x86_64 python-docker-py-1.9.0-1.el7.noarch tuned-profiles-atomic-openshift-node-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-master-3.3.1.1-1.git.0.629a1d8.el7.x86_64 tuned-profiles-atomic-2.7.1-3.el7.noarch docker-selinux-1.10.3-55.el7.x86_64 atomic-openshift-3.3.1.1-1.git.0.629a1d8.el7.x86_64 atomic-openshift-sdn-ovs-3.3.1.1-1.git.0.629a1d8.el7.x86_64 How reproducible: I am not sure how to reproduce this, however I have seen it after creating many pods and Openshift running in RHEV environment, but I do not have clear steps leading to this issue. Steps to Reproduce: Not available at time Actual results: It fails to start new pods Expected results: to start new pods Additional info: I have seen this issue in multiple nodes, all of them has in "lvs -a -o +chunksize " for "data" field "0" , while metadata shows value. See that "docker info" has identical value for "Data" Once in this state, not possible to restart either docker and/or Openshift node logs : # systemctl status atomic-openshift-node -l ● atomic-openshift-node.service - Atomic OpenShift Node Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: active (running) since Fri 2016-10-21 04:33:14 EDT; 1 weeks 1 days ago Docs: https://github.com/openshift/origin Main PID: 3706 (openshift) Memory: 87.5M CGroup: /system.slice/atomic-openshift-node.service ├─3706 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 └─3977 journalctl -k -f Oct 29 11:59:42 dhcp8-207.example.net atomic-openshift-node[3706]: I1029 11:59:42.360946 3706 proxier.go:502] Setting endpoints for "default/kubernetes:https" to [172.16.7.164:8443 172.16.7.165:8443 172.16.7.166:8443] Oct 29 11:59:43 dhcp8-207.example.net atomic-openshift-node[3706]: I1029 11:59:43.060455 3706 reconciler.go:254] MountVolume operation started for volume "kubernetes.io/secret/b0e4ee95-9dea-11e6-9452-001a4b1000fd-default-token-z0kid" (spec.Name: "default-token-z0kid") to pod "b0e4ee95-9dea-11e6-9452-001a4b1000fd" (UID: "b0e4ee95-9dea-11e6-9452-001a4b1000fd"). Volume is already mounted to pod, but remount was requested. Oct 29 11:59:43 dhcp8-207.example.net atomic-openshift-node[3706]: I1029 11:59:43.085378 3706 operation_executor.go:740] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/b0e4ee95-9dea-11e6-9452-001a4b1000fd-default-token-z0kid" (spec.Name: "default-token-z0kid") pod "b0e4ee95-9dea-11e6-9452-001a4b1000fd" (UID: "b0e4ee95-9dea-11e6-9452-001a4b1000fd"). Oct 29 11:59:43 dhcp8-207.example.net atomic-openshift-node[3706]: I1029 11:59:43.325623 3706 docker_manager.go:1759] Need to restart pod infra container for "hellopods16_clusterproject1(b0e4ee95-9dea-11e6-9452-001a4b1000fd)" because it is not found Oct 29 11:59:43 dhcp8-207.example.net atomic-openshift-node[3706]: E1029 11:59:43.329808 3706 docker_manager.go:1979] Failed to create pod infra container: RunContainerError; Skipping pod "hellopods91_clusterproject6(0812d7f7-9ded-11e6-9452-001a4b1000fd)": runContainer: Error response from daemon: devmapper: Error mounting '/dev/mapper/docker-252:1-50359601-5b962068483609e4a11a55974c2cfbf0d96c770f70a94ab5a5192d11a0bb41d0-init' on '/var/lib/docker/devicemapper/mnt/5b962068483609e4a11a55974c2cfbf0d96c770f70a94ab5a5192d11a0bb41d0-init': input/output error Oct 29 11:59:43 dhcp8-207.example.net atomic-openshift-node[3706]: E1029 11:59:43.329926 3706 pod_workers.go:183] Error syncing pod 0812d7f7-9ded-11e6-9452-001a4b1000fd, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: devmapper: Error mounting '/dev/mapper/docker-252:1-50359601-5b962068483609e4a11a55974c2cfbf0d96c770f70a94ab5a5192d11a0bb41d0-init' on '/var/lib/docker/devicemapper/mnt/5b962068483609e4a11a55974c2cfbf0d96c770f70a94ab5a5192d11a0bb41d0-init': input/output error" Oct 29 11:59:43 dhcp8-207.example.net atomic-openshift-node[3706]: E1029 11:59:43.534035 3706 docker_manager.go:1979] Failed to create pod infra container: RunContainerError; Skipping pod "hellopods16_clusterproject1(b0e4ee95-9dea-11e6-9452-001a4b1000fd)": runContainer: Error response from daemon: devmapper: Error mounting '/dev/mapper/docker-252:1-50359601-f0c66b17fda4917f8c36dc0775ed441f8ec9ee2b41ab2ae908b5ea73996eb22d-init' on '/var/lib/docker/devicemapper/mnt/f0c66b17fda4917f8c36dc0775ed441f8ec9ee2b41ab2ae908b5ea73996eb22d-init': input/output error Oct 29 11:59:43 dhcp8-207.example.net atomic-openshift-node[3706]: E1029 11:59:43.534108 3706 pod_workers.go:183] Error syncing pod b0e4ee95-9dea-11e6-9452-001a4b1000fd, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: devmapper: Error mounting '/dev/mapper/docker-252:1-50359601-f0c66b17fda4917f8c36dc0775ed441f8ec9ee2b41ab2ae908b5ea73996eb22d-init' on '/var/lib/docker/devicemapper/mnt/f0c66b17fda4917f8c36dc0775ed441f8ec9ee2b41ab2ae908b5ea73996eb22d-init': input/output error" Oct 29 11:59:45 dhcp8-207.example.net atomic-openshift-node[3706]: E1029 11:59:45.099663 3706 thin_pool_watcher.go:72] encountered error refreshing thin pool watcher: error performing thin_ls on metadata device /dev/mapper/docker_vg-docker--pool_tmeta: Error running command `thin_ls --no-headers -m -o DEV,EXCLUSIVE_BYTES /dev/mapper/docker_vg-docker--pool_tmeta`: exit status 1 Oct 29 11:59:45 dhcp8-207.example.net atomic-openshift-node[3706]: output: root@dhcp8-207: /var/log # ----- # docker info Containers: 0 Running: 0 Paused: 0 Stopped: 0 Images: 30 Server Version: 1.10.3 Storage Driver: devicemapper Pool Name: docker_vg-docker--pool Pool Blocksize: 524.3 kB Base Device Size: 10.74 GB Backing Filesystem: xfs Data file: Metadata file: Data Space Used: 0 B Data Space Total: 22.42 GB Data Space Available: 22.42 GB Metadata Space Used: 24.56 MB Metadata Space Total: 46.14 MB Metadata Space Available: 21.57 MB Udev Sync Supported: true Deferred Removal Enabled: true Deferred Deletion Enabled: true Deferred Deleted Device Count: 0 Library Version: 1.02.135-RHEL7 (2016-09-28) Execution Driver: native-0.2 Logging Driver: json-file Plugins: Volume: local Network: bridge null host Authorization: rhel-push-plugin Kernel Version: 3.10.0-512.el7.x86_64 Operating System: Red Hat Enterprise Linux Server 7.3 (Maipo) OSType: linux Architecture: x86_64 Number of Docker Hooks: 2 CPUs: 2 Total Memory: 7.638 GiB Name: dhcp8-207.example.net ID: KJ5J:FJPP:5HCT:CLM2:5V2S:TKDG:UVOE:VCAV:WG4G:IQVD:FACE:BOBN WARNING: bridge-nf-call-iptables is disabled Registries: registry.ops.openshift.com (insecure), registry.access.redhat.com (secure), docker.io (secure) root@dhcp8-207: /var/log # ---- # lvs -a -o +chunksize LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Chunk docker-pool docker_vg twi-aot--- 20.88g 0.00 53.24 512.00k [docker-pool_tdata] docker_vg Twi-ao---- 20.88g 0 [docker-pool_tdata] docker_vg Twi-ao---- 20.88g 0 [docker-pool_tmeta] docker_vg ewi-ao---- 44.00m 0 [docker-pool_tmeta] docker_vg ewi-ao---- 44.00m 0 [lvol0_pmspare] docker_vg ewi------- 24.00m 0 d