Bug 1523319
Summary: | Error starting daemon: error initializing graphdriver: devmapper: Unable to take ownership of thin-pool (atomicos-docker--pool) that already has used data blocks | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Micah Abbott <miabbott> |
Component: | docker | Assignee: | Daniel Walsh <dwalsh> |
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.5 | CC: | ajia, amurdaca, dwalsh, hasuzuki, lsm5, qcai, santiago, vgoyal, walters |
Target Milestone: | rc | Keywords: | Extras |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-11 00:01:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Micah Abbott
2017-12-07 17:41:11 UTC
Forgot to link to the nightly compose: http://download-node-02.eng.bos.redhat.com/nightly//RHAH-7.5.20171206.n.5/compose/Server/x86_64/ostree/repo/ I think this is configuration error. Docker is being passed a thin pool which is not empty and docker can't take ownership of a thin pool which is not empty. How do you reproduce this issue? What's the workflow of creating thin pool and passing it to docker. I think reset of storage and restarting docker should fix this. Well, it looks like the fresh-installed RHEL Atomic Host 7.4.X always have a non-empty thin-pool by default. # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert docker-pool atomicos twi-a-t--- <12.64g 0.15 0.12 (In reply to Vivek Goyal from comment #3) > How do you reproduce this issue? What's the workflow of creating thin pool > and passing it to docker. 1) Boot RHELAH 7.4.3 qcow2 image 2) Add an ostree remote that points to the 7.5 compose in comment #2 # ostree remote add --no-gpg-verify custom http://download-node-02.eng.bos.redhat.com/nightly//RHAH-7.5.20171206.n.5/compose/Server/x86_64/ostree/repo/ 3) Rebase to 7.5 compose # rpm-ostree rebase custom:rhel-atomic-host/7/x86_64/standard 4) Reboot 5) Observe errors > I think reset of storage and restarting docker should fix this. Yes, this does workaround the issue. But I don't think customers are going to want to have to delete all their Docker data when they are upgrading to RHELAH 7.5, right? (In reply to CAI Qian from comment #4) > Well, it looks like the fresh-installed RHEL Atomic Host 7.4.X always have a > non-empty thin-pool by default. > > # lvs > LV VG Attr LSize Pool Origin Data% Meta% Move Log > Cpy%Sync Convert > docker-pool atomicos twi-a-t--- <12.64g 0.15 0.12 This is strange. So docker has not run at all yet? Is there any magic going w.r.t atomic system containers. These kind of errors happen when docker has been started, it has setup thin pool. Now somebody has deleted /var/lib/docker (and not thin pool) and restarted docker. So do following. - start docker - stop docker - rm -rf /var/lib/docker - start docker And you will see the error that docker can't take ownership of non-empty thin pool. Problem is docker thinks it is running for the first time (because /var/lib/docker/ is empty) but thin pool is not empty has data created by previous run of docker. That's why when storage reset happens, we remove thin pool as well as /var/lib/docker/. So I am not sure what happening in atomic world that we are somehow triggering above sequence where thin pool is old but docker has lost its old data and starting fresh. (In reply to Micah Abbott from comment #5) > (In reply to Vivek Goyal from comment #3) > > > How do you reproduce this issue? What's the workflow of creating thin pool > > and passing it to docker. > > 1) Boot RHELAH 7.4.3 qcow2 image > 2) Add an ostree remote that points to the 7.5 compose in comment #2 > > # ostree remote add --no-gpg-verify custom > http://download-node-02.eng.bos.redhat.com/nightly//RHAH-7.5.20171206.n.5/ > compose/Server/x86_64/ostree/repo/ > > 3) Rebase to 7.5 compose > > # rpm-ostree rebase custom:rhel-atomic-host/7/x86_64/standard > > 4) Reboot > 5) Observe errors > > > I think reset of storage and restarting docker should fix this. > > Yes, this does workaround the issue. But I don't think customers are going > to want to have to delete all their Docker data when they are upgrading to > RHELAH 7.5, right? Is there any chance that we lost old /var/lib/docker/ after the upgrade? This is starting to look like a side-effect of the interaction between 'rpm-ostree' and the 7.5 kernel as described here: https://bugzilla.redhat.com/show_bug.cgi?id=1428677#c17 Some relevent IRC logs: <caiqian> vgoyal, yes, the /var/lib/docker was lost after upgrade to 7.5 <vgoyal> caiqian: we should not lose /var/lib/docker/ after upgrade <vgoyal> caiqian: container-storage-setup does not purge /var/lib/docker/ AFAIK <vgoyal> caiqian: "atomic storage reset" can do it <vgoyal> caiqian: is /var/lib/docker on rootfs or a special mount <vgoyal> caiqian: could it be that after upgrade it is overlayfs on a separate volume somehow mounted over /var/lib/docker/ <caiqian> vgoyal, rootfs <vgoyal> caiqian: i am just scratching my head. i don't know how did we end up in a situation where old /var/lib/docker/ is not visible <caiqian> vgoyal, wow, the old /var/lib/docker hide here <caiqian> find /sysroot/ostree/deploy/rhel-atomic-host/var/lib/docker/ <caiqian> vgoyal, something atomic host upgrade replace /var/lib/docker with a new one. <vgoyal> caiqian: i suspect that this has something to with ostree and rebase process <vgoyal> caiqian: findmnt says that pre-upgrade my /var/lib/docker is here. <vgoyal> ─/var /dev/mapper/atomicos-root[/ostree/deploy/rhel-atomic-host/var] <vgoyal> basically /ostree/deploy/rhel-atomic-host/var is mounted on /var <caiqian> vgoyal, yes, after upgrade, /var is no longer a separate mount Colin, what do you think? @miabbott does systemctl --failed show the tmpfiles.d generator failed? @vivek: Can you glance at https://bugzilla.redhat.com/show_bug.cgi?id=1428677 ? Does anything in the container stack use O_TMPFILE? `git grep O_TMPFILE` in projectatomic/docker has zero hits. The RHELAH 7.5 ISO also immediately blows up with the same issue from https://bugzilla.redhat.com/show_bug.cgi?id=1428677 ...which I may just add EINVAL to our "ignore" list but I'd really like a statement from the kernel side that O_TMPFILE is expected to work. (In reply to Colin Walters from comment #10) > @miabbott does systemctl --failed show the tmpfiles.d generator failed? Nope. -bash-4.2$ systemctl --failed UNIT LOAD ACTIVE SUB DESCRIPTION ● docker.service loaded failed failed Docker Application Container Engine ● kdump.service loaded failed failed Crash recovery kernel arming LOAD = Reflects whether the unit definition was properly loaded. ACTIVE = The high-level unit activation state, i.e. generalization of SUB. SUB = The low-level unit activation state, values depend on unit type. 2 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. (In reply to Colin Walters from comment #11) > The RHELAH 7.5 ISO also immediately blows up with the same issue from > https://bugzilla.redhat.com/show_bug.cgi?id=1428677 ...which I may just add > EINVAL to our "ignore" list but I'd really like a statement from the kernel > side that O_TMPFILE is expected to work. O_TMPFILE issue is now fixed in kernel in 7.5. So is this bug still happening? I can't reproduce this, just did a quick rebase: # rpm-ostree status State: idle Deployments: ● rhah-20180205.n.0:rhel-atomic-host/7/x86_64/standard Version: 7.5.0 (2018-02-05 18:36:43) Commit: 90db2b58446d807c92d02443bed3560f18a6f1ba7f6ddf640b2559000cd27046 rhel-atomic-host-ostree:rhel-atomic-host/7/x86_64/standard Version: 7.4.5 (2018-02-05 00:07:59) Commit: bb8c244eadbb48f5dfceaaf44630a82c986dfcc9ee431143a53935b2f3a2dcd0 GPGSignature: Valid signature by 567E347AD0044ADE55BA8A5F199E2F91FD431D51 Micah, is this still valid? I am going to claim this is fixed in the current release. Micah please reopen if that is not the case. Agreed, can't reproduce using the latest RHELAH 7.5 compose. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1071 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |