Bug 1177056
| Summary: | Huge metadata with thinp | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Matěj Cepl <mcepl> |
| Component: | lvm2 | Assignee: | Zdenek Kabelac <zkabelac> |
| lvm2 sub component: | Thin Provisioning | QA Contact: | cluster-qe <cluster-qe> |
| Status: | CLOSED WONTFIX | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | agk, cmarthal, heinzm, jbrassow, msnitzer, prajnoha, thornber, zkabelac |
| Version: | 7.1 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-11-18 17:08:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1469559 | ||
| Attachments: | |||
|
Description
Matěj Cepl
2014-12-24 01:35:13 UTC
Created attachment 972608 [details]
output of vgcfgbackup
Created attachment 972609 [details]
crash on startup of the system with too little space for metadata (perhaps none)
Created attachment 972610 [details]
/var/log/messages from the crashed system (upto the crash itself)
Created attachment 972611 [details]
/var/log directory from the recovery system when working on the recovery
(In reply to Matěj Cepl from comment #0) > Created attachment 972607 [details] > compressed metadata of rhel/pool00 thinp pool > > Description of problem: > I have lost two days of recovering my LVM thinp based system (both root, > home, and swap used to be on swap; I have now moved root to normal LVM; > swap and /home are encrypted via cryptsetup), when the metadata on the > thinp pool got to 100%. Originally the size of metadata was 112 MB, but > still even after increasing the size of metadata to 436MB metadata take > around 90% of the metadata space. I am attaching compressed metadata for > the thinp pool (and vgcfgbackup output for whole LVM system). > > So the first issue is that metadata seem to grow pretty fast. The second > problem is problem is that the crash happened at all. If the metadata > run out, than I would expect some normal -ENOSPACE or something of that > kind and not complete crash of the system and inability to boot up. > Another screenshot attached. > > The third issue is that recovery is damn too hard. I really like the > idea of bug 1136979 comment 2. Admin shouldn’t be asked to run anything > more complicated than something like lvconvert --repair rhel/pool00. Just a few comments from me where we need to look at: Few things happened together - the monitoring daemon was not used - which is the primary guard against running out-of-space. 'lvconvert --repair' failed to do it's work because it's been kind of unexpected, that repair form 112MB into 'double-sized' new metadata volume simply resulted into still 100% used metadata - with usage of 436MB it's been started however as mentioned with 90% fullness - here we need few words from Joe about this behaviour. Other issue seems to be - 'swapping' of metadata is changing weirdly chunksize - which seems to be clear bug in lvm2 code. (In reply to Zdenek Kabelac from comment #6) > Few things happened together - the monitoring daemon was not used - which is > the primary guard against running out-of-space. Just to emphasize ... I hope that the monitoring daemon was used in the normal production usage (although when looking at pgrep -f -l event I don't get anything about running dmeventd, is it run only on deman/per event?), but it was not used only in the recovery image (run from DVD installation "Troubleshooting system"), where for some reason /usr/sbin/dmeventd is not present. (In reply to Zdenek Kabelac from comment #6) > the monitoring daemon was not used Actually, on my rather default system I get this: mitmanek:~# dmevent_tool -m rhel-old_pool00_t_meta0 not monitored rhel-tmp not monitored rhel-swap_base not monitored rhel-debian not monitored rhel-root not monitored rhel-home_base not monitored rhel-pool00 not monitored home not monitored rhel-pool00-tpool not monitored rhel-pool00_tdata not monitored rhel-filemon not monitored rhel-widle not monitored rhel-pool00_tmeta not monitored swap not monitored rhel-new_home_base not monitored mitmanek:~# I feel anxious ... am I doomed to experience the same crash again? Also, /dev/rhel/home_based and /dev/rhel/swap_base are foundation volumes for encryption into home and swap volumes respectively. I don't know if it matters. (In reply to Matěj Cepl from comment #8) > (In reply to Zdenek Kabelac from comment #6) > > the monitoring daemon was not used > > Actually, on my rather default system I get this: > > rhel-pool00 not monitored It's probably worth to add comment here: lvm.conf thin_pool_autoextend_threshold = 100 thin_pool_autoextend_percent = 20 By default the monitoring is NOT enabled - user must enable it on his will - since it obviously require more free space in VG to be available. (Threshold should be like 75%) On the other hand tool could be possibly a bit more 'noicy' in case the monitoring is not enabled and thin-pool is created. Although we have here 2 bugs-in-one - changing to Joe to resolve the metadata resize weirdness. I'll do separate fix for '--chunksize' & lvconvert. Metadata in attachment are not the 'original' 112MB (then seems to be lost for now) - but even resized 440MB are showing strange results and occupy 90% of metadata space. Interestingly its conversion 440->112->224 shows then just 50%. (In reply to Matěj Cepl from comment #2) > Created attachment 972609 [details] > crash on startup of the system with too little space for metadata (perhaps > none) We really need the preceding thin-pool errors. This screen grab is useless, other than we know XFS hit a NULL pointer (which obviously is a bug, cc'ing Eric _but_ I really doubt enough context was provided for Eric or any other XFS developer to _really_ fix this.. so they'll have to resort to trying to reproduce). I suspect that the default 'no_space_timeout' of 60 seconds expired and the thin-pool switched to read-only mode. At which point write IOs were returned to XFS as errors. (In reply to Matěj Cepl from comment #3) > Created attachment 972610 [details] > /var/log/messages from the crashed system (upto the crash itself) Again, there are no messages about thinp.. so this messages file is useless. We need console logging, or if a crashdump was collected then the output of 'dmesg' from the crash utility. I see no evidence of this being a thin-pool BUG, reassigning to zkabelac as the default of _not_ monitoring thin-pool seems very flawed. Not to mention the fix needed from comment#10. chunksize is now preserved in lvconvert with upstream patch: https://www.redhat.com/archives/lvm-devel/2015-January/msg00068.html However lvm2 cannot deal with the problem of doubling metadata size and still having 100% fullness of metadata. So passing back to Joe. The 'default' monitoring mechanism is yet to be decided. (In reply to Zdenek Kabelac from comment #12) > chunksize is now preserved in lvconvert with upstream patch: > > https://www.redhat.com/archives/lvm-devel/2015-January/msg00068.html > > However lvm2 cannot deal with the problem of doubling metadata size and > still having 100% fullness of metadata. > > So passing back to Joe. > > The 'default' monitoring mechanism is yet to be decided. couple things that we need to check on for this bug: 1) is change in metadata size now properly handled 2) we need to make it easier to run 'repair' zkabelac, could you get this started and check if #1 is still a problem? (In reply to Jonathan Earl Brassow from comment #14) > (In reply to Zdenek Kabelac from comment #12) > > chunksize is now preserved in lvconvert with upstream patch: > > > > https://www.redhat.com/archives/lvm-devel/2015-January/msg00068.html > > > > However lvm2 cannot deal with the problem of doubling metadata size and > > still having 100% fullness of metadata. > > > > So passing back to Joe. > > > > The 'default' monitoring mechanism is yet to be decided. > > couple things that we need to check on for this bug: > 1) is change in metadata size now properly handled > 2) we need to make it easier to run 'repair' > > zkabelac, could you get this started and check if #1 is still a problem? I believe #1 is already solved (need zkabelac to confirm). We are working hard on solving #2 - that will be a separate bug. So, if #1 is solved, then we can close this bug with an appropriate resolution and proceed with #2 on a separate bug. There have been many improvements with thin_repair tool - so I'd guess this particular issue would be already handled correctly with current release. But it's too late for any more RH7 investigation - if the issue would have reoccurred on RH8, we need to create a new bug. |