Bug 1570902

Summary: LXC domains with nbd attached qcow2 image creates kernel stack trace
Product: [Community] Virtualization Tools Reporter: ralph.schmieder
Component: libvirtAssignee: Libvirt Maintainers <libvirt-maint>
Status: CLOSED DEFERRED QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: berrange, libvirt-maint, ralph.schmieder
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-12-17 12:25:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
libvirt stack trace and read errors from dmesg none

Description ralph.schmieder 2018-04-23 16:38:24 UTC
Created attachment 1425701 [details]
libvirt stack trace and read errors from dmesg

Description of problem:
When running a LXC domain in libvirt where the disk is defined as 

    <filesystem type='file' accessmode='passthrough'>
      <driver type='nbd' format='qcow2' wrpolicy='immediate'/>
      <source file='/var/local/some_disk.qcow2'/>
      <target dir='/'/>
    </filesystem>

then the domain comes up and runs fine. In this case, this is an Alpine 3.7 container.

However, when stopping/destroying the VM, read errors from the nbd and a kernel stack trace (attached) can be observed and several zombie processes are the result.

Discussed this on IRC, this was the gist of the 'brainstorming':

<danpb> we're putting the qemu-nbd process into the same cgroup  as the rest of the container
<danpb> and thus just relying on all pids in the cgroup being purged
<danpb> there's nothing that ensures we kill qemu-nbd  last
<cbosdonnat> danpb, hum... that would be interesting to try that indeed
<danpb> so any process in the container could still be reading/writing files in the mount on top of the  NBD volume at the time qemu-nbd is killed
<danpb> i think we need to take qemu-nbd out of the cgroup and  use  qemu-nbd -d   to explicitly terminate it at the right time
<danpb> this would also solve the memory pressure deadlocks we sometimes can hit

Version-Release number of selected component (if applicable):

libvirt-daemon-lxc-3.7.0-4.fc27.x86_64
Linux somebox 4.15.17-300.fc27.x86_64 #1 SMP Thu Apr 12 18:19:17 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

How reproducible:
always, using the alpine3.7 container

Steps to Reproduce:
1. start container using the root fs on nbd attached qcow2 image
2. stop  container
3. observe problem

Actual results:
- domain will not completely stop, 
- no "stopped" event seen
- several processes listed as zombies on host
- stack trace

Expected results:
- clean shutdown
- event "stopped" emitted
- no hanging processes


Additional info:
see attachment

Comment 1 ralph.schmieder 2018-04-23 17:17:52 UTC
when destroying, this can be seen in addition:

error: Failed to destroy domain vm_1
error: internal error: Some processes refused to die

Comment 2 Daniel Berrangé 2024-12-17 12:25:32 UTC
Thank you for reporting this issue to the libvirt project. Unfortunately we have been unable to resolve this issue due to insufficient maintainer capacity and it will now be closed. This is not a reflection on the possible validity of the issue, merely the lack of resources to investigate and address it, for which we apologise. If you none the less feel the issue is still important, you may choose to report it again at the new project issue tracker https://gitlab.com/libvirt/libvirt/-/issues The project also welcomes contribution from anyone who believes they can provide a solution.