Bug 1926589 - "Too many open files" in vdsm.log after 380 migrations
Summary: "Too many open files" in vdsm.log after 380 migrations
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: SuperVDSM
Version: 4.40.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.5.0
: 4.50.0.3
Assignee: Nir Soffer
QA Contact: mlehrer
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-09 06:54 UTC by Yedidyah Bar David
Modified: 2022-04-20 06:32 UTC (History)
8 users (show)

Fixed In Version: vdsm-4.50.0.3
Clone Of:
Environment:
Last Closed: 2021-12-23 20:59:48 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?


Attachments (Terms of Use)
Partial strace (335.15 KB, application/gzip)
2021-02-09 10:20 UTC, Yedidyah Bar David
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 113395 0 master ABANDONED supervdsm_server: Fix fd leak in _runAs 2021-02-15 12:40:00 UTC
oVirt gerrit 113401 0 master ABANDONED zombiereeaper: Disable for testing fd leak 2021-02-14 09:45:29 UTC
oVirt gerrit 113472 0 master MERGED commands: Support asynchronous wait 2021-10-14 14:55:12 UTC
oVirt gerrit 113473 0 master MERGED hba: Use commands.wait_async() 2021-10-14 18:30:10 UTC
oVirt gerrit 113474 0 master MERGED iscsiadm: Use commands.wait_async() 2021-10-14 18:30:13 UTC
oVirt gerrit 113475 0 master MERGED supervdsm_server: Simplify communication with child 2021-10-14 18:30:15 UTC
oVirt gerrit 113476 0 master MERGED supervdsm_server: Remove temporary variables 2021-10-14 18:30:18 UTC
oVirt gerrit 113477 0 master MERGED supervdsm_server: Remove safe_poll 2021-10-14 18:30:20 UTC
oVirt gerrit 113478 0 master MERGED supervdsm_server: Simplify error handling 2021-10-14 18:30:22 UTC
oVirt gerrit 113479 0 master MERGED supervdsm_server: Use commands.wait_async() 2021-10-14 18:30:25 UTC
oVirt gerrit 113480 0 master MERGED v2v: Always wait for virt-v2v 2021-10-15 22:39:48 UTC
oVirt gerrit 113481 0 master MERGED zombiereaper: Reap out last traces of it 2021-10-19 14:43:59 UTC
oVirt gerrit 117087 0 master MERGED supervdsm_server: Fix race when runAs times out 2021-10-14 18:30:27 UTC

Description Yedidyah Bar David 2021-02-09 06:54:10 UTC
Description of problem:

After a loop of migrations that succeeded for 380 iterations, it failed, with this in vdsm.log:

2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could not
connect to storageServer (hsm:2446)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
2443, in connectStorageServer
    conObj.connect()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
line 449, in connect
    return self._mountCon.connect()
  File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
line 171, in connect
    self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", line
210, in mount
    cgroup=cgroup)
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
line 56, in __call__
    return callMethod()
  File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
line 54, in <lambda>
    **kwargs)
  File "<string>", line 2, in mount
  File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772,
in _callmethod
    raise convert_to_error(kind, result)
OSError: [Errno 24] Too many open files

Some more details in [1]. I am currently trying to reproduce along with parallel loops that log ls -l /proc/*/fd/*. Will attach logs if it happens again.

[1] https://lists.ovirt.org/archives/list/devel@ovirt.org/thread/U353TBTVJIC7VXKXZ4WUWXLA5HFWZLWT/

Comment 1 Yedidyah Bar David 2021-02-09 10:20:02 UTC
It seems like supervdsm is leaking FDs, based on checking the number of open files over time. To get an idea of what code might be leaking them, I ran strace on it [1]. I picked at random one fd that seemed to have never been closed (178), by pid 45415 (one of the children of 28956), and attach the result of [2].

[1] strace -s 4096 -tt -f -p 28956 -o /tmp/supervdsm-strace1.log

[2] grep -C1000 '^45415\>.*\<178\>' /tmp/supervdsm-strace1.log

Comment 2 Yedidyah Bar David 2021-02-09 10:20:46 UTC
Created attachment 1755893 [details]
Partial strace

Comment 3 Yedidyah Bar David 2021-02-09 10:29:00 UTC
The "problem" seems like it might have been here:

45415 09:38:44.185890 pipe2([178, 179], O_CLOEXEC) = 0
45415 09:38:44.185916 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd935ffb9d0) = 580072
580072 09:38:44.186785 set_robust_list(0x7fd935ffb9e0, 24) = 0
45415 09:38:44.186935 close(179)        = 0
45415 09:38:44.187083 getpid()          = 28956
45415 09:38:44.187474 poll([{fd=44, events=POLLIN}], 1, 60000 <unfinished ...>

Meaning, we called pipe2, which opened two FDs 178 and 179, and we closed 179, but never closed 178. Why? Perhaps (later):

45415 09:38:44.201062 kill(580072, SIGTERM <unfinished ...>
580072 09:38:44.201074 <... read resumed>"\200\3X\3\0\0\0Byeq\0.", 13) = 13
45415 09:38:44.201079 <... kill resumed>) = 0
580072 09:38:44.201092 --- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=28956, si_uid=0} ---
45415 09:38:44.201097 wait4(580072,  <unfinished ...>
580072 09:38:44.201109 getpid( <unfinished ...>
45415 09:38:44.201113 <... wait4 resumed>0x7fd935ff8d44, WNOHANG, NULL) = 0
580072 09:38:44.201152 <... getpid resumed>) = 580072
45415 09:38:44.201158 wait4(580072,  <unfinished ...>
580072 09:38:44.201167 write(10, "\17", 1 <unfinished ...>
45415 09:38:44.201172 <... wait4 resumed>0x7fd935ff9014, WNOHANG, NULL) = 0
580072 09:38:44.201211 <... write resumed>) = 1
45415 09:38:44.201216 close(118 <unfinished ...>
28956 09:38:44.201220 <... poll resumed>) = 1 ([{fd=9, revents=POLLIN}])
580072 09:38:44.201259 rt_sigreturn({mask=[]} <unfinished ...>
45415 09:38:44.201266 <... close resumed>) = 0
28956 09:38:44.201270 read(9,  <unfinished ...>
580072 09:38:44.201301 <... rt_sigreturn resumed>) = 13
45415 09:38:44.201306 close(44 <unfinished ...>
28956 09:38:44.201310 <... read resumed>"\17", 128) = 1
580072 09:38:44.201456 exit_group(0 <unfinished ...>
45415 09:38:44.201463 <... close resumed>) = 0
28956 09:38:44.201467 poll([{fd=9, events=POLLIN}], 1, -1 <unfinished ...>
580072 09:38:44.201790 <... exit_group resumed>) = ?
45415 09:38:44.201799 getpid( <unfinished ...>
580072 09:38:44.202672 +++ exited with 0 +++
45415 09:38:44.202688 <... getpid resumed>) = 28956
45415 09:38:44.202701 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=580072, si_uid=107, si_status=0, si_utime=0, si_stime=0} ---
28956 09:38:44.202708 <... poll resumed>) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
45415 09:38:44.202721 getpid( <unfinished ...>
28956 09:38:44.202726 restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
45415 09:38:44.202735 <... getpid resumed>) = 28956

I do not know vdsm's code, but searching a bit I think this is the code doing this (in lib/vdsm/supervdsm_server.py)

        pipe, hisPipe = Pipe()
        with closing(pipe), closing(hisPipe):
            proc = Process(target=child, args=(hisPipe,))

Searching the net for 'python "with statement" with multiple does not call __exit__' finds me [1].

So perhaps it's because we get the SIGCHLD and this, with [1], causes us to not call __exit__ on hisPipe.

There are quite many guesses here. Not sure if this is at all relevant.

[1] https://bugs.python.org/issue29988

Comment 4 Nir Soffer 2021-02-09 11:27:57 UTC
(In reply to Yedidyah Bar David from comment #3)

Thanks for this research!

> I do not know vdsm's code, but searching a bit I think this is the code
> doing this (in lib/vdsm/supervdsm_server.py)
> 
>         pipe, hisPipe = Pipe()
>         with closing(pipe), closing(hisPipe):
>             proc = Process(target=child, args=(hisPipe,))
> 
> Searching the net for 'python "with statement" with multiple does not call
> __exit__' finds me [1].
> 
> So perhaps it's because we get the SIGCHLD and this, with [1], causes us to
> not call __exit__ on hisPipe.
> 
> There are quite many guesses here. Not sure if this is at all relevant.
> 
> [1] https://bugs.python.org/issue29988

The issue seems to be even with single with statement. This is a very bad python
bug, we assume that "with" works and relay on everywhere, like most python code.

The main issue in vdsm and supervdsm is that we use SIGCHLD to reap child processes
asynchronously, and this means any code in vdsm can be interrupted by signals during
normal flows.

Typically servers accept only termination signals, so these races are not
a practical problem. A server must be able to cope with abnormal termination
so bad signal handling during shutdown is not a practical issue.

We use async reaping  in these cases:

 69 def _rescan():
 70     """
 71     Called from supervdsm to perform rescan as root.
 72     """
 73     timeout = config.getint('irs', 'scsi_rescan_maximal_timeout')
 74 
 75     p = commands.start(
 76         [constants.EXT_FC_SCAN],
 77         stdout=subprocess.PIPE,
 78         stderr=subprocess.PIPE)
 79     try:
 80         out, err = p.communicate(timeout=timeout)
 81     except subprocess.TimeoutExpired:
 82         zombiereaper.autoReapPID(p.pid)
 83         raise Error("Timeout scanning (pid=%s)" % p.pid)
 84 
 85     if p.returncode != 0:
 86         raise Error("Scan failed with rc=%s out=%r err=%r"
 87                     % (p.returncode, out, err))

If we don't use that, blocked fc-scan can block storage code for unlimited
time since the operation takes a gloabl storage lock. But I don't think it
is a good idea to leave block scan running and try to do other scans, since
the other scan will lock anyway on the kernel.

I think this should be solved by moving this code to single thread that will
perform scans serially, and will wait until the current scan ends.

If we have a single thread doing scans, it will be busy waiting for the child.
Callers will wait on threading.Event() with a timeout and will have the same
semantics (scan timed out). We will have a new condition - scan that fail since
the scanner thread is busy. This is similar in semantics to new scan blocked
on the kernel, so it can be handled in the same way.


337 def session_rescan(timeout=None):
338     # Note: keeping old behaviour of taking the module lock while starting the
339     # command, and releasing the lock while the scan command is running. This
340     # looks like a bug since the purpose of the lock is preventing concurrent
341     # iscsiadm commands, but taking a lock for the entire operation may cause
342     # bigger issues.
343 
344     args = [constants.EXT_ISCSIADM, "-m", "session", "-R"]
345 
346     with _iscsiadmLock:
347         p = commands.start(
348             args,
349             sudo=True,
350             stdout=subprocess.PIPE,
351             stderr=subprocess.PIPE)
352 
353     try:
354         out, err = p.communicate(timeout=timeout)
355     except subprocess.TimeoutExpired:
356         zombiereaper.autoReapPID(p.pid)
357         raise IscsiSessionRescanTimeout(p.pid, timeout)

Same issue as fc-scan, same solution can be applied, single thread doing 
these scans.

 928     def _abort(self):
 929         self._aborted = True
 930         if self._proc is None:
 931             logging.warning(
 932                 'Ignoring request to abort job %r; the job failed to start',
 933                 self._id)
 934             return
 935         if self._proc.returncode is None:
 936             logging.debug('Job %r killing virt-v2v process', self._id)
 937             try:
 938                 self._proc.kill()
 939             except OSError as e:
 940                 if e.errno != errno.ESRCH:
 941                     raise
 942                 logging.debug('Job %r virt-v2v process not running',
 943                               self._id)
 944             else:
 945                 logging.debug('Job %r virt-v2v process was killed',
 946                               self._id)
 947             finally:
 948                 for pid in self._proc.pids:
 949                     zombiereaper.autoReapPID(pid)

This may leave zombie virt-v2v process accessing storage. Certainly not
good idea. I think this usage should be removed, and we should just wait 
for the child process.

148     def _runAs(self, user, groups, func, args=(), kwargs={}):
149         def child(pipe):
150             res = ex = None
151             try:
152                 uid = resolveUid(user)
153                 if groups:
154                     gids = [resolveGid(g) for g in groups]
155 
156                     os.setgid(gids[0])
157                     os.setgroups(gids)
158                 os.setuid(uid)
159 
160                 res = func(*args, **kwargs)
161             except BaseException as e:
162                 ex = e
163 
164             pipe.send((res, ex))
165             pipe.recv()
166 
167         pipe, hisPipe = Pipe()
168         with closing(pipe), closing(hisPipe):
169             proc = Process(target=child, args=(hisPipe,))
170             proc.start()
171 
172             needReaping = True
173             try:
174                 if not safe_poll(pipe, RUN_AS_TIMEOUT):
175                     try:
176 
177                         os.kill(proc.pid, signal.SIGKILL)
178                     except OSError as e:
179                         # Don't add to zombiereaper of PID no longer exists
180                         if e.errno == errno.ESRCH:
181                             needReaping = False
182                         else:
183                             raise
184 
185                     raise Timeout()
186 
187                 res, err = pipe.recv()
188                 pipe.send("Bye")
189                 proc.terminate()
190 
191                 if err is not None:
192                     raise err
193 
194                 return res
195 
196             finally:
197                 # Add to zombiereaper if process has not been waited upon
198                 if proc.exitcode is None and needReaping:
199                     zombiereaper.autoReapPID(proc.pid)

Here again there is no good reason to leave zombie around, we should
wait for the child.

If we remove the 4 places using zombiereaper, we can stop registering
SIGCHLD, and this will should avoid the fd leaks and remove some complexity
from vdsm.

However this is not trivial change for the storage code performing scans.

Since 3 uses of zombiereaper are related to storage, we can move this bug
to stoage.

The virt-v2v case should be handled by the virt team.

Regardless of the vdsm issues, I think we need a RHEL bug - if this is solved
upstream in python 3.9, we would like a backport in python 3.6.

Marcin, what do you think?

Comment 5 Nir Soffer 2021-02-09 12:01:50 UTC
I posted a quick minimal fix - can you try it with your reproducer?

Build should be available here:
https://jenkins.ovirt.org/job/vdsm_standard-check-patch/26535/

Comment 6 Nir Soffer 2021-02-10 14:49:10 UTC
Summary of what we found so far.

The flow test is moving the host running hosted engine to local maintenance
mode and back to normal mode. In this flow, we move the host to maintenance,
which cause all running vms to migrate to the other host.

According to Didi:

> > I suppose normally this should not happen more than once every few
> > months, for the HE VM. I suspect the fd leak might affect also
> > non-HE-VM migrations, which likely happen more often.

This is not a common flow, so if we have small fd leak during this flow it will
take lot of time to reach this state.

When supervdsmd exceeds the maximum number of open files, it basically disabled
since most of the operations it does is running commands and we need 4 file
descriptors for running any command. I don't think the system can recover from
this state. I'll open another bug to track this issue.

We have a simple workaround if we reach this state, restart vdsmd and
supervdsmd services.

My theory was that the reason for the leak is SIGCHLD signal received during
cleanup, causing file descriptors leak. The reason we get such signals is
registering signal handler and reaping terminated processes that were not
waited for in some flows.


What we tested:

- https://gerrit.ovirt.org/c/vdsm/+/113395 use try-finally instead
  of with to cleanup file descriptors in validateAccess flow. This
  change does not seem to have any effect.

- https://gerrit.ovirt.org/c/vdsm/+/113401 disable zombiereaper so
  vdsm and supervdsm do not get SIGCHLD signal for every child
  process. This seems to solve the leak, but the the patch is not
  correct as is, more work is needed to do this properly.


How to reproduce the issue:

On host (or vm) setup for running OST:

    export OST_IMAGES_DISTRO=el8

    . lagofy.sh he-basic-suite-master

    while true; do
        run_tc he-basic-suite-master/test-scenarios/012_local_maintenance_sdk_pytest.py
    done

To watch file descriptors we add this script on both hosts:

    # cat count-supervdsm-fds.sh
    while true; do
        pid=$(pidof supervdsmd)
        fds=$(ls -1 /proc/$pid/fd | wc -l)
        echo "$(date) pid=$pid fds=$fds"
        sleep 10
    done

And run it:

    # nohup sh count-supervdsm-fds.sh > count-supervdsm-fds.log &

Didi also collected output of ls -lh /proc/*/fd to files.


Test was started around at Tue Feb  9 14:55

On Tue Feb  9 19:54 we upgraded vdsm on host
1 to a version disabling zombiereaper
(https://gerrit.ovirt.org/c/vdsm/+/113401)

On Tue Feb  9 21:56 the test started to fail. According to
Didi the way this test is written, it will continue to
fail on the next runs, so the tests effectively ended at
that point.


On host 0 we see:

[root@lago-he-basic-suite-master-host-0 ~]# awk 'NR % 360 == 1' count-supervdsm-fds.log
Tue Feb  9 14:55:06 CET 2021 pid=754314 fds=120
Tue Feb  9 15:55:16 CET 2021 pid=754314 fds=199
Tue Feb  9 16:55:27 CET 2021 pid=754314 fds=272
Tue Feb  9 17:55:37 CET 2021 pid=754314 fds=335
Tue Feb  9 18:55:48 CET 2021 pid=754314 fds=401
(Upgraded vdsm on other host here)
Tue Feb  9 19:55:58 CET 2021 pid=754314 fds=468
Tue Feb  9 20:56:08 CET 2021 pid=754314 fds=539
Tue Feb  9 21:56:18 CET 2021 pid=754314 fds=550
(Test started to fail here)
Tue Feb  9 22:56:28 CET 2021 pid=754314 fds=550
...
Wed Feb 10 11:58:17 CET 2021 pid=754314 fds=550


On host 1 we see:

[root@lago-he-basic-suite-master-host-1 ~]# awk 'NR % 360 == 1' count-supervdsm-fds.log
Tue Feb  9 15:04:27 CET 2021 pid=741766 fds=130
Tue Feb  9 16:04:37 CET 2021 pid=741766 fds=205
Tue Feb  9 17:04:47 CET 2021 pid=741766 fds=277
Tue Feb  9 18:04:58 CET 2021 pid=741766 fds=354
Tue Feb  9 19:05:08 CET 2021 pid=741766 fds=427
(Upgraded vdsm here)
Tue Feb  9 20:05:18 CET 2021 pid=1006247 fds=30
Tue Feb  9 21:05:28 CET 2021 pid=1006247 fds=28
(Test started to fail here)
Tue Feb  9 22:05:38 CET 2021 pid=1006247 fds=29
...
Wed Feb 10 11:07:42 CET 2021 pid=1006247 fds=29


So clearly disabling zombiereaper fixes this leak.



What kind of file descriptor do we leak?

Host 0:

Tue Feb  9 14:55:06 CET 2021 pid=754314 fds=120

# awk '/754314/ {print $11}' ls-l-proc-fd-20210209-145507.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn
     95 pipe
     16 socket
      3 anon_inode
      1 /var/log/vdsm/supervdsm.log
      1 /var/lib/sss/mc/passwd
      1 /var/lib/sss/mc/group
      1 /tmp/#696969
      1 /run/vdsm/supervdsmd.lock
      1 /dev/null

Tue Feb  9 17:55:37 CET 2021 pid=754314 fds=335

# awk '/754314/ {print $11}' ls-l-proc-fd-20210209-175531.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn
    309 pipe
     17 socket
      3 anon_inode
      1 /var/log/vdsm/supervdsm.log
      1 /var/lib/sss/mc/passwd
      1 /var/lib/sss/mc/group
      1 /tmp/#696969
      1 /run/vdsm/supervdsmd.lock
      1 /dev/null

Tue Feb  9 21:56:18 CET 2021 pid=754314 fds=550

# awk '/754314/ {print $11}' ls-l-proc-fd-20210209-215617.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn
    525 pipe
     16 socket
      3 anon_inode
      1 /var/log/vdsm/supervdsm.log
      1 /var/lib/sss/mc/passwd
      1 /var/lib/sss/mc/group
      1 /tmp/#696969
      1 /run/vdsm/supervdsmd.lock
      1 /dev/null

Host 1:

Tue Feb  9 15:04:27 CET 2021 pid=741766 fds=130

# awk '/741766/ {print $11}' ls-l-proc-fd-20210209-150429.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn
    105 pipe
     16 socket
      3 anon_inode
      1 /var/log/vdsm/supervdsm.log
      1 /var/lib/sss/mc/passwd
      1 /var/lib/sss/mc/group
      1 /tmp/#697104
      1 /run/vdsm/supervdsmd.lock
      1 /dev/null

Tue Feb  9 19:05:08 CET 2021 pid=741766 fds=427

# awk '/741766/ {print $11}' ls-l-proc-fd-20210209-190501.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn
    401 pipe
     17 socket
      3 anon_inode
      1 /var/log/vdsm/supervdsm.log
      1 /var/lib/sss/mc/passwd
      1 /var/lib/sss/mc/group
      1 /tmp/#697104
      1 /run/vdsm/supervdsmd.lock
      1 /dev/null

Tue Feb  9 20:05:18 CET 2021 pid=1006247 fds=30

# awk '/1006247/ {print $11}' ls-l-proc-fd-20210209-200510.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn
     17 socket
      4 pipe
      3 anon_inode
      1 /var/log/vdsm/supervdsm.log
      1 /var/lib/sss/mc/passwd
      1 /var/lib/sss/mc/group
      1 /tmp/#697104
      1 /run/vdsm/supervdsmd.lock
      1 /dev/null

Tue Feb  9 21:05:28 CET 2021 pid=1006247 fds=28

# awk '/1006247/ {print $11}' ls-l-proc-fd-20210209-215612.log | awk -F ":" '{print $1}' | sort | uniq -c | sort -rn
     17 socket
      3 pipe
      3 anon_inode
      1 /var/log/vdsm/supervdsm.log
      1 /var/lib/sss/mc/passwd
      1 /var/lib/sss/mc/group
      1 /tmp/#697104
      1 /run/vdsm/supervdsmd.lock
      1 /dev/null


So we leak only pipes. This is likely from code running commands.
We create 2 pipes for every command, one for stdout, one for stderr.

My guess is that receiving a signal when running a command cause
cleanup code in vdsm or python subprocess module to fail and
leave file descriptor open.

To understand bette this leak, we can connect to supervdsm with
strace and check what happens when syscalls fail with EINTR.
I'm not sure it worth the effort, since removing zobbiereaper
seems will eliminate the root cause for this issue, and make
vdsm more robust.


In summary we have:

- pipe leak, caused by using zombiereaper to wait for some commands.
  shuld be fixed by reeaping out zombiereaper from vdsm.

- supervdsm does not recover from "Too many open files" error. It can
  recover from this by terminating. systemd will restart the service.
  Will be tracked in another bug.

Comment 7 Nir Soffer 2021-02-10 14:55:03 UTC
Tal, this is bug in supervdsm, caused by using zombiereaper which
are maintained by the infra team. But the reason we have zombiereaper
are 4 flows:
- fc scan
- iscsi scan
- validating access 
- running virt-v2v

To fix this issue we need to remove usage of zombiereaper from
storage code, virt-code, and remove zombiereaper from vdsm.

Most of the work is storage related. We can file separate bugs
for virt-v2 and infra work.

Comment 8 Martin Perina 2021-02-11 10:27:18 UTC
I don't think we need separate bugs for each team, let's just cooperate and sync on this single bug, it's worth to deliver a fix at once and not by parts

Comment 9 Nir Soffer 2021-02-14 10:00:33 UTC
(In reply to Martin Perina from comment #8)
> I don't think we need separate bugs for each team, let's just cooperate and
> sync on this single bug, it's worth to deliver a fix at once and not by parts

I posted patches for fixing both storage and virt issues:
https://gerrit.ovirt.org/q/topic:kill-zombiereaper+is:open

I will need reviews from Milan and Marcin for the relevant
parts.

This need more testing, but I think we can plan this for 4.4.6.

Comment 10 Sandro Bonazzola 2021-07-05 12:53:56 UTC
We are past 4.4.7 dev freeze and the bug is not marked as blocker, moving out to 4.4.8

Comment 11 D. White 2021-08-14 01:26:49 UTC
Hi,
In a thread on the oVirt mailing list, Nir requested that I share my experience with vdsm that may be related to this same bug. (c/f https://lists.ovirt.org/archives/list/users@ovirt.org/thread/YVD2IDBACJX4CILGSCW77WZEWAB2TLNX/#DR7CKPUEHQZSUMFSW3HSJLHE5V5PUIBY).

My environment is a 3-node HCI cluster with Gluster storage, but right now, Gluster is only running on 1 host (so no replication).

I was hoping to upgrade / replace the storage on my 2nd host today, but aborted that maintenance when I found that I couldn't even get into the Manager. I should probably mention that I think I may have screwed up the lvm when I rebuilt the RAID array, as I didn't specify the inode size in `mkfs.xfs -i` ... so the new volumes have the default inode size (256), and not what they should be (512). To correct that, I guess I'll specify the correct inode size on the 2nd host, and then once I finish getting the storage upgraded on my 2nd host, I'll  go back to the first, and re-build it again.

I'm running on RHEL 8.4, but I don't have a support subscription (even though I don't have a support subscription, I do see the ability to open a support ticket, so am happy to do so if that would be helpful at all).

As Nir suggested, I restarted the vdsmd service on all 3 hosts. That appears to have fixed the "storage" issue, and I now am able to run "hosted-engine --vm-status" and see some expected stdout. 
Unfortunately, my hosted-engine is still not starting (although I can now clearly connect to the hosted-engine storage), and I see this output every time I try to start the hosted-engine:

[root@cha2-storage ~]# hosted-engine --vm-start
Command VM.getStats with args {'vmID': 'ffd77d79-a699-455e-88e2-f55ee53166ef'} failed:
(code=1, message=Virtual machine does not exist: {'vmId': 'ffd77d79-a699-455e-88e2-f55ee53166ef'})
VM in WaitForLaunch

Prior to restarting vdsm on all 3 hosts, I generated an sosreport on two of my hosts, and am happy to provide that for further troubleshooting. 
In at least one of these cases, I noticed the following in stdout when I generated the sosreport: 
[plugin:vdsm] command 'vdsm-client Host getConnectedStoragePools' timed out after 300s
[plugin:vdsm] command 'vdsm-client Host getStorageDomains' timed out after 300s

I see this in /var/log/vdsm/vdsm.log:

> 2021-08-13 14:08:10,844-0400 ERROR (Reactor thread) [ProtocolDetector.AcceptorImpl] Unhandled exception in acceptor (protocoldetector:76)
> Traceback (most recent call last):
>   File "/usr/lib64/python3.6/asyncore.py", line 108, in readwrite
>   File "/usr/lib64/python3.6/asyncore.py", line 417, in handle_read_event
>   File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 57, in handle_accept
>   File "/usr/lib/python3.6/site-packages/yajsonrpc/betterAsyncore.py", line 173, in _delegate_call
>   File "/usr/lib/python3.6/site-packages/vdsm/protocoldetector.py", line 53, in handle_accept
>   File "/usr/lib64/python3.6/asyncore.py", line 348, in accept
>   File "/usr/lib64/python3.6/socket.py", line 205, in accept
> OSError: [Errno 24] Too many open files

The storage is mounted, but here's what I see:

[root@cha2-storage dwhite]# hosted-engine --vm-status
The hosted engine configuration has not been retrieved from shared storage. Please ensure that ovirt-ha-agent is running and the storage server is reachable.

[root@cha2-storage dwhite]# systemctl status ovirt-ha-agent
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring Agent
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2021-08-13 11:10:51 EDT; 2h 44min ago
Main PID: 3591872 (ovirt-ha-agent)
    Tasks: 1 (limit: 409676)
   Memory: 21.5M
   CGroup: /system.slice/ovirt-ha-agent.service
           └─3591872 /usr/libexec/platform-python /usr/share/ovirt-hosted-engine-ha/ovirt-ha-agent

Aug 13 11:10:51 cha2-storage.mgt.example.com systemd[1]: Started oVirt Hosted Engine High Availability Monitoring Agent.

Any time I try to do anything like connect the engine storage, disconnect the engine storage, or connect to the console, it just sits there, and doesn't do anything, and I eventually have to ctl-c out of it. Maybe I have to be patient? When I ctl-c, I get a trackback error:

[root@cha2-storage dwhite]# hosted-engine --console
^CTraceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main

    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 214, in <module>
[root@cha2-storage dwhite]#     args.command(args)
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 42, in func
    f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_setup/vdsm_helper.py", line 91, in checkVmStatus
    cli = ohautil.connect_vdsm_json_rpc()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 472, in connect_vdsm_json_rpc
    __vdsm_json_rpc_connect(logger, timeout)
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 395, in __vdsm_json_rpc_connect
    timeout=timeout)
  File "/usr/lib/python3.6/site-packages/vdsm/client.py", line 154, in connect
    outgoing_heartbeat=outgoing_heartbeat, nr_retries=nr_retries)
  File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 426, in SimpleClient
    nr_retries, reconnect_interval)
  File "/usr/lib/python3.6/site-packages/yajsonrpc/stompclient.py", line 448, in StandAloneRpcClient
    client = StompClient(utils.create_connected_socket(host, port, sslctx),
  File "/usr/lib/python3.6/site-packages/vdsm/utils.py", line 379, in create_connected_socket
    sock.connect((host, port))
  File "/usr/lib64/python3.6/ssl.py", line 1068, in connect
    self._real_connect(addr, False)
  File "/usr/lib64/python3.6/ssl.py", line 1059, in _real_connect
    self.do_handshake()
  File "/usr/lib64/python3.6/ssl.py", line 1036, in do_handshake
    self._sslobj.do_handshake()
  File "/usr/lib64/python3.6/ssl.py", line 648, in do_handshake
    self._sslobj.do_handshake()


This is what I see in /var/log/ovirt-hosted-engine-ha/broker.log:

MainThread::WARNING::2021-08-11 10:24:41,596::storage_broker::100::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(__init__) Can't connect vdsm storage: Connection to storage server failed
MainThread::ERROR::2021-08-11 10:24:41,596::broker::69::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Failed initializing the broker: Connection to storage server failed
MainThread::ERROR::2021-08-11 10:24:41,598::broker::71::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 64, in run
    self._storage_broker_instance = self._get_storage_broker()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/broker.py", line 143, in _get_storage_broker
    return storage_broker.StorageBroker()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 97, in __init__
    self._backend.connect()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py", line 375, in connect
    sserver.connect_storage_server()
  File "/usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/lib/storage_server.py", line 451, in connect_storage_server
    'Connection to storage server failed'
RuntimeError: Connection to storage server failed

MainThread::ERROR::2021-08-11 10:24:41,599::broker::72::ovirt_hosted_engine_ha.broker.broker.Broker::(run) Trying to restart the broker
MainThread::INFO::2021-08-11 10:24:42,439::broker::47::ovirt_hosted_engine_ha.broker.broker.Broker::(run) ovirt-hosted-engine-ha broker 2.4.7 started
MainThread::INFO::2021-08-11 10:24:44,442::monitor::45::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Searching for submonitors in /usr/lib/python3.6/site-packages/ovirt_hosted_engine_ha/broker/submonitors
MainThread::INFO::2021-08-11 10:24:44,443::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load
MainThread::INFO::2021-08-11 10:24:44,449::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor cpu-load-no-engine
MainThread::INFO::2021-08-11 10:24:44,450::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor engine-health
MainThread::INFO::2021-08-11 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mem-free
MainThread::INFO::2021-08-11 10:24:44,451::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor mgmt-bridge
MainThread::INFO::2021-08-11 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor network
MainThread::INFO::2021-08-11 10:24:44,452::monitor::62::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Loaded submonitor storage-domain
MainThread::INFO::2021-08-11 10:24:44,452::monitor::63::ovirt_hosted_engine_ha.broker.monitor.Monitor::(_discover_submonitors) Finished loading submonitors

AFTER I generated the sosreports, I ran: systemctl restart vdsmd on all 3 hosts.
I found this stdout really interesting on 1 of the hosts when I then followed it up by running systemctl status vdsmd:

                                                                ERROR FINISH thread <Thread(event/1, started daemon 139673041090304)> failed
                                                                 Traceback (most recent call last):
                                                                   File "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260, in run
                                                                     ret = func(*args, **kwargs)
                                                                   File "/usr/lib/python3.6/site-packages/vdsm/clientIF.py", line 256, in contEIOVms
                                                                     vmObj.maybe_resume()
                                                                   File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1636, in maybe_resume
                                                                     self.cont()
                                                                   File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1711, in cont
                                                                     self._underlyingCont()
                                                                   File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4130, in _underlyingCont
                                                                     self._dom.resume()
                                                                   File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
                                                                     ret = attr(*args, **kwargs)
                                                                   File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
                                                                     ret = f(*args, **kwargs)
                                                                   File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
                                                                     return func(inst, *args, **kwargs)
                                                                   File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2370, in resume
                                                                     raise libvirtError('virDomainResume() failed')
                                                                 libvirt.libvirtError: Failed to acquire lock: No space left on device

Let me know how I can help. I'll keep these sosreports for safe keeping.

Comment 12 D. White 2021-08-14 01:35:43 UTC
Sorry to spam the Bugzilla list a 2nd time, but it appears that I just needed to be a little bit patient after restarting vdsmd on all 3 of my hosts.
I can confirm that the hosted-engine is now operational and healthy.

Comment 13 Nir Soffer 2021-10-19 17:20:57 UTC
Didi, do you have an easy way to test this fix, maybe running
your OST test with current vdsm master?

Comment 14 Yedidyah Bar David 2021-11-14 08:22:33 UTC
Ran this over the weekend:

while true; do ost_run_tc he-basic-suite-master/test-scenarios/test_012_local_maintenance_sdk.py;
echo 'Test finished, sleeping'; sleep 10; done

And it finished so far 518 loops.

[root@ost-he-basic-suite-master-host-0 ~]# ps auxww | grep '[s]upervdsm'
root       76939  0.2  0.7 1498284 40028 ?       S<sl Nov02  40:57 /usr/bin/python3 /usr/share/vdsm/supervdsmd --sockfile /run/vdsm/svdsm.sock
[root@ost-he-basic-suite-master-host-0 ~]# ls -l /proc/76939/fd/* | wc -l
38

[root@ost-he-basic-suite-master-host-0 ~]# rpm -q vdsm
vdsm-4.50.0.1-24.git1969ab99c.el8.x86_64

This is slightly old - 1969ab99c was merged Oct 26 - because I reused an existing setup and didn't bother upgrading first - but seems to be at least newer than all the linked patches from current bug, so probably ok.

Looks good to me. Thanks!

Comment 15 Nir Soffer 2021-11-14 09:53:54 UTC
(In reply to Yedidyah Bar David from comment #14)
Thanks for testing!

I think we can mark this verified, based on your tests.

Comment 17 Sandro Bonazzola 2022-04-20 06:32:24 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.