Bug 2027419

Summary: [RFE] Improve the thread model of nbdkit-retry-filter
Product: Red Hat Enterprise Linux 9 Reporter: mxie <mxie>
Component: nbdkitAssignee: Virtualization Maintenance <virt-maint>
Status: CLOSED MIGRATED QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 9.0CC: chhu, eblake, hongzliu, juzhou, lersek, mzhan, rjones, tyan, tzheng, virt-maint, vwu, xiaodwan
Target Milestone: rcKeywords: FutureFeature, MigratedToJIRA, Triaged
Target Release: ---Flags: pm-rhel: mirror+
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-07 20:51:36 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description mxie@redhat.com 2021-11-29 15:32:03 UTC
Description of problem:
[RFE] Improve the thread model of nbdkit-retry-filter

Version-Release number of selected component (if applicable):
nbdkit-1.28.2-2.el9.x86_64


How reproducible:
100%

Steps to Reproduce:
1. Check the thread model of nbdkit-vddk-plugin and nbdkit-retry-filter

# nbdkit vddk --dump-plugin | grep thread
max_thread_model=parallel
thread_model=parallel

# nbdkit vddk --filter=retry --dump-plugin | grep thread
max_thread_model=parallel
thread_model=serialize_requests


Actual results:
The thread model of retry filter will reduce the thread model of vddk plugin during v2v conversion, details please refer to https://bugzilla.redhat.com/show_bug.cgi?id=2018463#c6


Expected results:
As above description

Additional info:

Comment 1 Eric Blake 2021-11-29 16:03:30 UTC
Oooh, that's tricky.  Consider what happens if we allow parallel requests:
thread1            thread2
enter retry_pwrite
 - call next->pwrite
                   enter retry_pwrite
                     - call next->pwrite
 - plugin fails
 - want to restart plugin
                     - plugin completes (may be success or failure)

We can't close the old connection to the plugin until all pending operations finish, then when we connect to the new plugin, we must replay ALL pending requests that were not successful prior to the reconnect, which includes queuing up any data to be written, and so forth.  But just because 

Thinking aloud here: we need some sort of gating.  On entry, every command checks if the connection is live (proceed immediately, and increment semaphore) or pending restart (wait for the connection to come back up).  On success or failure after retries expired, decrement semaphore and return.  On failure (with retries still permitted), toggle the pending restart flag, decrement semaphore, then block until notified of successful restart.  It may be easier with a background thread whose job is to wake up when we have marked a pending restart, wait for the semaphore to drop to 0, then perform the restart and notify all pending threads waiting on the restart.

Comment 2 Eric Blake 2021-11-29 16:05:09 UTC
(In reply to Eric Blake from comment #1)
> We can't close the old connection to the plugin until all pending operations
> finish, then when we connect to the new plugin, we must replay ALL pending
> requests that were not successful prior to the reconnect, which includes
> queuing up any data to be written, and so forth.  But just because 

Sorry for the incomplete thought.

But just because one thread fails does not mean they all will (although it may be likely: if one thread failed because of a dropped connection, the others will likely fail for the same reason, but we only want one restart, not N restarts).

Comment 3 Laszlo Ersek 2021-11-30 07:45:15 UTC
Do we know how the VDDK library behaves when it has queued some requests, those requests are executing asynchronously, and the network connection fails? "nbdkit-vddk-plugin" currently collects requests from all nbdkit threads into a queue, then that queue "fans out" to async vddk operations. Will all those async operations gracefully fail (report error in their completion callbacks) when the network breaks? If the VDDK library gets confused in this situation, then we can't wait for all requests after the first failed one to "drain cleanly", so that we can reconnect to the server and resubmit the requests.

Comment 4 Richard W.M. Jones 2021-11-30 16:49:52 UTC
VDDK so the answer is unlikely to be good.  In this case the TCP connection
eventually times out and is closed and subsequent requests fail.  The retry
filter gets around this by doing a complete reopen of the VDDK connection
(ie. going through VixDiskLib_ConnectEx and _Open again).

What's not clear to me is what happens to the ReadAsync and WriteAsync calls
which are in flight at this time, although again, it being VDDK, the behaviour is
unlikely to be a sensible one.

Let me test this.

Comment 9 Laszlo Ersek 2021-12-01 16:34:58 UTC
Thinking aloud to myself -- this is an approach without a background
thread:

enum conn_state = {
  CST_GOOD,
  CST_BROKEN,
  CST_GIVE_UP
};

static struct retry_state {
  enum conn_state cst;
  unsigned reqs_in_flight;
  unsigned retry_count;
  pthread_mutex_t mutex;
  pthread_cond_t progress;
};

static bool
progress (struct retry_state *rs)
{
  return rs->cst == CST_GOOD ||
         rs->cst == CST_GIVE_UP ||
         rs->reqs_in_flight == 0;
}

static int
process_synch_request (struct retry_state *rs, ...)
{
  int r;
  bool retry = true;

  pthread_mutex_lock (&rs->mutex);

  do {
    bool progress_before_update;

    while (!progress (rs))
      pthread_cond_wait (&rs->progress, &rs->mutex);

    switch (rs->cst) {
    case CST_GOOD:
      rs->reqs_in_flight++;

      pthread_mutex_unlock (&rs->mutex);
      r = request (...);
      pthread_mutex_lock (&rs->mutex);

      progress_before_update = progress (rs);

      rs->reqs_in_flight--;
      if (r == 0) {
        retry = false;
        r->cst = CST_GOOD;
        r->retry_count = 0;
      } else if (rs->cst == CST_GOOD)
        rs->cst = CST_BROKEN;

      if (!progress_before_update && progress (rs))
        pthread_cond_broadcast (&rs->progress);

      break;

    case CST_BROKEN:
      assert (rs->reqs_in_flight == 0);
      if (rs->retry_count < max_retry_count) {
        reconnect (...);
        rs->retry_count++;
        rs->cst = CST_GOOD;
      } else
        rs->cst = CST_GIVE_UP;
      break;

    case CST_GIVE_UP:
      r = -EIO;
      retry = false;
      break;
    }
  } while (retry);

  pthread_mutex_unlock (&rs->mutex);

  return r;
}

Comment 10 Laszlo Ersek 2021-12-02 08:58:11 UTC
Not a functional difference, just a simplification / clarification, for comment 9:

@@ -23,11 +23,12 @@
       progress_before_update = progress (rs);
 
       rs->reqs_in_flight--;
+      assert (rs->cst == CST_GOOD || rs->cst == CST_BROKEN);
       if (r == 0) {
         retry = false;
         r->cst = CST_GOOD;
         r->retry_count = 0;
-      } else if (rs->cst == CST_GOOD)
+      } else
         rs->cst = CST_BROKEN;
 
       if (!progress_before_update && progress (rs))

Comment 11 Laszlo Ersek 2023-03-16 15:11:41 UTC
In the mid term, I'd like to take a stab at this; minimally it should be a learning experience for me about nbdkit. Assigning myself the BZ.