Bug 479226 - [FOCUS] Blast hung on 16hour port bounce test.
[FOCUS] Blast hung on 16hour port bounce test.
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
1.0
x86_64 All
low Severity medium
: ---
: ---
Assigned To: Red Hat Real Time Maintenance
David Sommerseth
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-01-08 01:50 EST by IBM Bug Proxy
Modified: 2016-05-22 19:27 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-09-12 15:19:06 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch-scsi_dh-put (1.48 KB, application/octet-stream)
2009-01-08 01:50 EST, IBM Bug Proxy
no flags Details

  None (edit)
Description IBM Bug Proxy 2009-01-08 01:50:28 EST
=Comment: #0=================================================
Venkateswarara Jujjuri <jvrao@us.ibm.com> - 
Test:

6 blast threads running continuously.
One of the paths are getting bounced for every 10mins in a while loop.
while [ 1 ]
do
port offline
sleep 10m
port online
sleep 10m
done


This test ran for 16 hours before 2 of the 6 blast threads hung.
System is up and running, but blast threads were hung.
=Comment: #1=================================================
Venkateswarara Jujjuri <jvrao@us.ibm.com> - 
 
Currently I am suspecting if there is any corruption in the request_queue data structure.
Here is my analysis.


PID: 576    TASK: ffff81007e4d2b20  CPU: 3   COMMAND: "kmpathd/3"
 #0 [ffff81007e4d7be0] schedule at ffffffff8128531c
 #1 [ffff81007e4d7c98] io_schedule at ffffffff81285859
 #2 [ffff81007e4d7cb8] get_request_wait at ffffffff8112a0b2
 #3 [ffff81007e4d7d48] __make_request at ffffffff8112b640
 #4 [ffff81007e4d7dd8] generic_make_request at ffffffff8112854f
 #5 [ffff81007e4d7e68] process_queued_ios at ffffffff881c182b
 #6 [ffff81007e4d7e98] run_workqueue at ffffffff8104d5ab
 #7 [ffff81007e4d7ed8] worker_thread at ffffffff8104e425
 #8 [ffff81007e4d7f28] kthread at ffffffff8105144f
 #9 [ffff81007e4d7f48] kernel_thread at ffffffff8100d048


kmpathd is  waiting for for some free request buffers.
So what happened to all the buffers? who consumed it?

crash> struct request_queue ffff81014df657d8
struct request_queue {
  queue_head = {
    next = 0xffff81014df657d8,
    prev = 0xffff81014df657d8
  },  <<< next==prev==request_queue -> Meaning that the elv_queue_empty() returns TRUE.
  last_merge = 0x0,
  elevator = 0xffff81014ed2d3c0,
  rq = {
    count = {128, 47},
    starved = {0, 0},
    elvpriv = 175,
    rq_pool = 0xffff81014ed2d7c0,
    wait = {{
        lock = { << READ LOCK
          lock = {
            wait_lock = {
              raw_lock = {
                slock = 0
              },
              break_lock = 0
            },
            wait_list = {
              prio_list = {
                next = 0xffff81014df65820,
                prev = 0xffff81014df65820
              },
              node_list = {
                next = 0xffff81014df65830,
                prev = 0xffff81014df65830
              }
            },
            owner = 0x0 << NO owner.  So no one is holding the lock
          },
          break_lock = 0
        },
        task_list = { << Task list is not empty. This is what I am suspecting as corruption.
                            << More analysis below.
          next = 0xffff81007e4d7cf8,
          prev = 0xffff810061eb1ac0

**********
struct __wait_queue {
        unsigned int flags;
#define WQ_FLAG_EXCLUSIVE       0x01
        void *private;
        wait_queue_func_t func;
        struct list_head task_list;
};

                __add_wait_queue_tail(q, wait);

static inline void __add_wait_queue_tail(wait_queue_head_t *head,
                                                wait_queue_t *new)
{
        list_add_tail(&new->task_list, &head->task_list);
}

So, task_list is wqit_queue_t - 0x18.

0xffff81007e4d7cf8 - 0x10

crash> struct __wait_queue 0xffff81007e4d7ce0
struct __wait_queue {
  flags = 0, 
  private = 0x0, 
  func = 0, 
  task_list = {
    next = 0x0, 
    prev = 0x3bf53af539f538f5
  }
}
<< This appears corrupt. >>>

**********
        }
      }, {
        lock = { << WRITE LOCK
          lock = {
            wait_lock = {
              raw_lock = {
                slock = 3341
              },
              break_lock = 0
            },
            wait_list = {
              prio_list = {
                next = 0xffff81014df65868,
                prev = 0xffff81014df65868
              },
              node_list = {
                next = 0xffff81014df65878,
                prev = 0xffff81014df65878
              }
            },
            owner = 0x0 << NO owner.  So no one is holding the lock
          },
          break_lock = 0
        },
        task_list = {  << Task list is empty.
          next = 0xffff81014df65898,
          prev = 0xffff81014df65898
        }
      }}
  },
  request_fn = 0xffffffff8805abb2,
....
...
}



=Comment: #2=================================================
Michael S. Anderson <andmike@linux.vnet.ibm.com> - 
Question on the comment above. Is the output of _wait_queue correct or a cut-n-paste error?

When I look at the __wait_queue I get the following.

 struct __wait_queue ffff81007e4d7ce0
struct __wait_queue {
  flags = 0x1, 
  private = 0xffff81007e4d2b20, 
  func = 0xffffffff81051573 <autoremove_wake_function>, 
  task_list = {
    next = 0xffff810141081ac0, 
    prev = 0xffff81014df65850
  }
}

This seems valid as private points back to the kmpathd/3 task
=Comment: #3=================================================
Venkateswarara Jujjuri <jvrao@us.ibm.com> - 
(In reply to comment #2)
> Question on the comment above. Is the output of _wait_queue correct or a
> cut-n-paste error?
> 
> When I look at the __wait_queue I get the following.
> 
>  struct __wait_queue ffff81007e4d7ce0
> struct __wait_queue {
>   flags = 0x1, 
>   private = 0xffff81007e4d2b20, 
>   func = 0xffffffff81051573 <autoremove_wake_function>, 
>   task_list = {
>     next = 0xffff810141081ac0, 
>     prev = 0xffff81014df65850
>   }
> }
> 
> This seems valid as private points back to the kmpathd/3 task
> 

Hrm..you are right. Last night I was trying this on 3c24 crash. I am debugging both the dumps..
Thanks for checking it. :)
=Comment: #4=================================================
Venkateswarara Jujjuri <jvrao@us.ibm.com> - 
Here is more analysis... All the data structures appear intact.
The request_queue is empty. Then why 3 threads are waiting for requests?
is there some bug in the logic of calling process_queued_ios and further doing
  if (!must_queue)
                dispatch_queued_ios(m);
???

Here is the analysis:

Total 3 processes are waiting for the requests to show up.

struct request_queue {
  queue_head = {
    next = 0xFFFF81014DF657D8,
    prev = 0XFFFF81014DF657D8
  },
...
}

crash> struct __wait_queue 0xffff81007e4d7ce0
struct __wait_queue {
  flags = 1,
  private = 0xffff81007e4d2b20,  >>>>>>> kmpathd/3
  func = 0xffffffff81051573 <autoremove_wake_function>,
  task_list = {
    next = 0xffff810141081ac0,
    prev = 0xffff81014df65850
  }
}
crash>  struct __wait_queue 0xffff810141081aa8
struct __wait_queue {
  flags = 1,
  private = 0xffff81014f4e0aa0,  >>>>>>>>>>> smartd
  func = 0xffffffff81051573 <autoremove_wake_function>,
  task_list = {
    next = 0xffff810061eb1ac0,
    prev = 0xffff81007e4d7cf8
  }
}
crash> struct __wait_queue 0xffff810061eb1aa8
struct __wait_queue {
  flags = 1,
  private = 0xffff81007e491580, >>>>>>>>> multipath
  func = 0xffffffff81051573 <autoremove_wake_function>,
  task_list = {
    next = 0xffff81014df65850,
    prev = 0xffff810141081ac0
  }
}
crash> struct __wait_queue 0xffff81014df65838
struct __wait_queue {
  flags = 1307990064,
  private = 0x0,
  func = 0,
  task_list = {
    next = 0xffff81007e4d7cf8,  // Back to where we started.
    prev = 0xffff810061eb1ac0
  }
}

And the request queue is empty:


struct elevator_queue {
  ops = 0xffffffff813c8e70,
  elevator_data = 0xffff81014df63c00,
..
}

crash> struct elevator_ops 0xffffffff813c8e70
struct elevator_ops {
  ....
  elevator_queue_empty_fn = 0xffffffff811314ac <cfq_queue_empty>,
  ...
}

static int cfq_queue_empty(struct request_queue *q)
{
        struct cfq_data *cfqd = q->elevator->elevator_data;

        return !cfqd->busy_queues; // So let us look at busy_queues field.
}

crash> struct cfq_data 0xffff81014df63c00
struct cfq_data {
  queue = 0xffff81014df657d8, 
  service_tree = {
    rb = {
      rb_node = 0x0
    }, 
    left = 0x0
  }, 
  busy_queues = 0,  <<<<< 
  rq_in_driver = 0, 
  sync_flight = 0, 
...
}

So at this point, I guess we need to focus on the initial decision to process_queued_ios() and that
area is correct. Mike, any thoughts??
=Comment: #7=================================================
Venkateswarara Jujjuri <jvrao@us.ibm.com> - 
Found the bug. 
Issue is with scsi_dh_rdac. In couple of places, it is missing blk_put_request() for the
corresponding blk_get_request().

File : scsi_dh_rdac.c
Functions: submit_inquiry() , send_mode_select()

These two functions get the request_queue buffer through  rdac_failover_get() ->
get_rdac_req()->blk_get_request()->get_request()

But the corresponding blk_put_req() or freed_request() is missing.

This causes the request_queue buffer leak. Since the bug is in the device handler code,
the leak happens only on the path bounce. 

port bounce test caused enough path bounces,
making the leak surpass the water mark, holding up threads.

Further analysis reveled that, this problem has been taken care in the mainline.

Simple fix is putting blk_put_request in submit_inquiry() , send_mode_select() functions.
That will fix this problem.. But, I will be closely comparing -rt version of the device handler code
with that of the mainline to see if we are missing any other important stuff and may come up with a
grander patch.
=Comment: #8=================================================
Venkateswarara Jujjuri <jvrao@us.ibm.com> - 

patch-scsi_dh-put

This patch fixes the buffer leak problem.
=Comment: #11=================================================
Venkateswarara Jujjuri <jvrao@us.ibm.com> - 
Moving it to FIX_BY_IBM. Will send out a separate mail to RH with the patch.
Comment 1 IBM Bug Proxy 2009-01-08 01:50:32 EST
Created attachment 328444 [details]
patch-scsi_dh-put
Comment 2 IBM Bug Proxy 2009-06-29 12:20:47 EDT
------- Comment From sripathik@in.ibm.com 2009-06-29 12:20 EDT-------
To RH: We saw this fix in kernel-rt-2.6.29.4-23.el5rt.src.rpm. Hence we are closing this bug on IBM side.

Note You need to log in before you can comment on or make changes to this bug.