Bug 1503411

Summary: [iSCSI]; Incorrect number of tcmu-runner daemons reported after GWs go down and come back up
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Tejas <tchandra>
Component: iSCSIAssignee: Jason Dillaman <jdillama>
Status: CLOSED WONTFIX QA Contact: Tejas <tchandra>
Severity: medium Docs Contact: Erin Donnelly <edonnell>
Priority: unspecified    
Version: 3.0CC: ceph-eng-bugs, ceph-qe-bugs, edonnell, jdillama, tchandra
Target Milestone: rc   
Target Release: 3.*   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
.Incorrect number of `tcmu-runner` daemons reported after iSCSI target LUNs fail and recover After iSCSI target Logical Unit Numbers (LUNs) recover from a failure, the `ceph -s` command in certain cases outputs an incorrect number of `tcmu-runner` daemons.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-26 16:14:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1494421    

Description Tejas 2017-10-18 04:42:19 UTC
Description of problem:


Version-Release number of selected component (if applicable):
ceph version 12.2.1-14.el7cp
libtcmu-1.3.0-0.4.el7cp.x86_64

In the 'ceph -s' command output the number of tcmu-runner daemons is reported. I am disabling the network interface on the GW nodes, and after a while bringing it back up.
Command used:
ifdown <eth>
ifup <eth>

Total luns: 122
expected tcmu daemons: 488

After 1 GW network dwon:
 ceph -s
  cluster:
    id:     2057393b-ce5e-4821-9eb0-96519e801921
    health: HEALTH_OK
 
  services:
    mon:         3 daemons, quorum havoc,mustang,skytrain
    mgr:         mustang(active)
    osd:         20 osds: 20 up, 20 in
    rgw:         1 daemon active
    tcmu-runner: 257 daemons active   <----------------
 
  data:
    pools:   13 pools, 842 pgs
    objects: 1140k objects, 3320 GB
    usage:   9960 GB used, 12284 GB / 22245 GB avail
    pgs:     842 active+clean




After all 4 GWs have gone down and come back up:
~]# ceph -s
  cluster:
    id:     2057393b-ce5e-4821-9eb0-96519e801921
    health: HEALTH_OK
 
  services:
    mon:         3 daemons, quorum havoc,mustang,skytrain
    mgr:         mustang(active)
    osd:         20 osds: 20 up, 20 in
    rgw:         1 daemon active
    tcmu-runner: 31 daemons active    <---------------
 
  data:
    pools:   13 pools, 842 pgs
    objects: 1140k objects, 3320 GB
    usage:   9961 GB used, 12284 GB / 22245 GB avail
    pgs:     842 active+clean
 
  io:
    client:   10743 B/s rd, 111 MB/s wr, 10 op/s rd, 511 op/s wr

Comment 3 Jason Dillaman 2017-10-18 13:16:44 UTC
@Tejas: the service daemons have a 60 second grace period (by default). Did you check the daemon state after 60 seconds had passed?