2219440 – MGR process hung - pybind/cephfs: holds GIL during rmdir

Bug 2219440 - MGR process hung - pybind/cephfs: holds GIL during rmdir

Summary: MGR process hung - pybind/cephfs: holds GIL during rmdir

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Mgr Plugins
Sub Component:
Version:	5.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	6.1z1
Assignee:	Patrick Donnelly
QA Contact:	Hemanth Kumar
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:	2219093
Blocks:	2219173 2221020
TreeView+	depends on / blocked

Reported:	2023-07-03 17:42 UTC by Vikhyat Umrao
Modified:	2023-10-06 17:41 UTC (History)
CC List:	22 users (show)
Fixed In Version:	ceph-17.2.6-84.el9cp
Doc Type:	Bug Fix
Doc Text:	.Python tasks no longer wait for the GIL Previously, the Ceph manager daemons held the Python Global Interpreter Lock (GIL) during some RPCs. Due to this, other python tasks in the queue starved waiting for the GIL. With this fix, the GIL is released during all `libcephfs` or `librbd` calls and other Python tasks can acquire the GIL normally.
Clone Of:	2219093
Environment:
Last Closed:	2023-08-03 16:45:10 UTC
Embargoed:
Dependent Products:
Flags:	hyelloji: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	61869	None	None	None	2023-07-03 17:43:04 UTC
Github	ceph ceph pull 52322	None	open	quincy: pybind: drop GIL during library callouts	2023-07-06 16:57:06 UTC
Red Hat Issue Tracker	RHCEPH-6954	None	None	None	2023-07-03 17:45:11 UTC
Red Hat Knowledge Base (Solution)	7037839	None	None	None	2023-10-06 17:41:45 UTC
Red Hat Product Errata	RHBA-2023:4473	None	None	None	2023-08-03 16:46:02 UTC

Description Vikhyat Umrao 2023-07-03 17:42:19 UTC

+++ This bug was initially created as a clone of Bug #2219093 +++

Description of problem:
Opening another BZ quickly, since the case attached to this bz is hot. The active MGR process on bmwcefesp000001 hung causing the mgr failover to the standby daemon on bmwcefesp000003.

Version-Release number of selected component (if applicable):
External ODF RHCS 5.1

Additional info:

- Core dumps and "ceph -s" output are incoming.

Perf dump was taken while mgr process was hung. It's in supportshell at:

0700-mgr-bmwcefesp000001-perf-top.txt

--- Additional comment from  on 2023-07-01 23:20:01 UTC ---

Unfortunately we hit https://bugzilla.redhat.com/show_bug.cgi?id=2170213 so we weren't able to capture a coredump.

-Brandon

--- Additional comment from  on 2023-07-01 23:28:07 UTC ---

[ansautomat@bmwcefesp000001 ~]$ sudo podman exec -it ceph-mon-$(hostname -s) ceph -s
  cluster:
    id:     8d2323ac-0820-49ad-b917-059ba00794f2
    health: HEALTH_WARN
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum bmwcefesp000003,bmwcefesp000002,bmwcefesp000001 (age 2d)
    mgr: bmwcefesp000003(active, since 63m), standbys: bmwcefesp000002, bmwcefesp000001
    mds: 2/2 daemons up, 1 standby
    osd: 324 osds: 324 up (since 23h), 324 in (since 23h)
 
  data:
    volumes: 1/1 healthy
    pools:   13 pools, 10625 pgs
    objects: 106.42M objects, 41 TiB
    usage:   131 TiB used, 485 TiB / 616 TiB avail
    pgs:     10625 active+clean
 
  io:
    client:   16 MiB/s rd, 94 MiB/s wr, 865 op/s rd, 1.87k op/s wr

[ansautomat@bmwcefesp000001 ~]$ sudo podman exec -it ceph-mon-$(hostname -s) ceph health detail
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
    mgr.bmwcefesp000001 crashed on host bmwcefesp000001.advancedp.org at 2023-07-01T23:00:55.691802Z

--- Additional comment from  on 2023-07-01 23:44:12 UTC ---

^^The above "ceph -s" output was taken after we SIGKILLED the hung mgr on bmwcefesp000001.^^


- Mgr logs located in supportshell:

0710-ceph-mgr.bmwcefesp000001.log


- As stated, core dump could not be generated due to https://bugzilla.redhat.com/show_bug.cgi?id=2170213:

[ansautomat@bmwcefesp000001 ~]$ sudo coredumpctl dump -o mgr-bmwcefesp000001-coredump.txt
           PID: 861251 (ceph-mgr)
           UID: 167 (167)
           GID: 167 (167)
        Signal: 6 (ABRT)
     Timestamp: Sat 2023-07-01 23:00:56 UTC (21min ago)
  Command Line: /usr/bin/ceph-mgr --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false --foreground -i bmwcefesp000001
    Executable: /usr/bin/ceph-mgr
Control Group: /machine.slice/libpod-d030bd2c8bdfe2219c8bf0f888b0ff4df66f96c31d6eb5257c65b43fbc5fbfa0.scope
          Unit: libpod-d030bd2c8bdfe2219c8bf0f888b0ff4df66f96c31d6eb5257c65b43fbc5fbfa0.scope
         Slice: machine.slice
       Boot ID: 00792aa77e3a47fba7e8d900139eacf7
    Machine ID: 096034123a2b40fd9421f3196736558d
      Hostname: bmwcefesp000001.advancedp.org
       Storage: none
       Message: Process 861251 (ceph-mgr) of user 167 dumped core.
Coredump entry has no core attached (neither internally in the journal nor externally on disk).

- Brandon

--- Additional comment from Vikhyat Umrao on 2023-07-02 02:42:00 UTC ---

This is from the DXC/BMW CephFS escalation - https://bugzilla.redhat.com/show_bug.cgi?id=2218688.

Greg and I were called in the customer remote session when we were informed that they do not see Stand-by MGRs in ceph -s. When we joined the remote session these were the events

- MGR03 was the active MGR and 01 and 02 were not visible as standby in ceph -s
- We checked both MGR01 and 02 both have processes up and running 
- Restarting 02 made it to join the standby with 03 as active mgr
- We thought before restarting 01 let us capture, perf top, MGR logs and coredump
  + perf top
  + MGR logs are available 
  + but coredump capture failed as explained above
    - For capturing the coredump we did `kill -SIGABRT <01 PID> this made MGR 01 restart and it also joined 02 as standby and 03 being active.

--- Additional comment from Brad Hubbard on 2023-07-02 04:11:20 UTC ---

It may be easier to get cpu usage per thread and then a couple of thread dumps
to see what the thread with high cpu usage is actually doing. If we do get a
core make sure we identify which thread has high cpu usage to make it easier to
identify where we should focus.

--- Additional comment from Vikhyat Umrao on 2023-07-02 04:15:13 UTC ---

(In reply to Brad Hubbard from comment #5)
> It may be easier to get cpu usage per thread and then a couple of thread
> dumps
> to see what the thread with high cpu usage is actually doing. If we do get a
> core make sure we identify which thread has high cpu usage to make it easier
> to
> identify where we should focus.

Slack link - where the wallclock profiler and poormanprofiler got discussed - https://redhat-internal.slack.com/archives/C05EQ6XHL78/p1688270148381289

--- Additional comment from Brad Hubbard on 2023-07-02 04:15:52 UTC ---

Please also get output showing whether the implicated thread is in d-state or
not and also wchan details (some incantation of ps should get you these fields).

--- Additional comment from Brad Hubbard on 2023-07-02 04:26:59 UTC ---

If a coredump is gathered please also provide full details of the exact
container image in use when the coredump happened to expedite setting up an
environment to analyse the core.

--- Additional comment from Ashish Singh on 2023-07-02 05:33:26 UTC ---

We could be hitting the bug https://bugzilla.redhat.com/show_bug.cgi?id=2106031.

For now we're capturing the following:
 - Debug logs (debug_mgr 20, debug_monc 20, debug_ms 5)
 - Threads for MGR daemon
 - Thread dump for threads with High CPU

--- Additional comment from Karun Josy on 2023-07-02 10:24:20 UTC ---

Hello everyone,

In the remote session, we noticed only 1 active MGR and no-standby again.
I restarted one to have at least 1 standby.
And decided to collect the logs from the third one which is missing in #ceph -s.

*** MGR03 ---> bmwcefesp000003 is the one that is currently running but missing in #ceph -s ****


Logs collected :

In support shell :
- Debug logs from MGR03 (debug_mgr 20, debug_monc 20, debug_ms 5)  ----->  /cases/03544845/0760-mgr03_journal_20230702_06:19:36.txt
- netstat output ---->  /cases/03544845/0770-ntstat_bmwcefesp000003_06:11:04.txt
- pstack text : /cases/03544845/0800-core-pid-dump-2023-07-02.tar.gz/core-pid-dump-2023-07-02
- gcore pid : /cases/03544845/0800-core-pid-dump-2023-07-02.tar.gz/core.768914
- container version : advantage-ceph-images-rhceph_rhceph-5-rhel8:5-235 

- # rpm -qa | grep glibc and # rpm -qa | grep ceph from within container
- top -H -p 768914
- sudo cat /proc/768914/stack
- sudo cat /proc/768914/status
-----------------------

 ~]$ top -H -p 768914

top - 05:50:36 up 2 days, 14:46,  1 user,  load average: 1.74, 1.78, 1.81
Threads: 163 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.1 us,  2.5 sy,  0.0 ni, 91.1 id,  0.1 wa,  0.5 hi,  0.7 si,  0.0 st
MiB Mem : 386352.3 total, 281340.1 free,  93432.1 used,  11580.0 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used. 290653.7 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 822152 ceph      20   0 6995340   1.0g  53016 R  16.3   0.3  35:53.05 msgr-worker-2
 822151 ceph      20   0 6995340   1.0g  53016 S  14.6   0.3  45:11.16 msgr-worker-1
 834736 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   1:20.85 crash
 834775 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   1:18.67 telemetry
 834776 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   1:19.98 mgr-fin
 834787 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   1:18.12 mgr-fin
 834788 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   9:32.67 mgr-fin
 834793 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   1:20.85 mgr-fin
 834796 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   9:23.91 mgr-fin
 834798 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   1:21.94 mgr-fin
 834831 ceph      20   0 6995340   1.0g  53016 S   0.7   0.3   2:41.94 dashboard
 834735 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   2:10.11 balancer
 834738 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   1:18.10 devicehealth
 834741 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   2:10.28 pg_autoscaler
 834742 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   1:21.20 progress
 834754 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3  13:07.54 prometheus
 834755 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   2:41.95 prometheus
 834766 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   1:31.76 mgr-fin
 834767 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   1:23.17 mgr-fin
 834768 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   1:22.24 mgr-fin
 834772 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   1:30.30 mgr-fin
 834792 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   1:19.69 mgr-fin
 834794 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   8:15.70 mgr-fin
 834797 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3  11:26.84 mgr-fin
 834899 ceph      20   0 6995340   1.0g  53016 S   0.3   0.3   0:05.19 mgr-fin
 768914 ceph      20   0 6995340   1.0g  53016 S   0.0   0.3   0:07.32 ceph-mgr
 822148 ceph      20   0 6995340   1.0g  53016 S   0.0   0.3  12:45.56 log



$ sudo cat /proc//768914/status
Name:   ceph-mgr
Umask:  0022
State:  S (sleeping)
Tgid:   768914
Ngid:   768914
Pid:    768914
PPid:   768853
TracerPid:      0
Uid:    167     167     167     167
Gid:    167     167     167     167
FDSize: 4096
Groups:
NStgid: 768914  55
NSpid:  768914  55
NSpgid: 768914  55
NSsid:  768853  1
VmPeak:  7006864 kB
VmSize:  6995340 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:   1095708 kB
VmRSS:   1070588 kB
RssAnon:         1017572 kB
RssFile:           53016 kB
RssShmem:              0 kB
VmData:  6638364 kB
VmStk:       968 kB
VmExe:      4892 kB
VmLib:    129984 kB
VmPTE:      3492 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    0
Threads:        163
SigQ:   0/1545256
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000001000
SigIgn: 0000000000000000
SigCgt: 00000001c18044eb
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000800425fb
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        2
Speculation_Store_Bypass:       thread vulnerable
Cpus_allowed:   ffffffff
Cpus_allowed_list:      0-31
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        1633
nonvoluntary_ctxt_switches:     76



$ sudo cat /proc/768914/stack
[<0>] futex_wait_queue_me+0xa3/0x100
[<0>] futex_wait+0x11f/0x210
[<0>] do_futex+0x2f9/0x4d0
[<0>] __x64_sys_futex+0x145/0x1f0
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ansautomat@bmwcefesp000003 ~]$



# rpm -qa | grep ceph
python3-cephfs-16.2.7-126.el8cp.x86_64
ceph-base-16.2.7-126.el8cp.x86_64
ceph-prometheus-alerts-16.2.7-126.el8cp.noarch
ceph-mgr-diskprediction-local-16.2.7-126.el8cp.noarch
ceph-mgr-rook-16.2.7-126.el8cp.noarch
ceph-osd-16.2.7-126.el8cp.x86_64
libcephfs2-16.2.7-126.el8cp.x86_64
python3-ceph-argparse-16.2.7-126.el8cp.x86_64
ceph-common-16.2.7-126.el8cp.x86_64
ceph-selinux-16.2.7-126.el8cp.x86_64
ceph-grafana-dashboards-16.2.7-126.el8cp.noarch
ceph-mgr-dashboard-16.2.7-126.el8cp.noarch
ceph-mgr-k8sevents-16.2.7-126.el8cp.noarch
ceph-mgr-16.2.7-126.el8cp.x86_64
ceph-iscsi-3.5-3.el8cp.noarch
ceph-mon-16.2.7-126.el8cp.x86_64
ceph-immutable-object-cache-16.2.7-126.el8cp.x86_64
cephfs-mirror-16.2.7-126.el8cp.x86_64
nfs-ganesha-ceph-3.5-1.el8cp.x86_64
libcephsqlite-16.2.7-126.el8cp.x86_64
cephadm-16.2.7-126.el8cp.noarch
ceph-mgr-cephadm-16.2.7-126.el8cp.noarch
ceph-mgr-modules-core-16.2.7-126.el8cp.noarch
ceph-radosgw-16.2.7-126.el8cp.x86_64
ceph-mds-16.2.7-126.el8cp.x86_64
python3-ceph-common-16.2.7-126.el8cp.x86_64


# rpm -qa | grep glibc
glibc-minimal-langpack-2.28-189.5.el8_6.x86_64
glibc-common-2.28-189.5.el8_6.x86_64
glibc-2.28-189.5.el8_6.x86_64



37905aa113d2  bmwinfesv000031.advantagedp.org/advantage-ceph-images-rhceph_rhceph-5-rhel8:5-235                                         26 hours ago  Up 26 hours ago              ceph-mgr-bmwcefesp000003

--- Additional comment from Karun Josy on 2023-07-02 10:31:47 UTC ---

Current status :

We have 1 MGR active and 1 MGR on standby. 
MGR03 ---> bmwcefesp000003 is currently running but still missing in #ceph -s 
The debug logs and coredump are from this node MGR03.
We decided not to restart the service and keep it as it is for the time being for further debugging if required. 
And asked the customer to monitor the cluster continuously to make sure we have at least 1 standby always.

We will need to restart MGR03 as well eventually. Please confirm if we have all the required logs before we go ahead.







###############

Note : 

We collected proc/stack and status from a working standby MGR for comparison. Adding it here, hope it helps


output from node 2 ( standby mgr MGR02 )
---------------------------------------

$ sudo ps aux | grep mgr
root     3174559  0.0  0.0  70152  2476 ?        Ss   04:16   0:00 /usr/bin/conmon --api-version 1 -c b2bc3b7d9a72a7a0734c6b9b2b22048ecd96193d914a457dcc1e9b83ddf1cf58 -u b2bc3b7d9a72a7a0734c6b9b2b22048ecd96193d914a457dcc1e9b83ddf1cf58 -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/b2bc3b7d9a72a7a0734c6b9b2b22048ecd96193d914a457dcc1e9b83ddf1cf58/userdata -p /run/containers/storage/overlay-containers/b2bc3b7d9a72a7a0734c6b9b2b22048ecd96193d914a457dcc1e9b83ddf1cf58/userdata/pidfile -n ceph-mgr-bmwcefesp000002 --exit-dir /run/libpod/exits --full-attach -s -l journald --log-level warning --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/b2bc3b7d9a72a7a0734c6b9b2b22048ecd96193d914a457dcc1e9b83ddf1cf58/userdata/oci-log --conmon-pidfile //run/ceph-mgr --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg cni --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg b2bc3b7d9a72a7a0734c6b9b2b22048ecd96193d914a457dcc1e9b83ddf1cf58
167      3174634  1.5  0.1 5305704 434428 ?      Sl   04:16   1:45 /usr/bin/ceph-mgr --cluster ceph --setuser ceph --setgroup ceph --default-log-to-stderr=true --err-to-stderr=true --default-log-to-file=false --foreground -i bmwcefesp000002
ansauto+ 3342669  0.0  0.0 221940  1196 pts/3    S+   06:12   0:00 grep --color=auto mgr


$ sudo cat /proc/3174559/stack
[<0>] do_sys_poll+0x4b8/0x570
[<0>] __x64_sys_poll+0x37/0x130
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6


$ sudo cat /proc/3174559/status
Name:   conmon
Umask:  0022
State:  S (sleeping)
Tgid:   3174559
Ngid:   0
Pid:    3174559
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 64
Groups:
NStgid: 3174559
NSpid:  3174559
NSpgid: 3174559
NSsid:  3174559
VmPeak:    70152 kB
VmSize:    70152 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:      2476 kB
VmRSS:      2476 kB
RssAnon:             748 kB
RssFile:            1728 kB
RssShmem:              0 kB
VmData:      660 kB
VmStk:       132 kB
VmExe:        80 kB
VmLib:     13712 kB
VmPTE:       160 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:        1
SigQ:   0/1545256
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000010200
SigIgn: 0000000000001000
SigCgt: 0000000180004006
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Speculation_Store_Bypass:       thread vulnerable
Cpus_allowed:   ffffffff
Cpus_allowed_list:      0-31
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:      0-1
voluntary_ctxt_switches:        1725
nonvoluntary_ctxt_switches:     1

[ansautomat@bmwcefesp000002 ~]$ sudo cat /proc/3174559/stack
[<0>] do_sys_poll+0x4b8/0x570
[<0>] __x64_sys_poll+0x37/0x130
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6
[ansautomat@bmwcefesp000002 ~]$
[ansautomat@bmwcefesp000002 ~]$

[ansautomat@bmwcefesp000002 ~]$ sudo cat /proc/3174559/stack
[<0>] do_sys_poll+0x4b8/0x570
[<0>] __x64_sys_poll+0x37/0x130
[<0>] do_syscall_64+0x5b/0x1b0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xc6

--- Additional comment from Vikhyat Umrao on 2023-07-02 14:50:25 UTC ---

From Slack:

Ranjith Rajaram
  6:58 AM
Currently they are running with two managers(one active and one standby). The other one is in hung state(003).  We have asked the customer to keep monitoring the ceph status. If they find any of the existing mgr  not listed in the output, we have asked them to restart. 003 is reserved for data collection (edited) 

6:59
we have performed two mgr restart so far

pdonnell
  7:08 AM
my initial analysis is that it appears the mgr is stuck on the DaemonServer:;lock

3 replies
Last reply today at 7:49 AMView thread


pdonnell
  7:08 AM
which is acquired in some question places like DaemonServer::ms_handle_authentication
7:08
that appears to have completely blocked the messenger thread reading messages on the wire
:+1::skin-tone-2:
1

7:09
which is why it stopped doing anything

--- Additional comment from Vikhyat Umrao on 2023-07-02 14:52:08 UTC ---

This one might be duplicate of - https://bugzilla.redhat.com/show_bug.cgi?id=2106031 and Radek is actively looking into it.

--- Additional comment from Radoslaw Zarzynski on 2023-07-02 15:27:53 UTC ---

Hi Karun!

Thanks for the mgr03 log. Let's take a look. This instance has
booted up on yesterday. 

```
[rzarzyns@supportshell-1 ~]$ less /cases/03544845/0760-mgr03_journal_20230702_06:19:36.txt
Jul 01 19:06:45 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-01T19:06:45.827+0000 7fef0db2a500  0 ceph version 16.2.7-126.el8cp (fe0af61d104d48cb9d116cde6e593b5fc8c197e4) pacific (stable), process ceph-mgr, pid 55
```

The very last 2 mins of the log is flooded *solely* with entries
from `_check_auth_rotating()`.


```
Jul 02 06:18:46 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:18:45.999+0000 7feefe18b700  5 --2- 192.78.200.27:0/55 >> [v2:192.78.200.26:3300/0,v1:192.78.200.26:6789/0] conn(0x56
1104c79000 0x5610fd6f3e00 secure :-1 s=THROTTLE_DONE pgs=43442 cs=0 l=1 rev1=1 rx=0x5611042fd050 tx=0x5611046280c0).handle_message got 24 + 0 + 0 byte message. envelope type=54 src mon.1 off 0
Jul 02 06:18:46 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:18:46.844+0000 7feef9181700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-02T05:18:46.845689+0000)
...
Jul 02 06:20:06 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:20:06.862+0000 7feef9181700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-02T05:20:06.862975+0000)
Jul 02 06:20:07 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:20:07.862+0000 7feef9181700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-02T05:20:07.863107+0000)
Jul 02 06:20:08 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:20:08.862+0000 7feef9181700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-02T05:20:08.863246+0000)
Jul 02 06:20:09 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:20:09.862+0000 7feef9181700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-02T05:20:09.863376+0000)
Jul 02 06:20:10 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:20:10.862+0000 7feef9181700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2023-07-02T05:20:10.863496+0000)
```

These *lderr* entries are generated by the following code:

```
  ldout(cct, 10) << "_check_auth_rotating renewing rotating keys (they expired before " << cutoff << ")" << dendl;
  if (!rotating_secrets->need_new_secrets() &&
      rotating_secrets->need_new_secrets(issued_at_lower_bound)) {
    // the key has expired before it has been issued?
    lderr(cct) << __func__ << " possible clock skew, rotating keys expired way too early"
               << " (before " << issued_at_lower_bound << ")" << dendl;
  }
```

We know this instance ceph-mgr is running with `debug_monc=20/20`,
therefore I would expect to see:

1. the `ldout(10)` entry just before the `if`;
2. the unconditional `ldout(10)` from `MonClient::tick()` or
   `MonClient::_finish_auth()` as they are the sole callers
    of `_check_auth_rotating()`:

```
int MonClient::_check_auth_tickets()
{
  ceph_assert(ceph_mutex_is_locked(monc_lock));
  if (active_con && auth) {
    // ...
    _check_auth_rotating();
  }
  return 0;
}

void MonClient::_finish_auth(int auth_err)
{
  ldout(cct,10) << __func__ << " " << auth_err << dendl;
  // ...
  if (!auth_err && active_con) {
    ceph_assert(auth);
    _check_auth_tickets();
  }
```

```
void MonClient::tick()
{
  ldout(cct, 10) << __func__ << dendl;

  utime_t now = ceph_clock_now();

  auto reschedule_tick = make_scope_guard([this] {
      schedule_tick();
    });

  _check_auth_tickets();
```

To exemplify the expected content let me show entries just couple
of seconds older:

```
Jul 02 06:18:42 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:18:42.591+0000 7fedb55fe700 10 monclient: tick
Jul 02 06:18:42 bmwcefesp000003.advancedp.org ceph-mgr-bmwcefesp000003[768841]: 2023-07-02T06:18:42.591+0000 7fedb55fe700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2023-07-02T06
:18:12.592768+0000)
```

Is a 2 min glitch in logging something you Guys ever encountered?
May we have a fresh log from the mgr03?

Regards,
Radek

--- Additional comment from Ranjith Rajaram on 2023-07-02 15:38:00 UTC ---

Quick question(posting it to the BZ). Please nack if you do not think it is related

We have a hung ceph manager, it may be hung due to a different reason

We took the coredump of the hung ceph mgr. The hung ceph manager seems to be trying to delete an entry from the _deleting before going into a hung state

Path '/volumes/_deleting/4a00a3ea-c792-4b29-9746-fb840f5a790b/0d13f4b6-fa25-4461-9356-d149d9634676/kpi_orchestration_11/write-dag-run-config/2023-06-12T12:03:02.937843+00:00')

(gdb) bt 77
#0  futex_wait_cancelable (private=0, expected=0, futex_word=0x7fedc060eb88) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5611073bd568, cond=0x7fedc060eb60) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7fedc060eb60, mutex=0x5611073bd568) at pthread_cond_wait.c:655
#3  0x00007fef027bc8f0 in ?? () at ../../../../../libstdc++-v3/src/c++11/condition_variable.cc:118 from /lib64/libstdc++.so.6
#4  0x00007fedc060eb30 in ?? ()
#5  0x00007feef1e9b5ab in std::condition_variable::wait<Client::make_request(MetaRequest*, const UserPerm&, InodeRef*, bool*, mds_rank_t, ceph::bufferlist*)::<lambda()> > (__p=..., __lock=..., this=0x7fedc060eb60)
    at /usr/src/debug/ceph-16.2.7-126.el8cp.x86_64/src/log/Entry.h:35
#6  Client::make_request (this=0x5611073bd400, request=0x561104b65200, perms=..., ptarget=0x0, pcreated=0x0, use_mds=<optimized out>, pdirbl=0x0) at /usr/src/debug/ceph-16.2.7-126.el8cp.x86_64/src/client/Client.cc:1867
#7  0x00007feef1ebbea6 in Client::_rmdir (this=0x5611073bd400, dir=0x56110b85b000, name=<optimized out>, perms=...) at /usr/src/debug/ceph-16.2.7-126.el8cp.x86_64/src/client/Client.cc:13637
#8  0x00007feef1ebc705 in Client::unlinkat (this=0x5611073bd400, dirfd=-100, relpath=<optimized out>, flags=<optimized out>, perm=...) at /usr/src/debug/ceph-16.2.7-126.el8cp.x86_64/src/client/Client.cc:7197
#9  0x00007feef2274a87 in __pyx_pf_6cephfs_9LibCephFS_94rmdir (
    __pyx_v_path=b'/volumes/_deleting/4a00a3ea-c792-4b29-9746-fb840f5a790b/0d13f4b6-fa25-4461-9356-d149d9634676/kpi_orchestration_11/write-dag-run-config/2023-06-12T12:03:02.937843+00:00', __pyx_v_self=0x7fee30464160)
    at /usr/src/debug/ceph-16.2.7-126.el8cp.x86_64/build/src/pybind/cephfs/cephfs.c:19171
#10 __pyx_pw_6cephfs_9LibCephFS_95rmdir (__pyx_v_self=<cephfs.LibCephFS at remote 0x7fee30464160>,
    __pyx_v_path=b'/volumes/_deleting/4a00a3ea-c792-4b29-9746-fb840f5a790b/0d13f4b6-fa25-4461-9356-d149d9634676/kpi_orchestration_11/write-dag-run-config/2023-06-12T12:03:02.937843+00:00')
    at /usr/src/debug/ceph-16.2.7-126.el8cp.x86_64/build/src/pybind/cephfs/cephfs.c:19059
#11 0x00007fef0d56c05d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at /usr/src/debug/python3-3.6.8-51.el8.x86_64/Python/ceval.c:1303

We still have the following two subvolumes in the _deleting

drwxrwxrwx. 3 root root 2 Aug  8  2022 4a00a3ea-c792-4b29-9746-fb840f5a790b
drwxrwxrwx. 3 root root 1 Jun 23 13:32 fcf8112c-d8f4-49a1-a807-1e5222e672cf

If the hung manager is holding up any lock and we know it is not proceeding further, will the other active manager will be able to clear up the deleted volume.

--- Additional comment from Patrick Donnelly on 2023-07-02 15:47:51 UTC ---

The cephfs call stacks are irrelevant. See:

Thread 12 (Thread 0x7feefb185700 (LWP 822162)):
#0  0x00007fef0339081d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007fef03389ac9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x00005610fa9bd6cf in ClusterState::set_service_map(ServiceMap const&) ()
#3  0x00005610faa271ea in Mgr::handle_service_map(boost::intrusive_ptr<MServiceMap>) ()
#4  0x00005610faa29ec4 in Mgr::ms_dispatch2(boost::intrusive_ptr<Message> const&) ()
#5  0x00005610faa33e95 in MgrStandby::ms_dispatch2(boost::intrusive_ptr<Message> const&) ()
#6  0x00007fef047a92aa in DispatchQueue::entry() () from target:/usr/lib64/ceph/libceph-common.so.2
#7  0x00007fef0485af91 in DispatchQueue::DispatchThread::entry() () from target:/usr/lib64/ceph/libceph-common.so.2
#8  0x00007fef033871cf in start_thread () from target:/lib64/libpthread.so.0
#9  0x00007fef01ddadd3 in clone () from target:/lib64/libc.so.6

has DaemonServeR::lock; wants ClusterState::lock

Thread 22 (Thread 0x7fedd2e39700 (LWP 834730)):
#0  0x00007fef0338d838 in pthread_cond_timedwait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x00007fef0d4c7edc in take_gil () from target:/lib64/libpython3.6m.so.1.0
#2  0x00007fef0d4c80fd in PyEval_RestoreThread () from target:/lib64/libpython3.6m.so.1.0
#3  0x00005610faa1e993 in with_gil_t::with_gil_t(without_gil_t&) ()
#4  0x00005610fa976226 in ActivePyModules::get_python(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()

Probably has ClusterState::lock; wants GIL


Thread 3 (Thread 0x7feeff18d700 (LWP 822150)):
#0  0x00007fef0339081d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007fef03389ac9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x00005610fa986d17 in std::mutex::lock() ()
#3  0x00005610fa9dee2c in DaemonServer::ms_handle_authentication(Connection*) ()
#4  0x00007fef04906e55 in MonClient::handle_auth_request(Connection*, AuthConnectionMeta*, bool, unsigned int, ceph::buffer::v15_2_0::list const&, ceph::buffer::v15_2_0::list*) () from target:/usr/lib64/ceph/libceph-common.so.2
#5  0x00007fef0489165f in ProtocolV2::_handle_auth_request(ceph::buffer::v15_2_0::list&, bool) () from target:/usr/lib64/ceph/libceph-common.so.2
#6  0x00007fef0489261e in ProtocolV2::handle_auth_request_more(ceph::buffer::v15_2_0::list&) () from target:/usr/lib64/ceph/libceph-common.so.2
#7  0x00007fef0489b0c3 in ProtocolV2::handle_frame_payload() () from target:/usr/lib64/ceph/libceph-common.so.2
#8  0x00007fef0489b380 in ProtocolV2::handle_read_frame_dispatch() () from target:/usr/lib64/ceph/libceph-common.so.2
#9  0x00007fef0489b575 in ProtocolV2::_handle_read_frame_epilogue_main() () from target:/usr/lib64/ceph/libceph-common.so.2
#10 0x00007fef0489b622 in ProtocolV2::_handle_read_frame_segment() () from target:/usr/lib64/ceph/libceph-common.so.2
#11 0x00007fef0489c781 in ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int) () from target:/usr/lib64/ceph/libceph-common.so.2
#12 0x00007fef04884eec in ProtocolV2::run_continuation(Ct<ProtocolV2>&) () from target:/usr/lib64/ceph/libceph-common.so.2
#13 0x00007fef0484d3f9 in AsyncConnection::process() () from target:/usr/lib64/ceph/libceph-common.so.2
#14 0x00007fef048a7507 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () from target:/usr/lib64/ceph/libceph-common.so.2
#15 0x00007fef048ada1c in std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from target:/usr/lib64/ceph/libceph-common.so.2
#16 0x00007fef027c2ba3 in execute_native_thread_routine () from target:/lib64/libstdc++.so.6
#17 0x00007fef033871cf in start_thread () from target:/lib64/libpthread.so.0
#18 0x00007fef01ddadd3 in clone () from target:/lib64/libc.so.6

wants DaemonServer::lock; blocking reads on the AsyncConnection center? (No more messages read off the wire)


Question is: who has the GIL?

--- Additional comment from Greg Farnum on 2023-07-02 15:48:51 UTC ---

Ranjith: failed manager gets blocklisted, so it won’t be holding any locks or able to prevent progress by the others. :)

--- Additional comment from Patrick Donnelly on 2023-07-02 16:00:39 UTC ---

Thread 77 (Thread 0x7fedc0614700 (LWP 834795)):
#0  0x00007fef0338d44c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x00007fef027bc8f0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from target:/lib64/libstdc++.so.6
#2  0x00007feef1e9b5ab in Client::make_request(MetaRequest*, UserPerm const&, boost::intrusive_ptr<Inode>*, bool*, int, ceph::buffer::v15_2_0::list*) () from target:/lib64/libcephfs.so.2
#3  0x00007feef1ebbea6 in Client::_rmdir(Inode*, char const*, UserPerm const&) () from target:/lib64/libcephfs.so.2
#4  0x00007feef1ebc705 in Client::unlinkat(int, char const*, int, UserPerm const&) () from target:/lib64/libcephfs.so.2
#5  0x00007feef2274a87 in __pyx_pw_6cephfs_9LibCephFS_95rmdir () from target:/lib64/python3.6/site-packages/cephfs.cpython-36m-x86_64-linux-gnu.so


has the GIL!!

https://github.com/ceph/ceph/blob/c42efbf5874de8454e4c7cb3c22bd41bcc0e71f5/src/pybind/cephfs/cephfs.pyx#L1356

this probably explains soooo many performance problems with the mgr. It's often blocking WITH THE GIL waiting for rmdir response from the mds!!!

--- Additional comment from Patrick Donnelly on 2023-07-02 16:02:19 UTC ---

(In reply to Patrick Donnelly from comment #18)
> Thread 77 (Thread 0x7fedc0614700 (LWP 834795)):
> #0  0x00007fef0338d44c in pthread_cond_wait@@GLIBC_2.3.2 () from
> target:/lib64/libpthread.so.0
> #1  0x00007fef027bc8f0 in
> std::condition_variable::wait(std::unique_lock<std::mutex>&) () from
> target:/lib64/libstdc++.so.6
> #2  0x00007feef1e9b5ab in Client::make_request(MetaRequest*, UserPerm
> const&, boost::intrusive_ptr<Inode>*, bool*, int,
> ceph::buffer::v15_2_0::list*) () from target:/lib64/libcephfs.so.2
> #3  0x00007feef1ebbea6 in Client::_rmdir(Inode*, char const*, UserPerm
> const&) () from target:/lib64/libcephfs.so.2
> #4  0x00007feef1ebc705 in Client::unlinkat(int, char const*, int, UserPerm
> const&) () from target:/lib64/libcephfs.so.2
> #5  0x00007feef2274a87 in __pyx_pw_6cephfs_9LibCephFS_95rmdir () from
> target:/lib64/python3.6/site-packages/cephfs.cpython-36m-x86_64-linux-gnu.so
> 
> 
> has the GIL!!
> 
> https://github.com/ceph/ceph/blob/c42efbf5874de8454e4c7cb3c22bd41bcc0e71f5/
> src/pybind/cephfs/cephfs.pyx#L1356
> 
> this probably explains soooo many performance problems with the mgr. It's
> often blocking WITH THE GIL waiting for rmdir response from the mds!!!

https://tracker.ceph.com/issues/61869

--- Additional comment from  on 2023-07-02 16:34:23 UTC ---

Fresh logs from the hung mgr process are available in supportshell at:

0850-ceph-mgr.bmwcefesp000003.log

--- Additional comment from Brad Hubbard on 2023-07-02 23:13:08 UTC ---

Jumped the gun on POST, rolling back.

--- Additional comment from Kentaro Nakai on 2023-07-02 23:43:51 UTC ---

///

- Enabled the customer escalation flag to YES along with severity /priority set as urgent with the escalation raised as ACE EN-63511 & Watch-list Entry 7727

///

Comment 11 errata-xmlrpc 2023-08-03 16:45:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 6.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:4473

Note You need to log in before you can comment on or make changes to this bug.

akraj
anrao
assingh
bhubbard
bmcmurra
ceph-docs
gjose
hyelloji
kjosy
knakai
mcaldeir
mgowri
mkasturi
ngangadh
nojha
pdonnell
rrajaram
rzarzyns
saraut
tserlin
vdas
vumrao