Bug 1844578
Summary: | anaconda very rarely crashes at startup with a pygobject traceback | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Jan Stodola <jstodola> |
Component: | pygobject3 | Assignee: | Ray Strode [halfline] <rstrode> |
Status: | CLOSED ERRATA | QA Contact: | Desktop QE <desktop-qa-list> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 8.3 | CC: | anaconda-maint-list, dhorak, jkonecny, jkoten, lmiksik, mclasen, mkolman, pniahodk, rvykydal, tpelka, tpopela, vponcova, walters |
Target Milestone: | rc | Keywords: | OtherQA |
Target Release: | 8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | pygobject3-3.28.3-2.el8 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-11-04 01:35:45 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1812825 | ||
Attachments: |
Description
Jan Stodola
2020-06-05 17:14:06 UTC
It seems to be an issue in the Python bindings for GLib. Reassigning. Created attachment 1696028 [details]
Log from anaconda kickstart test with the Traceback
We've seen similar problem lately (June 05, python3-gobject-3.28.3-1.el8.x86_64) in the same call from Anaconda in our nightly kickstart tests:
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:ERROR:anaconda.modules.common.task.task:Thread AnaTaskThread-StartRHSMTask-1 has failed: Traceback (most recent call last):
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: File "/usr/lib64/python3.6/site-packages/pyanaconda/threading.py", line 280, in run
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: threading.Thread.run(self)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: File "/usr/lib64/python3.6/threading.py", line 864, in run
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: self._target(*self._args, **self._kwargs)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/common/task/task.py", line 97, in _task_run_callback
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: self._set_result(self.run())
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: File "/usr/lib64/python3.6/site-packages/pyanaconda/modules/subscription/initialization.py", line 74, in run
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: rhsm_config_proxy.Set("logging.default_log_level", get_variant(Str, "DEBUG"), "")
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: File "/usr/lib/python3.6/site-packages/dasbus/client/handler.py", line 419, in _call_method
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: parameters = get_variant(in_type, parameters)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: File "/usr/lib/python3.6/site-packages/dasbus/typing.py", line 124, in get_variant
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: return Variant(type_string, value)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: File "/usr/lib64/python3.6/site-packages/gi/overrides/GLib.py", line 172, in __new__
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: v = creator._create(format_string, value)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: File "/usr/lib64/python3.6/site-packages/gi/overrides/GLib.py", line 121, in _create
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription: builder = GLib.VariantBuilder.new(gvtype)
01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:AttributeError: type object 'VariantBuilder' has no attribute 'new'
01:29:54,779 WARNING org.fedoraproject.Anaconda.Modules.Subscription:INFO:anaconda.threading:Thread Done: AnaTaskThread-StartRHSMTask-1 (140054874896128)
The problem happened only for one of ~190 tests in one of 6 nightly runs.
FWIW pygobject master brach is passing all upstream tests on s390x - http://multiarch-ci.khw1.lab.eng.bos.redhat.com:8080/job/pygobject/label=s390x/ and https://gitlab.gnome.org/GNOME/pygobject/-/issues/247 is the last/only s390x (endianness) issue I recall for pygobject For the record, the traceback in comment #8 is x86_64 nested virtualization in PSI OpenStack (but we don't know yet if the root cause is the same as in the image build fail). I believe the original s390x is also nested virt situation, the builder is a VM and runs image installation in nested VM. > 01:29:54,773 WARNING org.fedoraproject.Anaconda.Modules.Subscription:AttributeError: type object 'VariantBuilder' has no attribute 'new'
This looks like a problem with gobject-introspection to me, not like a problem with pygobject.
What does the xml for VariantBuilder look like ?
The xml I'm referring to is /usr/share/gir-1.0/GLib-2.0.gir Created attachment 1696118 [details]
/usr/share/gir-1.0/GLib-2.0.gir from gobject-introspection-devel-1.56.1-1.el8.s390x.rpm
I'm attaching /usr/share/gir-1.0/GLib-2.0.gir from gobject-introspection-devel-1.56.1-1.el8.s390x.rpm
So I don't know if it's related, but looking at a different of the build logs for gobject-introspection between x86_64 and s390x I see this: ╎❯ diff -u ../../../build.log ../../../build.log.s390x | egrep -3 '^\+.*warning:' CC girepository/cmph/libcmph_la-brz.lo make[2]: Leaving directory '/builddir/build/BUILD/gobject-introspection-1.56.1' +girepository/cmph/brz.c: In function 'brz_pack': +girepository/cmph/brz.c:813:18: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] + *g_is_ptr++ = (cmph_uint32)g_i; + ^ +girepository/cmph/brz.c: In function 'brz_bmz8_search_packed': +girepository/cmph/brz.c:934:11: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] + h1_ptr = (cmph_uint8 *) g_is_ptr[h0]; + ^ +girepository/cmph/brz.c: In function 'brz_fch_search_packed': +girepository/cmph/brz.c:998:11: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] + h1_ptr = (cmph_uint8 *) g_is_ptr[h0]; + ^ make[2]: Entering directory '/builddir/build/BUILD/gobject-introspection-1.56.1' I hadn't heard of cmph before, but a quick google search shows it's library for producing perfect hashes: https://sourceforge.net/p/cmph/git/ci/master/tree/ Looking the code around the warnings is a little bleak: → → #if defined (__ia64) || defined (__x86_64__)• → → → *g_is_ptr++ = (cmph_uint64)g_i;• → → #else• → → → *g_is_ptr++ = (cmph_uint32)g_i;• → → #endif• ... → #if defined (__ia64) || defined (__x86_64__)• → → g_is_ptr = (cmph_uint64 *)packed_mphf;• → #else• → → g_is_ptr = packed_mphf;• → #endif• → • → h1_ptr = (cmph_uint8 *) g_is_ptr[h0];• ... → #if defined (__ia64) || defined (__x86_64__)• → → g_is_ptr = (cmph_uint64 *)packed_mphf;• → #else• → → g_is_ptr = packed_mphf;• → #endif• → • → h1_ptr = (cmph_uint8 *) g_is_ptr[h0];• ... So, apparently it thinks the only 64-bit arches are intel ones! Might be a red herring, of course. Looking at the latest upstream version of cmph shows that hasn't really changed since it was bundled. I think the warnings a red herring. looking in the source I see: gthash.c: cmph_config_set_algo (config, CMPH_BDZ); and the warnings are in brz.c, so I don't think the code is actually getting used. ( I mean it's choosing the algorithm in bdz.c not the algorithm in brz.c ) so comment 0 and comment 8 do seem related. In comment 0 there's this code: # Since we discarded all leaf types, this must be a container• builder = GLib.VariantBuilder.new(gvtype)• ... for i in value:• dup = variant_type_from_string(remainer_format).dup_string()• builder.add_value(self._create(dup, i))• leading to the quizical error: TypeError: argument self: Expected GLib.VariantBuilder, but got gi.repository.GLib.VariantBuilder comment 8 is failing in the first line above where builder is assigned, but the error message is different: AttributeError: type object 'VariantBuilder' has no attribute 'new' I do see the word threading in the traceback. Does python3 do way with the global interpreter lock ? If so, I wonder if we're hitting some sort of concurrency issue. That might explain why it's intermittently failing. Might want to drop brz.c then Apparently, the GIL is still a thing in python3. So I don't think it's a threading issue. The weird error message from comment 0 comes from here: static gboolean _pygi_marshal_from_py_interface_object(...) ... {• if (PyObject_IsInstance (py_arg, iface_cache->py_type) ||• (PyObject_TypeCheck(py_arg, &PyGObject_Type) &&• g_type_is_a (G_OBJECT_TYPE (pygobject_get (py_arg)), iface_cache->g_type))) { ... } else { PyObject *module = PyObject_GetAttrString(py_arg, "__module__");• • PyErr_Format (PyExc_TypeError, "argument %s: Expected %s, but got %s%s%s",• arg_cache->arg_name ? arg_cache->arg_name : "self",• ( (PyGIInterfaceCache *)arg_cache)->type_name,• module ? PYGLIB_PyUnicode_AsString(module) : "",• module ? "." : "",• ... py_arg->ob_type->tp_name);• }• }• I don't know under what circumstances the interface cache is populated, but you could imagine g_type_is_a failing if the gobject associated with the python module got disposed inadvertently. So this error message: AttributeError: type object 'VariantBuilder' has no attribute 'new' is coming from here: from ._gi import \• Repository, \• ... repository = Repository.get_default() ... class IntrospectionModule(object):• ... def __getattr__(self, name):• info = repository.find_by_name(self._namespace, name)• if not info:• raise AttributeError("%r object has no attribute %r" % (• self.__name__, name))• • The C half of this returns NULL (not info) here: GIBaseInfo *• g_irepository_find_by_name (GIRepository *repository,• → → → const gchar *namespace,• → → → const gchar *name)• {• ... entry = g_typelib_get_dir_entry_by_name (typelib, name);• if (entry == NULL)• return NULL;• ... }• g_typelib_get_dir_entry_by_name calls g_typelib_get_dir_entry over and over again with different indices, and is just: return (DirEntry *)&typelib->data[header->directory + (index - 1) * header->entry_blob_size];• typelib->data is the mmaped typelib file. Maybe the typelib file is getting truncated? Radek, do you happen to have the full journal output? (In reply to Ray Strode [halfline] from comment #22) > So this error message: > > AttributeError: type object 'VariantBuilder' has no attribute 'new' > > is coming from here: ... > raise AttributeError("%r object has no attribute %r" % (• > self.__name__, name))• This was wrong. The error messages are close, but subtly different. There would need to be a '%r' or so after the word object. Doing a grep, it's probably coming from python directly: ╎❯ egrep " object '%.*has no attribute" /usr/* -R Binary file /usr/lib64/libpython3.8.so matches Binary file /usr/lib64/libpython3.8.so.1.0 matches ╎❯ strings /usr/lib64/libpython3.8.so |egrep "object '.*has no attribute" type object '%.50s' has no attribute '%U' ╎❯ If we look at GLib.VariantBuilder with python, we see it's implemented with a metaclass: >>> type(GLib.VariantBuilder) <class 'gi.types.StructMeta'> implementation is here: class StructMeta(type, MetaClassHelper):• def __init__(cls, name, bases, dict_):• ... cls._setup_fields()• cls._setup_methods()• ... _setup_methods mentioned above is just: def _setup_methods(cls):• for method_info in cls.__info__.get_methods():• setattr(cls, method_info.__name__, method_info)• get_methods() calls g_struct_info_get_method in a loop g_struct_info_get_n_methods() times, so 'new' should be getting setattr'd to the class here. Indeed, running >>> delattr(GLib.VariantBuilder, 'new') >>> GLib.VariantBuilder.new(GLib.VariantType("(b)")) ... AttributeError: type object 'VariantBuilder' has no attribute 'new' Leads to the same error message, so perhaps the method is something getting deleted, or perhaps _setup_methods is never getting called. One side note, next in the code after _setup_methods is this bit: for method_info in cls.__info__.get_methods():• if method_info.is_constructor() and \• method_info.__name__ == 'new' and \• (not method_info.get_arguments() or• cls.__info__.get_size() == 0):• cls.__new__ = staticmethod(method_info)• # Boxed will raise an exception• # if arguments are given to __init__• cls.__init__ = nothing• break• So interestingly enough, GLib.VariantBuilder.new() isn't even necessary! __new__ gets set up so, the line that was failing: builder = GLib.VariantBuilder.new(gvtype) could have been written builder = GLib.VariantBuilder(gvtype) If this version works reliably that would suggest possibly delattr is somehow getting called (versus setup_methods not getting called in the first place), I suppose. (In reply to Ray Strode [halfline] from comment #23) > Radek, do you happen to have the full journal output? Unfortunately in this case of failure the journal is not collected from the system. I could try to run some batches of kickstart tests with Anaconda updated so we are able to collect the journal but it will take some time and the chances we hit the condition are uncertain. (In reply to Radek Vykydal from comment #26) > (In reply to Ray Strode [halfline] from comment #23) > > Radek, do you happen to have the full journal output? > > Unfortunately in this case of failure the journal is not collected from the > system. I could try to run some batches of kickstart tests with Anaconda > updated so we are able to collect the journal but it will take some time and > the chances we hit the condition are uncertain. We could update Anaconda so that it shuts down when we pass the get_variant place (so that the test that is not interesting is finished quickly and the update testing tooling so that the test is not stopped when hitting a traceback (and allows ssh access to the machine). (In reply to Radek Vykydal from comment #27) > (In reply to Radek Vykydal from comment #26) > > (In reply to Ray Strode [halfline] from comment #23) > > > Radek, do you happen to have the full journal output? > > > > Unfortunately in this case of failure the journal is not collected from the > > system. I could try to run some batches of kickstart tests with Anaconda > > updated so we are able to collect the journal but it will take some time and > > the chances we hit the condition are uncertain. > > We could update Anaconda so that it shuts down when we pass the get_variant > place (so that the test that is not interesting is finished quickly and the > update testing tooling so that the test is not stopped when hitting a > traceback (and allows ssh access to the machine). Or better, update just the kickstart tests tooling to filter the tests quickly so that we don't modify the condition by using updates image. Clearing needinfo flag. (In reply to Radek Vykydal from comment #27) > (In reply to Radek Vykydal from comment #26) > > (In reply to Ray Strode [halfline] from comment #23) > > > Radek, do you happen to have the full journal output? > > > > Unfortunately in this case of failure the journal is not collected from the > > system. I could try to run some batches of kickstart tests with Anaconda > > updated so we are able to collect the journal but it will take some time and > > the chances we hit the condition are uncertain. > > We could update Anaconda so that it shuts down when we pass the get_variant > place (so that the test that is not interesting is finished quickly and the > update testing tooling so that the test is not stopped when hitting a > traceback (and allows ssh access to the machine). That is worth trying. I'm even thinking if we can't change kickstart tests to extend/disable timeout when this error is detected. We will look on the possibilities. Clearing needinfo, GLib-2.0.gir was provided in comment 15. I am now running batches of nightly rhel8 kickstart tests (~180 tests/installations) killing tests passing get_variant line. One batch is taking about 30 minutes. No luck with reproducing yet. I'll try to get more of our PSI OpenStack resources involved to run more batches. is there any way you could somehow run the tests with this scratch build? Task info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29194774 it adds some debug spew that may help characterize the problem. (In reply to Ray Strode [halfline] from comment #32) > is there any way you could somehow run the tests with this scratch build? > > Task info: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29194774 > > > it adds some debug spew that may help characterize the problem. We'd probably need to use updates image which can modify the condition, or maybe we could build iso with the scratch build. Bad news is that so far I've run some 9 batches without any hit. Created attachment 1696310 [details] journal So I've succeeded to reproduce the issue from comment #8 (AttributeError: type object 'VariantBuilder' has no attribute 'new'). Attaching the journal. The VM does not have bridged networking so to access it you'd need ssh to hypervisor (I'd provide access) and from hypervisor do ssh root.122.55 to installer VM. (In reply to Radek Vykydal from comment #34) > Created attachment 1696310 [details] > journal > > So I've succeeded to reproduce the issue from comment #8 (AttributeError: > type object 'VariantBuilder' has no attribute 'new'). Err, actually it is the case from the description: TypeError: argument self: Expected GLib.VariantBuilder, but got gi.repository.GLib.VariantBuilder Created attachment 1696314 [details] journal Actually reproduced also the case from comment #8: (AttributeError: type object 'VariantBuilder' has no attribute 'new'). Attaching the journal. Access from the hypervisor would be ssh root.122.182 (In reply to Radek Vykydal from comment #34) > Created attachment 1696310 [details] > journal > > So I've succeeded to reproduce the issue from comment #8 (AttributeError: > type object 'VariantBuilder' has no attribute 'new'). > Attaching the journal. > The VM does not have bridged networking so to access it you'd need > ssh to hypervisor (I'd provide access) and from hypervisor do > ssh root.122.55 to installer VM. The access (password is fedora): [rvykydal@prnot3 ~]$ ssh fedora.136.57 fedora.136.57's password: Last login: Tue Jun 9 12:55:51 2020 from 10.43.26.115 [fedora@host-f32-2ff2ef-0 ~]$ ssh root.122.55 Last login: Tue Jun 9 12:55:12 2020 from 192.168.122.1 [anaconda root@localhost ~]# (In reply to Radek Vykydal from comment #36) > Created attachment 1696314 [details] > journal > > Actually reproduced also the case from comment #8: > (AttributeError: type object 'VariantBuilder' has no attribute 'new'). > Attaching the journal. > Access from the hypervisor would be ssh root.122.182 The access (password is fedora): [rvykydal@prnot3 ~]$ ssh fedora.138.28 fedora.138.28's password: Last login: Tue Jun 9 12:59:15 2020 from 10.43.26.115 [fedora@host-f32-2ff2ef-1 ~]$ ssh root.122.182 Last login: Tue Jun 9 13:01:22 2020 from 192.168.122.1 [anaconda root@localhost ~]# I'll try to hit the issue with updates image containing the scratch build. So I had one theory that perhaps the typelib was inadvertently getting unmapped out from under pygobject. I tried to simulate that, by running: [anaconda root@localhost ~]# /usr/libexec/platform-python Python 3.6.8 (default, Apr 3 2020, 16:09:51) [GCC 8.3.1 20191121 (Red Hat 8.3.1-5)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import gi >>> from gi.repository import GLib >>> builder = GLib.VariantBuilder.new(GLib.VariantType("(su)")) [anaconda root@localhost ~]# cat /proc/$(pidof platform-python3.6)/maps |grep typelib 7fd829d58000-7fd829d86000 r--p 00000000 fd:00 12761 /usr/lib64/girepository-1.0/GLib-2.0.typelib Then in gdb running: (gdb) call (void) munmap(0x7fd829d58000,0x7fd829d86000-0x7fd829d58000) and went back to python and ran: >>> builder2 = GLib.VariantBuilder.new(GLib.VariantType("(su)")) This lead to different symptoms, a crash: Program received signal SIGSEGV, Segmentation fault. 0x00007fd82d55b10f in g_function_info_get_flags () from /lib64/libgirepository-1.0.so.1 (gdb) up #1 0x00007fd82d7aaa4a in _function_info_descr_get () from /usr/lib64/python3.6/site-packages/gi/_gi.cpython-36m-x86_64-linux-gnu.so (gdb) #2 0x00007fd83c363091 in type_getattro () from /lib64/libpython3.6m.so.1.0 (gdb) #3 0x00007fd83c361d52 in PyObject_GetAttr () from /lib64/libpython3.6m.so.1.0 (gdb) #4 0x00007fd83c3b6005 in _PyEval_EvalFrameDefault () from /lib64/libpython3.6m.so.1.0 but note frame 2 is type_getattro, which is the same function where the error was AttributeError: type object 'VariantBuilder' has no attribute 'new' is raised. So maybe we're in the right ballpark. The crash is likely from here (though I lack debug symbols to be sure): /* Look for the attribute in the metatype */• meta_attribute = _PyType_Lookup(metatype, name);• • if (meta_attribute != NULL) {• meta_get = Py_TYPE(meta_attribute)->tp_descr_get;• which calls: _function_info_descr_get (PyGICallableInfo *self, PyObject *obj, PyObject *type) {• ... flags = g_function_info_get_flags ( (GIFunctionInfo*) self->base.info);• which dereferences: g_function_info_get_flags (GIFunctionInfo *info)• {• ... blob = (FunctionBlob *)&rinfo->typelib->data[rinfo->offset];• after my gdb shenangins, data is unmapped, thus crash. So does this help us with the "real failure"? Maybe, a little. Looking at the code again: /* Look for the attribute in the metatype */• meta_attribute = _PyType_Lookup(metatype, name);• • if (meta_attribute != NULL) {• meta_get = Py_TYPE(meta_attribute)->tp_descr_get;• • if (meta_get != NULL && PyDescr_IsData(meta_attribute)) {• /* Data descriptors implement tp_descr_set to intercept• * writes. Assume the attribute is not overridden in• * type's tp_dict (and bases): call the descriptor now.• */• return meta_get(meta_attribute, (PyObject *)type,• (PyObject *)metatype);• }• Py_INCREF(meta_attribute);• }• • We see if we get to tp_descr_get (as we apparently should in working conditions), we return from the function. tp_descr_get can't return NULL in our case, so this suggests the real failure is causing meta_attribute = _PyType_Lookup(metatype, name); to return NULL. (because the error message is raised at the very bottom of that function, so it can't be returning early in our failure case.) so while i'm waiting on the latest debug output from QE, I decided to try to toy around in a local VM. I didn't expect to be able to reproduce the problem (since it takes many iterations during automated testing to hit the bug), but I was hopeful I might be able to spot some sort of otherwise silent heap corrupting bug with valgrind. So I booted into the installer, killed anaconda, installed valgrind, edited /usr/libexec/anaconda/start-module to run the module through valgrind, then started anaconda. No luck: [anaconda root@localhost tmp]# journalctl |grep 2239 Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Memcheck, a memory error detector Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Command: /usr/libexec/platform-python -m pyanaconda.modules.subscription Jun 09 18:58:26 localhost.localdomain org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Process terminating with default action of signal 15 (SIGTERM) Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== at 0x6300EA1: poll (in /usr/lib64/libc-2.28.so) Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== by 0x1939B9B5: ??? (in /usr/lib64/libglib-2.0.so.0.5600.4) ct_FastCallKeywords (in /usr/lib64/libpython3.6m.so.1.0) Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== by 0x54C03A5: ??? (in /usr/lib64/libpython3.6m.so.1.0) Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== by 0x54C0EA7: _PyEval_EvalFrameDefault (in /usr/lib64/libpython3.6m.so.1.0) Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== by 0x541D5F5: ??? (in /usr/lib64/libpython3.6m.so.1.0) Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== HEAP SUMMARY: Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== in use at exit: 23,309,258 bytes in 208,281 blocks Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== total heap usage: 657,309 allocs, 449,028 frees, 85,971,811 bytes allocated Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== raproject.Anaconda.Modules.Subscription[2199]: ==2239== of which reachable via heuristic: Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== length64 : 792 bytes in 15 blocks Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== newarray : 1,826 bytes in 31 blocks Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== suppressed: 0 bytes in 0 blocks Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Rerun with --leak-check=full to see details of leaked memory Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== For lists of detected and suppressed errors, rerun with: -s Jun 09 19:01:58 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com org.fedoraproject.Anaconda.Modules.Subscription[2199]: ==2239== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) So it's worth pointing out that every time it's crashed, it's crashed while trying to resolve the paramters in a dbus RPC call. In both comment 0 and comment 8 and comment 36 it's crashing on rhsm_config_proxy.Set() Set is a remote procedure call, not a native method. In comment 34, it's crashing when trying to request a name on the bus: self.proxy.RequestName (also a remote procedure call). So, I've started to think about what sorts of things could make this non-deterministic, and one thought I had was "Maybe it's Garbage Collection kicking in?". So I tried manually adding: gc.collect() throughout the problematic areas, but it didn't help in creating a reliable reproducer, unfortunately. We could also try the opposite, do a debug build with gc.disable() and see if that makes the problem go away completely. If it does have an effect, it might give us a hint on where the problem is. I want to revisit comment 21 since i've looked at it a bit closer now: The failing if statement is: if (PyObject_IsInstance (py_arg, iface_cache->py_type) ||• (PyObject_TypeCheck(py_arg, &PyGObject_Type) &&• g_type_is_a (G_OBJECT_TYPE (pygobject_get (py_arg)), iface_cache->g_type))) { The object at hand is not a gobject, so the second half is completely irrelevant. That means PyObject_IsInstance (py_arg, iface_cache->py_type) is failing. the builder instance is no longer a PyObject. i meant to say "of the type expected in the cache" in the last comment. Maybe, iface_cache->py_type is getting overwritten with a new type? I've been doing a bit of code reading, so I'm just going to dump my findings here. iface_cache->py_type is apparently set up in pygi_arg_interface_setup: iface_cache->py_type = pygi_type_import_by_gi_info ( (GIBaseInfo *) iface_info);• which is roughly implemented as: module_name = g_strconcat ("gi.repository.", namespace_, NULL);• py_module = PyImport_ImportModule (module_name);• py_object = PyObject_GetAttrString (py_module, name);• so it's basically doing the equivalent of: from gi.repository.GLib as GLib py_type = getattr(GLib, "VariantBuilder") the /chain for setting up py_type for the self argument of this failing builder.add_value method call is something like: pygi_arg_interface_setup -> pygi_arg_interface_new_from_info -> pygi_arg_struct_new_from_info -> _arg_cache_new_for_interface -> _function_with_instance_cache_generate_args_cache_real -> _callable_cache_init -> _function_with_instance_cache_init -> pygi_method_cache_new -> _wrap_g_callable_info_invoke _wrap_g_callable_info_invoke is notable: PyObject *• _wrap_g_callable_info_invoke (PyGIBaseInfo *self, PyObject *py_args,• PyObject *kwargs)• {• ... if (self->cache == NULL) {• ... function_cache = pygi_method_cache_new (self->info);• ... self->cache = (PyGICallableCache *)function_cache;• ... }• ... return pygi_callable_info_invoke (self->info, py_args, kwargs, self->cache, NULL); }• So if the cache isn't already set up, it sets it up then it invokes the method which marshals the args and uses ffi. The full chain for the marshalling of the self argument I think should be something like: pygi_callable_info_invoke -> pygi_function_cache_invoke -> _function_cache_invoke_real -> pygi_invoke_c_callable -> _invoke_marshal_in_args -> arg_type_class_from_py_marshal Note our error message comes from _pygi_marshal_from_py_interface_object which is weird for this type of call. It's an analog to arg_type_class_from_py_marshal but for a different type of argument! arg_type_class_from_py_marshal is for boxed structs (like GVariantBuilder), _pygi_marshal_from_py_interface_object is for GObject arguments. It's like it thinks the builder variable is a GObject instead of GBoxed. They way it decides is from the cache generated in _wrap_g_callable_info_invoke. Part of that cache generation involves calling _arg_cache_new_for_interface which does this: {• ... info_type = g_base_info_get_type ( (GIBaseInfo *)iface_info);• ... switch (info_type) {• ... case GI_INFO_TYPE_OBJECT:• case GI_INFO_TYPE_INTERFACE:• return pygi_arg_gobject_new_from_info (type_info,• arg_info,• transfer,• direction,• iface_info,• callable_cache);• case GI_INFO_TYPE_BOXED:• ... return pygi_arg_struct_new_from_info (type_info,• arg_info,• transfer,• direction,• iface_info);• ... }• So apparently info_type is the wrong value. iface_info comes from one frame higher _function_with_instance_cache_generate_args_cache_real, which does: interface_info = g_base_info_get_container ((GIBaseInfo *) callable_info); callable_info should be a GIFunctionInfo and it's container should be a GIStructInfo, but the container is apparently a GIObjectInfo or GIInterfaceInfo. The GIFunctionInfo is supposed to come from the stuff i talked about in comment 25: class StructMeta(type, MetaClassHelper):• def __init__(cls, name, bases, dict_):• ... cls._setup_fields()• cls._setup_methods()• ... _setup_methods mentioned above is just: def _setup_methods(cls):• for method_info in cls.__info__.get_methods():• setattr(cls, method_info.__name__, method_info)• So at some point the above code should have done setattr(GLib.VariantBuilder, "add_value", method_info), where method_info was a GIFunctionInfo who's container was GIStructInfo. No answers yet, but at least I understand a little better how it's supposed to work. I'll try to a build that disables Garbage Collection to see if that makes the problem go away. this build (assuming it succeeds) should disable garbage collection. If garbage collection is implicated, this build will never hit the bug: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637 this build (assuming it succeeds) should make it garbage collect much more frequently. If garbage collection is implicated, this build may help it hit the bug more often: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646 Created attachment 1696433 [details] journal for scratch build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 Log from kstest run hitting a traceback using the iso containing this scratch build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 (updated version of build from comment #32) Just to be clear about instrumenting of the kickstart tests to pick the reproducer fast, I am killing the test if it reaches: https://github.com/rhinstaller/anaconda/blob/8bea2c61f155a2fd5e17f53b7a6dd5656951c8e6/pyanaconda/modules/subscription/initialization.py#L76 by looking for the message in the log. The reproducer happens a line before: https://github.com/rhinstaller/anaconda/blob/8bea2c61f155a2fd5e17f53b7a6dd5656951c8e6/pyanaconda/modules/subscription/initialization.py#L74 I can modify this to your needs. (In reply to Ray Strode [halfline] from comment #48) > this build (assuming it succeeds) should disable garbage collection. If > garbage collection is implicated, this build will never hit the bug: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637 > > this build (assuming it succeeds) should make it garbage collect much more > frequently. If garbage collection is implicated, this build may help it hit > the bug more often: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646 Jirka, could you please create boot.isos for both? I'll run the test batch with them. (In reply to Radek Vykydal from comment #49) > Created attachment 1696433 [details] > journal for scratch build > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 > > Log from kstest run hitting a traceback using the iso containing this > scratch build: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 > (updated version of build from comment #32) Access to the vm (password is fedora): [rvykydal@prnot3 ~]$ ssh fedora.136.57 fedora.136.57's password: Last login: Wed Jun 10 08:08:36 2020 from 10.43.26.115 [fedora@host-f32-2ff2ef-0 ~]$ ssh root.122.114 Last login: Wed Jun 10 08:10:11 2020 from 192.168.122.1 [anaconda root@localhost ~]# Created attachment 1696466 [details] another log for scratch build https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 (.debug) (In reply to Radek Vykydal from comment #49) > Created attachment 1696433 [details] > journal for scratch build > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 > > Log from kstest run hitting a traceback using the iso containing this > scratch build: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29217469 > (updated version of build from comment #32) Another instance of the reproducer for the same pygobject scratch build (.debug) (In reply to Ray Strode [halfline] from comment #48) > this build (assuming it succeeds) should disable garbage collection. If > garbage collection is implicated, this build will never hit the bug: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637 > > this build (assuming it succeeds) should make it garbage collect much more > frequently. If garbage collection is implicated, this build may help it hit > the bug more often: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646 Thank you Jirka for building the isos. Hunts for reproducer for both builds are running. Created attachment 1696491 [details] log with Traceback (.debug.1 scratch build) (In reply to Ray Strode [halfline] from comment #48) > this build (assuming it succeeds) should disable garbage collection. If > garbage collection is implicated, this build will never hit the bug: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637 Attaching log with Traceback hit after a few batches (normally we hit Traceback after ~7-10 batches) Access to the VM: [rvykydal@prnot3 ~]$ ssh fedora.137.248 fedora.137.248's password: Last login: Wed Jun 10 13:34:22 2020 from 10.43.26.115 [fedora@race-hunter1-a052d7-1 ~]$ ssh root.122.28 Last login: Wed Jun 10 13:36:00 2020 from 192.168.122.1 [anaconda root@localhost ~]# We do see the typelib is still mapped: Jun 09 20:38:05 localhost org.fedoraproject.Anaconda.Modules.Subscription[1897]: 7f8cba7e4000-7f8cba812000 r--p 00000000 fd:00 141820 /usr/lib64/girepository-1.0/GLib-2.0.typelib• So that definitively rules out spurious munmap calls being the problem. This is interesting: Jun 09 20:38:05 localhost org.fedoraproject.Anaconda.Modules.Subscription[1897]: GLib.VariantBuilder: ['__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__gtype__', '__hash__', '__info__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_free_on_dealloc', 'copy'] It's supposed to be: ['__class__', '__del__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__gtype__', '__hash__', '__info__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_free_on_dealloc', 'add_value', 'close', 'copy', 'end', 'new', 'open', 'ref', 'unref'] So, 'add_value', 'close', ''end', 'new', 'open', 'ref', 'unref' are missing, these come from the typelib. 'copy' is present. This comes from: pygboxed.c: { "copy", (PyCFunction) gboxed_copy, METH_NOARGS }, Indeed, if I repeat the same experiment from comment 25, but try to get rid of "copy" instead of "new" it fails: >>> delattr(GLib.VariantBuilder, 'copy') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: copy >>> So it does seem like perhaps all the attributes are getting deleted on the object, or aren't getting added in the first place. Maybe we should be printing out GLib.VariantBuilder.__dict__ as well. I suspect it's pretty bare. As an experiment, I decided changing the code to use GObjectMeta instead of StructMeta, to see if it would fail in the same way. Unfortunately, it seems to still call _setup_methods so all the methods get property added. Looking in StructMeta, I do see an early return: class StructMeta(type, MetaClassHelper):• def __init__(cls, name, bases, dict_):• ... # Avoid touching anything else than the base class.• g_type = cls.__info__.get_g_type()• if g_type != TYPE_INVALID and g_type.pytype is not None:• return• ... cls._setup_methods()• So perhaps we're hitting that. Indeed, as an experiment, if I add: if g_type.name == 'GVariantBuilder': return before cls._setup_methods() the problem reproduces. So maybe g_type.pytype is somehow inadvertently getting set? There's also this chunk of code that's interesting: # Check if there is already a Python wrapper that is not a parent class• # of the wrapper being created. If it is a parent, it is ok to clobber• # g_type.pytype with a new child class wrapper of the existing parent.• # Note that the return here never occurs under normal circumstances due• # to caching on the __dict__ itself.• if g_type != TYPE_NONE:• type_ = g_type.pytype• if type_ is not None and type_ not in bases:• self.__dict__[name] = type_• return type_• ... wrapper = metaclass(name, bases, dict_) So the code is expecting, that if the pytype already set on the object is a parent, that the pytype will get clobbered when running through the metaclass. As mentioned in the previous comment that doesn't happen though: # Avoid touching anything else than the base class.• g_type = cls.__info__.get_g_type()• if g_type != TYPE_INVALID and g_type.pytype is not None:• return• So if pytype is set "Boxed" it's never going to get "VariantBuilder"'s methods and pytype. I don't know why pytype would be Boxed, though (or if it is) (In reply to Radek Vykydal from comment #55) > (In reply to Ray Strode [halfline] from comment #48) > > this build (assuming it succeeds) should disable garbage collection. If > > garbage collection is implicated, this build will never hit the bug: > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247637 > > Attaching log with Traceback hit after a few batches (normally we hit > Traceback after ~7-10 batches) Okay, so the GC is in the clear. I think the next move should be to instrument the code to print when pytype is set. That why we can see if it's the reason the methods aren't getting added to the object. (In reply to Ray Strode [halfline] from comment #48) > this build (assuming it succeeds) should make it garbage collect much more > frequently. If garbage collection is implicated, this build may help it hit > the bug more often: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646 Interestingly, after 8 batches when we are usually close to hitting a reproducer, no reproducer here. I'll wait a bit more before shutting down this series. (In reply to Radek Vykydal from comment #59) > (In reply to Ray Strode [halfline] from comment #48) > > > this build (assuming it succeeds) should make it garbage collect much more > > frequently. If garbage collection is implicated, this build may help it hit > > the bug more often: > > > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29247646 > > Interestingly, after 8 batches when we are usually close to hitting a > reproducer, no reproducer here. I'll wait a bit more before shutting down > this series. Maybe it's a timing thing, the GCs slow it down enough that it never fails? I've put a new build here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29279557 This one prints stacktrace any time someone tries to acces pytype on GVariantBuilder So at this point since we've ruled out the GC, and very likely ruled out heap corruption, the main contender, in my mind has again turned back to some sort concurrency race involving multiple threads. Looking at the subscription module it does have two threads running: class SubscriptionService(KickstartService):• ... def __init__(self):• ... self._rhsm_startup_task = StartRHSMTask()• ... def run(self):• """Initiate RHSM service startup before starting the main loop.• • This way RHSM service can startup in parallel without blocking• startup of the Subscription module. """• self._rhsm_startup_task.start()• super().run()• _rhsm_startup_task.start() starts the thread where the crash happens, and super().run() is this: def run(self):• """Run the loop."""• log.debug("Publish the service.")• self.publish()• log.debug("Start the loop.")• self._loop.run()• We can see the first of those debug messages in the log shortly before the crash. Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:anaconda.modules.common.base.base:Publish the service. self.publish() does two things: DBus.publish_object(SUBSCRIPTION.object_path, SubscriptionInterface(self))• DBus.register_service(SUBSCRIPTION.service_name)• The first shows up right away in the log: Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:dasbus.connection:Publishing an object at /org/fedoraproject/Anaconda/Modules/Subscription. But then we jump over to the other thread Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: INFO:anaconda.core.dbus:Connecting to the Anaconda bus at unix:abstract=/tmp/dbus-rmvcQJefAZ,guid=ff26b875e7c66ca5840fc0565ee0d50d. Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:anaconda.modules.subscription.initialization:subscription: creating /etc/yum.repos.d Jun 10 12:41:58 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: INFO:program:Running... systemctl start rhsm.service and jump back, where we see DBus.register_service run: Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:dasbus.connection:Registering a service name org.fedoraproject.Anaconda.Modules.Subscription. and back to the second thread (the one that ultimately tanks): Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:program:Return code: 0 Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: DEBUG:anaconda.modules.subscription.initialization:subscription: setting RHSM log level to DEBUG Looks like we try to get two bus connects on the system bus at the same time. Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: INFO:dasbus.connection:Connecting to the system bus. Jun 10 12:41:59 localhost org.fedoraproject.Anaconda.Modules.Subscription[2042]: INFO:dasbus.connection:Connecting to the system bus. Both threads are trying to do bus calls at the same time. One thread is doing: result = self.proxy.RequestName(...) and the other thread is doing: rhsm_config_proxy.Set(...) The DBus proxy object code implicitly gets a connection on the bus during this call. The crash is shortly after the connection is established, it's here: if in_type is not None:• parameters = get_variant(in_type, parameters)• get_variant is just a wrapper around GLib.Variant provided by pygobject. Since the other thread is exercising pretty much the same code path, it's likely at around the same place in the code. So there is a strong possibility we're hitting a thread safety issue here. So I have a theory that's similar to comment 57, but tweaked to take into account interactions that may be caused by multiple threads. Both threads are essentially accessing GLib.VariantBuilder at the same time. This means this code is running in two threads concurrently: class IntrospectionModule(object):• def __getattr__(self, name):• info = repository.find_by_name(self._namespace, name)• ... elif isinstance(info, RegisteredTypeInfo):• g_type = info.get_g_type()• ... # Create a wrapper. elif isinstance(info, (StructInfo, UnionInfo)):• if g_type.is_a(TYPE_BOXED):• bases = (Boxed,)• ... metaclass = StructMeta• ... type_ = g_type.pytype• if type_ is not None and type_ not in bases:• self.__dict__[name] = type_• return type_• ... wrapper = metaclass(name, bases, dict_) ... g_type.pytype = wrapper As mentioned in comment 57, it's possible for the wrapper to be created incompletely without fully setting up the object, if pytype is set. If that happens, the incomplete wrapper will get used as the new type object from then on out. But there's a check if pytype is already set, and an early return. Normally, if pytype is set to anything but "Boxed", it will just get returned and used, avoiding the problematic incomplete wrapper. Of course, in the scenario where two threads are running currently, pytype won't be "Boxed", it will be either None or one of the wrappers. So, in theory the early return should fire, and there should be no issues. But what if pytype gets set from the other thread right after the check in this thread? It would have the very symptoms we're seeing! One niggle is, that I know, thanks to the global interpreter lock, that two threads don't exactly run simultaneously. They take turns, one thread waiting on the lock and the other executing code until it drops the lock. The lock gets dropped at certain preemption points (like when calling into glib code). I don't know for sure that there is such a preemption point after the if type_ is not None and type_ not in bases:• check, and before the if g_type != TYPE_INVALID and g_type.pytype is not None: check. But there might be, and if there is then it seems to fit the bug pretty well. Created attachment 1696587 [details]
Patch that I think will fix the problem.
Assuming my theory is correct, I believe the above patch should fix the problem.
I putting a scratch build here: Task info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29285014 Moving to POST. If it checks out after a day of testing tomorrow I'll do an official build. Radek, can you qa ack? Created attachment 1696598 [details] Patch that I think will fix the problem, now with 100% more patch attachment 1696587 [details] only had the spec file changes. It lacked the actual patch. Thank you Ray, to me the fix seems very promising. I am not sure I can formally give qa_ack being a devel, but I'll definitely test the build running it on as much resources I can get (in PSI OpenStack). Normally the condition is hit sooner than ~10 batches are run [1] (each 170 installations, taking ~35 mins to run). So about 1/1500 installations. I'll run several series of batches with patched pygobject and one or two series without the patch in parallel. [1] we had only 2 exceptions: - using updates image we needed 14 batches - using gcextra build we didn't hit the condition even after 31 batches but I think we had enough cases hitting the condition as expected that it is very likely that if we don't hit it with the patch it is because the cause is actually fixed. (In reply to Ray Strode [halfline] from comment #64) > I putting a scratch build here: > > Task info: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=29285014 > > Moving to POST. If it checks out after a day of testing tomorrow I'll do an > official build. Radek, can you qa ack? New ISO is created for this build: http://10.43.136.2/trees/rhel-8-scratch/bug1844578_task29285014_boot.iso Starting the tests. (In reply to Radek Vykydal from comment #66) > I am not sure I can formally give qa_ack being a devel, but I'll definitely > test the build running it on as much resources I can get (in PSI OpenStack). ah, thanks for the work you put in getting a rare bug to reproduce many times over the last days. I assumed only someone with QE chops would be able to pull that off... tpelka, is this something you can qa ack? Current results of series of batches run in parallel: The issue was hit in both non-patched instances. - original1: reproduced in 2nd batch - original2: reproduced in 5th batch With patched pygobject the issue has not been hit after - patched1: 7 batches - patched2: 5 batches - patched3: 5 batches Seems good so far. Created attachment 1696798 [details]
test case
So this test, I think simulates, the bug. If I run it locally within a minute or so I see:
╎❯ python3 threadtest.py
type object 'VariantBuilder' has no attribute 'new'
'VariantBuilder' object has no attribute 'unref'
If I use a version of pygobject3 with my patch, I don't see the problem after several minutes.
I think the delattr stuff i'm doing in the test isn't really kosher though and it's leading to a fairly fast memory leak, unfortunately.
(In reply to Radek Vykydal from comment #70) > With patched pygobject the issue has not been hit after > - patched1: 7 batches > - patched2: 5 batches > - patched3: 5 batches Still not hit after 10, 8, 8 batches. I'll do one more update in the evening. I'll also run regular complete run of kickstart tests (without instrumentation looking for the reproducer) with the iso. (In reply to Radek Vykydal from comment #72) > (In reply to Radek Vykydal from comment #70) > > > With patched pygobject the issue has not been hit after > > - patched1: 7 batches > > - patched2: 5 batches > > - patched3: 5 batches > > Still not hit after 10, 8, 8 batches. I'll do one more update in the evening. > I'll also run regular complete run of kickstart tests (without > instrumentation looking for the reproducer) with the iso. 17,13,12 batches without hitting the problem. I am releasing the resources considering the test PASSED. great, i'll do a build now then. Great work Ray with the debugging and Radek with testing! You guys did an amazing job here! (In reply to Radek Vykydal from comment #72) > I'll also run regular complete run of kickstart tests (without > instrumentation looking for the reproducer) with the iso. There are no regressions in complete kickstart tests run using the iso with the patched pygobject3. (In reply to Jiri Konecny from comment #75) > Great work Ray with the debugging and Radek with testing! You guys did an > amazing job here! Indeed, totally amazing! :) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: GNOME security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4451 |