SIGSEGV in memory profiler (when `memalloc_add_event` calls `traceback_free`) #11751

oranav · 2024-12-17T10:17:17Z

We're hitting SIGSEGVs every now and then with the memory profiler.

Python version is 3.11.11. ddtrace is 2.17.3. We're using the amd64 architecture.

I've extracted a native stack traceback from the coredump:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=11, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007fb47c227f1f in __pthread_kill_internal (signo=11, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fb47c1d8fb2 in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  0x00007fb47a2f222f in ?? () from /app/venv/lib/python3.11/site-packages/ddtrace/internal/datadog/profiling/ddup/../libdd_wrapper-glibc-x86_64.so
#4  <signal handler called>
#5  0x00007fb47c523964 in _Py_Dealloc (op=<unknown at remote 0x7fb47ac9c230>) at Objects/object.c:2390
#6  0x00007fb4781a86e5 in traceback_free () from /app/venv/lib/python3.11/site-packages/ddtrace/profiling/collector/_memalloc.cpython-311-x86_64-linux-gnu.so
#7  0x00007fb4781a7a40 in memalloc_add_event.part () from /app/venv/lib/python3.11/site-packages/ddtrace/profiling/collector/_memalloc.cpython-311-x86_64-linux-gnu.so
#8  0x00007fb4781a7cbe in memalloc_malloc () from /app/venv/lib/python3.11/site-packages/ddtrace/profiling/collector/_memalloc.cpython-311-x86_64-linux-gnu.so
#9  0x00007fb47c545be6 in PyObject_Malloc (size=44) at Objects/obmalloc.c:712
#10 _PyBytes_FromSize (size=11, use_calloc=0) at Objects/bytesobject.c:103
#11 0x00007fb47c583a47 in PyBytes_FromStringAndSize (size=11, str=0x7fb3efd2d820 "<REDACTED>") at Objects/bytesobject.c:136

It seems to me that this call access some invalid memory.

I believe #11460 might fix it; a possible explanation is that two threads decide to ditch the same traceback, in case reservoir sampling yielded the same index in both threads, then we might call traceback_free twice on the same pointer (as long as it isn't guarded by a lock).
I'm not sure if that's the case though, but it's a possible explanation.

The text was updated successfully, but these errors were encountered:

sanchda · 2024-12-17T14:03:06Z

👋 Thank you for the report, @oranav. #11460 is indeed the fix for this.

sanchda · 2024-12-20T14:00:23Z

FYI, this will be released (later today, I hope) in 2.18.1. I'm also attempting to back-port to the 2.17 and 2.16 lines (🤞). It'll be part of mainline starting in the 2.19.0 release.

sanchda · 2024-12-20T22:19:51Z

Confirming that 2.18.1 shipped. Would love to hear some folks weigh in on whether or not it solved this problem for them.

apenney · 2025-01-08T14:35:00Z

We have some real serious production issues with 2.18.1 (and previous versions) with 2.14.1 because the last working one for us. We get a bunch of SIGSEGV and cripplingly high cpu/memory usage until everything dies. I made an actual support ticket (1984876) that I wanted to highlight here in case it's related. (might be a separate issue, but this is the only issue we found that seemed semi-related)

sanchda · 2025-01-08T14:40:31Z

@apenney following up with our support organization for the circumstantial details in your ticket. Will respond with top priority.

sanchda · 2025-01-08T14:45:29Z

@apenney I'm not sure yet whether your report is related to this one. I'm going to try to override our support processes and will iterate through there (it's a lot easier for me to review customer environments in the context of a support ticket than to ask for some kind of painful back-and-forth over Github Issues).

Note that my fix here DID introduce a performance regression, which has been fixed and backported by @nsrip-dd, but it has yet to land in a release.

sanchda · 2025-01-08T17:52:26Z

@apenney still investigating this. Let me break down where we're at.

The crashes I see don't appear to be related to this ticket. Maybe they were at one point, or maybe I'm missing them (sorry!), but the memalloc issues described by the OP no longer appear to be relevant.
However, there are a number of crashes from other components of dd-trace-py. I'm coordinating with the appropriate engineers to gain some insight into things.
The SIGILL trap you posted is problematic, and we don't have automatic detection for SIGILL just yet. I'm somewhat hoping that these issues are resolved by addressing the problems in category 2. If not, I'm also proposing that we upgrade our crash analysis infrastructure simultaneously.

Anyway @apenney, in terms of timeline, here's what you can expect.

I'll back off on posting updates on this issue in this thread, unless you have something tactical to share or if our other lines of communication don't sync up within some appropriate amount of time
Ownership of your ticket will transition to a different part of our org (not me, probably).
Focus is on understanding the crashes. Unfortunately, it's really hard to pinpoint overhead until we have crashes sorted out. If you have evidence for overhead being a totally orthogonal issue, please share your findings in a ticket and we might be able to divide-and-conquer things more effectively on this end.

taegyunkim added the Profiling Continous Profling label Dec 17, 2024

sanchda self-assigned this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV in memory profiler (when `memalloc_add_event` calls `traceback_free`) #11751

SIGSEGV in memory profiler (when `memalloc_add_event` calls `traceback_free`) #11751

oranav commented Dec 17, 2024

sanchda commented Dec 17, 2024

sanchda commented Dec 20, 2024

sanchda commented Dec 20, 2024

apenney commented Jan 8, 2025

sanchda commented Jan 8, 2025

sanchda commented Jan 8, 2025

sanchda commented Jan 8, 2025

SIGSEGV in memory profiler (when memalloc_add_event calls traceback_free) #11751

SIGSEGV in memory profiler (when memalloc_add_event calls traceback_free) #11751

Comments

oranav commented Dec 17, 2024

sanchda commented Dec 17, 2024

sanchda commented Dec 20, 2024

sanchda commented Dec 20, 2024

apenney commented Jan 8, 2025

sanchda commented Jan 8, 2025

sanchda commented Jan 8, 2025

sanchda commented Jan 8, 2025

SIGSEGV in memory profiler (when `memalloc_add_event` calls `traceback_free`) #11751

SIGSEGV in memory profiler (when `memalloc_add_event` calls `traceback_free`) #11751