Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leaked zombie processes when closing surfaces #4605

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

goonzoid
Copy link
Member

@goonzoid goonzoid commented Jan 5, 2025

Full discussion in #4554, which this fixes.

I don't love the way the killCommand function ends up looking here, but I believe it to be functionally correct.

I think we would really benefit from reworking and simplifying much of this subprocess handling code, but it's quite involved and there is of course risk that we re-introduce these kinds of bugs again. I may have a play with it at some point.

@slovdahl
Copy link
Contributor

slovdahl commented Jan 5, 2025

I checked out your branch on my Ubuntu 24.04 and built it (zig build -Doptimize=ReleaseFast). It's not working very well though. When I press ctrl-d in a tab, the ghostty window hangs and I have to press "Force Quit" in the popup that appears after a few seconds.

Log output after pressing ctrl-d:

Destroying EGL context
warning(io_exec): error getting pgid for kill

image

@goonzoid
Copy link
Member Author

goonzoid commented Jan 5, 2025

Yikes! That’s no good. I think I know what the issue is though. Thanks for checking it out!

I’ll try to get a Linux VM set up soon so that I can fix it correctly for both platforms.

@goonzoid
Copy link
Member Author

goonzoid commented Jan 5, 2025

Okay, thanks to @slovdahl's feedback and a fresh look at the code, I realized there's a much simpler fix available, which I've just pushed.

I think there's a higher than average risk of future bugs here with any slight changes in this area, so I still think it may be worth a rethink of much of subprocess management code. That said, this PR should fix at least some (possibly all) zombie leak cases without introducing any risk of us getting stuck in a loop and hanging.

Note that I still haven't had a chance to test on Linux, so someone should definitely do that before we merge.

@slovdahl
Copy link
Contributor

slovdahl commented Jan 5, 2025

I'm afraid the ghostty process still hangs when pressing ctrl-d on my Ubuntu 24.04 at least. 🤔

@goonzoid goonzoid mentioned this pull request Jan 5, 2025
@goonzoid
Copy link
Member Author

goonzoid commented Jan 5, 2025

Hmm, that's surprising. Is it possible that you're still building the previous version of my branch? Note that I did a force-push in order to update this PR, so you may need to do a hard reset or similar (a simple pull likely won't do it).

@jorismak
Copy link

jorismak commented Jan 5, 2025

Doing a first pull of your repo/branch and build now, let's see.
But in the issue / discussion, there was talk of the same happening on MacOS. Could be a complete different issue of course, but maybe it's not a Linux specific thing...

@jorismak
Copy link

jorismak commented Jan 5, 2025

info: ghostty version=1.0.2-HEAD+d58b4353
info: ghostty build optimize=ReleaseFast
info: runtime=apprt.Runtime.gtk
info: font_backend=font.main.Backend.fontconfig_freetype
info: dependency harfbuzz=8.4.0
info: dependency fontconfig=21402
info: renderer=renderer.OpenGL
info: libxev backend=main.Backend.io_uring

When exiting the shell with ctrl+d or exit still this and then nothing:

info(io_exec): started subcommand path=/bin/sh pid=46244
info(io_exec): subcommand cgroup=/user.slice/user-1000.slice/[email protected]/app.slice/app-ghostty-transient-46203.scope/surfaces/6802C70.scope
info(grid): reallocating GPU buffer old=0 new=8
warning(stream): unimplemented OSC callback: terminal.osc.Command{ .end_of_command = terminal.osc.Command.Command__struct_94772{ .exit_code = null } }
warning(stream): unimplemented OSC command: end_of_command
Using integer scale 2 for EGL window (1384 930 => 2768 1860)
Destroying EGL context

Tried again with gtk-single-instance forced to false, but still the same.

And with closing the window through Gnome / window manager:

info(io_exec): started subcommand path=/bin/sh pid=46855
info(io_exec): subcommand cgroup=-
info(grid): reallocating GPU buffer old=0 new=8
warning(stream): unimplemented OSC callback: terminal.osc.Command{ .end_of_command = terminal.osc.Command.Command__struct_94772{ .exit_code = null } }
warning(stream): unimplemented OSC command: end_of_command
Using integer scale 2 for EGL window (1384 930 => 2768 1860)
Destroying EGL context
Destroying EGL context
info(io_exec): read thread got quit signal
info(surface): surface closed addr=2ace9060

During the gtk2 era when I used glib, it was also tricky to get threads to nicely quit. Resulted in 'the thread must always exit itself', you can't force-abort it from the main thread, or it will stay behind.

But, as I said, that was ages ago, and I hope that modern Zig doesn't have the same issues as an ageing glib version :).

@slovdahl
Copy link
Contributor

slovdahl commented Jan 5, 2025

Hmm, that's surprising. Is it possible that you're still building the previous version of my branch? Note that I did a force-push in order to update this PR, so you may need to do a hard reset or similar (a simple pull likely won't do it).

Yeah, I deleted the local branch before fetching your branch again. Re-did it once more and verified I have d58b435 locally as my HEAD.

image

Noticed something interesting now though. This is ctrl-d with more than one tab open (which still hangs):

image

But, if I press ctrl-d with only one tab open, the window is closed as expected. The process keeps running in the background though, and nothing else than Destroying EGL context is logged.

@jorismak
Copy link

jorismak commented Jan 5, 2025

Never tested more tabs, so I don't know the 'original' 1.0.0 / 1.0.1 behaviour. But this branch indeed 'hangs' / stalls when opening multiple tabs and closing one with ctrl+d. Funnily, if I 'split right' for example, it seems to close OK with ctrl+d.

I don't know if it's related. But if I have an alacritty terminal running with top inside it, I see very high CPU idle with 0% wait. The moment I start a ghostty shell, the waits reach up to just over 10% (and thus my idle CPU goes down, while I don't have the feeling it's doing anything more).

If I then close the ghostty window with ctrl+d, the waits keep showing. If I type killall ghostty in another alacritty terminal, the waits instantly drop back to 0% again.

image

@slovdahl
Copy link
Contributor

slovdahl commented Jan 5, 2025

I see very high CPU idle with 0% wait. The moment I start a ghostty shell, the waits reach up to just over 10% (and thus my idle CPU goes down, while I don't have the feeling it's doing anything more).

Likely unrelated: #3224

@slovdahl
Copy link
Contributor

slovdahl commented Jan 5, 2025

Never tested more tabs, so I don't know the 'original' 1.0.0 / 1.0.1 behaviour. But this branch indeed 'hangs' / stalls when opening multiple tabs and closing one with ctrl+d. Funnily, if I 'split right' for example, it seems to close OK with ctrl+d.

FWIW, windows close normally with ctrl-d for me on latest main.

@jorismak
Copy link

jorismak commented Jan 5, 2025

Never tested more tabs, so I don't know the 'original' 1.0.0 / 1.0.1 behaviour. But this branch indeed 'hangs' / stalls when opening multiple tabs and closing one with ctrl+d. Funnily, if I 'split right' for example, it seems to close OK with ctrl+d.

FWIW, windows close normally with ctrl-d for me on latest main.

Yeah, for me as well. The problem was never with windows not closing. The problem is with the ghostty process not exiting when the shell decides to quit.

@slovdahl
Copy link
Contributor

slovdahl commented Jan 5, 2025

Built myself a debug build. All cases with only a single tab.

On main (but using epoll instead of io_uring) when I press ctrl-d and the window is closed but the process remains:

debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(io_thread): mailbox message=termio.message.Message{ .write_small = termio.message.MessageData(u8,38).Small{ .data = { 4, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170 }, .len = 1 } }
debug(io_exec): child process exited status=127 runtime=21997ms
debug(app): mailbox message=surface_message
debug(gtk_surface): gl surface unrealized
debug(opengl): program destroyed id=165
debug(opengl): program destroyed id=162
debug(renderer_thread): mailbox message=renderer.message.Message{ .reset_cursor_blink = void }
Destroying EGL context
debug(app): focus event focused=false
debug(io_exec): termios change mode=pty.Mode{ .canonical = true, .echo = true }
debug(renderer_thread): mailbox message=renderer.message.Message{ .focus = false }
debug(io_thread): mailbox message=termio.message.Message{ .focused = false }
debug(gtk): window destroy
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface

Pressing the X button in the top right corner to close the window:

debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(gtk): window close request
debug(app): focus event focused=false
debug(gtk_surface): gl surface unrealized
debug(opengl): program destroyed id=165
debug(opengl): program destroyed id=162
Destroying EGL context
Destroying EGL context
debug(renderer_thread): mailbox message=renderer.message.Message{ .focus = false }
debug(io_thread): mailbox message=termio.message.Message{ .focused = false }
debug(gtk): tab box destroy
debug(gtk_surface): gl destroy
debug(renderer_thread): starting renderer thread shutdown
debug(renderer_thread): renderer thread exited
debug(io_thread): starting IO thread shutdown
debug(io_exec): process group killed pgid=576934
debug(io_exec): waitpid result=0
debug(io_exec): process group killed pgid=576934
debug(io_exec): waitpid result=576934
info(io_exec): read thread got quit signal
debug(io_thread): IO thread exited
info(surface): surface closed addr=7e5e51a62000
debug(gtk): window destroy
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface

This PR when pressing ctrl-d:

debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(io_thread): mailbox message=termio.message.Message{ .write_small = termio.message.MessageData(u8,38).Small{ .data = { 4, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170, 170 }, .len = 1 } }
debug(io_exec): child process exited status=0 runtime=14748ms
debug(app): mailbox message=surface_message
debug(renderer_thread): mailbox message=renderer.message.Message{ .reset_cursor_blink = void }
debug(gtk_surface): gl surface unrealized
debug(opengl): program destroyed id=165
debug(opengl): program destroyed id=162
debug(io_exec): termios change mode=pty.Mode{ .canonical = true, .echo = true }
Destroying EGL context
debug(app): focus event focused=false
debug(renderer_thread): mailbox message=renderer.message.Message{ .focus = false }
debug(io_thread): mailbox message=termio.message.Message{ .focused = false }
debug(gtk): window destroy
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface

This PR when pressing the X button in the top right corner:

debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface
debug(gtk): window close request
debug(app): focus event focused=false
debug(gtk_surface): gl surface unrealized
debug(opengl): program destroyed id=165
debug(opengl): program destroyed id=162
Destroying EGL context
Destroying EGL context
debug(io_thread): mailbox message=termio.message.Message{ .focused = false }
debug(renderer_thread): mailbox message=renderer.message.Message{ .focus = false }
debug(gtk): tab box destroy
debug(gtk_surface): gl destroy
debug(renderer_thread): starting renderer thread shutdown
debug(renderer_thread): renderer thread exited
debug(io_thread): starting IO thread shutdown
debug(io_exec): process group killed pgid=580075
debug(io_exec): waitpid result=0
debug(io_exec): process group killed pgid=580075
debug(io_exec): waitpid result=580075
info(io_exec): read thread got quit signal
debug(io_thread): IO thread exited
info(surface): surface closed addr=73b675df0000
debug(gtk): window destroy
debug(app): mailbox message=redraw_surface
debug(app): mailbox message=redraw_surface

@goonzoid
Copy link
Member Author

goonzoid commented Jan 5, 2025

Thanks for all the info folks. Clearly I need to get a Linux test setup before proceeding much further. This all works well on MacOS (at least for me). I have some theories about what might be causing the stuck process, but don't want to keep guessing when I can't test it out locally.

@slovdahl
Copy link
Contributor

slovdahl commented Jan 5, 2025

I tried to debut the GTK side of this a little bit. First time I'm touching both Zig and GTK, so bear with me.. 😁

I'm not sure if it's the problem, but one thing that I find a bit strange is that App.deleteSurface is never called when the window is closed after pressing ctrl-d with only a single tab open.. The only place the method is called from is errdefer self.app.core_app.deleteSurface(self); in the realize method in apprt/gtk/Surface.zig. AFAICT, this means that quit_timer is never called.

cc'ing @jcollie as the original author of the quit_timer code paths.

@goonzoid
Copy link
Member Author

goonzoid commented Jan 8, 2025

Small progress here: managed to get a linux environment stood up today, and can reproduce the behavior exactly as described above. Ran out of time to debug but should be able to get to that soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants