Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_physical_replication_config_mismatch_too_many_known_xids is failing on slow backend exit #10167

Open
alexanderlaw opened this issue Dec 17, 2024 · 1 comment
Assignees
Labels
a/test/flaky Area: related to flaky tests c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link

Numerous test_physical_replication_config_mismatch_too_many_known_xids failures (occurred mostly on relatively slow aarch64, but the last one is from x64):
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10153/12340853499/index.html#testresult/d2d742bbd5eafbc2/
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10150/12324065238/index.html#/testresult/62cbf9b86ee197f6
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10139/12356033335/index.html#/testresult/ba041a552edad115
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9976/12347310946/index.html#/testresult/bec99e17e8048588
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10131/12322904295/index.html#/testresult/110ba3a81c35c6d7
....
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10126/12322979734/index.html#/testresult/e8a5900970131270
with the error:
psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 19647 failed: FATAL: sorry, too many clients already

show that the test can't survive a slow backend shutdown.

I could reproduce this locally, and with verbose logging, I see:

PG:2024-12-16 19:24:15.009 GMT [postgres][1826200:17][client backend] [[unknown]] LOG:  statement: SHOW neon.max_file_cache_size
PG:2024-12-16 19:24:15.009 GMT [postgres][1826091:219][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.010 GMT [postgres][1826091:220][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.011 GMT [postgres][1826091:221][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] Swap file cache page
PG:2024-12-16 19:24:15.011 GMT [][1826021:134][postmaster] DEBUG:  forked new backend, pid=1826217 socket=9
PG:2024-12-16 19:24:15.011 GMT [postgres][1826091:222][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:18][client backend] [[unknown]] DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to make
PG:2024-12-16 19:24:15.011 GMT [[unknown]][1826217:1][not initialized] [[unknown]] LOG:  connection received: host=127.0.0.1 port=49494
PG:2024-12-16 19:24:15.011 GMT [postgres][1826217:2][client backend] [[unknown]] FATAL:  sorry, too many clients already
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:19][client backend] [[unknown]] DEBUG:  shmem_exit(0): 6 on_shmem_exit callbacks to make
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:20][client backend] [[unknown]] DEBUG:  proc_exit(0): 2 callbacks to make
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:21][client backend] [[unknown]] LOG:  disconnection: session time: 0:00:00.031 user=cloud_admin database=postgres host=127.0.0.1 port=49488

This test sets max_connections=2, but one connection is permanently occupied by compute_ctl:activity_monitor and another one can be still occupied by a backend (1826200, in this case), that just executed psql query, like in Endpoint.start():

        self.log_config_value("shared_buffers")
        self.log_config_value("neon.max_file_cache_size")
        self.log_config_value("neon.file_cache_size_limit")

The issue can be reproduced easily with this sleep:

@@ -107,6 +107,7 @@ proc_exit(int code)
     if (MyProcPid != (int) getpid())
         elog(PANIC, "proc_exit() called in child process");
 
+pg_usleep(100000);
     /* Clean up everything that must be cleaned up */
     proc_exit_prepare(code);
@alexanderlaw alexanderlaw added the t/bug Issue Type: Bug label Dec 17, 2024
@github-actions github-actions bot added the external A PR or Issue is created by an external user label Dec 17, 2024
@jcsp jcsp removed the external A PR or Issue is created by an external user label Jan 6, 2025
@ololobus ololobus added c/compute Component: compute, excluding postgres itself a/test/flaky Area: related to flaky tests labels Jan 7, 2025
@ololobus
Copy link
Member

ololobus commented Jan 7, 2025

@knizhnik will have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test/flaky Area: related to flaky tests c/compute Component: compute, excluding postgres itself t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

4 participants