test_physical_replication_config_mismatch_too_many_known_xids is failing on slow backend exit #10167

alexanderlaw · 2024-12-17T05:00:00Z

Numerous test_physical_replication_config_mismatch_too_many_known_xids failures (occurred mostly on relatively slow aarch64, but the last one is from x64):
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10153/12340853499/index.html#testresult/d2d742bbd5eafbc2/
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10150/12324065238/index.html#/testresult/62cbf9b86ee197f6
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10139/12356033335/index.html#/testresult/ba041a552edad115
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9976/12347310946/index.html#/testresult/bec99e17e8048588
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10131/12322904295/index.html#/testresult/110ba3a81c35c6d7
....
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10126/12322979734/index.html#/testresult/e8a5900970131270
with the error:
psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 19647 failed: FATAL: sorry, too many clients already

show that the test can't survive a slow backend shutdown.

I could reproduce this locally, and with verbose logging, I see:

PG:2024-12-16 19:24:15.009 GMT [postgres][1826200:17][client backend] [[unknown]] LOG:  statement: SHOW neon.max_file_cache_size
PG:2024-12-16 19:24:15.009 GMT [postgres][1826091:219][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.010 GMT [postgres][1826091:220][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.011 GMT [postgres][1826091:221][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] Swap file cache page
PG:2024-12-16 19:24:15.011 GMT [][1826021:134][postmaster] DEBUG:  forked new backend, pid=1826217 socket=9
PG:2024-12-16 19:24:15.011 GMT [postgres][1826091:222][client backend] [compute_ctl:activity_monitor] DEBUG:  [NEON_SMGR] neon_get_request_lsns request lsn 0/14F91B0, not_modified_since 0/14F9178
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:18][client backend] [[unknown]] DEBUG:  shmem_exit(0): 4 before_shmem_exit callbacks to make
PG:2024-12-16 19:24:15.011 GMT [[unknown]][1826217:1][not initialized] [[unknown]] LOG:  connection received: host=127.0.0.1 port=49494
PG:2024-12-16 19:24:15.011 GMT [postgres][1826217:2][client backend] [[unknown]] FATAL:  sorry, too many clients already
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:19][client backend] [[unknown]] DEBUG:  shmem_exit(0): 6 on_shmem_exit callbacks to make
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:20][client backend] [[unknown]] DEBUG:  proc_exit(0): 2 callbacks to make
PG:2024-12-16 19:24:15.011 GMT [postgres][1826200:21][client backend] [[unknown]] LOG:  disconnection: session time: 0:00:00.031 user=cloud_admin database=postgres host=127.0.0.1 port=49488

This test sets max_connections=2, but one connection is permanently occupied by compute_ctl:activity_monitor and another one can be still occupied by a backend (1826200, in this case), that just executed psql query, like in Endpoint.start():

        self.log_config_value("shared_buffers")
        self.log_config_value("neon.max_file_cache_size")
        self.log_config_value("neon.file_cache_size_limit")

The issue can be reproduced easily with this sleep:

@@ -107,6 +107,7 @@ proc_exit(int code)
     if (MyProcPid != (int) getpid())
         elog(PANIC, "proc_exit() called in child process");
 
+pg_usleep(100000);
     /* Clean up everything that must be cleaned up */
     proc_exit_prepare(code);

The text was updated successfully, but these errors were encountered:

ololobus · 2025-01-07T16:39:41Z

@knizhnik will have a look

alexanderlaw added the t/bug Issue Type: Bug label Dec 17, 2024

github-actions bot added the external A PR or Issue is created by an external user label Dec 17, 2024

jcsp removed the external A PR or Issue is created by an external user label Jan 6, 2025

ololobus added c/compute Component: compute, excluding postgres itself a/test/flaky Area: related to flaky tests labels Jan 7, 2025

ololobus assigned knizhnik Jan 7, 2025

knizhnik mentioned this issue Jan 8, 2025

Increase max connection for replica to prevent test flukyness #10306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_physical_replication_config_mismatch_too_many_known_xids is failing on slow backend exit #10167

test_physical_replication_config_mismatch_too_many_known_xids is failing on slow backend exit #10167

alexanderlaw commented Dec 17, 2024

ololobus commented Jan 7, 2025

test_physical_replication_config_mismatch_too_many_known_xids is failing on slow backend exit #10167

test_physical_replication_config_mismatch_too_many_known_xids is failing on slow backend exit #10167

Comments

alexanderlaw commented Dec 17, 2024

ololobus commented Jan 7, 2025