test_physical_replication_config_mismatch_too_many_known_xids is failing on slow backend exit #10167
Labels
a/test/flaky
Area: related to flaky tests
c/compute
Component: compute, excluding postgres itself
t/bug
Issue Type: Bug
Numerous test_physical_replication_config_mismatch_too_many_known_xids failures (occurred mostly on relatively slow aarch64, but the last one is from x64):
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10153/12340853499/index.html#testresult/d2d742bbd5eafbc2/
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10150/12324065238/index.html#/testresult/62cbf9b86ee197f6
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10139/12356033335/index.html#/testresult/ba041a552edad115
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9976/12347310946/index.html#/testresult/bec99e17e8048588
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10131/12322904295/index.html#/testresult/110ba3a81c35c6d7
....
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-10126/12322979734/index.html#/testresult/e8a5900970131270
with the error:
psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 19647 failed: FATAL: sorry, too many clients already
show that the test can't survive a slow backend shutdown.
I could reproduce this locally, and with verbose logging, I see:
This test sets max_connections=2, but one connection is permanently occupied by compute_ctl:activity_monitor and another one can be still occupied by a backend (1826200, in this case), that just executed psql query, like in Endpoint.start():
The issue can be reproduced easily with this sleep:
The text was updated successfully, but these errors were encountered: