KAFKA-18073: Prevent dropped records from failed retriable exceptions #18146

twthorn · 2024-12-11T18:42:53Z

If any operation on a record (e.g., convert record fails due to connectivity issue to Schema Registry, transform record fails) results in a retriable exception that exceeds its retry configuration (default is no retries) then even with errors.tolerance set to None (the default) we will still drop the record. This causes silent & unexpected data loss by default for any Kafka Connect source or sink connector. We should prevent this and fail loudly and not drop records if a retriable exception fails.

Changes

In order to fix this for source/sink connectors, we fix the logic in RetryWithToleranceOperator operator. Specifically, if a RetriableException can no longer be retried, we either return null (skip) or raise an exception based on the error tolerance.

Testing

In AbstractWorkerSourceTaskTest, add several new tests for the different logical branches depending on errorrs.tolerance & whether convert or transform succeed/fail. We provide a way to avoid mocking the transformationChain so that the underlying retryWithToleranceOperator is used. Although this expands the scope of the logic tested, it is intentional & beneficial, as these mocks prevented this critical bug from being discovered earlier.

Also add similar tests in the worker sink task tests.

Also add tests in RetryWithToleranceOperatorTest for some additional cases of none vs. all. Refactor some unclear naming.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…lerance is none

twthorn · 2024-12-11T18:55:37Z

@chia7712 Could you take a look at this when you get the chance? I saw you on the past couple PRs for this file. This bug causes data loss by default for all Kafka Connect source connectors. Thanks for the help!

gharris1727 · 2024-12-11T19:54:51Z

connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java

@@ -396,12 +396,25 @@ boolean sendRecords() {
        for (final SourceRecord preTransformRecord : toSend) {
            ProcessingContext<SourceRecord> context = new ProcessingContext<>(preTransformRecord);
            final SourceRecord record = transformationChain.apply(context, preTransformRecord);
+            // If the result of a transformation is null, then the record should be filtered/skipped & there was no error
+            if (record == null) {


The "retriable exception causes data loss" also applies to the transformations. If the transformation chain gives up retrying, it will return null with context.failed().

I think you can let the null pass through convertTransformedRecord, and then have separate null and context.failed() checks there that cover all of the transformation and conversion steps.

Nice catch, updated

gharris1727

I'm thinking now that this fix is in the "wrong place", it's in the WorkerSourceTask, when it's also a problem on the sink task side.

I think the problem is really inside of the RetryWithToleranceOperator, and it shouldn't return null after a RetriableException when it's not within tolerance limits; it should throw an exception like other non-retriable exceptions.

twthorn · 2024-12-11T23:55:22Z

Good point, will update

…on failed retriable exceptions

twthorn · 2024-12-13T16:46:04Z

Hi @gharris1727 Can you take another look when you get the chance? Many thanks

…st utils

twthorn · 2024-12-19T22:17:08Z

@gharris1727 gentle nudge on this, added tests to sink as well, all tests are passing. Please let me know if there's anything else I can do.

twthorn · 2025-01-07T20:20:41Z

Hi @gharris1727 do you think you could take another look? Thank you again for the help.

gharris1727

Thanks @twthorn for your patience, I was celebrating the winter holidays and then got sick afterwards.

gharris1727 · 2025-01-08T20:45:26Z

...untime/src/main/java/org/apache/kafka/connect/runtime/errors/RetryWithToleranceOperator.java

                    context.error(e);
-                    return null;
+                    markAsFailed();
+                    if (withinToleranceLimits()) {
+                        return null;
+                    } else {
+                        throw new ConnectException("Exceeded deadline & tolerance for retriable exception", e);
+                    }


This duplicates (and double-wraps) the exception, once with "Tolerance exceeded in error handler" and once with "Exceeded deadline & tolerance for retriable exception".

I think we can reuse the existing handling for non-retriable exceptions in #execAndHandleError by just rethrowing e right after the trace message.

gharris1727 · 2025-01-08T20:54:22Z

connect/runtime/src/test/java/org/apache/kafka/connect/runtime/WorkerTestUtils.java

@@ -155,4 +165,29 @@ public static void assertAssignment(boolean expectFailed,
        assertEquals(expectedDelay, assignment.delay(),
                "Wrong rebalance delay in " + assignment);
    }
+
+    public static TransformationChain getTransformationChain(RetryWithToleranceOperator toleranceOperator, List<Object> results) {


Could you add generic arguments to TransformationChain and RetryWithToleranceOperator in this method and callers?

gharris1727 · 2025-01-08T21:04:33Z

...untime/src/main/java/org/apache/kafka/connect/runtime/errors/RetryWithToleranceOperator.java

+                        return null;
+                    } else {
+                        throw new ConnectException("Exceeded deadline & tolerance for retriable exception", e);
+                    }
                }
                if (stopping) {
                    log.trace("Shutdown has been scheduled. Marking operation as failed.");


This data loss scenario is making me think about this stopping flag, and whether it could cause data loss. Running out of retries and stopping retries due to a shutdown should probably behave similarly, and probably shouldn't skip the record.

But stopping is only set by #triggerStop/WorkerTask#cancel, which is a hard-shutdown operation after no further data is expected from the task, and no offsets are being committed.

I think we can probably leave this in-place, and if we ever change task shutdown we can address it then.

Agreed. In both source/sinks the worker doesn't commit the offsets if it's been cancelled. So data loss should not be possible because of that conditional check. Will leave as is.

gharris1727

LGTM, thanks so much for reporting and fixing this @twthorn!

…#18146) Reviewers: Greg Harris <[email protected]>

KAFKA-18073 Prevent dropped records when conversions fail & errors.to…

fe6a055

…lerance is none

github-actions bot added triage PRs from the community connect labels Dec 11, 2024

twthorn changed the title ~~KAFKA-18073 Prevent dropped records when conversions fail & errors.tolerance is none~~ KAFKA-18073: Prevent dropped records when conversions fail & errors.tolerance is none Dec 11, 2024

mumrah added the ci-approved label Dec 11, 2024

gharris1727 reviewed Dec 11, 2024

View reviewed changes

KAFKA-18073 Prevent dropped records from failed transformations

b71c788

twthorn requested a review from gharris1727 December 11, 2024 22:45

gharris1727 reviewed Dec 11, 2024

View reviewed changes

KAFKA-18073 Prevent RetryWithToleranceOperator from dropping records …

35d24f8

…on failed retriable exceptions

twthorn requested a review from gharris1727 December 12, 2024 19:32

twthorn changed the title ~~KAFKA-18073: Prevent dropped records when conversions fail & errors.tolerance is none~~ KAFKA-18073: Prevent dropped records from failed retriable exceptions Dec 12, 2024

twthorn added 2 commits December 18, 2024 14:46

KAFKA-18073 Add worker sink task tests, refactor helpers to worker te…

7221321

…st utils

Merge branch 'trunk' into KAFKA-18073

97cbad2

github-actions bot removed the triage PRs from the community label Dec 19, 2024

twthorn added 2 commits December 19, 2024 11:15

Empty commit to rerun tests

fbae415

Merge branch 'trunk' into KAFKA-18073

1931d11

gharris1727 reviewed Jan 8, 2025

View reviewed changes

twthorn added 2 commits January 8, 2025 18:11

KAFKA-18073 Add generic args to utils method, avoid rewrap exception

e231ce6

KAFKA-18073 Remove redundant code from execAndRetry

700be75

twthorn requested a review from gharris1727 January 9, 2025 17:18

gharris1727 approved these changes Jan 9, 2025

View reviewed changes

gharris1727 merged commit b35c294 into apache:trunk Jan 9, 2025
9 checks passed

gharris1727 pushed a commit that referenced this pull request Jan 9, 2025

KAFKA-18073: Prevent dropped records from failed retriable exceptions (…

fff17fe

…#18146) Reviewers: Greg Harris <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-18073: Prevent dropped records from failed retriable exceptions #18146

KAFKA-18073: Prevent dropped records from failed retriable exceptions #18146

twthorn commented Dec 11, 2024 •

edited

Loading

twthorn commented Dec 11, 2024

gharris1727 Dec 11, 2024

twthorn Dec 11, 2024

gharris1727 left a comment

twthorn commented Dec 11, 2024

twthorn commented Dec 13, 2024

twthorn commented Dec 19, 2024

twthorn commented Jan 7, 2025

gharris1727 left a comment

gharris1727 Jan 8, 2025

gharris1727 Jan 8, 2025

gharris1727 Jan 8, 2025

twthorn Jan 9, 2025

gharris1727 left a comment

KAFKA-18073: Prevent dropped records from failed retriable exceptions #18146

KAFKA-18073: Prevent dropped records from failed retriable exceptions #18146

Conversation

twthorn commented Dec 11, 2024 • edited Loading

Changes

Testing

Committer Checklist (excluded from commit message)

twthorn commented Dec 11, 2024

gharris1727 Dec 11, 2024

Choose a reason for hiding this comment

twthorn Dec 11, 2024

Choose a reason for hiding this comment

gharris1727 left a comment

Choose a reason for hiding this comment

twthorn commented Dec 11, 2024

twthorn commented Dec 13, 2024

twthorn commented Dec 19, 2024

twthorn commented Jan 7, 2025

gharris1727 left a comment

Choose a reason for hiding this comment

gharris1727 Jan 8, 2025

Choose a reason for hiding this comment

gharris1727 Jan 8, 2025

Choose a reason for hiding this comment

gharris1727 Jan 8, 2025

Choose a reason for hiding this comment

twthorn Jan 9, 2025

Choose a reason for hiding this comment

gharris1727 left a comment

Choose a reason for hiding this comment

twthorn commented Dec 11, 2024 •

edited

Loading