Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discard and Retry buttons not working for jobs that use retry_on #78

Closed
andbar opened this issue Feb 21, 2024 · 3 comments · Fixed by #85
Closed

Discard and Retry buttons not working for jobs that use retry_on #78

andbar opened this issue Feb 21, 2024 · 3 comments · Fixed by #85

Comments

@andbar
Copy link
Contributor

andbar commented Feb 21, 2024

We're experiencing an issue where jobs that have a retry_on will give the error Job with id '#####################' not found when we click the Discard or Retry button for that failed job. This does not happen when we click the button for jobs that failed on the first attempt and do not have a retry_on.

I think the root issue is this:

When a job has retry_on, if it fails, the solid_queue_jobs entry for that job gets a finished_at timestamp and the retry creates a new solid_queue_jobs entry with a new id but the same active_job_id. That will continue for the number of retry_on attempts. The final job that fails and does not queue another retry will have finished_at: nil and it will have a related solid_queue_failed_executions entry created with a job_id that matches. This is all as expected.

Example: if retry_on is set to 5 attempts, the database will end up with 5 solid_queue_jobs rows that match that same active_job_id. The first 4 will have finished_at timestamps, the last one will not, and the last one will link to a solid_queue_failed_executions database row (but none of the first 4 do). This is the behavior we are seeing and I'm assuming that's working as expected.

The issue comes in when mission control tries to retrieve the failed job when you click the Discard or Retry button. It is using the active_job_id; in tracing the code, it appears to end up at line 172 of the active_job/queue_adapters/solid_queue_ext.rb file (https://github.com/basecamp/mission_control-jobs/blob/c14ab260b493195cfa99863e8a939882605e8ed6/lib/active_job/queue_adapters/solid_queue_ext.rb#L172).

There it is attempting to find the solid queue job by active_job_id, but there are multiple rows that match in this scenario due to the retry_on. The job that is returned in that query is the first one that matches (the oldest one), which does not have a matching solid_queue_failed_executions row (it is the first failed attempt). When the check on line 173 happens (matches_relation_filters?(job)), which checks for a matched status on line 281 (which is the failed status), it returns false due to that particular job not having a failed_execution, because it was not the most recent failed attempt but rather the first one. So then it returns the error that it cannot find the job.

It seems that in this scenario, when there is more than one row in the solid_queue_jobs table that match that active_job_id, if it is looking for a failed job it needs to find the most recent one with finished_at: nil which will then have a corresponding failed_execution. Or perhaps it should find the job by the id (the solid_queue_jobs id), rather than by the active_job_id?

@rosa
Copy link
Member

rosa commented Feb 22, 2024

Ahhh, totally @andbar! I encountered the same thing in a different scenario and fixed it on https://github.com/basecamp/mission_control-jobs/pull/36, but I missed this one. I think the same patch can be applied to https://github.com/basecamp/mission_control-jobs/blob/c14ab260b493195cfa99863e8a939882605e8ed6/lib/active_job/queue_adapters/solid_queue_ext.rb#L172,

@andbar
Copy link
Contributor Author

andbar commented Feb 22, 2024

Yeah, looks like that change should work for this scenario as well. I can make a PR with that change if that will help you out, or would you prefer to do it yourself? Thanks for all the work you've put in on solid_queue and mission_control, they've been working well for us since we switched over to them.

@rosa
Copy link
Member

rosa commented Feb 22, 2024

Thanks a lot @andbar! Really appreciate your kind words.

About this issue, as you wish! I'm happy to do the change and also happy to merge a PR from you with that change 😊 I probably won't get to it tomorrow as I need to do other work but will do it on Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants