Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hubot responds 4 times to one message. But only some times? #1715

Open
johnseekins-pathccm opened this issue Feb 29, 2024 · 28 comments
Open

Comments

@johnseekins-pathccm
Copy link

johnseekins-pathccm commented Feb 29, 2024

We've written a custom hubot script to interact with AWS and report back to users. One of our responses will respond 4 times (with the exact same data, over ~20 minutes) in certain circumstances, and I can't understand what's happening.

So it starts normal enough:

robot.respond(/ecs list stale tasks( in )?([A-Za-z0-9-]+)?/i, async res => {
    const cluster = res.match[2] || defaultCluster
    const services = await paginateServices(ecsClient, cluster)

So then I can say something like @hubot ecs list stale tasks in Production and Hubot will come back with data about what we consider stale tasks.

What's interesting is that depending on the cluster I select, Hubot will either reply once (expected) or 4 times (far less expected):

  • @hubot ecs list stale tasks in Production responds 4 times with the exact same data, over about 20 minutes.
  • @hubot ecs list stale tasks in Development responds once.
  • @hubot ecs list stale tasks in Staging responds once.

We're using https://github.com/hubot-friends/hubot-slack as our adapter.

We only run a single instance of Hubot (because otherwise it can go split-brain) in a docker container in ECS. We are on the most recent release of Hubot (11.1.1).

My instinct is that it's something in the data, or how I'm handling pagination with AWS requests. But I'm honestly not sure. So any hints that y'all could provide would be amazing.

I'll provide the entire script below, just please don't judge my JS too harshly, I've never been good at it.

Actual script
use strict";
// Description:
//   Tool to introspect into ECS
//   Upstream docs: https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/client/ecs/
//
// Dependencies:
//   "@aws-sdk/client-ecs": "^3"
//
// Configuration:
//   HUBOT_AWS_REGION
//   HUBOT_AWS_ACCESS_KEY_ID
//   HUBOT_AWS_SECRET_ACCESS_KEY
//
// Commands:
//   hubot ecs list clusters - returns all ECS clusters in the defined region
//   hubot ecs describe cluster <name> - returns details about a single ECS cluster
//   hubot ecs list services (in <cluster>) - returns all ECS services in the defined cluster (defaults to Production)
//   hubot ecs describe service <name> (in <cluster>) - returns details about a single ECS service in a defined cluster (defaults to Production)
//   hubot ecs list stale tasks (in <cluster>) - returns details about ECS services that are considered 'stale' in a defined cluster (defaults to Production)
//
// Author:
//   <>

import {
  ECSClient,
  DescribeClustersCommand,
  DescribeServicesCommand,
  DescribeTaskDefinitionCommand,
  ListClustersCommand, 
  ListServicesCommand
} from '@aws-sdk/client-ecs' // ES Modules import

const region = process.env.HUBOT_AWS_REGION
const defaultCluster = 'Production'
process.env.AWS_ACCESS_KEY_ID = process.env.HUBOT_AWS_ACCESS_KEY_ID
process.env.AWS_SECRET_ACCESS_KEY = process.env.HUBOT_AWS_SECRET_ACCESS_KEY

/*
 * Variables for stale ECS deploys
 */
const staleDays = 45
const shortExpireDays = 2
const longExpireDays = 7
// days * 24 (hours) * 60 (minutes) * 60 (seconds) * 1000 (milliseconds) === number of milliseconds
const expiredSecs = staleDays * 24 * 60 * 60 * 1000
const shortExpireSecs = (staleDays - shortExpireDays) * 24 * 60 * 60 * 1000
const longExpireSecs = (staleDays - longExpireDays) * 24 * 60 * 60 * 1000
const ignoredServices = []

async function paginateClusters (ecsClient) {
  const input = {
    maxResults: 100
  }
  let command = new ListClustersCommand(input)
  let response
  try {
    response = await ecsClient.send(command)
  } catch (err) {
    console.log(`[ERROR] Request to AWS failed: ${err}`)
  }
  const results = response.clusterArns
  while (response.nextToken) {
    input.nextToken = response.nextToken
    command = new ListClustersCommand(input)
    try {
      response = await ecsClient.send(command)
    } catch (err) {
      console.log(`[ERROR] Request to AWS failed: ${err}`)
    }
    results.push(response.clusterArns)
  }
  return results
}

async function paginateServices (ecsClient, cluster) {
  const input = {
    cluster,
    maxResults: 100
  }
  let command = new ListServicesCommand(input)
  let response
  try {
    response = await ecsClient.send(command)
  } catch (err) {
    console.log(`[ERROR] Request to AWS failed: ${err}`)
  }
  const results = response.serviceArns
  while (response.nextToken) {
    input.nextToken = response.nextToken
    command = new ListServicesCommand(input)
    try {
      response = await ecsClient.send(command)
    } catch (err) {
      console.log(`[ERROR] Request to AWS failed: ${err}`)
    }
    results.push(response.serviceArns)
  }
  return results
}

export default async robot => {
  robot.logger.debug(`Running AWS commands in ${region}`)
  let ecsClient
  try {
    ecsClient = new ECSClient({ region })
  } catch (err) {
    robot.logger.error(`Can't make client connection: ${err}`)
  }

  /*
   * List Clusters
   */
  robot.respond(/ecs list clusters/i, async res => {
    const clusters = await paginateClusters(ecsClient)
    const clusterNames = clusters.map((x) => x.split('/')[x.split('/').length - 1]).sort()
    let output = `We have ${clusterNames.length} Clusters:\n\`\`\``
    for (let i = 0; i < clusterNames.length; i++) {
      output += `\n\t${clusterNames[i]}`
    }
    output += '\n```'
    res.send(output)
  })
  /*
   * Cluster Details
   */
  robot.respond(/ecs describe cluster ([A-Za-z0-9-]+)?/i, async res => {
    cluster = res.match[1] || defaultCluster
    const input = {
      clusters: [cluster],
      include: []
    }
    const command = new DescribeClustersCommand(input)
    try {
      const response = await ecsClient.send(command)
      res.send(`ECS Cluster \`${res.match[1]}\`:\n\`\`\`\nActive Services: ${response.clusters[0].activeServicesCount}\nRunning Tasks: ${response.clusters[0].runningTasksCount}\tPending Tasks: ${response.clusters[0].pendingTasksCount}\n\`\`\``)
    } catch (err) {
      robot.logger.error(`Request to AWS failed: ${err}`)
    }
  })

  /*
   * List Services in Cluster
   */
  robot.respond(/ecs list services( in )?([A-Za-z0-9-]+)?/i, async res => {
    const cluster = res.match[2] || defaultCluster
    const services = await paginateServices(ecsClient, cluster)
    const serviceNames = services.map((x) => x.split('/')[x.split('/').length - 1]).sort()
    let output = `We have ${serviceNames.length} Services configured in \`${cluster}\`:\n\`\`\``
    for (let i = 0; i < serviceNames.length; i ++) {
      output += `\n\t${serviceNames[i]}`
    }
    output += '\n```'
    res.send(output)
  })

  /*
   * Service Details
   */
  robot.respond(/ecs describe service ([A-Za-z0-9-]+)( in )?([A-Za-z0-9-]+)?/i, async res => {
    const cluster = res.match[3] || defaultCluster
    const service = res.match[1]
    let input = {
      cluster,
      services: [service],
      include: []
    }
    let command = new DescribeServicesCommand(input)
    let serviceData
    try {
      serviceData = await ecsClient.send(command)
      serviceData = serviceData.services[0]
    } catch (err) {
      robot.logger.error(`Request to AWS failed: ${err}`)
    }
    const taskDef = serviceData.taskDefinition

    input = {
      taskDefinition: taskDef
    }
    command = new DescribeTaskDefinitionCommand(input)
    let label
    try {
      const taskData = await ecsClient.send(command)
      label = taskData.taskDefinition.containerDefinitions[0].dockerLabels['com.datadoghq.tags.version']
    } catch (err) {
      robot.logger.error(`Task Definition Request to AWS failed: ${err}`)
    }
    let output = `\`${service}\` in \`${cluster}\`:\n\`\`\`\nCurrent Deployment at ${serviceData.deployments[0].createdAt}:\n`
    output += `\tDeployed Version: ${label}\n`
    output += `\tStatus: ${serviceData.deployments[0].rolloutState}\n`
    output += `\tDesired: ${serviceData.deployments[0].desiredCount}\tRunning: ${serviceData.deployments[0].runningCount}\tPending: ${serviceData.deployments[0].pendingCount}\tFailed: ${serviceData.deployments[0].failedCount || 0}`
    if (serviceData.deployments.length > 1) {
      output += `\nOlder Deployment at ${serviceData.deployments[1].createdAt}\n`
      output += `\tStatus: ${serviceData.deployments[1].rolloutState}\n`
      output += `\tDesired: ${serviceData.deployments[1].desiredCount}\tRunning: ${serviceData.deployments[1].runningCount}\tPending: ${serviceData.deployments[1].pendingCount}\tFailed: ${serviceData.deployments[1].failedCount || 0}`
    }
    if (serviceData.events.length > 0) {
      output += '\nRecent Events:'
      let count = 5
      if (serviceData.events.length < count) {
        count = serviceData.events.length
      }
      for (let i = 0; i < count; i++) {
        output += `\n\t${serviceData.events[i].message} at ${serviceData.events[i].createdAt}`
      }
    }
    output += '\n```'
    res.send(output)
  })
  /*
   * Stale Deploys
   */
  robot.respond(/ecs list stale tasks( in )?([A-Za-z0-9-]+)?/i, async res => {
    const cluster = res.match[2] || defaultCluster
    const services = await paginateServices(ecsClient, cluster)
    // no need to sort these results
    const serviceNames = services.map((x) => x.split('/')[x.split('/').length - 1])
    const staleDateShort = new Date(Date.now() - shortExpireSecs)
    const staleDateLong = new Date(Date.now() - longExpireSecs)
    const expiredDate = new Date(Date.now() - expiredSecs)
    let ignored = []
    let shortExp = []
    let longExp = []
    let exp = []
    /*
     * Collect service data
     */
    for (let i = 0; i < serviceNames.length; i++) {
      const service = serviceNames[i]
      if (ignoredServices.includes(service)) {
        ignored.push(service)
        continue
      }
      let input = {
        cluster,
        services: [service],
        include: []
      }
      let command = new DescribeServicesCommand(input)
      let serviceData
      try {
        serviceData = await ecsClient.send(command)
        serviceData = serviceData.services[0]
      } catch (err) {
        robot.logger.error(`Request to AWS failed: ${err}`)
      }
      const deployDate = new Date(serviceData.deployments[0].createdAt)
      // skip any service newer than our longest expiration window
      if (deployDate > staleDateLong) {
        continue
      }

      const servString = `\`${service}\` (deployed ${deployDate.toISOString()})`
      if (deployDate < expiredDate) {
        exp.push(servString)
      } else if (deployDate < staleDateShort) {
        shortExp.push(servString)
      } else {
        longExp.push(servString)
      }
    }
    /*
     * Output
     */
    let output
    if (exp.length < 1 && longExp.length < 1 && shortExp.length < 1) {
      output = `I couldn't find any stale tasks in \`${cluster}\``
    } else {
      output = `I found these old deployments in \`${cluster}\`:`
      if (exp.length > 0) {
        exp = exp.sort()
        output += `\n\tExpired:`
        for (let i = 0; i < exp.length; i++) {
          output += `\n\t\t${exp[i]}`
        }
      }
      if (shortExp.length > 0) {
        shortExp = shortExp.sort()
        output += `\nExpiring in <= ${shortExpireDays} days:`
        for (let i = 0; i < shortExp.length; i++) {
          output += `\n\t\t${shortExp[i]}`
        }
      }
      if (longExp.length > 0) {
        longExp = longExp.sort()
        output += `\nExpiring in <= ${longExpireDays} days:`
        for (let i = 0; i < longExp.length; i++) {
          output += `\n\t\t${longExp[i]}`
        }
      }
    }
    if (ignored.length > 0) {
      output += '\nI ignored these services:'
      for (let i = 0; i < ignored.length; i++) {
        output += `\n\t\`${ignored[i]}\``
      }
    }
    res.send(output)
  })
}
@joeyguerra
Copy link
Member

Nothing in the code immediately stands out to me as the culprit. I have a few probing questions:

  • how is Hubot hosted? i.e. in kubernetes, on an EC2 instance, ????
  • how many instances of Hubot are running?
  • Does a single instance of Hubot have access to Prod, Dev and Stage?
  • What version of Hubot is running?

@johnseekins-pathccm
Copy link
Author

johnseekins-pathccm commented Feb 29, 2024

Nothing in the code immediately stands out to me as the culprit. I have a few probing questions:

* how is Hubot hosted? i.e. in kubernetes, on an EC2 instance, ????

Docker container in ECS

* how many instances of Hubot are running?

1

* Does a single instance of Hubot have access to Prod, Dev and Stage?

Yes. Read-only access. And importantly, it's to the ECS Clusters. Not separate accounts/environments/etc.

* What version of Hubot is running?

11.1

@joeyguerra
Copy link
Member

Does it only respond 4 times when the value is "Production"?

@johnseekins-pathccm
Copy link
Author

Or when I leave it to "default". So when cluster === Production.

@joeyguerra
Copy link
Member

what chat adapter are you using?
Does it respond 4 times with the same answer?

@johnseekins-pathccm
Copy link
Author

johnseekins-pathccm commented Feb 29, 2024

what chat adapter are you using? Does it respond 4 times with the same answer?

https://github.com/hubot-friends/hubot-slack
Yep. Exact same response, 4 times. Also takes about 20 minutes to get all four replies.

(Updated all that in the initial question, too)

@joeyguerra
Copy link
Member

Ok. I've seen this behavior before during development. The issue was that the code failed to acknowledge the message. In that situation. the Slack system will "retry sending the message". Here's where the code is supposed to acknowledge the message.

I also see an issue in the Slack Adapter. It's not awaiting robot.receive. I'm unsure what that will cause, but I'll have to push a fix for that.

@joeyguerra
Copy link
Member

Can you start Hubot with HUBOT_LOG_LEVEL=debug to see what line of code the execution is getting to?

@joeyguerra
Copy link
Member

I've also added the await call in the Slack Adapter.

@johnseekins-pathccm
Copy link
Author

Seems to just...receive the message multiple times? To be clear...I definitely only typed it once, but this pattern (and I'm hesitant to give you full log messages...) looks like it's just...getting the message again.
Screenshot 2024-02-29 at 4 15 02 PM

@johnseekins-pathccm
Copy link
Author

Updated to the new adapter and I still get the duplicate messages. :(

@joeyguerra
Copy link
Member

joeyguerra commented Feb 29, 2024

Another thought is to await res.send because it's async.

@johnseekins-pathccm
Copy link
Author

await res.send() also doesn't help.

@joeyguerra
Copy link
Member

Is it odd that the envelope_id is different for each of those messages?

@joeyguerra
Copy link
Member

Can you run a Hubot instance locally on your machine and replace the behavior?

@xurizaemon
Copy link
Collaborator

xurizaemon commented Mar 1, 2024

It sounds like you might have a plausible cause, so add several grains of salt to anything in this comment :)

When I've observed Hubot get into a repeats-replies state, I had a suspicion it related to functionality such as remind-her or polling plugins (eg watch statuspage, report when status changes). It seemed like the use of setTimeout() or setInterval() could create concurrent threads. (The fact that you see it reply four times specifically suggests to me this doesn't quite fit ... but maybe there's a magic number in that system I don't know about.)

If the current best theory doesn't pan out, maybe consider which plugins could be disabled to isolate the behaviour?

@johnseekins-pathccm
Copy link
Author

johnseekins-pathccm commented Mar 1, 2024

There is a timeout in the slack response! Because this query to AWS is relatively slow, that doesn't entirely surprise me:

{"level":20,"time":1709307623089,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"Text = @hubot ecs list stale tasks"}
{"level":20,"time":1709307623089,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"Event subtype = undefined"}
{"level":20,"time":1709307623089,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"Received generic message: message"}
{"level":20,"time":1709307623090,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"Message '@hubot ecs list stale tasks' matched regex //^\\s*[@]?Hubot[:,]?\\s*(?:ecs list stale tasks( in )?([A-Za-z0-9-]+)?)/i/; listener.options = { id: null }"}
{"level":20,"time":1709307626395,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"eventHandler {
 \"envelope_id\": \"bd22596e-ee19-4201-8250-792f91fc96d7\",
 \"body\": {
    \"token\": \"<>\",
    \"team_id\": \"<>\",
    \"context_team_id\": \"<>\",
    \"context_enterprise_id\": null,
    \"api_app_id\": \"<>\",
    \"event\": {
      \"client_msg_id\": \"<>\",
      \"type\": \"message\",
      \"text\": \"<@hubot> ecs list stale tasks\",
      \"user\": \"<>\",
      \"ts\": \"1709307622.850469\",
      \"blocks\": [
        {
          \"type\": \"rich_text\",
          \"block_id\": \"5X8EE\",
          \"elements\": [
            {
              \"type\": \"rich_text_section\",
              \"elements\": [
                {
                  \"type\": \"user\",
                  \"user_id\": \"<>\"
                },
                {
                  \"type\": \"text\",
                  \"text\": \" ecs list stale tasks\"
                }
              ]
            }
          ]
        }
      ],
      \"team\": \"<>\",
      \"channel\": \"<>\",
      \"event_ts\": \"1709307622.850469\",
      \"channel_type\": \"channel\"
    },
    \"type\": \"event_callback\",
    \"event_id\": \"<>\",
    \"event_time\": 1709307622,
    \"authorizations\": [
      {
        \"enterprise_id\": null,
        \"team_id\": \"<>\",
        \"user_id\": \"<>\",
        \"is_bot\": true,
        \"is_enterprise_install\": false
      }
    ],
    \"is_ext_shared_channel\": false,
    \"event_context\": \"<>\"
  }
  \"event\": {
    \"client_msg_id\": \"<>\",
    \"type\": \"message\",
    \"text\": \"<@hubot> ecs list stale tasks\",
    \"user\": \"<>\",
    \"ts\": \"1709307622.850469\",
    \"blocks\": [
      {
        \"type\": \"rich_text\",
        \"block_id\": \"5X8EE\",
        \"elements\": [
          {
            \"type\": \"rich_text_section\",
            \"elements\": [
              {
                \"type\": \"user\",
                \"user_id\": \"<>\"
              },
              {
                \"type\": \"text\",
                \"text\": \" ecs list stale tasks\"
              }
            ]
          }
        ]
      }
    ],
    \"team\": \"<>\",
    \"channel\": \"<>\",
    \"event_ts\": \"1709307622.850469\",
    \"channel_type\": \"channel\"
  },
  \"retry_num\": 1,
  \"retry_reason\": \"timeout\",
  \"accepts_response_payload\": false
}"
}
{"level":20,"time":1709307626395,"pid":11932,"hostname":"John-Seekins-MacBook-Pro-16-inch-2023-","name":"Hubot","msg":"event {
  \"envelope_id\": \"<>\",
  \"body\": {
    \"token\": \"<>\",
    \"team_id\": \"<>\",
    \"context_team_id\": \"<>\",
    \"context_enterprise_id\": null,
    \"api_app_id\": \"<>\",
    \"event\": {
      \"client_msg_id\": \"<>\",
      \"type\": \"message\",
      \"text\": \"<@hubot> ecs list stale asks\",
      \"user\": \"<>\",
      \"ts\": \"1709307622.850469\",
      \"blocks\": [
        {
          \"type\": \"rich_text\",
          \"block_id\": \"5X8EE\",
          \"elements\": [
            {
              \"type\": \"rich_text_section\",
              \"elements\": [
                {
                  \"type\": \"user\",
                  \"user_id\": \"<>\"
                },
                {
                  \"type\": \"text\",
                  \"text\": \" ecs list stale tasks\"
                }
              ]
            }
          ]
        }
      ],
      \"team\": \"<>",
      \"channel\": \"<>\",
      \"event_ts\": \"1709307622.850469\",
      \"channel_type\": \"channel\"
    },
    \"type\": \"event_callback\",
    \"event_id\": \"<>\",
    \"event_time\": 1709307622,
    \"authorizations\": [
      {
        \"enterprise_id\": null,
        \"team_id\": \"<>\",
        \"user_id\": \"<>\",
        \"is_bot\": true,
        \"is_enterprise_install\": false
      }
    ],
    \"is_ext_shared_channel\": false,
    \"event_context\": \"<>\"
  },
  \"event\": {
    \"client_msg_id\": \"<>\",
    \"type\": \"message\",
    \"text\": \"<@hubot> ecs list stale tasks\",
    \"user\": \"<>",
    \"ts\": \"1709307622.850469\",
    \"blocks\": [
      {\n        \"type\": \"rich_text\",
        \"block_id\": \"5X8EE\",
        \"elements\": [
          {
            \"type\": \"rich_text_section\",
            \"elements\": [
              {
                \"type\": \"user\",
                \"user_id\": \"<>\"
              },
              {
                \"type\": \"text\",
                \"text\": \" ecs list stale tasks\"
              }
            ]
          }
        ]
      }
    ],
    \"team\": \"<>\",
    \"channel\": \"<>\",
    \"event_ts\": \"1709307622.850469\",
    \"channel_type\": \"channel\"
  },
  \"retry_num\": 1,
  \"retry_reason\": \"timeout\",
  \"accepts_response_payload\": false}
 user = <>"
}

@johnseekins-pathccm
Copy link
Author

It's definitely me racing a timeout! I changed the code to batch AWS requests more efficiently and I'm no longer getting duplicate messages!

Relevant code:

  /*
   * Stale Deploys
   */
  robot.respond(/ecs list stale tasks( in )?([A-Za-z0-9-]+)?/i, async res => {
    const cluster = res.match[2] || defaultCluster
    const services = await paginateServices(ecsClient, cluster)
    // no need to sort these results
    const serviceNames = services.map((x) => x.split('/')[x.split('/').length - 1])
    const staleDateShort = new Date(Date.now() - shortExpireSecs)
    const staleDateLong = new Date(Date.now() - longExpireSecs)
    const expiredDate = new Date(Date.now() - expiredSecs)
    let ignored = []
    let shortExp = []
    let longExp = []
    let exp = []
    /*
     * Collect service data
     */
    const chunkSize = 10
    for (let i = 0; i < serviceNames.length; i += chunkSize) {
      let chunk = serviceNames.slice(i, i + chunkSize)
      const ignoredFromChunk = chunk.filter((service) => ignoredServices.includes(service))
      ignored.push.apply(ignored, ignoredFromChunk)
      chunk = chunk.filter((service) => !ignoredServices.includes(service))
      if (chunk.length < 1) {
        continue
      }
      let input = {
        cluster,
        services: chunk,
        include: []
      }
      let command = new DescribeServicesCommand(input)
      let serviceData
      try {
        serviceData = await ecsClient.send(command)
        serviceData = serviceData.services
      } catch (err) {
        robot.logger.error(`Request to AWS failed: ${err}`)
      }
      for (let idx = 0; idx < serviceData.length; idx++) {
        const deployDate = new Date(serviceData[idx].deployments[0].createdAt)
        // skip any service newer than our longest expiration window
        if (deployDate > staleDateLong) {
          continue
        }

        const servString = `\`${serviceData[idx].serviceName}\` (deployed ${deployDate.toISOString()})`
        if (deployDate < expiredDate) {
          exp.push(servString)
        } else if (deployDate < staleDateShort) {
          shortExp.push(servString)
        } else {
          longExp.push(servString)
        }
      }
    }

@joeyguerra
Copy link
Member

Well done tracking down this bug.

I don't see the code that "batches the AWS requests". Would you mind pointing it out for me? I'd love to see how you solved it.

I'm also curious if there's a move I can make to the Slack Adapter to either not let this situation happen or make it very visible that it's happening.

@johnseekins-pathccm
Copy link
Author

Sure. The batch happens here:

const chunkSize = 10
    for (let i = 0; i < serviceNames.length; i += chunkSize) {
      let chunk = serviceNames.slice(i, i + chunkSize)
      const ignoredFromChunk = chunk.filter((service) => ignoredServices.includes(service))
      ignored.push.apply(ignored, ignoredFromChunk)
      chunk = chunk.filter((service) => !ignoredServices.includes(service))
      if (chunk.length < 1) {
        continue
      }
      let input = {
        cluster,
        services: chunk,
        include: []
      }
      let command = new DescribeServicesCommand(input)
      let serviceData
      try {
        serviceData = await ecsClient.send(command)
        serviceData = serviceData.services
      } catch (err) {
        robot.logger.error(`Request to AWS failed: ${err}`)
      }

Let's expand that a bit:

Instead of doing

    for (let i = 0; i < serviceNames.length; i++) {
      const service = serviceNames[i]
      if (ignoredServices.includes(service)) {
        ignored.push(service)
        continue
      }
      let input = {
        cluster,
        services: [service],
        include: []
      }
      let command = new DescribeServicesCommand(input)
      let serviceData
      try {
        serviceData = await ecsClient.send(command)
        serviceData = serviceData.services[0]
      } catch (err) {
        robot.logger.error(`Request to AWS failed: ${err}`)
      }

I now loop through the list of serviceArns in groups of 10 (and do some filtering). This means that I would send a request like ['service1', 'service2', ..., 'service10'] instead of [service1], [service2], etc. Reducing the time taken collecting data from AWS by a factor of 10.

I think perhaps surfacing the request timeout (somehow) would be amazing. Just so we know it's there.

@joeyguerra
Copy link
Member

joeyguerra commented Mar 1, 2024

I see. The new code changes

let input = {
  cluster,
  services: [service],
  include: []
}

to

let input = {
  cluster,
  services: services,
  include: []
}

where services is an array of service names without the ignored ones.

@johnseekins-pathccm
Copy link
Author

Closing this as it seems to be more an issue with timeouts within adapters. Thanks for the help!

@johnseekins-pathccm
Copy link
Author

I've been running in to this issue again. In some cases, I can improve the performance of the commands Hubot runs and avoid the race. In others, it's just an expensive query (or an expensive series of queries). I wasn't fully clear if this was adapter-specific or if this was a problem within Hubot itself.
@joeyguerra Should I resubmit this with the slack adapter or is there more debugging we can do here?

@joeyguerra
Copy link
Member

joeyguerra commented Dec 20, 2024

Which slack adapter are you using?

@johnseekins-pathccm
Copy link
Author

@joeyguerra
Copy link
Member

I'm happy to get on Discord to troubleshoot. I'm in Central time.

Other reasons that I've seen multiple responses to a single message is that multiple regexes (in the respond setup) match the message and so multiple handlers execute.

@johnseekins-pathccm
Copy link
Author

I appreciate the discord offer, but I don't actually have an account currently. I'll look in to that.

I could see multiple responses from different regexes happening, sure, but wouldn't the responses then be different? Not the same response 4 times?

@joeyguerra
Copy link
Member

MS Teams and Zoom would work too.

Yes, if the handlers do and respond with different results, they would be different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@joeyguerra @xurizaemon @johnseekins-pathccm and others