Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unhelpful crash message when ECS pidMode is configured incorrectly #22940

Open
m0rg-dev opened this issue Feb 17, 2024 · 2 comments
Open
Assignees

Comments

@m0rg-dev
Copy link

Agent Environment

Agent 7.50.3 on AWS ECS Fargate, using the latest container image.

Stack trace
| 1708126659023 | panic: runtime error: index out of range [0] with length 0                                                                                                                                                                                                                                                                                                                                                        |
| 1708126659023 | goroutine 378 [running]:                                                                                                                                                                                                                                                                                                                                                                                          |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/util.(*ChunkAllocator[...]).Accept(0x5dcf9a0, {0xc001bb0080?, 0xc, 0x10}, 0x681)                                                                                                                                                                                                                                                                                     |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/util/chunking.go:96 +0x2f9                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/util.ChunkPayloadsBySizeAndWeight[...](0xc00118b720, 0xc001bd25f0, 0x48, 0xf4240?)                                                                                                                                                                                                                                                                                   |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/util/chunking.go:166 +0x2c5                                                                                                                                                                                                                                                                                                          |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.chunkProcessesBySizeAndWeight({0xc001bb0080?, 0xc, 0x10}, 0xc001a8b680, 0x4044b33333333333?, 0x0?, 0xc001bd25f0)                                                                                                                                                                                                                                              |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/chunking.go:42 +0x326                                                                                                                                                                                                                                                                                                         |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.chunkProcessesAndContainers(0x97d03d8?, {0xc000152720, 0x3, 0xc001c0d860?}, 0xc001bad560?, 0xc001afcdb0?)                                                                                                                                                                                                                                                     |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:414 +0x118                                                                                                                                                                                                                                                                                                         |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.createProcCtrMessages(0xc000074360, 0x40401c28f5c28f5c?, {0xc000152720?, 0x0?, 0x403a8a3d70a3d70a?}, 0x0?, 0x3ff9eb851eb851ec?, 0x57eac212, {0x0, 0x0}, ...)                                                                                                                                                                                                  |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:371 +0x5d                                                                                                                                                                                                                                                                                                          |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*ProcessCheck).run(0xc000b5e480, 0x57eac212, 0x1)                                                                                                                                                                                                                                                                                                            |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:277 +0x6fe                                                                                                                                                                                                                                                                                                         |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*ProcessCheck).Run(0x3?, 0xc00144ce00, 0xc001ac53b0)                                                                                                                                                                                                                                                                                                         |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:351 +0xe5                                                                                                                                                                                                                                                                                                          |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).runCheckWithRealTime(0xc0002392c0, {0x6ed99b0, 0xc000b5e480}, 0xc001ac53b0)                                                                                                                                                                                                                                                                    |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:182 +0xc6                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).runnerForCheck.func2({0x1, 0x1})                                                                                                                                                                                                                                                                                                               |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:350 +0x65                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*runnerWithRealTime).run(0xc000b4df40)                                                                                                                                                                                                                                                                                                                       |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/runner.go:73 +0x35a                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).Run.func1()                                                                                                                                                                                                                                                                                                                                    |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:287 +0x5c                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | created by github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).Run                                                                                                                                                                                                                                                                                                                                 |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:285 +0x3c9                                                                                                                                                                                                                                                                                                          |
| 1708126659055 | process-agent exited with code 2, signal 0, restarting in 2 seconds

Describe what happened:

Got this crash while setting up DataDog process monitoring on a Fargate task. Eventually realized I'd made a typo and set pidMode = task on one of the container definitions instead of the root task config, and that fixed the immediate issue.

Describe what you expected:

Any diagnostic message that would help me discover the configuration issue.

Steps to reproduce the issue:

Set up an ECS task with multiple containers and don't set pidMode, sort of. There are more conditions but I'm not sure exactly what they are yet (in local testing I discovered it's apparently sensitive to whether the datadog-agent container's name comes alphabetically before or after any other containers in the task. more on that in a bit).

Additional environment details (Operating System, Cloud provider, etc):


Here's how far I've gotten in debugging the crash:

func (c *ChunkAllocator[T, P]) Accept(ps []P, weight int) {
if c.idx >= len(c.chunks) {
// If we are outside of the range of allocated chunks, allocate a new one
c.chunks = append(c.chunks, *new(T))
c.props = append(c.props, chunkProps{})
}
if c.OnAccept != nil {
c.OnAccept(&c.chunks[c.idx])
}
c.AppendToChunk(&c.chunks[c.idx], ps)
c.props[c.idx].size += len(ps)
c.props[c.idx].weight += weight
}

The specific line is c.props[c.idx].size += len(ps); we know c.idx is zero from the panic message, but we also know that c.idx >= len(c.chunks) is not true because otherwise the if would have been taken and c.props would have a 0th element. Substituting, !(c.idx >= len(c.chunks) => c.idx < len(c.chunks) => 0 < len(c.chunks) — that is, c.chunks does have (at least) a 0th element.

pkg/process/util/chunking.go correctly maintains the relationship between c.chunks and c.props, but c.chunks can also escape the module by reference via GetChunks:

func (c *ChunkAllocator[T, P]) GetChunks() *[]T {
return &c.chunks
}

If another section were to acquire a reference to c.chunks via GetChunks, and then append to it, that would violate Accept's assumption that len(c.props) >= len(c.chunks). This occurs:

func chunkProcessesBySizeAndWeight(procs []*model.Process, ctr *model.Container, maxChunkSize, maxChunkWeight int, chunker *util.ChunkAllocator[model.CollectorProc, *model.Process]) {
if ctr != nil && len(procs) == 0 {
// can happen in two scenarios, and we still need to report the container
// a) if a process is skipped (e.g. disallowlisted)
// b) if process <=> container mapping cannot be established (e.g. Docker on Windows)
appendContainerWithoutProcesses(ctr, chunker.GetChunks())
return
}

func appendContainerWithoutProcesses(ctr *model.Container, collectorProcs *[]model.CollectorProc) {
if len(*collectorProcs) == 0 {
*collectorProcs = append(*collectorProcs, model.CollectorProc{})
}
collectorProc := &(*collectorProcs)[len(*collectorProcs)-1]
collectorProc.Containers = append(collectorProc.Containers, ctr)
}

If either of the "two scenarios" referenced in chunkProcessesBySizeAndWeight's comment occurs, and the container with unmappable processes is the first one inspected (hence why order matters!) so that appendContainerWithoutProcesses sees an empty collectorProcs, c.chunks will be extended but c.props won't. When chunkProcessesBySizeAndWeight later calls utils.ChunkPayloadsBySizeAndWeight, which calls Accept, this crash will occur.

My first instinct would be to say that GetChunks shouldn't exist (pass chunker down to appendContainerWithoutProcesses and use Accept with an empty process list? I dunno) but I'm seeing this code for the first time so take it with a grain of salt.

Presumably in this case process <=> container mapping cannot be established because of the pidMode configuration issue, but I'm not certain. I don't know if there's a good way to detect that setting from within the container, but this crash definitely shouldn't be happening.

@henare
Copy link

henare commented Mar 27, 2024

For me the change in container image name appears to be directly related. I had pidMode set correctly and live process data flowing for all containers in the task definition.

When I updated the Datadog container image name to point to an internal ECR repo, so we could control the version tag and enable a read-only root filesystem, this bug showed up. Changing the container image back to the Datadog published one "fixed" it again.

@Just-Drue
Copy link

Had same issue with pidMode, fixing that resolved it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants