[BUG] Unhelpful crash message when ECS `pidMode` is configured incorrectly #22940

m0rg-dev · 2024-02-17T00:39:34Z

Agent Environment

Agent 7.50.3 on AWS ECS Fargate, using the latest container image.

Stack trace

| 1708126659023 | panic: runtime error: index out of range [0] with length 0                                                                                                                                                                                                                                                                                                                                                        |
| 1708126659023 | goroutine 378 [running]:                                                                                                                                                                                                                                                                                                                                                                                          |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/util.(*ChunkAllocator[...]).Accept(0x5dcf9a0, {0xc001bb0080?, 0xc, 0x10}, 0x681)                                                                                                                                                                                                                                                                                     |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/util/chunking.go:96 +0x2f9                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/util.ChunkPayloadsBySizeAndWeight[...](0xc00118b720, 0xc001bd25f0, 0x48, 0xf4240?)                                                                                                                                                                                                                                                                                   |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/util/chunking.go:166 +0x2c5                                                                                                                                                                                                                                                                                                          |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.chunkProcessesBySizeAndWeight({0xc001bb0080?, 0xc, 0x10}, 0xc001a8b680, 0x4044b33333333333?, 0x0?, 0xc001bd25f0)                                                                                                                                                                                                                                              |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/chunking.go:42 +0x326                                                                                                                                                                                                                                                                                                         |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.chunkProcessesAndContainers(0x97d03d8?, {0xc000152720, 0x3, 0xc001c0d860?}, 0xc001bad560?, 0xc001afcdb0?)                                                                                                                                                                                                                                                     |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:414 +0x118                                                                                                                                                                                                                                                                                                         |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.createProcCtrMessages(0xc000074360, 0x40401c28f5c28f5c?, {0xc000152720?, 0x0?, 0x403a8a3d70a3d70a?}, 0x0?, 0x3ff9eb851eb851ec?, 0x57eac212, {0x0, 0x0}, ...)                                                                                                                                                                                                  |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:371 +0x5d                                                                                                                                                                                                                                                                                                          |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*ProcessCheck).run(0xc000b5e480, 0x57eac212, 0x1)                                                                                                                                                                                                                                                                                                            |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:277 +0x6fe                                                                                                                                                                                                                                                                                                         |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*ProcessCheck).Run(0x3?, 0xc00144ce00, 0xc001ac53b0)                                                                                                                                                                                                                                                                                                         |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/process.go:351 +0xe5                                                                                                                                                                                                                                                                                                          |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).runCheckWithRealTime(0xc0002392c0, {0x6ed99b0, 0xc000b5e480}, 0xc001ac53b0)                                                                                                                                                                                                                                                                    |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:182 +0xc6                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).runnerForCheck.func2({0x1, 0x1})                                                                                                                                                                                                                                                                                                               |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:350 +0x65                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/checks.(*runnerWithRealTime).run(0xc000b4df40)                                                                                                                                                                                                                                                                                                                       |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/checks/runner.go:73 +0x35a                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).Run.func1()                                                                                                                                                                                                                                                                                                                                    |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:287 +0x5c                                                                                                                                                                                                                                                                                                           |
| 1708126659023 | created by github.com/DataDog/datadog-agent/pkg/process/runner.(*CheckRunner).Run                                                                                                                                                                                                                                                                                                                                 |
| 1708126659023 |  /omnibus/src/datadog-agent/src/github.com/DataDog/datadog-agent/pkg/process/runner/runner.go:285 +0x3c9                                                                                                                                                                                                                                                                                                          |
| 1708126659055 | process-agent exited with code 2, signal 0, restarting in 2 seconds

Describe what happened:

Got this crash while setting up DataDog process monitoring on a Fargate task. Eventually realized I'd made a typo and set pidMode = task on one of the container definitions instead of the root task config, and that fixed the immediate issue.

Describe what you expected:

Any diagnostic message that would help me discover the configuration issue.

Steps to reproduce the issue:

Set up an ECS task with multiple containers and don't set pidMode, sort of. There are more conditions but I'm not sure exactly what they are yet (in local testing I discovered it's apparently sensitive to whether the datadog-agent container's name comes alphabetically before or after any other containers in the task. more on that in a bit).

Additional environment details (Operating System, Cloud provider, etc):

Here's how far I've gotten in debugging the crash:

datadog-agent/pkg/process/util/chunking.go

Lines 85 to 98 in abce0cb

 func (c *ChunkAllocator[T, P]) Accept(ps []P, weight int) { 

 if c.idx >= len(c.chunks) { 

 // If we are outside of the range of allocated chunks, allocate a new one 

 c.chunks = append(c.chunks, *new(T)) 

 c.props = append(c.props, chunkProps{}) 

 } 

 if c.OnAccept != nil { 

 c.OnAccept(&c.chunks[c.idx]) 

 } 

 c.AppendToChunk(&c.chunks[c.idx], ps) 

 c.props[c.idx].size += len(ps) 

 c.props[c.idx].weight += weight 

 }

The specific line is c.props[c.idx].size += len(ps); we know c.idx is zero from the panic message, but we also know that c.idx >= len(c.chunks) is not true because otherwise the if would have been taken and c.props would have a 0th element. Substituting, !(c.idx >= len(c.chunks) => c.idx < len(c.chunks) => 0 < len(c.chunks) — that is, c.chunks does have (at least) a 0th element.

pkg/process/util/chunking.go correctly maintains the relationship between c.chunks and c.props, but c.chunks can also escape the module by reference via GetChunks:

datadog-agent/pkg/process/util/chunking.go

Lines 100 to 102 in abce0cb

 func (c *ChunkAllocator[T, P]) GetChunks() *[]T { 

 return &c.chunks 

 }

If another section were to acquire a reference to c.chunks via GetChunks, and then append to it, that would violate Accept's assumption that len(c.props) >= len(c.chunks). This occurs:

datadog-agent/pkg/process/checks/chunking.go

Lines 15 to 22 in abce0cb

 func chunkProcessesBySizeAndWeight(procs []*model.Process, ctr *model.Container, maxChunkSize, maxChunkWeight int, chunker *util.ChunkAllocator[model.CollectorProc, *model.Process]) { 

 if ctr != nil && len(procs) == 0 { 

 // can happen in two scenarios, and we still need to report the container 

 // a) if a process is skipped (e.g. disallowlisted) 

 // b) if process <=> container mapping cannot be established (e.g. Docker on Windows) 

 appendContainerWithoutProcesses(ctr, chunker.GetChunks()) 

 return 

 }

datadog-agent/pkg/process/checks/chunking.go

Lines 45 to 51 in abce0cb

 func appendContainerWithoutProcesses(ctr *model.Container, collectorProcs *[]model.CollectorProc) { 

 if len(*collectorProcs) == 0 { 

 *collectorProcs = append(*collectorProcs, model.CollectorProc{}) 

 } 

 collectorProc := &(*collectorProcs)[len(*collectorProcs)-1] 

 collectorProc.Containers = append(collectorProc.Containers, ctr) 

 }

If either of the "two scenarios" referenced in chunkProcessesBySizeAndWeight's comment occurs, and the container with unmappable processes is the first one inspected (hence why order matters!) so that appendContainerWithoutProcesses sees an empty collectorProcs, c.chunks will be extended but c.props won't. When chunkProcessesBySizeAndWeight later calls utils.ChunkPayloadsBySizeAndWeight, which calls Accept, this crash will occur.

My first instinct would be to say that GetChunks shouldn't exist (pass chunker down to appendContainerWithoutProcesses and use Accept with an empty process list? I dunno) but I'm seeing this code for the first time so take it with a grain of salt.

Presumably in this case process <=> container mapping cannot be established because of the pidMode configuration issue, but I'm not certain. I don't know if there's a good way to detect that setting from within the container, but this crash definitely shouldn't be happening.

The text was updated successfully, but these errors were encountered:

henare · 2024-03-27T04:25:50Z

For me the change in container image name appears to be directly related. I had pidMode set correctly and live process data flowing for all containers in the task definition.

When I updated the Datadog container image name to point to an internal ECR repo, so we could control the version tag and enable a read-only root filesystem, this bug showed up. Changing the container image back to the Datadog published one "fixed" it again.

Just-Drue · 2024-03-28T10:40:32Z

Had same issue with pidMode, fixing that resolved it.

m0rg-dev added the team/triage label Feb 17, 2024

sgnn7 added team/processes and removed team/triage labels Feb 20, 2024

robertjli assigned daniel-taf Apr 1, 2024

daniel-taf mentioned this issue Apr 25, 2024

[PROCS-3838] Add handling for ECS Fargate misconfiguration during chunking #25100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unhelpful crash message when ECS `pidMode` is configured incorrectly #22940

[BUG] Unhelpful crash message when ECS `pidMode` is configured incorrectly #22940

m0rg-dev commented Feb 17, 2024

henare commented Mar 27, 2024

Just-Drue commented Mar 28, 2024

[BUG] Unhelpful crash message when ECS pidMode is configured incorrectly #22940

[BUG] Unhelpful crash message when ECS pidMode is configured incorrectly #22940

Comments

m0rg-dev commented Feb 17, 2024

henare commented Mar 27, 2024

Just-Drue commented Mar 28, 2024

[BUG] Unhelpful crash message when ECS `pidMode` is configured incorrectly #22940

[BUG] Unhelpful crash message when ECS `pidMode` is configured incorrectly #22940