New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Unhelpful crash message when ECS pidMode
is configured incorrectly
#22940
Comments
For me the change in container image name appears to be directly related. I had When I updated the Datadog container image name to point to an internal ECR repo, so we could control the version tag and enable a read-only root filesystem, this bug showed up. Changing the container image back to the Datadog published one "fixed" it again. |
Had same issue with |
Agent Environment
Agent 7.50.3 on AWS ECS Fargate, using the latest container image.
Stack trace
Describe what happened:
Got this crash while setting up DataDog process monitoring on a Fargate task. Eventually realized I'd made a typo and set
pidMode = task
on one of the container definitions instead of the root task config, and that fixed the immediate issue.Describe what you expected:
Any diagnostic message that would help me discover the configuration issue.
Steps to reproduce the issue:
Set up an ECS task with multiple containers and don't set
pidMode
, sort of. There are more conditions but I'm not sure exactly what they are yet (in local testing I discovered it's apparently sensitive to whether thedatadog-agent
container's name comes alphabetically before or after any other containers in the task. more on that in a bit).Additional environment details (Operating System, Cloud provider, etc):
Here's how far I've gotten in debugging the crash:
datadog-agent/pkg/process/util/chunking.go
Lines 85 to 98 in abce0cb
The specific line is
c.props[c.idx].size += len(ps)
; we knowc.idx
is zero from the panic message, but we also know thatc.idx >= len(c.chunks)
is not true because otherwise theif
would have been taken andc.props
would have a 0th element. Substituting,!(c.idx >= len(c.chunks)
=>c.idx < len(c.chunks)
=>0 < len(c.chunks)
— that is,c.chunks
does have (at least) a 0th element.pkg/process/util/chunking.go
correctly maintains the relationship betweenc.chunks
andc.props
, butc.chunks
can also escape the module by reference viaGetChunks
:datadog-agent/pkg/process/util/chunking.go
Lines 100 to 102 in abce0cb
If another section were to acquire a reference to
c.chunks
viaGetChunks
, and thenappend
to it, that would violateAccept
's assumption thatlen(c.props) >= len(c.chunks)
. This occurs:datadog-agent/pkg/process/checks/chunking.go
Lines 15 to 22 in abce0cb
datadog-agent/pkg/process/checks/chunking.go
Lines 45 to 51 in abce0cb
If either of the "two scenarios" referenced in
chunkProcessesBySizeAndWeight
's comment occurs, and the container with unmappable processes is the first one inspected (hence why order matters!) so thatappendContainerWithoutProcesses
sees an emptycollectorProcs
,c.chunks
will be extended butc.props
won't. WhenchunkProcessesBySizeAndWeight
later callsutils.ChunkPayloadsBySizeAndWeight
, which callsAccept
, this crash will occur.My first instinct would be to say that
GetChunks
shouldn't exist (passchunker
down toappendContainerWithoutProcesses
and useAccept
with an empty process list? I dunno) but I'm seeing this code for the first time so take it with a grain of salt.Presumably in this case
process <=> container mapping cannot be established
because of thepidMode
configuration issue, but I'm not certain. I don't know if there's a good way to detect that setting from within the container, but this crash definitely shouldn't be happening.The text was updated successfully, but these errors were encountered: