Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] How to minimize/restrict memory utilization? #662

Open
akamensky opened this issue Dec 31, 2021 · 20 comments
Open

[Performance] How to minimize/restrict memory utilization? #662

akamensky opened this issue Dec 31, 2021 · 20 comments
Labels
performance performance related issues and questions

Comments

@akamensky
Copy link

Problem description

We just moved our large system from graphite-web to carbonapi -> go-carbon which results in much faster performance and also allows us to scale horizontally easily (with multiple backend servers defined in carbonapi).

From my team's side we are managing the monitoring system, but not individual dashboards and how they are querying, that is done by others. Which results in very large queries to carbonapi (both number of metrics requested at once and the date range).

The above causes carbonapi to run out of memory every few minutes on a 128 GB server. This did not happen with graphite-web ever (it was slow, but memory footprint there was pretty small). I've tried to tune following:

  • Number of concurrent connections -- no effect on memory usage really
  • Setting caches to XX MBs -- only seems to result in much faster rss growth (OOM within seconds)
  • Completely disabling cache -- only seems to result in much faster rss growth (OOM within seconds)
  • Timeouts -- have no effect on memory usage
  • Changing between v1 and v2 backends -- v2 backends consume memory VERY quick (like in 2-3 seconds since startup it is already OOM), v1 memory growth is much slower

Seems it is related to having multiple backend servers and needing to merge response which is done in-memory. So there is no setting really that controls that. At least no setting for that in documentation.

So my question is -- how to restrict memory usage in carbonapi to avoid OOM?

carbonapi's version
v0.15.4

Does this happened before
N/A, it did not happen before, but before it was graphite-web with single data server, not carbonapi with multiple backends.

carbonapi's config

listen: "0.0.0.0:8081"

prefix: ""
useCachingDNSResolver: false
cachingDNSRefreshTime: "1m"

expvar:
  enabled: true
  pprofEnabled: false
  listen: ""

cpus: 10
concurency: 1000
maxBatchSize: 100
idleConnections: 10
pidFile: ""

cache:
  type: "mem"
  size_mb: 0
  defaultTimeoutSec: 60

graphite:
  host: ""
  interval: "60s"
  prefix: "carbon.api"
  pattern: "{prefix}.{fqdn}"

upstreams:
  tldCacheDisabled: true
  buckets: 10
  slowLogThreshold: "10s"
  timeouts:
    find: "60s"
    render: "60s"
    connect: "200ms"
  concurrencyLimitPerServer: 0
  keepAliveInterval: "5s"
  maxIdleConnsPerHost: 100
  doMultipleRequestsIfSplit: false
  backends:
    - "http://backend1:8080"
    - "http://backend2:8080"
    - "http://backend3:8080"
  # carbonsearch is not used if empty
  carbonsearch:

logger:
  - logger: ""
    file: "stderr"
    level: "error"
    encoding: "console"
    encodingTime: "iso8601"
    encodingDuration: "seconds"

backend software and config

go-carbon(s):

[common]
user = "user"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "local"
metric-interval = "10s"
max-cpu = 20

[whisper]
data-dir = "/data/graphite/whisper/"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
workers = 20
max-updates-per-second = 0
max-creates-per-second = 0
hard-max-creates-per-second = false
sparse-create = false
flock = false
enabled = true
hash-filenames = true

[cache]
max-size = 6000000
write-strategy = "sorted"
input-buffer=600000

[udp]
listen = ":2003"
enabled = true
log-incomplete = false
buffer-size = 0

[tcp]
listen = ":2003"
enabled = true
buffer-size = 0

[pickle]
listen = ":2004"
max-message-size = 67108864
enabled = false
buffer-size = 0

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
read-timeout = "30s"

[grpc]
listen = "127.0.0.1:7003"
enabled = true

[tags]
enabled = false
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
local-dir = "/var/lib/graphite/tagging/"
tagdb-timeout = "1s"

[carbonserver]
listen = "0.0.0.0:8080"
enabled = true
buckets = 10
metrics-as-counters = false
read-timeout = "60s"
write-timeout = "60s"
query-cache-enabled = true
query-cache-size-mb = 0
find-cache-enabled = true
trigram-index = true
scan-frequency = "5m0s"
max-globs = 10000
fail-on-max-globs = false
graphite-web-10-strict-mode = true
internal-stats-dir = ""
stats-percentiles = [99, 98, 95, 75, 50]

[dump]
enabled = false
path = "/var/lib/graphite/dump/"
restore-per-second = 0

[pprof]
listen = "localhost:7007"
enabled = false

[[logging]]
logger = ""
file = "/var/log/go-carbon/go-carbon.log"
level = "warn"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

Query that causes problems

  1. What's the query? can't provide exact query, but example would be using aliasByNode with multiple wildcards.
  2. How many metrics does it fetch? 1000s
  3. What's the resolution of those metrics? We don't have an insight into that, as that is defined by individual dev teams, so it is possible resolutions are varying within single request. However from what I can see 90% of all metrics in system are using 5s resolution (that is required for devs to troubleshoot issues, cannot really lower the resolution to higher intervals). Then after 7 days those 5 seconds are aggregated to 1m (by go-carbon), then after 30 days to 10min, at which point they are kept to a total of 2 years.
  4. How many datapoints per metric do you have? Not sure I understand the question, see aggregation/retention policy above.
@akamensky akamensky added the performance performance related issues and questions label Dec 31, 2021
@Civil
Copy link
Member

Civil commented Jan 5, 2022

Hi,

Some comments:

Number of concurrent connections -- no effect on memory usage really

It should depend on how many actual connections your users generate.

Setting caches to XX MBs -- only seems to result in much faster rss growth (OOM within seconds)

That is odd. Basically if inserting a value causes overflow in terms of size, cache evicts some random items (https://github.com/dgryski/go-expirecache/blob/master/cache.go#L93-L95). That would cause more load on GO's GC but appart from that it should slow down the rss growth.

But let's keep that thought about GC in mind for now

Completely disabling cache -- only seems to result in much faster rss growth (OOM within seconds)

That actually can happen as the more actual requests you do, the more garbage you'll have for golang to collect

Changing between v1 and v2 backends -- v2 backends consume memory VERY quick (like in 2-3 seconds since startup it is already OOM), v1 memory growth is much slower

That is extremely weird, as backendv1 is converted internally to a backendv2 (there is just an extra step to pre-populate some things there). Basically for each backend section you have backendv2 section that behaves the same (https://github.com/go-graphite/carbonapi/blob/main/zipper/config/config.go#L128-L155)

Basically your section:

  backends:
    - "http://backend1:8080"
    - "http://backend2:8080"
    - "http://backend3:8080"

Is equivalent to:

    doMultipleRequestsIfSplit: true
    backendsv2:
        backends:
          -
            groupName: "backends"
            protocol: "carbonapi_v2_pb"
            lbMethod: "broadcast"
            maxBatchSize: 100
            keepAliveInterval: "10s"
            maxIdleConnsPerHost: 1000
            doMultipleRequestsIfSplit: true
            servers:
                - "http://backend1:8080"
                - "http://backend2:8080"
                - "http://backend3:8080"

Also it would override global doMultipleRequestsIfSplit and set it to true to mimic old behavior. In that case the difference would be that your requests will be split by maxBatchSize amount of metrics and several of them will be sent in parallel as separate requests to all your backends.

That would be more friendly towards Golang's GC.

Seems it is related to having multiple backend servers and needing to merge response which is done in-memory. So there is no setting really that controls that. At least no setting for that in documentation.

maxBatchSize controls that to some extent. Also carbonapi_v3_pb on one hand saves a bit on extra information but as it supports sending multiple requests in single HTTP request it might be more taxing on a GC.

As well concurrencyLimitPerServer limits amount of parallel connections, in your case I would rather change it to some smaller value (0 = unlimited) as that would allow server to merge more smaller requests together.

So my question is -- how to restrict memory usage in carbonapi to avoid OOM?

With go-carbon ways how to deal with that are rather limited, unfortunately (it doesn't support sending already pre-aggregated replies and would always do it's best to send you all the data it can). I'm not sure if there are any ways to limit size of reply on go-carbon side to be honest, as that would be the better approach here. Otherwise - you can limit amount of concurrent queries and actually enable caches (the less you need to go to backends the better it would be for you).

There were some efforts by @msaf1980 to improve how caching is done in general. You might want to look at his work (it's currently in master, I haven't cut a release yet as I want to fix few issues first).

It might also help if you'll be able to collect some heap profiles and share svg: https://go.dev/doc/diagnostics#profiling

carbonapi provides a way to enable expvar and pprof on a separate port:
https://github.com/go-graphite/carbonapi/blob/main/cmd/carbonapi/carbonapi.example.yaml#L27-L30

You can enable it there and collect profiles with curl http://carbonapi:[expvar_listen_port]/debug/pprof/heap > heap.pprof and then use the docs I've linked above. Or you can use this article as a reference: https://medium.com/compass-true-north/memory-profiling-a-go-service-cd62b90619f9 (it seems to have all the steps listed there).

@Civil
Copy link
Member

Civil commented Jan 6, 2022

Oh, and I forgot to mention, as there is some evidence that it might be actually GC pressure, it would be great to see what Go version you are using and maybe you can play a bit with GOGC value (https://pkg.go.dev/runtime). There are numerous articles on how to do that or what it means:

  1. https://dave.cheney.net/tag/gogc
  2. https://archive.fosdem.org/2019/schedule/event/gogc/

And some others. So maybe lowering it might help if garbage collection issues is actually the case here.

@akamensky
Copy link
Author

akamensky commented Jan 6, 2022

it would be great to see what Go

Currently I am using image from Docker Hub, but I've built the tip using 1.17 and in both cases the behavior is similar. Though I did not measure the time it takes to OOM in both cases.

I have changed the setup from running 1 Grafana -> 1 carbonapi -> 2+ go-carbon to 1 Grafana -> 1 Nginx LB -> 2+ carbonapi+go-carbon that naturally spreads requests in RR fashion across multiple carbonapi instances, and while it won't share caches etc, that has improved the memory situation by a lot (not a single OOM just yet). It still would go up to some high number, but much slower, and it would release memory eventually.

@Civil
Copy link
Member

Civil commented Jan 6, 2022

When you are having a backends in broadcast mode they will replies from all your backends and try to merge them, so if you have 2 copies of data you'll need about 3x memory amount (for a short period of time) to store that data.

About releasing memory - golang as most of GC-based languages do not really like to do it, so even if it's not in-use it won't be released for quiet some time. That is expected. That's why you have some metrics exported by carbonapi itself. For actual memory usage you should refer to them.

Overall for heavy requests I would recommend considering migrating backend to graphite-clickhouse/carbon-clickhouse and use them and enable backend side aggregation. Not only you won't need to use broadcast mode (if you have replication enabled on CH side, it will ensure that data is the same across all your replicas) but also it can pre-aggregate responses based on what Grafana requested. That usually will give about some reduction in amount of data you need to fetch and process. However that obviously have it's own drawbacks (you'll need to manage clickhouse installation, you'll need to migrate data somehow, clickhouse in general is slower for single reads and low amount of updates, but it's faster for bulk reads and writes)

@akamensky
Copy link
Author

akamensky commented Jan 7, 2022

so if you have 2 copies of data

No, we don't have copies of data, it is all separate across different servers (using carbon-relay-ng and consistent hashing), as the write throughput of single server (raid10 sata3 ssds) is not enough anymore (unless we go oh-so-expensive NVMe storage). But it does merge since with consistent hashing we can't predict which metrics will go where and single dashboard may be loading metrics from different storage hosts. But that is expected.

The unexpected part is how much memory it wants to use. The first setup (which had OOMs) has 128 GB RAM, which cabonapi consumed in a matter of seconds in some instances. The current setup is 2 servers each with 256 GB RAM and carbonapi periodically gets close to the max.

Comparing it to graphite-web -- same data requested uses barely 2MB of RAM. But it is entirely possible that the merging is done on disk there? (that could be the case because of how insanely slow it is to give data from 2 sources).

@akamensky
Copy link
Author

We did not have this issue for awhile, and recently started hitting it again. I did another round of config tuning lowering concurrency, maxBatchSize, upstreams.backendsv2.maxBathSize and upstreams.carbonsearchv2.maxBathSize. That has fixed it again for now.

Perhaps would be good to create a performance tuning documentation on how to tune towards different use-cases?

@easterhanu
Copy link

We just went into production with carbonapi 0.16.0~1 and also quickly ended up having memory issues - carbonapi had maybe 100 MiB of data in cache, but process memory consumption was 15 GiB and increasing. We have two go-carbon carbonserver backends in broadcast mode, but the memory issue can be replicated with a single backend as well.

After some testing, we think this is a memory leak related to carbonapi response cache and JSON response format. Here's a small bash test script to run locally on an idle carbonapi server - it keeps requesting the same data in a loop once a second, increasing maxDataPoints by one to force a cache miss every time, and records carbonapi's RSS (resident set size) memory usage and change between requests:

#!/bin/bash

carbonapi_pid=$(pgrep -u carbon carbonapi)
if [ -z "$carbonapi_pid" ]; then
    printf "carbonapi is not running\n"
    exit 0
fi

render_url="localhost:8080/render"
target="testing.carbonapi.*.runtime.mem_stats.*"
range="from=-48h&until=now"
format="json"
request="${render_url}/render?target=${target}&${range}&format=${format}"

printf "Teasing carbonapi at $render_url\n"
rss_before=$(ps -q $carbonapi_pid --no-headers -o rss)
for points in {1000..2000}; do
    curl --silent --show-error "${request}&maxDataPoints=$points" > /dev/null || break
    rss_after=$(ps -q $carbonapi_pid --no-headers -o rss)
    printf "%s # carbonapi RSS: %9d bytes (delta %6d bytes)\n" \
        "maxDataPoints=$points" $rss_after $(($rss_after - $rss_before))
    rss_before=$rss_after
    sleep 1
done

With carbonapi response cache enabled and backend cache disabled, i.e.:

cache:
    type: "mem"
    size_mb: 0
    defaultTimeoutSec: 60

backendCache:
    type: "null"

...and running the script on a small test VM, carbonapi runs out of memory pretty fast:

# JSON format, response cache on, backend cache off = OOM
$ ./carbonapi_oom.sh
Teasing carbonapi at localhost:8080/render
maxDataPoints=1000 # carbonapi RSS:     38140 bytes (delta  18396 bytes)
maxDataPoints=1001 # carbonapi RSS:     43920 bytes (delta   5780 bytes)
maxDataPoints=1002 # carbonapi RSS:     57980 bytes (delta  14060 bytes)
...
maxDataPoints=1095 # carbonapi RSS:    374736 bytes (delta  12164 bytes)
maxDataPoints=1096 # carbonapi RSS:    386352 bytes (delta  11616 bytes)
curl: (52) Empty reply from server # kernel OOM killer

The selected metrics query and size of the response affects memory consumption rate, but the point is here that we hardly ever see the RSS figure going down. Sometimes the delta stays at zero for a few requests, but overall it's almost linear increase.

However if we switch from using response cache to backend cache, or request the data in CSV format instead of JSON, carbonapi's memory consumption stays perfectly in control:

# JSON format, response cache off, backend cache on  = OK
# CSV format,  response cache on,  backend cache off = OK
# CSV format,  response cache off, backend cache on  = OK
$ ./carbonapi_oom.sh
Teasing carbonapi at localhost:8080/render
maxDataPoints=1000 # carbonapi RSS:     37192 bytes (delta  18056 bytes)
maxDataPoints=1001 # carbonapi RSS:     46680 bytes (delta   9488 bytes)
maxDataPoints=1002 # carbonapi RSS:     53200 bytes (delta   6520 bytes)
...
maxDataPoints=1048 # carbonapi RSS:     92164 bytes (delta   5680 bytes)
maxDataPoints=1049 # carbonapi RSS:     78564 bytes (delta -13600 bytes)
maxDataPoints=1050 # carbonapi RSS:     78884 bytes (delta    320 bytes)
maxDataPoints=1051 # carbonapi RSS:     91740 bytes (delta  12856 bytes)
...
maxDataPoints=1161 # carbonapi RSS:    125932 bytes (delta  13300 bytes)
maxDataPoints=1162 # carbonapi RSS:    106184 bytes (delta -19748 bytes)
maxDataPoints=1163 # carbonapi RSS:     94868 bytes (delta -11316 bytes)
maxDataPoints=1164 # carbonapi RSS:     94700 bytes (delta   -168 bytes)
...
maxDataPoints=2000 # carbonapi RSS:     96560 bytes (delta -24092 bytes)

So a workaround for carbonapi's huge memory usage seems to be disabling response cache and relying on backend cache instead.

We are not go experts here, but a colleague of mine tried profiling the issue, and all the excess memory seems to be used by MarshalJSON(). Maybe there's something wrong in the way JSON data is being processed, or the way response cache eviction works.

@Civil
Copy link
Member

Civil commented Jan 26, 2023

Can you also grab and share memory profile?

@msaf1980
Copy link
Collaborator

msaf1980 commented Jan 26, 2023

If cache enabled - it's a problem. Different maxDataPoints produce different data set (because different cache key used).
It work as expected, but potentially can be exploited in untrusted environment.

For example, i't a key build functions

func responseCacheComputeKey(from, until int64, targets []string, format string, maxDataPoints int64, noNullPoints bool, template string) string {

@msaf1980
Copy link
Collaborator

msaf1980 commented Jan 26, 2023

@easterhanu If you need protect against it - set cache size limit.

@easterhanu
Copy link

@Civil here's a heap profile moments before OOM:

heap.pprof.gz

@msaf1980 as far as I can tell, setting response cache size_mb for example to 50 or 100 MiB has no effect to the test result - RSS still just keeps increasing until carbonapi runs out of memory. According to carbonapi's internal cache_size metric the cache size itself is not the problem, memory is spent somewhere else (e.g. in production we've seen tens of MiBs in cache, but several GiBs of memory in use).

@msaf1980
Copy link
Collaborator

msaf1980 commented Jan 26, 2023

We are not go experts here, but a colleague of mine tried profiling the issue, and all the excess memory seems to be used by MarshalJSON(). Maybe there's something wrong in the way JSON data is being processed, or the way response cache eviction works

Direct write to http.ResponseWriter may be a solution. I do some test early, but don't make a PR - in our environment we don't have memory overload (16 GB is not too costly for huge installations).

@easterhanu
Copy link

We did some more testing and debugging, and think the root cause for response cache's huge memory consumption is this :

n := len(results) * (len(results[0].Name) + len(results[0].PathExpression) + 128*len(results[0].Values) + 128)

bf0ffdc changed the way the byte slice is created from var b []byte to b := make([]byte, 0, n) but we believe the math for calculating the value of n is wrong. Adding some printf statements shows the cap of the created slice being ~15x more than the len of the actual data requires. When these oversized slice references are then stored to response cache, they end up taking a lot of memory which is not really used for anything by the application, but which go's garbage collector cannot clean up either.

@easterhanu
Copy link

There are other memory concerns too. During past weekend we had two production servers running carbonapi with just the backend cache enabled (as a workaround for response cache issues). Server A carbonapi had steady memory consumption around ~50 MiB, whereas server B carbonapi got killed by kernel OOM killer after reaching almost 30 GiB.

Carbonapi settings were the same for both servers, but B was serving some really heavy wildcard requests which would often fail with something like

WARN    zipper  errors occurred while getting results   {"type": "protoV2Group", "name": "http://xxxxx", "type": "fetch", "request": "&MultiFetchRequest{Metrics:[]FetchRequest{FetchRequest{ ...  (insert tons of FetchRequests for different metrics)  ...  "errors": "max tries exceeded", "errorsVerbose": "max tries exceeded\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:25\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: failed to fetch data from server/group\nHTTP Code: 500\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:27\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\n\nCaused By: error while fetching Response\n\ngithub.com/go-graphite/carbonapi/zipper/types.init\n\t/root/go/src/github.com/go-graphite/carbonapi/zipper/types/errors.go:34\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6321\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.doInit\n\t/usr/local/go/src/runtime/proc.go:6298\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:233\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"}

Sometimes we could also see "could not expand globs - context canceled" errors on the go-carbon side.

Switching doMultipleRequestsIfSplit: true to doMultipleRequestsIfSplit: false seemed to fix the query issues and also memory consumption, although we do not exactly understand why.

@msaf1980
Copy link
Collaborator

This need some research. Can you test version from custom branch ?

n := len(results) * (len(results[0].Name) + len(results[0].PathExpression) + 128*len(results[0].Values) + 128)

It's a try to get appoximated buffer size. Without this too many realloc. You can see benchmark in PR #729

May be need logic change or switch to direct write.

@easterhanu
Copy link

@msaf1980 which branch would you like us to test?

@msaf1980
Copy link
Collaborator

msaf1980 commented Jan 31, 2023

@msaf1980 which branch would you like us to test?

Tomorrow I write branch names. I'll adapt branch with direct write to current muster and may be a version with updated policy of buffer prealloc.

@msaf1980
Copy link
Collaborator

msaf1980 commented Feb 1, 2023

Switching doMultipleRequestsIfSplit: true to doMultipleRequestsIfSplit: false seemed to fix the query issues and also memory consumption, although we do not exactly understand why.

Hm, if this a true, json marshaling not a problem for high memory usage.

From documentation:
Only affects cases with maxBatchSize > 0. If set to `false` requests after split will be sent out one by one, otherwise in parallel

@Civil I don't use go-carbon on heavy load. doMultipleRequestsIfSplit: false is recomemended for this ?

@Civil
Copy link
Member

Civil commented Feb 1, 2023

@Civil I don't use go-carbon on heavy load. doMultipleRequestsIfSplit: false is recomemended for this ?

It is niche. To have better results you need to have:

  1. multiple go-carbon servers, actually the more the better.
  2. server's I/O system should prefer concurrent multiple requests
  3. Individual requests (after split) should be small or medium.

As it mostly allows you to save on network and utilize potential concurrency of underlying storage.

If you have small amount of servers or slow I/O that is not very good with concurrent requests - I wouldn't recommend that as I would expect worse performance.

Potentially there is a room to implement heuristic to alternate between both, but that would require getting some information from go-carbon and much more performance data than I can gather myself.

And I would strongly advice against splitting globs or anything fancy if your backend is a database that can scale by itself (e.x. clickhouse)

@Knud3
Copy link

Knud3 commented Jun 2, 2023

doMultipleRequestsIfSplit: false did not help in our single go-carbon and single carbonapi setup at all. When doing massive render request it starts eating memory up to somewhere 40-50Gb and then crash. Pod does not restart because memory limit was not reached. main will parse config as yaml {"config_file": "/etc/carbonapi/carbonapi.yaml"} in logs after crash, so it did restarted itself.

Any settings what to try?

notFoundStatusCode: 404
cache:
  type: "mem"
  size_mb: 1024
  defaultTimeoutSec: 600
backendCache:
  type: "mem"
  size_mb: 4096
  defaultTimeoutSec: 10800
truncateTime:
  "8760h": "1h"
  "2160h": "10m"
  "1h": "1m"
  "0": "10s"
cpus: 6
concurency: 1000
combineMultipleTargetsInOne: true
idleConnections: 200
upstreams:
  graphite09compat: false
  buckets: 10
  keepAliveInterval: "15s"
  timeouts:
    find: "300s"
    render: "300s"
    connect: "500ms"
  backendsv2:
    backends:
      -
        groupName: "go-carbon"
        protocol: "carbonapi_v3_pb"
        lbMethod: "all"
        doMultipleRequestsIfSplit: true
        maxTries: 3
        maxBatchSize: 500
        concurrencyLimit: 0
        servers:
          - "http://carbonserver:8080"
expireDelaySec: 600
unicodeRangeTables:
  - "Latin"
  - "Common"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance performance related issues and questions
Projects
None yet
Development

No branches or pull requests

5 participants