Cost_Final table update policy fails with out of memory exception. #1188

RolandKrummenacher · 2024-12-05T10:31:28Z

🐛 Issue

During data ingestion, the Cost_Final_V1_0 update policy (function Costs_transform_v1_0) fails due to an "out of memory" exception.

👣 Steps to Reproduce

Setup
- Configure the FinOps Toolkit with the default settings, using the smallest ADX SKU.
Prepare Data
- Ensure price data has already been ingested.
Trigger Issue
- Ingest 50 MB of cost data in Snappy-compressed Parquet format.
Verify Ingestion Failure
- Run the command .show ingestion failures in the ingestion database.
- Confirm the presence of an "out of memory" exception.
Verify Operation Failure
- Run the command .show operations in the ingestion database.
- Confirm that the Costs_transform_v1_0 update policy has failed.

🤔 Expected Outcome

The ingestion process should successfully complete without any errors.

No "out of memory" exceptions should occur during the operation.
The Costs_transform_v1_0 update policy should execute without failures, ensuring proper data transformation and ingestion.

ℹ️ Additional context

I suspect that the join in the Costs_transform_v1_0 function might be part of the problem. It could potentially be replaced with a lookup. Maybe other KQL optimizations apply such as materializing the price lookup before joining.

🙋‍♀️ Ask for the community

We could use your help:

Please vote this issue up (👍) to prioritize it.
Leave comments to help us solidify the vision.

The text was updated successfully, but these errors were encountered:

RolandKrummenacher · 2024-12-05T10:37:31Z

Discussed with @flanakin during our debug session today.

MSBrett · 2024-12-05T20:20:56Z

I ran this command to clean the data from the prices table which seems to let the ingestion complete.
.drop extents from Prices_final_v1_0

RolandKrummenacher · 2024-12-06T11:37:57Z

I'm also suspecting that the parse_resourceid(ResourceId) call is not performant. It looks like this function is executing a subquery for every record.

mschwit · 2024-12-09T12:08:05Z

I'm also getting similar errors, but for different queries. Errors I usually get are like this (even with bigger and not the smallest ADX SKUs):
Expression.Error: Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'hr: '2161770503' 'Engine under memory pressure, context: sequence datum materialize': ', details: 'StgError { kind: HResult(2161770503, "Engine under memory pressure, context: sequence datum materialize"), source: None }').
[0]Kusto.Data.Exceptions.KustoServicePartialQueryFailureLowMemoryConditionException: Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'hr: '2161770503' 'Engine under memory pressure, context: sequence datum materialize': ', details: 'StgError { kind: HResult(2161770503, "Engine under memory pressure, context: sequence datum materialize"), source: None }').

RolandKrummenacher · 2024-12-09T13:30:53Z

I'm also getting similar errors, but for different queries. Errors I usually get are like this (even with bigger and not the smallest ADX SKUs): Expression.Error: Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'hr: '2161770503' 'Engine under memory pressure, context: sequence datum materialize': ', details: 'StgError { kind: HResult(2161770503, "Engine under memory pressure, context: sequence datum materialize"), source: None }'). [0]Kusto.Data.Exceptions.KustoServicePartialQueryFailureLowMemoryConditionException: Query execution lacks memory resources to complete (80DA0007): Partial query failure: Low memory condition (E_LOW_MEMORY_CONDITION). (message: 'hr: '2161770503' 'Engine under memory pressure, context: sequence datum materialize': ', details: 'StgError { kind: HResult(2161770503, "Engine under memory pressure, context: sequence datum materialize"), source: None }').

Do you face this error for your custom queries or for queries of the finops toolkit?

mschwit · 2024-12-09T13:38:26Z

I get this when using the Power BI reports https://github.com/microsoft/finops-toolkit/blob/dev/src/power-bi/kql/CostSummary.pbip and https://github.com/microsoft/finops-toolkit/blob/dev/src/power-bi/kql/RateOptimization.pbip

Even with a cluster of 128 GB Memory, I e.g. get this error:

Query execution has exceeded the allowed limits (80DA0001): 'summarize' operator has exceeded the memory budget (5368709120) during evaluation. Results may be incorrect or incomplete (E_RUNAWAY_QUERY; see https://aka.ms/kustoquerylimits).. [0]Kusto.Data.Exceptions.KustoServicePartialQueryFailureLimitsExceededException: Query execution has exceeded the allowed limits (80DA0001): 'summarize' operator has exceeded the memory budget (5368709120) during evaluation. Results may be incorrect or incomplete (E_RUNAWAY_QUERY; see https://aka.ms/kustoquerylimits).. Timestamp=2024-12-09T13:36:49.8502403Z ClientRequestId=KPBI;81e77b52-ff9a-44f7-a72a-df89a26f30f3;28f52463-ef89-4443-b8b9-a7b1747e05b4;d7b4848d-1f8f-44e2-ba8f-ed785386a4cb ActivityId=90f5e9a6-b259-4e53-bfe4-49207348f03d ActivityType=GW.Http.CallContext ServiceAlias=FINOPSTESTMCB MachineName=KEngine000000 ProcessName=Kusto.WinSvc.Svc ProcessId=6836 ThreadId=7808 ActivityStack=(Activity stack: CRID=KPBI;81e77b52-ff9a-44f7-a72a-df89a26f30f3;28f52463-ef89-4443-b8b9-a7b1747e05b4;d7b4848d-1f8f-44e2-ba8f-ed785386a4cb ARID=90f5e9a6-b259-4e53-bfe4-49207348f03d > GW.Http.CallContext/90f5e9a6-b259-4e53-bfe4-49207348f03d) MonitoredActivityContext=(ActivityType=GW.Http.CallContext, Timestamp=2024-12-09T13:35:54.0927543Z, ParentActivityId=90f5e9a6-b259-4e53-bfe4-49207348f03d, TimeSinceStarted=55757.486 [ms])ErrorCode= ErrorReason= ErrorMessage= DataSource= DatabaseName= ClientRequestId= ActivityId=00000000-0000-0000-0000-000000000000 UnderlyingErrorCode=80DA0001 UnderlyingErrorMessage='summarize' operator has exceeded the memory budget (5368709120) during evaluation. Results may be incorrect or incomplete (E_RUNAWAY_QUERY; see https://aka.ms/kustoquerylimits). . The exception was raised by the IDbCommand interface. Table: HybridBenefitCosts.

RolandKrummenacher · 2024-12-09T14:07:44Z

n with a cluster of 128 GB Memory, I e.g. get this error:

Although it's a similar error, I think it's not directly related to this issue, since I have the issue at ingestion time, not at query time.
However, are you running the reports with daily or monthly granularity? Have you tried to limit the powerbi report to a 1 month time range?

mschwit · 2024-12-09T14:25:24Z

I tried to limit the Report only to 2 days with a montly scope. But it seems, that the query first builds the result set and the filters after that

flanakin · 2024-12-14T23:14:00Z

I'm also suspecting that the parse_resourceid(ResourceId) call is not performant. It looks like this function is executing a subquery for every record.

I was able to ingest data from our test account without a problem. The largest snappy parquet file was 11.7 MB. That said, I also suspect the parse_resourceid() function. There's a join to look up a resource type display name for the ResourceType column.

I enabled 7-day retention for the raw data and ingested 13 months of data. I ran the Costs_transform_v1_0() function, which is what the update policy uses with and without the join and compared the following query metrics:

Metric	Before	After	Diff	Percent
Query time	98.188s	18.984s	-79.204s	-80.6%
CPU time	45.391s	33.047s	-12.391s	-27.3%
Data scanned	199.1 MB	203.5 MB	+4.4 MB	+2.21%
Peak memory	6.9 GB	985.7 MB	-5.9 GB	-86.087%
Data size	72.1 MB	72.2 MB	+82.3 B	+0.11%
Row count	39,409	39,250	-159	-0.40%

I'll work on removing or replacing this. I have a thought on how we can achieve this with a slightly different approach.

flanakin · 2024-12-15T00:18:07Z

Right now, the parse_resourceid() function runs for every row, but it's not needed for most. If I move it to only be called when needed (keeping the join), the performance is even better, but the memory is a little higher than it was without the join.

Performance

Query time: 00:06.821 (6.821 seconds)
Total CPU time: 00:05.609 (5.609 seconds)
Data scanned (estimated): 212.4 MB (222,742,533 bytes)
Peak memory: 1.1 GB (1,171,552,608 bytes)

Result

Data size: 70.1 MB (73,521,723 bytes)
Rows count: 41,292

flanakin · 2024-12-15T10:06:45Z

Okay, adding on to the last 2 changes:

Remove the ResourceTypes table join from the parse_resourceid() function
Only call the parse_resourceid() function when needed (and not for every row)
Create a new resource_type() function that does an inline mapping for resource type display names without doing a join
Replace the parse_resourceid() call with resource_type() only when needed
Fixed a bug where x_ResourceType is being set incorrectly in 0.7

Query time is a tad faster, but the memory hasn't changed since the last test. This is likely because the lookup isn't happening much. I'll submit a few PRs covering these changes.

Performance

Query time: 00:06.421 (6.421 seconds)
Total CPU time: 00:05.281 (5.281 seconds)
Data scanned (estimated): 212.5 MB (222,831,681 bytes)
Peak memory: 1.1 GB (1,177,547,424 bytes)

Result

Data size: 70.1 MB (73,460,816 bytes)
Rows count: 40,860

flanakin · 2024-12-15T10:09:31Z

@mschwit Can you submit a separate issue for the problem you're running into? I have 3 optimizations that I want to look into after the ingestion memory problem is resolved. Having a separate issue with your specific details will help ensure that doesn't get missed.

mschwit · 2024-12-16T07:17:42Z

Dear @flanakin, I created an additional issue at #1220

RolandKrummenacher added Needs: Triage 🔍 Untriaged issue needs to be reviewed Type: Bug 🐛 Something isn't working labels Dec 5, 2024

flanakin added Tool: FinOps hubs Data pipeline solution Skill: KQL KQL queries and Data Explorer integration and removed Needs: Triage 🔍 Untriaged issue needs to be reviewed labels Dec 10, 2024

flanakin assigned MSBrett Dec 10, 2024

flanakin added this to the 2024-12 - December milestone Dec 10, 2024

flanakin assigned flanakin and unassigned MSBrett Dec 15, 2024

flanakin linked a pull request Dec 16, 2024 that will close this issue

Hubs ADX ingestion out of memory fix #1216

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cost_Final table update policy fails with out of memory exception. #1188

Cost_Final table update policy fails with out of memory exception. #1188

RolandKrummenacher commented Dec 5, 2024

RolandKrummenacher commented Dec 5, 2024

MSBrett commented Dec 5, 2024

RolandKrummenacher commented Dec 6, 2024

mschwit commented Dec 9, 2024

RolandKrummenacher commented Dec 9, 2024

mschwit commented Dec 9, 2024 •

edited

Loading

RolandKrummenacher commented Dec 9, 2024

mschwit commented Dec 9, 2024

flanakin commented Dec 14, 2024

flanakin commented Dec 15, 2024

flanakin commented Dec 15, 2024

flanakin commented Dec 15, 2024

mschwit commented Dec 16, 2024

Cost_Final table update policy fails with out of memory exception. #1188

Cost_Final table update policy fails with out of memory exception. #1188

Comments

RolandKrummenacher commented Dec 5, 2024

🐛 Issue

👣 Steps to Reproduce

🤔 Expected Outcome

ℹ️ Additional context

🙋‍♀️ Ask for the community

RolandKrummenacher commented Dec 5, 2024

MSBrett commented Dec 5, 2024

RolandKrummenacher commented Dec 6, 2024

mschwit commented Dec 9, 2024

RolandKrummenacher commented Dec 9, 2024

mschwit commented Dec 9, 2024 • edited Loading

RolandKrummenacher commented Dec 9, 2024

mschwit commented Dec 9, 2024

flanakin commented Dec 14, 2024

flanakin commented Dec 15, 2024

Performance

Result

flanakin commented Dec 15, 2024

Performance

Result

flanakin commented Dec 15, 2024

mschwit commented Dec 16, 2024

mschwit commented Dec 9, 2024 •

edited

Loading