Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest-OptimizationCSVExportsToLogAnalytics Runbook with "consumptionexports" parameter failing consistently. #998

Closed
davemilnercapg opened this issue Sep 20, 2024 · 29 comments · Fixed by #1048
Assignees
Labels
Status: 🔄️ In progress Implementation is currently in progress Tool: Optimization Engine Azure Optimization Engine Type: Bug 🐛 Something isn't working

Comments

@davemilnercapg
Copy link

🐛 Problem

During daily scheduled task run of Ingest-OptimizationCSVExportsToLogAnalytics runbook, the runbook runs successfully for many other parameters, but always fails during the "consumptionexports" parameter task.

This results in all of the recommendations having blank Cost components.

👣 Repro steps

Deploy Azure Optimization Engine to Azure subscription, set up to monitor many subscriptions, over time the runbook fails.

🤔 Expected

Ingest-OptimizationCSVExportsToLogAnalytics completes successfully for all parameters and storage buckets.

📷 Screenshots

image image

ℹ️ Additional context

It seems to be failing during the process of different blobs not all at the same one. So is this something to do with out of memory issues?

🙋‍♀️ Ask for the community

We could use your help:

  1. Please vote this issue up (👍) to prioritize it.
  2. Leave comments to help us solidify the issue and context or similar.
@davemilnercapg davemilnercapg added Needs: Triage 🔍 Untriaged issue needs to be reviewed Type: Bug 🐛 Something isn't working labels Sep 20, 2024
@helderpinto helderpinto self-assigned this Sep 20, 2024
@helderpinto helderpinto added Tool: Optimization Engine Azure Optimization Engine and removed Needs: Triage 🔍 Untriaged issue needs to be reviewed labels Sep 20, 2024
@helderpinto
Copy link
Member

@davemilnercapg, thanks for reporting this issue. Can you please share what the "Exception" tab says for these failing jobs?

image

@davemilnercapg
Copy link
Author

Thread failed to start. (Thread failed to start. (Exception of type 'System.OutOfMemoryException' was thrown.))

Error on Exception tab

@helderpinto
Copy link
Member

helderpinto commented Sep 20, 2024

It's unusual to see such errors in this runbook. Are you using the latest version of AOE? Can you share what was the latest message written in the job output and if you see the same last message in every job failure?

@davemilnercapg
Copy link
Author

We are using the latest version of the AOE. It was upgraded recently. However the issue with this runbook started prior to the upgrade and remained after the upgrade.

The latest message written in the job output varies, depending on the last record processed prior to experiencing the system out of memory error.

Here are 3 examples of the last message written on 3 seperate runs:

image

@helderpinto
Copy link
Member

I see... There was probably some other issue some months ago and AOE is now lagging behind trying to process very old CSVs. As you can see by the dates, it is trying to process blobs from June! To work around this problem, if you don't mind, I ask you to please identify the date/time of the last blob that was successfully processed. You can normally find this in the 5th line of the job output. Something like this:

Processing blobs modified after 2024-09-18T22:26:51.000Z

Then open the AOE Storage Account, use the Storage Browser, navigate to the consumptionexports container, order the rows by Last modified, and finally delete all the blobs from that date onwards, up until a date you consider as the minimum for your historical cost analysis needs.

After that, test again the runbook, by starting it and passing consumptionexports as input. You should see signs of recovery. If this solves the problem and you want to backfill some consumption data from the past, I can explain how to achieve it.

@helderpinto
Copy link
Member

image

@davemilnercapg
Copy link
Author

Thx, Helder.

I started out with about 27k objects in that container. After purging 10k down to 17k objects, I ran it again, and it ran out of memory.

Purge 2: down to 12k objects. Currently running.

For long-term, how should we manage the growth in these storage accounts? is there a scheduled task in the AOE for cleanup? Do we need to write one? Is it available to use forced garbage collection in the runbooks to not run out of memory? Or manage large blog stores another way?

@davemilnercapg
Copy link
Author

these Azure worker agents are allocated with 400MB of memory.

@helderpinto
Copy link
Member

27k objects is really a huge number for the consumption exports. Can you confirm the AOE storage account has the Clean6MonthsOldBlobs lifecycle management rule?

If it doesn't, please create one such rule and configure it to delete blobs at least 30 days after being modified.

If it does, it means you have a large amount of subscriptions in your environment and you should maybe export consumption data at the EA/MCA level (generates one large blob per day), instead of doing it per subscription (potentially hundreds of blobs per day). Can you confirm the amount of subscriptions monitored by AOE and whether you are under an EA or a MCA?

image

@davemilnercapg
Copy link
Author

image The lifecycle management rule is in place and enabled.

AOE is monitoring a large number of subscriptions. With the design of the CAF architecture and the subscription democratization approach the number of subscriptions is larger and will increase.

Yes we are under an EA.

Do you have a recommended approach or artifacts to help export consumption data at the EA/MCA level to generate one large blob per day? Is this a configuration or customization of AOE?

@helderpinto
Copy link
Member

helderpinto commented Sep 23, 2024

To change the consumption export scope from subscription to EA billing account, you must do the following:

  1. Create, in the AOE Automation Account, an AzureOptimization_ConsumptionScope variable set to BillingAccount
  2. Ensure you have a AzureOptimization_BillingAccountID variable set to your EA billing account ID
  3. Ensure you already granted the Enterprise Enrollment Reader role to the AOE Automation managed identity (see docs)

I guess that 2. and 3. are already done, because you said earlier that the other runbooks were running without issues. If all the Reservations and Savings Plans workbooks are loading correctly, then for sure 2. and 3. are done.

After you complete the steps above, you should get, in the next job, a single consumption export file for the whole EA.

I hope this helps!

@davemilnercapg
Copy link
Author

Thx Helder!

We set #1 and confirmed the other elevated setup. Running to obtain the larger single consumption export file. We also anticipate an out of memory exception and are in process of migrating this to a hybrid worker agent. That should get us covered - will update with results.

@davemilnercapg
Copy link
Author

ok for results after those modified settings, we are getting the following Error:

Exception calling "GetBytes" with "1" argument(s): "Array cannot be null. Parameter name: chars" (Exception calling "GetBytes" with "1" argument(s): "Array cannot be null. Parameter name: chars" (Array cannot be null. Parameter name: chars))

The exception seems to be happening on line 257 due to a null input.

The last output of the logs are:

Found 7074 new blobs to process...

About to process 2024-08-07-e904450b-efb3-4723-add4-48176d3c50eb-1.csv...

Items:

  1. I do not see any use of the variables AzureOptimization_ConsumptionScope or AzureOptimizationBillingAccountId in the code for the runbook - Ingest-OptimizationCSVExportsToLogAnalytics runbook. Am I missing something?

Plz let me know any insights.

@helderpinto
Copy link
Member

helderpinto commented Sep 23, 2024

The changes you made are influencing the outcome of the Export-ConsumptionToBlobStorage runbook. This runbook is now uploading a single consumption CSV to the consumptionexports storage container. Later, the Ingest-OptimizationCSVExportsToLogAnalytics runbook collects this CSV blobs content and ingests it into Log Analytics.

The log messages you are reporting are a symptom that old CSVs are still being processed. The 2024-08-07-e904450b-efb3-4723-add4-48176d3c50eb-1.csv blob might have corrupted content that is blocking the runbook to go forward and process the next blobs. Can you please download it and confirm whether the blob has signs of corruption? If you delete that blob, the runbook will continue processing.

@davemilnercapg
Copy link
Author

Thx. I am seeing a new 300Mb file that represents the billing account level consumption.

What I am seeing with the old CSVs is there are a number of them that are 0 size. This is producing the error on line 257 with GetBytes.
image

I'm trying to delete all the 0 size CSVs in the portal but not having a lot of success.

I'm not sure what is causing the zero size consumption files, would this be "corruption" of some kind?

I am probably going to need to insert an empty check into that script to process from where I'm at.

@helderpinto
Copy link
Member

helderpinto commented Sep 24, 2024

Thanks for the additional details. The version of the Ingest-OptimizationCSVExportsToLogAnalytics runbook you have in your AOE deployment seems to not be prepared to deal with 0 bytes blobs, which, in the case of consumption exports, are coming from subscriptions that do not have any Azure costs associated. Can you confirm the version you have is the one in here? The latest version deals well with empty blobs.

As an additional check, can you add the following instruction in line 220 of the runbook?

$unprocessedBlobs = $unprocessedBlobs | Where-Object { $_.Length -gt 0 }

@davemilnercapg
Copy link
Author

I updated the version of that runbook thx. I also added the additional check to line 220. I have also approached this from a different angle, and created a new maintenance runbook - CleanUp-ZeroLengthBlobs - which approaches the problem from that angle. That also seems to be working...

@davemilnercapg
Copy link
Author

Latest update. After additions to line 220 and script update, the script chugged through all of the older outstanding small files successfully, and brought the last-updated up to 9/22/2024.

After setting the Consumption to BillingAccount, the nightly export produced a 135MB CSV file with 100,000 lines.

This failed on the nightly processing for that file - logs:

Processing blobs modified after 2024-09-23T13:05:57.000Z (line 83999) and ingesting them into the AzureOptimizationConsumptionV1_CL table...

2024-09-22-71446200-AmortizedCost-1-final.csv found (modified on 2024-09-25T12:17:49.000Z)

Found 1 new blobs to process...

About to process 2024-09-22-71446200-AmortizedCost-1-final.csv...

From there I get an error message:

Exception calling "GetBytes" with "1" argument(s): "Array cannot be null. Parameter name: chars" (Exception calling "GetBytes" with "1" argument(s): "Array cannot be null. Parameter name: chars" (Array cannot be null. Parameter name: chars))

So what it looks like to me is the addition to line 220 will filter out all the small zero byte files. but the zero value lines within the bigger files are still causing problems. Troubleshooting...

@helderpinto
Copy link
Member

Seemingly, log ingestion failed between line 83999 and line 89999. Can you find something odd in those lines? Additionally, the only reason why GetBytes (a few lines below) would fail was by passing $jsonObject as $null - and I can't find a good reason for it to become $null. The ConvertTo-Json cmdlet before doesn't convert to $null, even if the input object is itself $null.

@davemilnercapg
Copy link
Author

I am not seeing anything odd in those lines. I am seeing the $jsonObject appearing as $null though. still troubleshooting...

@davemilnercapg
Copy link
Author

ok I made the following mods and ran it:

image

I had to disable the Write-Output and Write-Warning as it was running it out of memory. However, the logs showed this:

image

This seems to look like it worked successfully. Plz let me know if looks correct....

@helderpinto
Copy link
Member

It seems you didn't change the essentials of the algorithm. You just added a couple of more checks to ensure a correct $jsonObject reaches the GetBytes method. I am still curious about what made this happen - you're the first customer in years reporting this type of issue. Anyways, I am thinking of adding the same checks in the code. Thanks for sharing the work-around!

Now comes the moment of truth :-) All this effort is relevant only if the consumption-related workbooks load correctly. Can you confirm?

@davemilnercapg
Copy link
Author

Yes I am seeing output from all of the consumption related workbooks:

image image image

I see a lot of the report data starting about a month ago. We may be missing earlier due to some of the blobs piling up and deleting some of them from older dates.

@helderpinto
Copy link
Member

It is looking good now! If you want to backfill consumption data for older dates, you just have to trigger the Export-ConsumptionToBlobStorage runbook and specify the TargetStartDate and TargetEndDate parameters as needed. For example, to export consumption for Aug 11th, both parameters should be set to 2024-08-11. To export consumption between Aug 1st and Aug 11th (inclusive), set TargetStartDate=2024-08-01 and TargetEndDate=2024-08-11. On the next day, you should see all this historical data in the workbooks.

Be careful, however, with the amount of data to export in each job run. You said earlier that a single day generates 135 MB of data. It is maybe better to export no more than 3-4 consecutive days per job. Also, if you want to have more than 30 days of historical data, check the Log Analytics workspace retention, which is by default 30 days (free retention).

@helderpinto
Copy link
Member

@davemilnercapg, can you confirm whether the issue is definitely resolved? Thanks.

@helderpinto helderpinto added the Status: 📋 Pending confirmation Waiting on explicit confirmation that the issue was resolved in previous release label Sep 29, 2024
@davemilnercapg
Copy link
Author

@helderpinto - I can confirm that with the changes I made and posted above to this script that I have seen it run successfully multiple days in a row. The run from last night processed a 117k row single file successfully without errors. Without those changes, as the script is currently, it will fail consistently with null references sent to GetBytes.

IMO the root cause of this both in EA mode (1 large file) and 27k smaller files is that many subscriptions do not have current values being output, so they result in a file being created, or a line being created with zero consumption values. These are not filtered out, so will either cause a null reference error or an out of memory error for larger subscriptions, running this with a billing account management level, or with EA level data coming in.

No its not resolved without changes I specified. With them it resolves. To resolve this successfully, incorporate those changes and test it in a medium to larger subscription environment.

@davemilnercapg
Copy link
Author

So the specific types of small accounts I see producing zero values export are like the MSDN subscriptions, VSPE subscriptions, that will likely be in all customer tenants, and some other small ones.

@helderpinto
Copy link
Member

Thanks for the feedback, @davemilnercapg. Let's keep this bug open. It will be closed once the suggested changes are incorporated. On a side note, MSDN subscriptions and the like do not appear in the single, EA-level file, because they are not part of the agreement. Therefore, the zero-bytes lines must have a different cause. Nevertheless, let's simply discard those rare situations in the runbook code.

@helderpinto helderpinto added Status: 🔄️ In progress Implementation is currently in progress and removed Status: 📋 Pending confirmation Waiting on explicit confirmation that the issue was resolved in previous release labels Sep 30, 2024
@helderpinto
Copy link
Member

Hi, @davemilnercapg

There is a PR (#1048) that will fix this issue. If you want to try out the fix before the release, you just have to make sure you are running AOE on the latest version (September release) and then update the Ingest-OptimizationCSVExportsToLogAnalytics runbook with this code.

Please, make sure you back up your current code, so that you're able to roll back in case the proposed fix isn't effective.

If you are not on the latest AOE release, check here how to upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: 🔄️ In progress Implementation is currently in progress Tool: Optimization Engine Azure Optimization Engine Type: Bug 🐛 Something isn't working
Projects
None yet
2 participants