Overview

The following describes an issue I've been dealing with for some time, which I really only began to encounter after upgrading to Veeam Backup & Replication v12.

My current organization manages thousands of Horizon desktops spread across many different datacenters across the globe. Each site is managed by at least one VMware vCenter server, which then houses multiple Horizon Desktop pods. We use Veeam Backup & Replication v12 to perform vm-level backups of critical virtual infrastructure, including desktop image templates and full-clone desktops. Occasionally, these jobs report a failure, as one or more VMs fails to backup due to the following errors:

4:07:17 AM Failed to create VM snapshot. Error: CreateSnapshot failed, vmRef vm-224105, timeout 1800000, snName VEEAM BACKUP TEMPORARY SNAPSHOT, snDescription Please do not delete this snapshot. It is being used by Veeam Backup., memory False, quiesce False 00:02
4:07:29 AM Error: An error occurred while saving the snapshot: A digest operation has failed. (An error occurred while saving the snapshot: A digest operation has failed., An error occurred while taking a snapshot: A digest operation has failed.)

4:07:29 AM Session with ID "6423ea62-3887-4aa6-85b6-c03b8ce41d45" is not started yet.

4:07:31 AM Processing finished with errors at 8/12/2023 4:07:31 AM

The jobs will retry the failed VMs three times, after waiting 5 minutes each, and will on occasion successfully complete. Or, the job will continue to fail on those VMs, and that's that. The intermittent nature of this problem is ultimately what's been bugging me, as it's clearly not something as clear-cut as "this VM is corrupt" or "there's something wrong with the job settings." And as I've google'd and opened tickets and researched the problem, I believe I've encountered something that, at least, is not an easy-to-find solution. It may also be unique to our environment, or at least to organizations that manage multiple VMware Horizon desktop pods within a single vCenter environment.

Important things to note:

This issue appears to be specific to hosting multiple Horizon desktop pods within the same vCenter. Or, it could be a single pod with multiple desktop pools, but my environment is obviously much larger than just that.
VMware tends to frown upon hosting multiple Horizon pods within the same vCenter. Please review the following KB article for more details: https://kb.vmware.com/s/article/80673
Any changes recommended here may not be supported or recommended by your vendor and/or organization, so please don't blame me if you followed this guide and something bad happened. You can, however, blame me if something good happens.
Software versions in use at the time of writing:

Veeam Backup & Replication 12.0.0.1420 P20230223
vCenter 7.0u3h 20395099
ESXi 7.0u3n 21930508
Horizon 8.4.0 build 19067837

Symptoms and Troubleshooting:

As stated above, Veeam backups fail with a repetitive error about digest operations failing. If you google this error, you may come across this KB article:

[VMC on AWS] Unable to snapshot a VM, with error: A digest operation has failed. (82425)

The mention of CBRC is the key here.. however, the KB article focuses on migration from AWS SDDC, which you may then think, "Yeah, this doesn't apply to me." That's what I did, and I was only halfway correct.

At this point, the next logical conclusion is to test taking a snapshot of the VM to see if that basic functionality works. And in my case, it worked fine. And, as mentioned previously, the issue is intermittent, which by definition implies that sometimes snapshots work.

So what is one to do? Open a ticket with Veeam, of course! Well, I did that too, and their response was essentially that Veeam only uses the vSphere VIMAPI calls to facilitate snapshots, and if the issue is intermittent, you should open a ticket with VMware. Sigh.

If at this point you now open a ticket with VMware, you may very well find that they point back to the above KB82425, and tell you to go through all sorts of headaches powering down VMs and deleting digest files and what-have-you. Yeah, I said, "Nope" too.

So now we're back to square one. Well, do you remember how I said this issue appears to be specific to hosting multiple Horizon pools within the same vCenter? And that CBRC is actually the key? If you're here reading this article, you may very well suffer from this specific and chronic issue:

Frequent and constant "update option value" vCenter tasks. It may look a little something like this:

Seemingly innocuous, these little boogers are actually the root of the problem. If you search the webs for this condition, you should easily find the following KB article:

Horizon environment trigger "Update option values" tasks every 5 minutes in vCenter Web Client Recent Tasks (56982)

The key here is that you have one or more Horizon connection servers, pointing to vCenter servers, that have one or more desktop pools with varying "Storage Accelerator" settings applied. As the above KB article leaves a lot to be desired in the form of screenshots and context, here's a summary:

Each Connection server has a vCenter connection. Within that vCenter connection configuration, you should see a Storage tab. This tab allows you to enable or disable the View Storage Accelerator feature:

As described on the following VMware page:

https://docs.vmware.com/en/VMware-Horizon/2209/virtual-desktops/GUID-B4A4A8BA-FCB5-4311-A53F-4C5D917DD2DB.html

CBRC uses ESXi host memory to cache virtual machine disk data, reduce IOPS, and improve performance during boot storms, when many machines start up or run anti-virus scans at once. By reducing the number of IOPS during boot storms, View Storage Accelerator lowers the demand on the storage array, which lets you use less storage to support your Horizon 8 deployment. The feature is also beneficial when administrators or users load applications or data frequently.

Sounds like a great feature, right? Now, there's also a second place where this can be enabled & disabled: The Desktop Pool. If you edit an existing storage pool, you should see something that looks a little like this:

The above pool has that View Storage Accelerator disabled. But it's enabled at the vCenter level. If you're like us, you probably more than one desktop pool configured. If any of these pools AND vCenter connections vary between having View Storage Accelerator enabled, then BAM: You have the above "update option values" flood in vCenter. Well, what difference does that make. Glad you asked.

As the above VMware KB articles describe, CBRC is an ESXi host-level setting. If, for example, you create your first desktop pool and you enable this View Storage Accelerator, the connection server will go through and enable the following Advanced System Settings:

(note that 1024 is the default, and the size specified in the vCenter connection setting)

Now, if you create secondary desktop pool and, say, forget to enable VSA, the connection server says to itself, "Hey, we need CBRC disabled," and it goes and disables it on every host in vCenter with, you guessed it, an 'update option values' task. Now CBRC is disabled. But wait, you need it enabled for your first pool! "I better go toggle that back on again," says the connection server with much religious fervor. And again, you have "update option value" tasks, disabling that setting. Over, and over again. Until your task logs are full and your heart burns with frustration.

What does this have to do with Veeam backups, you ask? Or have you begun to see the connection?

If you haven't guessed it yet, here's the problem: As the ESXi hosts are having their CBRC settings constantly toggled off and on again like an episode of The IT Crowd gone amok, the VMs are getting mixed signals on whether or not they should be using this CBRC setting. When Veeam tries to back up the VM, fails because CBRC was enabled, but now it's not. When it tries again in 5 minutes, maybe the setting was toggled, and maybe it works? It's a crapshoot at this point.

THE SOLUTION:

VMware KB56982 is actually the answer. You need to either:

Standardize Storage View Accelerator on all vCenter connections, whether enable or disable
Standardize Storage View Accelerator on all Desktop Pools, whether enable or disable

This may be a real pain if you have hundreds of pools

Skip the SVA Disabling functionality

Ultimately, I lean towards the latter option as it's difficult (or nearly impossible) to maintain standardization across an incredibly complex environment. But that's just my choice. Following the steps outlined in the above KB to modify the ATOM database tells the connection servers to stop disabling CBRC on the hosts. You'll have to do this on every connection server (or connection server pair, if you run them in an HA pair) for the change to fully take effect.

Once you've done this, you may see some tasks fly by as they finally synchronize and toggle all the servers back to "enabled," but then it should stop. And once it's stopped, you may need to retry your VM backups a couple times, or else trigger an Active Full backup of the offending VM(s). But at this point, you should be good.

Hope this helps!

Search This Blog

Randomness of Gabe 2.0

Veeam Backups of Horizon Desktop VMs intermittently failing with "A digest operation has failed."

Overview

Important things to note:

Symptoms and Troubleshooting:

THE SOLUTION:

Comments

Post a Comment