Veeam Backups of Horizon Desktop VMs intermittently failing with "A digest operation has failed."
Overview
The following describes an issue I've been dealing with for some time, which I really only began to encounter after upgrading to Veeam Backup & Replication v12.
My current organization manages thousands of Horizon desktops spread across many different datacenters across the globe. Each site is managed by at least one VMware vCenter server, which then houses multiple Horizon Desktop pods. We use Veeam Backup & Replication v12 to perform vm-level backups of critical virtual infrastructure, including desktop image templates and full-clone desktops. Occasionally, these jobs report a failure, as one or more VMs fails to backup due to the following errors:
4:07:17 AM Failed to create VM snapshot. Error: CreateSnapshot failed, vmRef vm-224105, timeout 1800000, snName VEEAM BACKUP TEMPORARY SNAPSHOT, snDescription Please do not delete this snapshot. It is being used by Veeam Backup., memory False, quiesce False 00:02
4:07:29 AM Error: An error occurred while saving the snapshot: A digest operation has failed. (An error occurred while saving the snapshot: A digest operation has failed., An error occurred while taking a snapshot: A digest operation has failed.)4:07:29 AM Session with ID "6423ea62-3887-4aa6-85b6-c03b8ce41d45" is not started yet.4:07:31 AM Processing finished with errors at 8/12/2023 4:07:31 AM
The jobs will retry the failed VMs three times, after waiting 5 minutes each, and will on occasion successfully complete. Or, the job will continue to fail on those VMs, and that's that. The intermittent nature of this problem is ultimately what's been bugging me, as it's clearly not something as clear-cut as "this VM is corrupt" or "there's something wrong with the job settings." And as I've google'd and opened tickets and researched the problem, I believe I've encountered something that, at least, is not an easy-to-find solution. It may also be unique to our environment, or at least to organizations that manage multiple VMware Horizon desktop pods within a single vCenter environment.
Important things to note:
- This issue appears to be specific to hosting multiple Horizon desktop pods within the same vCenter. Or, it could be a single pod with multiple desktop pools, but my environment is obviously much larger than just that.
- VMware tends to frown upon hosting multiple Horizon pods within the same vCenter. Please review the following KB article for more details: https://kb.vmware.com/s/article/80673
- Any changes recommended here may not be supported or recommended by your vendor and/or organization, so please don't blame me if you followed this guide and something bad happened. You can, however, blame me if something good happens.
- Software versions in use at the time of writing:
- Veeam Backup & Replication 12.0.0.1420 P20230223
- vCenter 7.0u3h 20395099
- ESXi 7.0u3n 21930508
- Horizon 8.4.0 build 19067837
Symptoms and Troubleshooting:
Seemingly innocuous, these little boogers are actually the root of the problem. If you search the webs for this condition, you should easily find the following KB article:
CBRC uses ESXi host memory to cache virtual machine disk data, reduce IOPS, and improve performance during boot storms, when many machines start up or run anti-virus scans at once. By reducing the number of IOPS during boot storms, View Storage Accelerator lowers the demand on the storage array, which lets you use less storage to support your Horizon 8 deployment. The feature is also beneficial when administrators or users load applications or data frequently.
Sounds like a great feature, right? Now, there's also a second place where this can be enabled & disabled: The Desktop Pool. If you edit an existing storage pool, you should see something that looks a little like this:
The above pool has that View Storage Accelerator disabled. But it's enabled at the vCenter level. If you're like us, you probably more than one desktop pool configured. If any of these pools AND vCenter connections vary between having View Storage Accelerator enabled, then BAM: You have the above "update option values" flood in vCenter. Well, what difference does that make. Glad you asked.
As the above VMware KB articles describe, CBRC is an ESXi host-level setting. If, for example, you create your first desktop pool and you enable this View Storage Accelerator, the connection server will go through and enable the following Advanced System Settings:
Now, if you create secondary desktop pool and, say, forget to enable VSA, the connection server says to itself, "Hey, we need CBRC disabled," and it goes and disables it on every host in vCenter with, you guessed it, an 'update option values' task. Now CBRC is disabled. But wait, you need it enabled for your first pool! "I better go toggle that back on again," says the connection server with much religious fervor. And again, you have "update option value" tasks, disabling that setting. Over, and over again. Until your task logs are full and your heart burns with frustration.
What does this have to do with Veeam backups, you ask? Or have you begun to see the connection?
If you haven't guessed it yet, here's the problem: As the ESXi hosts are having their CBRC settings constantly toggled off and on again like an episode of The IT Crowd gone amok, the VMs are getting mixed signals on whether or not they should be using this CBRC setting. When Veeam tries to back up the VM, fails because CBRC was enabled, but now it's not. When it tries again in 5 minutes, maybe the setting was toggled, and maybe it works? It's a crapshoot at this point.
THE SOLUTION:
VMware KB56982 is actually the answer. You need to either:
- Standardize Storage View Accelerator on all vCenter connections, whether enable or disable
- Standardize Storage View Accelerator on all Desktop Pools, whether enable or disable
- This may be a real pain if you have hundreds of pools
- Skip the SVA Disabling functionality
Comments
Post a Comment