Hello welcome to my blog, this being my first post. I’m a Junior developer in test here at Black Marble and my day to day job roles are starting encompass dev-ops as well as testing.
So there’s this one project we’ve been working on which I have been working to automate using end to end Build Deploy with Lab Management. That exercise is coming up in another Blog post as I’m still working on it.
What I wanted to share with you all today is a problem I’ve encountered (and solved) while trying to accomplish that project.
For those not in the know Microsoft Lab Management is a piece of Test Environment management software, most notably for managing VM’s. We use it at Black Marble for provisioning Dev and Test environments for our projects and for the most part it’s fairly straight forward for me or Tom Barnes (my counterpart in test) to role out a new environment to order for a given project once IT have built the template/stored environment we’re using for that project.
In my Auto Deploy exercise I was doing some really heavy moving around of environment Snapshots on the environment I was deploying to.
Basically the 4 box environment (consisting of a SharePoint 2013 Server, CRM 2011 Server, SQL 2012 server and a DC) had at any one time 1 snapshot. A sterile point in time prior to me attempting to deploy on top of it. If my auto deploy didn’t work due to an environment setting I went and changed the setting, created a new snapshot and deleted the old one.
Lo and behold today I try to revert to my newest clean snapshot and Lab Manager tells me
A) It couldn’t apply the Snapshot
B) The snapshot tree was corrupted and that I needed to create a new tree.
My immediate response was to run crying to Rik Hepworth, our resident witch doctor for all things SCVMM. As this had happened before, and he’d fixed it last time believing that it was just a corrupted snapshot. We found the problem this time was far more….odd.
We opened the System Centre Virtual Machine Manager Console, logged onto the SCVMM server and examined my environment (now in a mixed state), and the reason for the snapshot tree being utterly rodded was fairly evident.
The Truth Comes Out
Lab Manager lies! To itself and to the user.
My SharePoint VM had 4 snapshots
My CRM VM had 3 snapshots
MY DC had 1 snapshots
MY SQL server had 1 snapshots
……On top of the one I’d just created prior into getting into this mess.
But wait hadn’t I been deleting the old snapshots as I no longer needed them? Well yes but that didn’t actually mean Lab Manager went and did it.
Lab Manager perceived that it only had one snapshot for the whole environment, which was not the truth by a country mile. What had happened repeatedly it seems is that the deletion jobs I had triggered in Lab Manager had failed occasionally within SCVMM but Lab Manager didn’t know this because it doesn’t talk to SCVMM all that well. Lab Fires and Forgets commands at SCVMM. Near as we can figure it keeps it’s own version of the truth regarding it’s machine snapshots and holds it’s hands over it’s ears shouting la la la when the real world doesn’t match up with what it’s expecting.
All well and good to know for future but what about my poor environment I’d been slaving over for 3 weeks?
All was not lost
My environment was not beyond recovery, in SCVMM we paused everything else running on the same tin as my environment. This was to minimize the risk of jobs timing out mid-process. Something we’ve had problems with previously
For each VM we opened it’s properties and removed each of the previous checkpoints until we had the latest snapshot only. (As shown below)
Update: This happened again to me today, and it seems the problem can be vary in severity, I had to delete the ENTIRE snapshot tree, not just the latest snapshot. Until I had done this all new SCVMM jobs to apply new snapshots to the environment en masse failed (I had to do them one by one), and any new snapshots I made had suffered the same problem. Manually apply a safe snapshot to each machine in your environment and then is possible (for in my experience most reliable results) delete the entire tree and then snapshot it again to create a new one.
We then individually applied this snapshot to each of the VM’s in the environment one by one (rather than Lab Managers Shot Gun approach of doing them all at once which depending on how much resource your tin has can behave very erratically).
Once the VM’s were back at this checkpoint we deleted the old checkpoint and created a new one (because we’re paranoid that the old one was inherently shifty)
We then fired them back up and presto my dead environment was resurrected via SCVMM necromancy. It was even in the right state by the time all the difference disks had been merged.
Be careful with heavy use of snapshots in Lab Management, the larger the environment the more likely things will go wrong. Lab Manager will not tell you when a snapshot deletion has failed, it really is a fire and forget process as far as it’s concerned.
If you get a scary error when your environment is in a mixed state after attempting to apply a snapshot then fear not, so long as you have a witch doctor on hand your environment can be recovered to the state you originally intended.
If you are entertaining curiosity about Lab Management and SCVMM I strongly recommend you look at the following blogs.
- Our witch doctor Rik Hepworth’s blog for an IT Pro’s perspective and war stories with Lab Manager and SCVMM.
- Our director of Engineering and ALM MVP Richard Fennel, who has also worked extensively with this product and know a quite frankly scary amount about it’s innards.
Check back in the next week for my completed lab deploy war story, and thanks for reading!