Logic App Flat File Schemas and BizTalk Flat File Schemas

I recently started working on a new Logic App for a customer using the Enterprise Integration Pack for Visual Studio 2015, I was greeted with a familiar sight, the schema generation tools from BizTalk, but with a new lick of paint Open-mouthed smile

The Logic App requires the use of Flat File schemas, so I knocked up a schema from the  instance I’d been provided and went to validate it against the instance used to generate it (since it should validate).

My Flat File was a bit of a pain to be frank in that it had ragged endings, that is to say some sample rows might look a bit like:




Which I’ve worked with before….but couldn’t quite remember how I solved exactly other than tinkering with the element properties.

I generated the schema against the lines with the additional column and erroneously set the last field property Nillable to True. When I went to validate the instance lo’ and behold it wasn’t a valid instance and I had little information about why.

So I fired up my BizTalk 2013 R2 virtual machine (I could have used my 2016 one to be fair if I hadn’t sent it to the farm with old yellow last week) and rinsed and repeated the Flat File Schema Wizard.

So I got a bit more information this time, namely that the sample I’d been provided was missing a CR/LF on the final line and that the Nillable I’d set on the last column was throwing a wobbler by messing up the following lines.

Setting the field’s  Nillable property back to false, but it’s Min and Max occurs to 0 and 1 respectively and I had a valid working schema.

So I copied the schema back to my Logic Apps VM and attempted to revalidate my file (with its final line CR/LF amended). To my annoyance, invalid instance!

I was boggled at this quite frankly but some poking around the internet led me to this fellow’s Blog.


In short there’s an attribute added on a Flat File schema which denotes the extension class to be used by the schema editor, when built by a BizTalk Flat File Schema Wizard it’s set to


When generated by the Enterprise Integration Pack it’s


Changing this attribute value in my BizTalk generated Flat File schema and presto, the schema could validate the instance it was generated from.

So in short I’ll say the flavour of the Schema designer tools in the Enterprise Integration Pack seem to throw out errors and verbosity a little different to its BizTalk ancestor, it still throws out mostly the same information, but in different places:

  • EIP

You only get a generic error message in the output log. Go and examine the errors log for more information.


  • BizTalk

You get the errors in the output log, and in the error log.


In my case the combination of my two errors (the flat file being malformed slightly, and my Nillable field change) in the EIP only gave me an error “Root element is missing” which wasn’t particularly helpful and the BizTalk tooling did give me a better fault diagnosis.

On the bright side the two are more or less interchangeable. Something to bear in mind if you’re struggling with a Flat Schema and have a BizTalk development environment on hand.

SharePoint 2010 Integrations with TFS 2017.3

An interesting aspect of a TFS Upgrade I found myself in recently was to do with SharePoint Integrations with Team Foundation Server. As a bit of a pre-amble, back in the day, when TFS was a bit more primitive, it didn’t have a rich dashboard experience like it does today, and so leaned on it’s distant cousin SharePoint for Project Portal support. A lot of my customers as a consequence of this have significant document repositories in SharePoint Foundation instances installed as part of their Team Foundation Server installations. This in itself isn’t an issue, as you can decouple a SharePoint server from TFS and have the two operate independently of one another, but some users will miss the “Navigate to Project Portal” button from their project dashboard.

I found an easy answer to this is to use the Mark Down widget to give them the link they want/need so they can still access their project documentation, but of course the work item and report add-ins will cease to function. Encouraging users to adopt the newer TFS/Azure Dev Ops Dashboards is key here.

But enough waffling. A technical conundrum faced in this recent upgrade I thought worth mentioning was around the SharePoint integrations themselves. The customer was migrating from a 2012.4 server to a 2017.3 server (as tempting as 2018.3 was, IT governance in their organization meant SQL 2016 onwards was a no no). Their old TFS server was cohabiting with a SharePoint Foundation 2010 server which was their BA’s document repository and point of entry for having a look at the state of play for each Team Project. The server had two Team Project Collections for two different teams who were getting two shiny new TFS Servers.

One team was going to the brand new world and still needed the SharePoint integrations. The other was going to the new world with no SharePoint (huzzah!). Initially the customer had wanted to move the SharePoint server to one of the new servers, and thus be able to decommission the old Windows Server 2008R2 server it was living on.

This proved to be problematic in that the new server was on Windows Server 2016, which even had we upgraded SharePoint to the last version which supported TFS integration which was SharePoint 2013, 2013 is too old and creaky to run on Windows Server 2016. So we had to leave the SharePoint server where it was. This then presented a new problem. According to the official documentation, to point the SharePoint 2010 server at TFS 2017 I would have had to install the 2010 Connectivity Extensions which shipped with the version of TFS I was using. So that would be TFS 2017’s SharePoint 2010 extensions. But we already had a set of SharePoint 2010 extensions installed on that server….specifically TFS 2012’s SharePoint 2010 extensions.

To install 2017’s version of the extensions I’d have had to uninstall or upgrade TFS 2012.4 since two versions of TFS don’t really like cohabiting on the same server and the newer version tends to try (helpfully I should stress) upgrading the older version. The customer didn’t like the idea of their fall-back/mothballed server being rendered unusable if they needed to spin it up again. This left me in something of a pickle between a rock and a hard place.

So on a hunch I figured why not go off the beaten path and ignore the official documentation (something I don’t make a habit of), and suggested to the customer “let’s just point the TFS 2012’s SharePoint 2010 extensions at TFS 2017 and see what happens!”. So, I reconfigured the SharePoint 2010 Extensions the TFS 2017 Admin console and guess what? Worked like a charm. The SharePoint 2010 Project Portals quite happily talked to the TFS 2017 Server with no hiccups and no loss of functionality.

Were I to make an educated guess, I’d say the TFS 2017, SharePoint 2010 extensions are the same as TFS 2012’s but with a new sticker on the front, or only minor changes.

TFS 2018.3 Upgrade–Where’s my Code Content in the Code Hub?

Firstly it’s been a while (about 4 years or so) since I last blogged, and so there’s going to be a bit of a gap in what I blog about now versus what I used to blog about. Why the big gap, and what’s been happening between I might talk about in a future post but for now I thought I’d share a recent war story of a TFS Upgrade gone wrong, which had a pretty arcane error.

The client had previously been using an on-premise TFS 2012.4 server and had expressed an interest in upgrading to a newer version,  TFS 2018.3 to be precise.

Naturally our first recommendation was to move to Azure DevOps (formerly VSTS, VSO) but this wasn’t a good fit for them at the point in time, issues around Data sovereignty and all that and Brexit generally putting the spook into everyone in the UK who had been eyeing the European Data Centres.

Non the less we built some future proofing into the new TFS server we built and configured for them, chiefly using SQL Server 2017 to mitigate any problems with them updating to Azure DevOps Server 2019/2020 which is due for release before the end of the year, or just after the turn of the year if the previous release cadence of TFS is any indicator.

We performed a dry-run upgrade installation, and then a production run, the client I suspect brushed through their dry-run testing a little too quickly and failed to notice an issue which appeared in the production run.

There was no content in the Code Hub file content.


Opening the Editor also just showed a blank editing pane. Examining the page load using web debugging tools showed us the following javascript error.


: Script error for “BuiltInExtensions/Scripts/TFS.Extension”

We also saw certain page assets failing to load in the network trace.


Editor.main.css looked to be quite important given the context of the page we were in. It was also noted in the network trace we had quite a number of 401’s and many of the page assets were not displaying correctly in TFS (like the Management/Admin Cog, the TFS logo in the top right, folder and branch icons in the Code Hub source control navigator). We were stumped at first, a support call from Microsoft in the end illuminated us to the issue. The client had a group policy setting which prevented membership of a role assignment in Local Policies from being modified.


When adding the IIS feature to Windows Server, the local user group IIS_IUSRS normally gets added to this role. In the clients case, because of this group policy setting which prevented role assignments from being made, this had not occurred. No error had been raised in the feature enablement, and so no one knew anything had gone amiss when setting up the server.

This local user group contains (as I understand it) the application pool identities when creating application pools in IIS. TFS’s app pool needs this impersonation policy to load certain page assets as the currently signed in user. Some group policy changes later we were able to add the local group this by hand and resolve the issue (combined with an iisreset command). It’s been explained to me this used to be a more common error back in the days of TFS 2010 and 2012 but is something of a rarity these days hence no luck with any Google-Fu or other inquiries we made to the error.

Interestingly a week later I was performing another TFS Upgrade for a different client, they were going to TFS 2017.3 and the Code Hub, Extension Management, and Dashboards were affected by the same error, fortunately recent experience helped us resolve that quickly.

VS2013 Update 2 RC–Ouch (Part 1)

So recently the Visual Studio Team released the Update 2 RC for the 2013. Among other things it has updates for project templates including the Universal Application project template.

This particular feature is something that my current development team were interested in as the project we’re currently working on is a Windows 8.1 Application, our client has strongly hinted they want it for Windows Phone 8.1 in the not too far future.

With that in mind one of our developers downloaded and installed the Update onto his development rig. The projects he added (portable class libraries) as it turned out had a solid dependency on the the Update. This meant no one on the rest of the team could build the solution without first unloading the projects he had added.

This also meant the build agent couldn’t build the solution since msbuild couldn’t open the project files. With work having already been done in these projects we all started rolling our Visual Studio installations forward onto the Update. (Whilst our Team Lead/Engineering Manager applied the update to the Build Agent).

That’s when the problems started.

I’ve been working on the CodedUI tests for this project for about 8-10 weeks now. They’re all hand coded, I use the CodedUI Test Builder tooling in Visual Studio to have peaks at the various properties of the controls I’m trying to find. After the this tooling stopped working. I’d open the Tool and then…nothing. Checking the event logs on my rig showed that the codeduitest.exe had crashed due to not being able to load any of the test extensions.

I attempted a repair installation to no avail (hoping that perhaps some of the assemblies had been corrupted). At this point I decided it would be best to reinstall visual studio 2013. My colleague kicked off an installation of VS2013 RTM for me the same evening after I left.

When I got into work the next day VS started and the CodedUI tool started! Great….but then I discovered that it was not very functional. It couldn’t pick up any objects. Poking around I found that to my horror the RTM install had decided to split itself between the root of Program Files(x86) and Program Files(X86)\Visual Studio 12.0.

I made a guess that some of the dlls that the IDE used weren’t where it was expecting despite the numerous registry keys that should have been pointing Visual Studio around to where it’s innards were strewn about. I tried moving the erroneously placed components back into the the 2013 folder but this caused the IDE to fail to even start.

Needless to say I uninstalled it again then found out to my dismay that it would only let me install it into the root of X86. Reading into it if the VS installation detects that a dependency of the installation is already installed (I believe it’s the C++ binaries) it forces you to install VS in the same location.

Because VS does not cleanly uninstall I had no means of getting rid of the pointer that was forcing it into the X86 folder, short of going through my registry with regedit and destroying everything labelled Visual Studio 12.0/2013 with extreme prejudice, even then it wasn’t guaranteed to work. So I had to have my rig re-imaged and wiped down. All so I could put the correct tooling back on to do my job properly, and overall I’d lost about 3 days of time on the sprint we were on due to the faffing around trying to get my machine into working order.

You’d think by this point how could it get any worse. Well it did but that’s a separate blog post (see part 2)

The moral of this story isn’t “Don’t install VS2013 Update 2 RC!!”, it should be tested and used by the community, just not on a production project in my opinion. Our first developer to install the update should have consulted with the rest of the team before doing so and as a team we would have weighed the risks versus the gains before deciding whether to install it. The simple reason being it’s an RC not the finished product. I can’t help but wonder if we’d waited a couple of months for an RTM version would I have had the same problems upgrading.

My Domain Controller Doesn’t Think It’s a Domain Controller

I’ve been helping our other Tester Tom Barnes on a project he’s been lead tester on for a couple of months, mostly running acceptance tests here and there when I’ve had a spare couple of minutes.

As mentioned in previous posts (and in Richard and Rik’s blogs) we use a lot of SCVMM virtual environments at black marble, presented through TFS Lab Management. This project was no different, our test environment consisting of a DC, SQL Server and Several SharePoint servers

So today I thought, while waiting for a big database operation to finish on another project, I’d test another user story for functional completeness. I remoted onto one of a client VM’s (which point at the SharePoint web server via host file configuration) and resumed my session from the previous day.

None of the websites I was intending to test were working, 404’s around the board. My immediate thought was to check the SharePoint server to see whether a deployment had gone amiss. I attempted to remote onto the SharePoint server using the SP Admin account, only to be told my password was incorrect. So I tried the domain admin account and ran into the same problem. Once again no luck.

I thought to check the domain controller since I knew we’d been running PowerShell scripts which modify password behaviour in AD, I was hoping someone hadn’t accidentally turned on the password expiry policy.

I couldn’t login to the DC with the Domain Admin. Lab then thought to give me a further bit of worrying information “Lab cannot determine whether the machine you are trying to connect to is a DC or joined to a domain”.

To quote Scooby Doo “Ruh Oh!”

I logged in using the machine admin account and the problem became fairly obvious on logging in, the desktop was quite helpful in informing that….

The DC was running in Safe-Mode.

For those unaware of what Safe-Mode does, it disables a lot of services and programs from starting up, in the case of a DC one of these is Active Directory Domain Services (and probably a host more). No AD Domain Services, no authentication, no authentication means lots of other services/applications which use dedicated service accounts fall flat on their face. Notable examples being:

  • CRM
  • SharePoint
  • SQL
  • TFS

So for all intents and purposes, my DC was not actually behaving like a Domain Controller.

So I restarted it….and it started in Safe Mode again…much to my annoyance. It did this without fail during successive restarts, no option on start-up was given to not start in Safe Mode and nothing in the event logs indicated the system had suffered a catastrophic problem on start-up for it to boot into Safe Mode.

Some quick Google-Fu showed the the problem, and more importantly how to fix it.


Something or Someone had turned Safe Boot on in System Configuration. Funnily enough turning this off meant the DC booted normally! You can find System Configuration in Server 2012 by using a normal search on pressing the Windows key.

Still haven’t found out what turned it on in the first place mind, but I’ll burn that bridge down if I have to cross it again.

Anyhow thanks for reading.


Lab Automated Build & Deploy: An Exercise in Pain.

Well firstly apologies for anyone who was hanging onto my every word and waiting for this follow up post to my previous article.

I was so terribly busy at the office (with the project this exercise pertained to) that I never had time to finish the automated build deploy I was working on in that last blog post. Or document it when I did. So here it is, and I apologise for the lateness of it.

The original title of this post was Lab Automated Build & Deploy: A Novice Attempt, however after finally crossing the t’s and dotting the i’s I can assure you that the current title was quite apt.

Got there in the end, but for my first “rodeo” I sure picked a corker 🙂

Step 1: The Environment

At Black Marble we’re quite adept at utilising Microsoft Test Manager Lab Management to procure VM’s for our development and testing cycle, I had a sterile 4 box test rig sitting in the Library which I could roll out as I pleased.

Looks a bit like this

So to ensure our deployment was always on a sterile environment I rolled out a new one. It consisted of a DC, SQL Server, SharePoint2013 server and a CRM server. I then made sure that I put any settings in that I couldn’t easily do through PowerShell and then took a snap shot of the environment using MTM’s Environment Viewer.

Gotcha: I had to amend my snapshot later to include the PowerShell setting set-executionpolicy unrestricted, as most of our deployment scripts were PowerShell based. For some reason this didn’t seem to reliably take from the deployment scripts themselves.

Gotcha: My first run of the process utterly rodded the environment because the snapshot was dodgy, IT had to go into SCVMM to manually restore the environment to a precursor snapshot. Something lab couldn’t do as it reckoned the entire snapshot tree was in an unusable state. This happened every time I tried rolling the environment back to it’s snapshot. Evidently taking the environment snapshot is a delicate process at times. While taking a snapshot don’t interrupt your lab manager environment viewer session. IT swear blindly this can cause behind the scenes shenanigans to occur. In the end we took a new snapshot.

Gotcha: The second run of the process with the new snapshot failed! In IT’s words lab manager is a touch daft, it fires off all snapshot reversion processes to SCVMM at once, that’s 4 jobs, if any one of them fail it counts the entire lot as a failure. The reason in our case was we had too many Virtual Machines running on our tin and the snapshot process was timing out because of low resources. We moved some stuff into the VM Library and turned/paused everything else we could that wasn’t needed.

 For those last 2 Gotchas, and the tale that followed on from them, see my first post Lab & SCVMM – They go together like a horse and carriage for more information on this. Probably also worth noting a patch for SCVMM around the time dramatically improved it’s stability when performing large numbers of operations.

With an environment in a mostly ready state I started working on my Build. The default Lab Deploy build template found in VS2012 runs as a post build action, you can either specify an existing/completed build, or queue a new one from another pre-defined build definition. I chose the latter as I wanted ideally the development teams and myself to have most recent versions of the product in test every morning.

All I did to start with was clone our C.I build. I customized the build template used for this build to accept a file path to a PowerShell script file, which it would then execute post drop. I was able to use the DropLocation parameter in the build definition workflow with some string manipulation to invoke the PowerShell script file which isincluded in the project in every build.


My Re order drop location process, firstly I convert the Workspace mapping of the Master post build script to a server mapping. I then use some string.Format shenanigans to construct the full file path and arguments  (taking the Drop location created for me already by the template) to the to the Post Build Master script. Which I then use when I use an Invoke process workflow item to fire up Power Shell on the build server/agent!.


Passing the Post Drop script file location as it exists in a workspace. I had to do this because the existing post script section has no Macro for the drop location which my scripts NEEDED.

I probably could have used the Post Drop PowerShell Script File and arguments section in the Build process if not for that at the time I had no way of getting hold of the drops location parameter ( and as our drop folders are all version stamped they change build for build). Probably could have done it through some MS build wizardry but that’s hindsight for you.

The PowerShell file well was really just a collection of Robocopy commands to shuffle around the drops folder into something resembling where my deployment scripts would be want to find things. Basically making my life easier when I wanted to copy the deployable files from the drops location to each machine in the environment. You could just as easily use another command line copy tool (or do it using basic copy commands).

This was for the most part a walk in the park.
What came next was a royal headache.

2.0 The Lab Build

This part of the problem would have been made a lot easier had my understanding of how Lab Management and TFS Build string together when deploying to a Lab environment.

As I now understand it here is the sequence of events

  1. Your base build starts, this is kicked off by the build controller and delegated to the relevant build agent. Your base build post build scripts are run by the TFSBuild service account on the build agent itself.
  2. Your Lab Build starts, this runs on the build agent as normal, your deployment scripts are run from the machines in the environment you designate in the process wizard.

    I initially thought they were run by the Visual Studio Test agent on that machine (oh how wrong I was), which is typically an account with admin rights. Your scripts can get here from the build location by some back door voodoo magic between TFS and Lab Management which ignores petty things like cross domain access and authentication! Not that I used this, I did it the hard way.

Gotcha! I found was that reverting to a snapshot of the environment in which it was turned off. Our IT have a thing about stateful snapshots and memory usage, so naturally all my snapshots were stateless (i.e the environment was turned off!). This has the logical repercussion that meant everything that came after the base build failed spectacularly. This was a relatively easy fix, modifying the lab deploy build process template to include an additional step where it turned the Environment back on. It’s kind of nice in that it waits for the lab isolation to finish configuring before it attempts to run any scripts.

For the record, not a custom workflow activity, it’s in there already.



So I now had a set of scripts that development had written to set the solution up, I wrapped these in another set of scripts specific to each environment that would “pull” the deployables from the drops location down to each machine (where the developer scripts were expecting them). Oh but here was a real big problem, these scripts were running as the domain admin of the test environment, which was on an entirely different domain to where our drops location was!
There was no way I was going to be allowed or able to turn this environment into a trusted domain, and I couldn’t seem to change the identity of the test agents either. No matter what I did in the UI they seemed intent on running as NT Authority.

At this point there were two avenues of work available to me. I could execute all the Robocopy PS scripts from the build process template taking advantage of the previously mentioned voodoo magic. This still didn’t get me around the deployment problem given the scripts (by judicious use of whoami) weren’t running as the accounts they needed to.

My second option was use Net Use, and map the drop location as a drive, as I could provide credentials in the script.

I opted for the latter which in hindsight was the more difficult option. My decision not to use the first option was helped by the interface in the process editor for entering the scripts is an abomination in the eyes of good UI design.

I noticed almost immediately that the copy scripts were working in some cases and not in others when trying to pull my files down. Some detective work on what was different between the files that were being copied and the files that weren’t yielded the Gotcha! below.

Gotcha!: Robocopy here was not my friend! I had been using the /Copyall switch on my Robocopy commands, which is equivalent to /copy + all :args, the audit information required higher security privileges than what were available to the account executing the copy script (which was TFSLab at this moment in time, the credentials we supplied for net-use in creating the network share). Omitting /Copyall and using just /copy:d solved the problem. Additionally Robocopy does not replace files if they exist already. So either use a blank snapshot (where the files have never been copied down before) or ensure your scripts can delete the files in place if you need to. I used Robocopy’s /PURGE switch to mirror (and recursively delete) a blank folder created via MKDIR.

So now we had all the deployables pulled down correctly to their correct machines. So it should have been as simple as executing the child scripts.

Nobody except the CRM box was playing ball…grrr.


3.0 SharePoint & SQL deployment

So the problems here were that operations performed on these two machines required specific users with specific access rights to perform their actions. I needed to execute my scripts as SharePoint Admin on the SharePoint Server and as the Domain Admin on the SQL Server.

For love nor money could I get the test agents to execute the scripts as anything other than NT System. I tried all sorts of dastardly things, like switching the Powershell execution context using Enter-PSSession (which I don’t recommend doing for long as TFS Build loses all ability to log the deployment once you execute the session switch) to adding NT System to some Security Groups it really shouldn’t be in.

All in all I didn’t like either of these solutions (though had I pursued them fully I’m sure they would have worked).

They were horrible, the first solution meant I could not easily debug the process as I had no built in logging, and also meant I had hard coded security credentials in my deployment scripts.

The second option is such bad practice that if my IT department saw what I did with the NT System account they’d shoot me on sight because it really is very very bad practice.

I was mulling it over until one day one of my more learned colleagues (the mighty “Boris”) clued me in that the Test Agent UI would execute the scripts if, and only if I was using it as an interactive process, not as a service. What I actually needed to change was the LAB AGENT service account, important distinction here. The Lab Agent is separate to the Test Agent, and has no configuration tool unlike the test agent.


Note the little blighter in the image. It’s this you have to change to provide a different execution account to your scripts, and it’s easy enough to do (right click Properties > Log On and provide the account).

Comprehension dawned. (FYI I used the latter)


4.0 Result!

And suddenly everything started coming together, I had a couple of niggles here and there, scripts that had the odd wrong parameter etc.

But overall the process now worked.

So end to end. My first Lab Deploy build goes a little like this.

LabBaseBuild runs, this is a customized workflow that as a post build action takes the drop location parameter generated during our normal CI build workload, and passes it to a PowerShell script that wraps around 3 child scripts. These child scripts move the CRM, SQL and SharePoint deployables into a mirror of the folder structures the developer scripts are expecting.

The Lab Deploy build then kicks off. This is a customized workflow that includes an environment start step after applying a stateless Snapshot to the environment. The build then runs 3 Scripts




Each of these scripts is passed the build drop location as a parameter, maps this location using Net Use (using the lab service account’s credentials as it’s quite Security Permission light, this allows us to jump across from the test domain to our black marble domain).

The scripts first pull down the folders they need from the drops location using good old Robocopy.

Each script then calls the deployment script for each feature.
CRM uses a custom .exe that calls into the CRM SDK for importing and publishing a CRM Solution, uploading reporting services reports and importing/assigning user security roles.

The SQL script applies a DACPAC to a specific database, imports a new set of SSIS packages (ispac files), calls to an external data source to update a database contents with the clients application specific data (which updates on a daily basis).

It then Truncates a load of transaction logs, and performs a DB Shrink, then executes our SSIS packages and imports up to 92 million + rows of data.

The SP script upgrades a set of WSP’s, activates the new feature for a site collection, configures a SharePoint search business data connectivity application, configures a SharePoint search service application and executes an indexing of the Application DB on the SQL server.

And optionally rolls out the entire scripted deployment of a Site Collection and a Test site with test user accounts already set up.

The Build then, as it is, ends.


5.0 In Conclusion

To say I learnt a lot doing this would be an understatement, but I could have picked a far easier project to do my first build deploy.

I think for any first timer using this technology is correctly interpreting the way in which the various components of the build process talk to one another is key!

Basically to do it right first time you need to know what does what, and who is it doing it as 🙂

Thanks for reading.

Lab and SCVMM – They go together like a horse and carriage

Hello welcome to my blog, this being my first post. I’m a Junior developer in test here at Black Marble and my day to day job roles are starting encompass dev-ops as well as testing.

So there’s this one project we’ve been working on which I have been working to automate using end to end Build Deploy with Lab Management. That exercise is coming up in another Blog post as I’m still working on it.

What I wanted to share with you all today is a problem I’ve encountered (and solved) while trying to accomplish that project.


Lab Management

For those not in the know Microsoft Lab Management is a piece of Test Environment management software, most notably for managing VM’s. We use it at Black Marble for provisioning Dev and Test environments for our projects and for the most part it’s fairly straight forward for me or Tom Barnes (my counterpart in test) to role out a new environment to order for a given project once IT have built the template/stored environment we’re using for that project.

In my Auto Deploy exercise I was doing some really heavy moving around of environment Snapshots on the environment I was deploying to.

Basically the 4 box environment (consisting of a SharePoint 2013 Server, CRM 2011 Server, SQL 2012 server and a DC) had at any one time 1 snapshot. A sterile point in time prior to me attempting to deploy on top of it. If my auto deploy didn’t work due to an environment setting I went and changed the setting, created a new snapshot and deleted the old one.

Lo and behold today I try to revert to my newest clean snapshot and Lab Manager tells me

A) It couldn’t apply the Snapshot

B) The snapshot tree was corrupted and that I needed to create a new tree.

My immediate response was to run crying to Rik Hepworth, our resident witch doctor for all things SCVMM. As this had happened before, and he’d fixed it last time believing that it was just a corrupted snapshot. We found the problem this time was far more….odd.

We opened the System Centre Virtual Machine Manager Console, logged onto the SCVMM server and examined my environment (now in a mixed state), and the reason for the snapshot tree being utterly rodded was fairly evident.


The Truth Comes Out

Lab Manager lies! To itself and to the user.

My SharePoint VM had 4 snapshots

My CRM VM had 3 snapshots

MY DC had 1 snapshots

MY SQL server had 1 snapshots

……On top of the one I’d just created prior into getting into this mess.

But wait hadn’t I been deleting the old snapshots as I no longer needed them? Well yes but that didn’t actually mean Lab Manager went and did it.

Lab Manager perceived that it only had one snapshot for the whole environment, which was not the truth by a country mile. What had happened repeatedly it seems is that the deletion jobs I had triggered in Lab Manager had failed occasionally within SCVMM but Lab Manager didn’t know this because it doesn’t talk to SCVMM all that well. Lab Fires and Forgets commands at SCVMM. Near as we can figure it keeps it’s own version of the truth regarding it’s machine snapshots and holds it’s hands over it’s ears shouting la la la when the real world doesn’t match up with what it’s expecting.

All well and good to know for future but what about my poor environment I’d been slaving over for 3 weeks?

All was not lost

Fixing It

My environment was not beyond recovery, in SCVMM we paused everything else running on the same tin as my environment. This was to minimize the risk of jobs timing out mid-process. Something we’ve had problems with previously

For each VM we opened it’s properties and removed each of the previous checkpoints until we had the latest snapshot only. (As shown below)



Update: This happened again to me today, and it seems the problem can be vary in severity, I had to delete the ENTIRE snapshot tree, not just the latest snapshot. Until I had done this all new SCVMM jobs to apply new snapshots to the environment en masse failed (I had to do them one by one), and any new snapshots I made had suffered the same problem. Manually apply a safe snapshot to each machine in your environment and then is possible (for in my experience most reliable results) delete the entire tree and then snapshot it again to create a new one.

We then individually applied this snapshot to each of the VM’s in the environment one by one (rather than Lab Managers Shot Gun approach of doing them all at once which depending on how much resource your tin has can behave very erratically).

Once the VM’s were back at this checkpoint we deleted the old checkpoint and created a new one (because we’re paranoid that the old one was inherently shifty)

We then fired them back up and presto my dead environment was resurrected via SCVMM necromancy. It was even in the right state by the time all the difference disks had been merged.


In Summary

Be careful with heavy use of snapshots in Lab Management, the larger the environment the more likely things will go wrong. Lab Manager will not tell you when a snapshot deletion has failed, it really is a fire and forget process as far as it’s concerned.

If you get a scary error when your environment is in a mixed state after attempting to apply a snapshot then fear not, so long as you have a witch doctor on hand your environment can be recovered to the state you originally intended.

If you are entertaining curiosity about Lab Management and SCVMM I strongly recommend you look at the following blogs.

Check back in the next week for my completed lab deploy war story, and thanks for reading!