Building environments for Lab Manager: Why bare metal scripting fails

In the world of DevOps it’s all about the scripts: I’ve seen some great work done by some clever people to create complex environments with multiple VMs all from scratch using PowerShell. That’s great, but unfortunately in the world of Lab Manager it just doesn’t work well at all.

We’ve begun the pretty mammoth task of generating a new suite of VMs for our Lab Manager deployment to allow the developers and testers to create multi-machine environments. I had hoped to follow the scripting path and create these things much more on the fly, but it wasn’t to be.

I hope to document our progress over the next few weeks. This post is all about the aims and the big issues we have that make us take the path we are following.

Needs and Wants

Let’s start with our requirements:

  • A flexible, multi-server environment with a range of Microsoft software platforms to allow devs to work on complex projects.
  • All servers must be part of the same domain.
  • All products must be installed according to best practice – no running SharePoint as local service here!
  • Multiple versions of products are needed: SharePoint 2010 and 2013; CRM 4 and 2011; SQL 2008 R2 and 2012; Biztalk 2010 and 2013.
  • ‘Flexible’ VMs running IIS or bare server 2008 R2/2012 are needed.
  • No multi-role servers. We learned from past mistakes: Don’t put SQL on the DC because Lab Manager Network Isolation causes trouble.
  • Environments must only consist of the VMs that are needed.
  • Lab Manager should present these as available VMs that can be composed into an environment: No saving complete environments until they have been composed for a project.
  • Developers want to be able to run the same VMs locally on their own workstations for development; Lab Environments are for testing and UAT but we need consistency across them all.

That’s quite a complex set of needs to meet. What we decided to build was the following suite of VMs:

  • Domain Controller. Server 2012.
  • SQL 2012 DB server. Server 2012.
  • SharePoint 2013 (WFE+APP on one box). Server 2012. Uses SQL 2012 for DB.
  • Office Web Apps 2013. Server 2012.
  • Azure Workflow Server (for SharePoint 2013 workflows). Server 2012.
  • CRM 2011 Server. Server 2012. Users SQL 2012 for DB.
  • Biztalk 2013 Server. Server 2012. Users SQL 2012 for DB.
  • IIS 8 server. Server 2012.
  • ‘Flexible’ Server 2012. For when you just want a random server for something.
  • SQL 2008 R2 server. Server 2008 R2.
  • SharePoint 2010 (WFE+APP+OWA on one box). Server 2008 R2. Uses SQL 2008 R2 for DB.
  • CRM 4. Server 2008 R2. Uses SQL 2008 R2 for DB.
  • Biztalk 2010. Server 2008 R2. Uses SQL 2008 R2 for DB.
  • IIS 7.5 server. Server 2008 R2.
  • ‘Flexible’ Server 2008 R2.

In infrastructure terms we end up with a number of important elements:

  • Our AD domain and DNS domain: <domain>.local.
  • For SharePoint 2013 Apps we need a different domain. For ease this is apps.<domain>.local.
  • To ensure we can use SSL for our web sites we need a CA. This is used to issue device certs and web certs. For simplicity a wildcard cert (*.<domain>.local) is issued.
  • Services such as SharePoint web applications all get DNS registrations. Each service gets an IP address and these addresses are bound to servers in addition to their primary IPs.
  • All services are configured to work on our private network. If we want them to work on the public network (Lab machines, excluding the DNS, can have multiple NICs and multiple networks) then we’ll deal with that once the environment is composed and deployed through Lab Manager.

Problems and constraints

The biggest problem with Lab Manager is the way Network Isolation works. Lab asks SCVMM to deploy a new environment. If network isolation is required (because you are deploying a DC and member servers more than once through many copies of the same servers) then Lab creates a new Hyper-V virtual network (named with GUID) and connects the VMs to that. It then configures static addresses on that network for the VMS. It starts with the DC and counts up.

My experience is that trying to be clever with servers that are sysprepped and then run scripts simply confuse the life out of Lab. You really need your VMs to be fully working and finished right out of the gate. Unfortunately, that mans building the full environment by hand, completely, and then storing them all with SCVMM before importing each VM into Lab.

There are still a few wrinkles that I know we have to iron out,even with this approach:

  • In the wonderful world of the SharePoint 2013 app model we need subdomains and wildcard DNS entries. We also need multiple IP addresses on the server. Right now we haven’t tested this with lab. What we are hoping is that we can build our environment on the correct address space as Lab uses. It counts up from 1, so our additional IPs will count down from 254. What we don’t know is whether Lab will remove all the IP addresses from the NICs when it configures the machines. If it does, then we will need to have some powershell that runs to configure the VMs correctly.
  • Since we don’t have to use all the VMs we have built in any given environment, DNS registration becomes important. Servers should register themselves with the DNS running on our DC, but we need to make sure we don’t have incorrect registrations hanging around.

Both of these areas will have to be addressed during this week, so I’ll post an update on how we get on.

Consistency is still key

Even though we can’t use scripts to create our environment from bare metal, consistency is still really important. We are, therefore, using scripts to ensure that we are following a set of fixed, replicable steps for each VM build. By leaving the scripts on the VM when we finished, we also have some documentation as to what is configured. We are also trying, where possible, to ensure that if a script is re-run it won’t cause havoc by creating duplicate configurations or corrupting existing ones.

Each of our VMs has been generated in SCVMM using templates we built for the two base operating systems. That avoids differences in OS install and allows me to get a new VM running in minutes with very little involvement. By scripting our steps, should things go badly wrong we can throw away a VM and run through those steps again. It’s getting trickier as we move forward, though; rebuilding our SQL boxes once we have SharePoint installed would be a pain.

Frustration begets good practice

In fact, one of the most useful things that has come out of this project so far is a growing set of robust powershell modules that perform key functions for us. They are things that we have scripted already, but in the past these have been scripts we have edited and run manually as part of install procedures. Human intervention meant we created simpler scripts. This week I have been shifting some of the things those scripts do into functions. The functions are much more complex, as they carefully check for success and failure at every step. However, the end result is a separation of the function that does the work and the parameters that change from job to job. The scripts will will create for any given installation now are simpler and are paired with an appropriate module of functions.

Deciding where to spend time

Many will read this blog and raise their hands in despair at the time we are spending to build this environment. Surely the scripted approach is better? Interestingly, our developers would disagree. They want to be able to get a new environment up and running quickly. The truth is that our Lab/SCVMM solution can push a new multi-server rig out and have it live and usable far quicker than we could do with bare metal scripts. The potential for failure using scripts if things don’t happen in exactly the right order is quite high. More importantly, if devs are sat on their hands then billable time is being wasted. Better to spend the time up front to give us an environment with a long life.

Dev isn’t production, except it is, sort of…

The crux of this is that we need to build a production-grade environment. Multiple times. With a fair degree of variation each time. If I was deploying new servers to production my AD and network infrastructure would already be there. I could script individual roles for new servers. If I was building a throwaway test or training rig where adherence to best practice wasn’t critical then I could use the shortcuts and tricks that allow scripted builds.

Development projects run into difficulties when the dev environment doesn’t match the rigor of production. We’ve had issues with products like SharePoint, where development machines have run a simple next-next-finish wizard approach to installation. Things work in that kind of installation that fail in a production best practice installation. I’m not jumping for joy over the time it’s taking to build our new rigs, but right now I think it’s the best way.

Adding DHCP Option 119 (Domain Search List) to Windows Server 2008 R2

If you’ve seen my recent blog post on making Android work in Hyper-V you will have seen my problems around DNS resolution when in the office. That turned out to be down to the DHCP options being handed back by our Server 2008 R2 box. Or rather, it was what wasn’t being handed back to the client that was the problem.

Linux (and Apple OSX, as it turns out) both want a response to option 119, which defines the domain suffix search order. Windows does not request this option and the windows DHCP server does not offer the option at all.

There are a few locations on the web that talk around this but none properly document the process of adding the custom option to your DHCP server and more importantly, exactly how to encode the response (thanks go to Matt for describing that one in enough detail!).

What option 119 looks like

Option 119 hands back a byte array that encodes the domain suffix. The trouble is, it’s not a simple matter of turning text into ascii hex values. There is a structure to it.

Our domain is blackmarble.co.uk. I found a handy web site that sped up the process of getting the hex of that.

b l a c k m a r b l e . c o . u k
0x62 0x6c 0x61 0x63 0x6b 0x6d 0x61 0x72 0x62 0x6c 0x65 0x2e 0x63 0x6f 0x2e 0x75 0x6b

The client expects the full domain to be split (get rid of the separating periods), with each section prefixed by it’s length (e.g. 11 blackmarble 2 co 2 uk) and then the whole string is null terminated.

  b l a c k m a r b l e   c o   u k  
0x0b 0x62 0x6c 0x61 0x63 0x6b 0x6d 0x61 0x72 0x62 0x6c 0x65 0x02 0x63 0x6f 0x02 0x75 0x6b 0x00

Adding the option to Server 2008 R2

To add the DHCP option to Windows Server 2008 R2 you need to add a new option to the IPV4 section. Select ‘Set Predefined Options’ from the context menu on the IPV4 heading beneath the name of the server in the DHCP console.

dhcp option menu

In the dialog that appears, click the Add button.

dhcp options add

Enter the name for new option. It needs to be of type Byte and make sure you check the array option; the Code is 119 and you can add a description if you want to.

dhcp add new option

Once you’ve defined the option you can add it either to the main Server Options section, a scope or an individual address reservation.

119 server options

Each of the hex values you worked out (see my examples) needs to be added in turn to create the full byte array.

Hopefully this pulls all the necessary information into one handy reference and it saves somebody else the confusion I experienced.

Server Core, Hyper-V and VLANs: An Odyssey

A sensible plan

This is a torrid tale of frustration and annoyance, tempered by the fun of digging through system commands and registry entries to try and get things working.

We’ve been restructuring our network at Black Marble. The old single subnet was creaking and we were short of addresses so we decided to subnet with network subnets for physical, virtual internal and virtual development servers, desktops, wifi etc. We don’t have a huge amount of network equipment, and we needed to put virtual servers hosted on hyper-v on separate networks so we decided to use VLANs.

Our new infrastructure has one clever switch that can generate all the VLANs we need, link those VLANs to IP subnets and provide all the routing between them. By doing it this way we can present any subnet to any port on any switch with careful configuration and use of the 802.1Q VLAN standard. Hyper-V servers can have a single physical interface with traffic from multiple VLANs flowing across it to the virtual switch, with individual VMs assigned to specific VLANs.

We did the heavy lifting of the network move without touching our Hyper-V cluster, placing all the NICs of all the servers on the VLAN corresponding to our old IP subnet. We then tested VLANs over the virtual switch in Hyper-V using a separate server and made sure we knew how to configure the switch and Hyper-V to make it all work.

Then we came to the cluster. Running Windows 2008 R2 Server Core.

Since we built the cluster Andy and I have come to decide that if we ever rebuild it, server core will not be used. It’s just too darn hard to configure when you really need to, and this is one of those times.

A tricky situation

Before we began to muck around with the VLAN settings, we needed to change the default gateway that the servers used. The old default gateway was the address of our ISA (now a shiny TMG) server. That box is still there, but now we have the router at the heart of the network, whose address is the new default gateway.

To change the default gateway on server core we need a command line tool. Enter Netsh, stage left.

We first need to list the interfaces so we know what we’re doing. IPConfig will list the interfaces and their IP settings. Old lags will no doubt abbreviate the netsh commands that we need next but I’ll write them out in full so they make sense.

Give me a list of the physical network adapters and their connection status: netsh interface show interface

Show me the IPV4 interfaces: netsh interface ipv4 show interface

To change the default gateway we must issue a set command with all the IP settings – just entering the gateway will not work as all the current settings get wiped first:
netsh interface ipv4 set address name="<name>" source=static address=x.x.x.x mask=255.255.255.0 gateway=x.x.x.x
Where <name> is the name shown in the IPV4 interface list, which will match the one shown in the ipconfig output that you want to change the gateway for. We’re using a class C subnet structure – your network mask may vary.

It’s worth pointing out that we stopped the cluster service on the server whilst we did this (changing servers one by one so we kept our services runnning).

We had two interfaces to change. Once corresponded to the NIC used to manage the server and the other corresponded to the one used by the virtual switch for Hyper-V. That accounted for two of the four NICs on our Sun X2200-M2’s, with the SAN iSCSI network taking a third. The SAN used a Broadcom, the spare was the other Broadcom and the others used each of the two nVidia NICs on the Sun (that will become important shortly).

A sudden problem

Having sorted the IP networking our next step was to sort out the VLAN configuration. To do that we changed the switch port that the NIC hosting the hyper-V virtual switch was connected to from being an untagged member of only our server subnet VLAN to being a tagged member of that VLAN and a tagged member of the new VLAN corresponding to our subnet for virtual internal servers.

The next step was to set the VLAN id for a test VM (we could ignore the host as it doesn’t share the virtual switch – it has it’s own dedicated NIC).

The snag was, the checkbox to enable VLAN ids was disabled when we looking in Hyper-V manager, both for the virtual switch and for the NIC in the VM.

Some investigation and checking of our test server showed that the physical network driver had a setting, Priority and VLAN, that needed to be set to enable priority and VLAN tagging of traffic, and that the default state was priority only. On full server that’s a checkbox in the driver setting. On server core…?

So, first of all we tried to find the device itself. Sadly, the server decided that remove management of devices from another server wasn’t going to be allowed, despite reporting that it should be. So we searched for command line tools.

To query the machine so it lists visible hardware devices: sc query type= driver (note the space before ‘driver’)

That will give you a list, allowing you to find the network device name.

For our server, that came back with nvenetfd – the nVidia NIC.

To use that name and find the file responsible: sc qc <device name> (in our case nvenetfd )

That returned the nvm62x64.sys driver file. Nothing about settings, but it allowed us to check the driver versions. Hold that thought, I’ll come back to it shortly.

Meanwhile

We’d also been poking at the test server, looking at the NIC settings. Logic suggested that the settings should be in the registry – all we have to do was find them.

I’ll save you the hunt:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E972-E325-11CE-BFC1-08002BE10318}

That key holds all the Network Adapters. There are keys beneath that are numbered (0000, 0001, etc). The contents of those keys enabled us to figure out which key matched which adapter. Looking at the test server and comparing it to the server- core hyper-v box we found a string value call *PriorityVlanTag which had a value of 3 on the test server (priority and vlan enabled) and 1 on the hyper-v box. We set the hyper-v box to 3. Nothing. No change. We rebooted. Still nothing.

Then we noticed that in the key for the NIC there was a subkey: \Ndi\params. In there was a key called *PriorityVlanTag. In that key were settings listing the options that were displayed in the GUI settings dialog, along with the value that got set. For the nVidia the value was 2, not 3. We duly changed the value and tried again. Nothing.

So we decided to update the drivers. This brings us back to where I left us earlier with the sc command.

To update a driver on server core, you need to unpack the driver files into a folder and then run the following:
pnputil –i –a <explicit path to driver inf file>

After failing to get any other drivers to install it looked like we had the latest version and the system was not letting go. So we did some more research on the internet (what did we ever do before the internet?).

It transpires, for those of you with Sun servers, that the nVidia cards appear not to support VLAN ids on traffic, despite having all the settings to suggest that they do.

Darn.

A way forward

Fortunately we have a spare broadcom on each of our hyper-v host servers. We are now switching the virtual switch binding from the nVidia to the broadcom on each of our servers. We didn’t even have to hack around with registry settings once we did that the VLAN id settings in Hyper-V simply sprang into life.

The moral of this story is that if you want to use VLAN id’s with Hyper-V and your server has nVidia network adapters (and certainly if it’s a Sun X2200-M2) then stop now before you lose your hair. You need to use another NIC, if you have one, or install one if you don’t. Hopefully, however, the command line tools and registry keys above will help other travellers who find themselves in a similar situation to ourselves.

Fixing SharePoint 2007 IIS WAMREG DCOM 10016 activation errors on Server 2008 R2

Anybody who works will SharePoint will grumble if you mention DCOM activation permissions. No matter how hard we try, how many patches we install (or how hard we try to ignore it), granting activation and launch permissions to the SharePoint service accounts is like plugging a dike with water-soluble filler.

On Server 2008  R2 our job is made that much harder by the fact that, by default, even administrators can’t edit the security settings for the IIS WAMREG service (GUID {61738644-F196-11D0-9953-00C04FD919C1}, for when you see it in your application event log).

The fix is to change the default permissions on a registry key, which you can only do by taking ownership of the key. My only comment would be that those permissions were locked down for a good reason in Server 2008 R2 and it’s somewhat frustrating that we need to do this.

Anyway, the key you are looking for is:

HKEY_CLASSES_ROOT\AppID\{61738644-F196-11D0-9953-00C04FD919C1}

To change the ownership you need to click the Advanced button in the Permissions tab of the properties dialog, then select the Owner tab. I’d recommend changing the owner to the Administrators group rather than a specific user, and make sure the permissions for TrustedInstaller are the same after you finished as they were before you started.

Once done, you can edit the DCOM permissions for the IIS WAMREG service in the same way as on other versions of Server 2008.

NewSID fails on Windows Server 2008 R2

The title says it all. I’m currently building a virtual lab to test DirectAccess and every time I run newsid on windows server 2008 R2 the system bluescreens irrevocably on reboot. I’ve now switched to using sysprep to change the SID. Here’s hoping the sysinternals guys update what is undoubtedly one of the most useful tools around!