The blogs of Black Marble staff

Analysis tools not to be without

Maybe it is just my background in network analysis, but I do feel any developer working with remote servers needs a protocol analyzer; in just the same way as you need the SQL profiler when working with a Microsoft SQL server, especially with auto generated code. Without tools like these how can you work out what is actually being sent on the wire? And if the data on the wire is wrong what hope is there for higher levels in the application|

So two new versions of old favorites both now supporting Vista

Update on my old Dell

I posted in the past about all the problems I had with my overheating Dell 5150, and with the problems trying to Vista Betas working on it. Well an update on both....

  1. After putting new thermal grease on the CPU heatskin it never overheated again
  2. With the release version of Vista every bit of hardware (bar the modem which I quickly found a driver for) was detected and worked first time and I have seen no problems with the PC since.

So I have a healthy, three year old laptop running Vista with the Aero interface (and the 5150 does have a nice high res. screen) perfectly adequately.

This all said I have still moved onto one of our new company standard Acer 8210 Laptops with a Core2Duo running Vista.

Who stole Microsoft Marketing and where are they?

Microsoft marketing have as of today vanished and are presumed missing. They normally show themselves sometime at the end of the product cycle to replace cool and interesting project names with irrelevant and confusing messages about the product. Without the Microsoft marketing department's intervention a large proportion of the work I do explaining the Microsoft developer story might not be needed.

So what is the evidence that these mysterious people and oft not seen people, In the last year great products Avalon became WPF , Indigo became WCF , WF became WF (mmm) , Longhorn became Vista. So Soma (VP Developer Division at Microsoft) announced that WPF/e ( WPF everywhere ) is going to be called wait for it , wait for it Silverlight,



WOW a really cool name , a bit too cool if you ask me, too close to Straylight from Neuromancer , are they planning something?, a lead in for conspiracy theorists maybe.

Follow up on our nVidia RAID problems

I had posted on problems with the nVidia Raid on our SunFire servers. Well I think I now have the root cause of the problems: not the Sun hardware, the nVidia RAID, or Windows 64 bit drivers.

All the problems we had were when we used mirrored pairs of Western Digital 500Gb SATA drives that we had bought in a single batch of four drives. Identical drives bought on another day were fine, as were 500Gb drives from Maxtor and Hitachi.

After testing these four drives we found three of them kept developing low level unfixable bad blocks, irrespective of the PC, Sun or any other brand, they were used in. It seems when one of these bad blocks was hit:

  • the nVidia RAID caused the mirrored pair to loose their sync and the server hung, when rebooted the server had two drives and the mirror had to be recreated.
  • if Windows Software Mirroring was used we just lost the mirrored pair, at least there was no server hang. The mirror would try to recover - with mixed success, usually failing at the same point each time. However, sometimes working, hence all our confusion in finding the root cause of the problem.

Given this experience we are staying with Windows Software Mirroring as at least the server does not hang.

Now in twenty odd years in this business I have never had three out of four drives fail in single batch, in fact I don't think I have ever had a 'dead on arrival' hard disk from any of the big name brands.

My guess is these four drives were dropped at some point after they left the factory QA department and before we bought them. The faulty three are off back to WD under warranty, I wonder if the fourth will survive? It is certainly not going into any system that is critical.

What happened to the idealists?

Douglas Coupland's Jpod has been doing the rounds in the office of late. I enjoyed MicroSerfs, so approached Jpod with excitement.

Frankly, I'm disappointed.

It's not the writing - I 've enjoyed pretty much all of his books. It's not that the books are similar in approach and style (they are) but rather the contrast in the lives of the characters.

Overall, MicroSerfs was optimistic. The characters in the book were using their talent to make the world a better place. The technology in Jpod is cynically created to make the most money. I finished MicroSerfs feeling good about what I do for a living; I'm stuggling through Jpod as it slowly destroys that feeling.

Let's set aside whether this contrast is intentional - I don't want to discuss what Mr Coupland is trying to say. What I want to get across is something that I have felt for a while and which Jpod merely reinforced:

The IT industry is becoming more and more cynical.

Perhaps this is a function of its age and maturity; perhaps it has more to do with the complexity of modern IT solutions; perhaps it is that we have accomplished so much so quickly that progress can only become harder and slower.

When I started working, the University for which I worked was only just embracing desktop computers. I was involved in promoting desktop PCs and workgroup servers to departments and it was an exciting time. Throughout my career there, I was involved in the creation of new services that were intended to make people's lives better, easier, simpler, more efficient, and I got a great deal of satisfaction from it.

I still get satisfaction from delivering those kind of solutions, and I like to think that myself and my colleagues here at Black Marble still aim to make the world a better place through technology, in our own way.

I'm less convinced that the rest of the world still feels that way. What do you think?

Analysing Active Directory

I think I've mentioned before how I've been updating our IT infrastructure. Company growth has meant a need for expanded services. Add to that new versions of SharePoint and Exchange, mix in a need to run virtual servers for development and you have a need for more tin.

Over the past six months I've expanded our domain to keep pace with our growing needs. The number of physical servers we have has increased, with a few more virtual servers for specific roles that I prefer to keep separate but which don't really merit their own box.

As part of this growth, I added a second domain controller. Our existing DC was also running Exchange 2003, and this situation has caused me the most headaches in the sliding block puzzle of service upgrade and migration: We couldn't demote the DC on our old server because of Exchange 2003, but I was reticent about putting in Exchange 2007 until I had redundancy of critical services (DC, DNS, etc).

Updating Domains, getting ready for Exchange

I will admit at this point that my knowledge of AD is not as deep as I would like, although it is increasing daily. That does mean, however, that I check before I leap - find articles on MSDN, TechNet and the wider blogosphere to find the pitfalls so I avoid pratfalls.

So, I read carefully about raising the functional level of the Forest and Domain when installing a 2003 R2 domain, made sure everything was patched and service packed before starting, read and re-read the instructions. When confident I had run through all the prerequisites I ran dcpromo to add my domain controller.

I was then left with two servers, both of which had the necessary tools to mange AD, both of which were registered in DNS as DC's, both of which appeared to be fine.

Nothing I read suggested that I needed to check anything else to make sure the process had completed... (You can see where this is going, can't you...?)

Exchange 2007 - the big transition

Over the first weekend in April we transitioned from Exchange 2003 to Exchange 2007. Once again, I did my reading. I ran the Exchange Best Practice Analyser and made sure that our Exchange 2003 installation was in tip-top condition. I compared two or three different sets of instructions on how to run throughthe process, setting on one from an Exchange community site because of some extra little nuggets of insight it contained.

The transition went relatively smoothly. The new server went in, was configured correctly and the Exchange 2007 site was connected to the Exchange 2003 site. Mailboxes were transferred (we had a problem with one, but we fixed it) and clients were checked to have connected to the new server.

Once happy, we uninstalled the old Exchange, as per instructions.

It took a full day, but we were being careful and thorough. We thought it had gone fine.

The next step would be to remove our old DC from the AD and decommission the server. Being cautious, we wanted to test that things wouldn't stop if we removed the old DC, so we unplugged the network cable...


Everything stopped - Exchange clients disconnected, logons stopped, everything!

Is there a doctor in the house?

Stage one when hit with a problem - gather as much information as possible.

We looked at our systems, we checked logs, we watched the Outlook clients connecting to exchange. When we disconnected our old DC, nothing seemed to want to talk to the new DC. I checked the Exchange server settings and made sure the server was set to use the new DC for its configuration and all seemed fine.

We noticed an error that the clients couldn't connect to Global Catalogue server, so I did some more reading, realised that the old DC was our global catalogue server and so followed the steps to change the role over to the new DC. Everything said it had worked, but nothing changed.

I did some more reading about role masters and set the new DC to be the master for each role - at least I thought I did - through the AD users and groups tool. Still nothing.

At this point I decided that either I could spent days or weeks researching and prodding, or I could call in the cavalry. The support team we have access to as a Gold Partner are fantastic - I can never praise them enough - and sure enough I had people on the problem within an hour of logging the call.

Because we initially thought the problem was with our exchange config, we dealt with a very efficient Exchange support guy. He worked methodically through the problem, and started to look deeper into our domain and DC's as he zeroed in on it being a domain issue.

At this point, I encountered the AD support tools being used in anger for the first time. I passed the support guys dozens of log files. We also discovered what appeared to be the problem - my new DC wasn't really a DC!

That last statement is a bit too simplistic. Our new DC was happily replicating the AD. It reported everything being fine when examined with replmon. Both DC's agreed on their view of the world.

What I didn't know was that in addition to the AD replicas, a NETLOGON share is created on the new DC by dcpromo. I also did not know that this process had failed - at no point did anything tell me. Because there was no share, the server was not dealing with client requests correctly, which is why our systems had a fit when I unplugged the old DC.

Peering into a deep, dark well

Having identified the fault, my exchange guy called in an AD specialist to assist. He ably worked through the fault. There are a sequence of steps to follow which will trigger a rebuild of the netlogon share. We worked through them. They didn't work. We knew they didn't work because the share wasn't created. Apart from a couple of event log messages which I didn't consider to be helpful, nothing told us what was wrong.

Having failed to rebuild the share on the new DC, my AD ninja looked at the old DC. He decided to rebuild the same share on the existing DC, the thinking being that the replication was failing because of a fault on the source, rather than the destination. In order to do this, the domain group policies would be destroyed and rebuilt as defaults.

This process took some time, but to cut a very long story short, it appears that our default group policy objects were corrupted, which was blocking the replication. By deleting them and rebuilding the sysvol directory structure on our original DC, then forcing a rebuild on the new DC, the AD was fixed.

My eternal gratitude to the Microsoft support guys. My point, long and meandering though the journey has been, is this: At no point did I see anything which suggested corruption of those objects. At no point did I see anything which suggested they were the cause of the replication fault.

My toolbox is missing!

In order to get the information the support guys needed, I had to install first the Support Tools from the installation media and then the resource kit tools downloaded via the web. Those tools should have been installed by default, or at least should have been added when I created my new DC.

Even when I'd installed the tools, they didn't really give me much information. Now, I will readily admit here that I am new to the tools, and continued reading will doubtless help me in this regard, but the key point is a simple one:

I can't see what's going on!

Shhh... say it quietly... NDS

I supported IT solutions including Novell servers for fifteen years before joining Black Marble. In my previous role we had some thirty servers with a fairly complex, but well structured NDS directory. Over those years, we had some problems with replication and corruption, and every time we did, we started with the same procedure: We watched.

What Active Directory is lacking, in my humble opinion, is an equivalent of the Novell DStrace tool. DSTrace allows you to watch the activity of your directory replicas. By careful use of the various options you can configure your servers to show you replication traffic, requests and responses and more. Colour coding allows you to spot errors and warnings and after a while you start to see patterns in the mass of text. If we had an NDS problem we could use DStrace to get a feel for the cause - you could see if there were corrupt objects which weren't replicating between servers. You could even figure out which servers were right and wrong.

Once you'd seen the fault, the dsrepair tool allowed you to tackle it either with surgical precision or with heavy artillery. You could force a replication of an individual object, overwriting the corrupted copy by force, or use drastic measures like deleting a replica of the directory or a partition.

Where are those tools for active directory? If they exist, please tell me, because I'd like to get my hands on them. I can't imaging dealing with huge installations of AD without that kind of toolset.

A wishlist...

What would I like to see then? I'm writing this post before I start rummaging around the web, and if I find examples of these tools I'll post about them.

  1. A tool which checks the integrity of the directory and it's objects, and identifies where replicas on different servers disagree.
  2. A tool that allows me to see all the AD traffic in real time - logging to a database might be useful, but just seeing the messages on screen would be a start. I want to be able to toggle different messages - errors, warnings, replication traffic, client requests and responses etc to get a feel for what works and what doesn't.
  3. A tool to allow me to fix individual objects - to replace them from backup or to overwrite them with a copy from another replica (by far my preferred method).

If this lot already exists then tell me. If there are good books on the subject then point me at them. I've found some support articles which are helpful, but not as much as I'd like. I'm not precious - if this all stems from a fundamental misunderstanding or lack of knowledge on my part I'm happy to admit my mistake. However, at this point I'm leaning more to it being an indication that AD still hasn't matured to the level of NDS in terms of management and control.

An extreme hour

I went to the Yorkshire Extreme Programming Club last night, the meeting included an Extreme Hour. An interesting experience; the idea is that in an hour you go through a number of 10 minute XP iterations, doing 'development' by drawing the solution on a white board.

Yesterday we had three separate groups of six; each with two customer, two developers (pair programming i.e. with one pen) and two QA/testers, and all had to design moon cheese harvesting solutions. 

So did I learn much new about XP? well not sure, but the exercise does show the importance of communication within the group. The problem with running the exercise in a social setting (a pub) is the beer is not condusive to structured thinking, and a subject of moon cheese gives huge scope for fights of fancy - my group ended up on a discussion of whether to out source the production of killer robots (to protect the cheese mine from Clangers - don't ask how we got to this point) or if the actual robot manufacturing was a task our developers should do (they were not keen to draw 10,000 robots, I cannot think why).

What I would say if you are thinking of running such a sessions is:

  1. Be tight on the time keeping - like a scrum sprint an XP iteration should finish on time, even if features are cut
  2. With six people in a group I would be tempted to have one customer, two teams of pair programmers (four people) and a single tester. This I think would help show the issues of interteam communications 

If nothing else this was a good social evening, so if you are in the area pop along to the next meeting see the web site for details

When the server just stops

Today I rebuilt a PC with new drives and all seemed OK, but after 15 minutes or so it kept stopping (no nice shutdown), irrespective of what the PC was doing. I even swapped back the old disks all to no avail, hence I was stumped for a while.

The problem turned out to be it was a somewhat full case and a wire was stopping the system fan turning, so the CPU was going into thermal shutdown and leaving no event log messages. Also the motherboard was not complaining as the am was still drawing current.

I should have spotted this one earlier, as it has happened before, so it goes on my blog of stupid things you forget to check for!