Some Points with working with SQL Always On Availability Groups (when doing a TFS Migration)
Just shy of a full year from my last post. Crikey
So it’s another TFS related post, but not specifically to do with TFS. I still perform TFS migrations/upgrades from time to time here at Black Marble as we still have many customers who can’t migrate to Azure DevOps for one reason or another.
A few weeks ago we performed a TFS upgrade for a customer, bringing them from TFS 2013.1 to Azure DevOps Server 2019.0.1.
The new SQL Server installation that underpinned this new environment was a SQL Server 2016 Always On Availability Group, with 2 synchronous nodes and one async node contained in a figurative shed somewhere a few miles away.
Now our initial migration/upgrade to this server had gone well, there a dozen or so smaller collections we moved over with the initial migration but due to deadlines/delivery/active sprints etc there were a couple of teams who’s collections couldn’t be moved in that initial setup. So we came back a few months later to move these last few collections which coincidentally happened to be the biggest.
The biggest of the collections was approximately ~160gb in size, which wasn’t the biggest collection I’d seen, (not by a looooong shot) but not small by any means.
Could we get this thing into the Availability Group? Nope. Every time we completed the “Promote Database” Wizard the database would fail to appear in the “synchronized” or even “synchronizing” state on any of the nodes. Inspection of the file system on each of the secondary nodes didn’t even show the database files as being present.
So we had a think about it (rinsed repeated a few times) and someone called their SQL DBA friend who told us the GUI based wizard hates anything over ~150gb. We should let the wizard generate the script for us, and run it ourselves.
Well lo and behold the promotion of the databases worked…mostly. We saw the files appear on disk on all servers in the group, and they appeared as databases in the availability group but with warnings on them that they still weren’t syncing.
So on a hunch I re-ran the GUI wizard to promote the databases again, which among other things performs a series of validation checks. The key validation check (and the backbone of my hunch) is “is this database already in an AG?”. The answer was yes, and this seemed to shock SSMS into recognizing that the job was complete and the synchronization status of the synchronous and asynchronous replicas jumped into life.
My guess is that promoting a db into an AG is a bit of a brittle process, and if some thread in the background dies or the listener object in memory waiting for that thread dies then SSMS never knows what the state of the job is. Doing it via script is more resilient, but still not bullet proof.
Also worth noting for anyone who isn’t a SQL boffin, an async replica will never show a database as “synchronized” only ever “synchronizing”. Makes sense when you think about it! (Don’t let your customers get hung up on it).