I have previously completed some work for a mid-sized organization and some of the things I’ve come across in the few shorts weeks I’ve had access to their systems are, frankly, astonishing. I’ve barely even scratched the surface and we’re already well into “so bad it’s not even wrong” territory here; a few examples:

  • Critical Servers not under warranty
  • Critical Servers not backed up
  • Servers backed up to disks on the same disk array as the live data
  • Backups taken to tape and then left in the tape loader indefinitely
  • Servers not patched, ever
  • Antivirus not installed on servers, or installed but disabled “for performance reasons”
  • Server hardware/software not monitored
  • Speed/Duplex on all network interfaces set manually
  • Public address ranges for internal network addressing
  • A single broadcast domain for all devices on the network
  • 120m+ Ethernet cable runs
  • Cat3 cable runs within the core infrastructure (undocumented)
  • New user credentials sent by email, via an externally hosted mail system, to user’s line manager
  • All IT Staff granted unaudited access to the entirety of the file servers
  • All IT Staff local admins on all servers

And that’s just the stuff I’ve found so far and haven’t already repressed. I honestly have no idea where to begin, if there were such a thing as a Worst Practice Guide they’d have not only followed it to the letter, but added their own extensive appendices as well.

It would appear that the Exchange 2010 EMC isn’t particularly bright; when you launch it, it picks a CAS to connect to from AD. This is fine.

However, should that server cease to exist, by which I mean Exchange is uninstalled and it is properly decommissioned, then the EMC will continue to try and connect to it. Even after the connection fails, it’ll keep on merrily plugging away at the non-existent server, never considering that there are probably other servers it could try.

This is very annoying and seemingly very stupid behaviour. To work around it, close the EMC, fire up your registry editor of choice, locate the following key: HKCU\Software\Microsoft\Exchangeserver\v14\AdminTools\ and delete the NodeStructureSettings value. This will reset the EMC and cause it to pick a new CAS to connect to; it may also affect other settings that you’ve changed in the console.

Another option is to close the EMC, navigate to C:\users\<username>\AppData\Roaming\Microsoft\MMC\ and delete the Exchange Management Console file. This will also reset the EMC and will definitely reset any customisations you’ve made to the console.

Why you should ever have to do this is something of a mystery to me, perhaps Microsoft just never expected anyone to decommission an Exchange server once it was built.

So, you’ve got an Exchange 2007 CCR Cluster set up and all is well in the world; your data is safely replicated offsite so that in the event of a disaster, you can have your users back up and emailing in the time it takes a DNS record to update.

But then, disaster! A different disaster to the previously mentioned one, obviously, because this disaster causes a cluster failover and the connection between nodes is down for just long enough that they get out of sync and require a reseed to fix.

At this point I’d like to jump off on a slight tangent to bemoan the inconsistency with which Exchange 2007 handles the interruption of replication traffic. On the one hand, you can shut down one node for a couple of hours and when you bring it back up again replication resumes quite happily, but on the other hand if you have 5 minutes of iffy network connectivity, suddenly the databases* are irrevocably out of sync and need to be reseeded.

Anyway, you’re not too concerned by this turn of events because, while a reseed of your ~90Gb database takes a few hours it’s not like the cluster is going to fail back while you’ve got databases in an inconsistent state, is it?

Well, it shouldn’t have happened, but it did; the bloody thing failed back while halfway through reseeding the database and then, obviously, couldn’t mount it at the other end. This posed something of a problem, because Move-ClusteredMailboxServer (or the GUI equivalent) gets upset when your databases aren’t in sync and refuses to let you fail over and Restore-StorageGroupCopy would have forcibly mounted the database sans up-to-date logs and effectively reverted it to the state it was in before it all failed over the first time, binning a lot of emails in the process.

Thankfully, Move-ClusteredMailboxServer has a very handy -IgnoreDismounted option which allows you, when you’re really sure, to skip all replication health checks on Dismounted databases, allowing you to fail the server over and remount the (more) up-to-date version of the database, whereupon you can attempt to re-reseed it. So, if you find yourself in a similar quandary, with your databases all out of sync and at risk of losing hours or even days worth of data, before starting a restore from tape & printing your CV, you can always try: Move-ClusteredMailboxServer -Identity <CCR Cluster Name> -TargetMachine <Target Cluster Node> -IgnoreDismounted Just remember that there’s a good chance of at least some data loss, but if you’re in a position to need to use it, the alternative is probably a lot of data loss so it’s a risk that might be worth taking.

* I know that technically Exchange 2007 CCR replicates at the Storage Group level rather than Database level, but as you can only have a single database in a CCR replicated Storage Group and “database” is easier to type, I’ve used it instead.

Well, the BBC iPlayer for Android has been released and I’m really disappointed.

Before the app I could go to the iPlayer website on my phone and stream recorded TV and Radio programs over 3G or Wi-Fi; no Live streaming, but otherwise pretty good.

With the app, however, I can’t stream anything unless I’m on a Wi-Fi connection and as far as I can see there’s no way to override it, so the fact that they now offer live streams is all but worthless as if I’m somewhere with Wi-Fi I’m usually somewhere with a TV or Radio. What’s even worse is that they’ve now applied the same fucking policy to the mobile version of the iPlayer website too, so I can’t even stream *that* over 3G any more.

Why have they done it? No idea, but it’s bloody stupid. By all means make it default to Wi-Fi only to stop all the idiots complaining when they stream the entire Eastenders back catalogue over 3G and run up for £3,000 phone bill, but I’m not one of those idiots; I want live 3G streaming and at the very least I want my recorded 3G streaming back. I can only imagine that the mobile networks threatened to block all iPlayer traffic if the BBC released their app with 3G support, because we all know that’s easier than actually upgrading your networks to support demand.

Oh, and it can’t run in the background either, or if your phone switches off the screen.

It’s been removed from my phone after a grand total of 8 minutes. Not happy.

Update: According to the FAQ here “[they] are working to make the service available on 3G networks in a future release of the BBC iPlayer Android App.” So that’s all OK then…

Update: This turned out to be a Nagios-related powershell script running against Exchange that was being launched by a service running as LocalSystem, which didn’t have permissions to perform various tasks within Exchange. As soon as we stopped running the script the errors went away. Still no idea why the errors were popping up on servers in the Org that weren’t referenced by the task, but that’s Exchange for you.

Right, I’m throwing this out on the tiny off-chance that anyone has come across it and knows of a solution, because so far, Microsoft support haven’t and don’t.

Frequent entries in the Application logs of all Exchange 2010 Servers as follows:

(Process w3wp.exe, PID <PID>) “RBAC authorization returns Access Denied for user <Mailbox Server Computer Account>. Reason: No role assignments associated with the specified user were found on Domain Controller <Domain Controller FQDN>”

Several things.

1) Everything in <> has obviously been changed by me to remove details of my internal infrastructure, the actual errors contain real PID, account and server values. In all cases, the computer account is that of the Mailbox server, even though the error shows up on Mailbox, CAS and UM servers.

2) This is not, I repeat, not the same issue as you’ll find all over Google with a very similar error message that features a user account rather than a computer account. That one is usually caused by people not setting up permissions for their administrators properly in the ECP or broken permissions inheritance on accounts.

3) This error has survived a complete rebuild (OS and Exchange) of the Mailbox server, a re-running of the domain/forest prep tools and a couple of weeks examination by Microsoft Support. We’re currently looking at rebuilding all the other 2010 servers to see if it survives that too.

Any suggestions will be gratefully accepted.

I was going to write a long and rambling post about this, but I helpfully found someone who’s already done it for me: http://www.west-wind.com/weblog/posts/32765.aspx

Some salient highlights:

This is an application I’ve created and have full control over in terms of build, but I can’t figure out why it will not pin to the Windows 7 taskbar. All other applications pin just fine, but this particular one will not pin or be dragged onto the taskbar. Not from a running application, not from a dragged shortcut or by using the context menu to pin it to the taskbar. The Task menu that pops up has nothing more than a Close this Application on it instead of the usual pin options.

Windows has a few reserved names that include things like Documentation, Help, Setup, Readme etc. that are not pinned to the taskbar. These are exceptions in Windows […] But it turns out the rules in Windows aren’t exact matches, but it looks for anything that contains these names. So anything that contains the word Help in the EXE name is considered a special item.

These ‘restricted’ values are determined by a registry key at:


with these default values:

Documentation;Help;Install;More Info;Readme;Read me;Read First;Setup;Support;What’s New;Remove

One can only speculate why in the hell Microsoft decided a) To do this and b) Not to document it anywhere or c) Make it apparent in the UI that there’s a reason you can’t pin these files.

In addition to this, it’s fairly well known that you can’t pin things from Network locations either, though this can be worked around by doing the following:

  1. Create a new empty text file in a local folder
  2. Rename the file to the same as your desired network-located file
  3. Right-Click on the file and Pin it to the Start Menu/Task Bar
  4. Right-Click on the pinned item (If it’s on the Task Bar you’ll then need to right-click again on the item in the pop-up list) and choose properties
  5. Change the Target and Start In values to the network-located file, pick an icon if needed and click OK

Again, it’s unclear why Microsoft chose to arbitrarily restrict your ability to pin things in this manner, but I really hope they fix it at some point because it’s an awful lot of arsing around to do something that should be very straightforward.

Actual vendor response when we asked them for some high-level documentation regarding the installation and configuration of their software prior to rolling it out to a bunch of sites we manage:

“Why do you want that? It’s not normally something we’d give out…or do”

They then proceeded to admit that they don’t have any internal documentation for install or configuration, let alone anything to give to customers, and so would have to write some from scratch if we really, really wanted it.


Seriously Microsoft, I’m starting to go right off you.

Forefront Threat Management Gateway 2010, your ISA Server replacement, your “Needs a 64-bit Operating System” ISA Server replacement that requires Server 2008 or Server 2008 R2 doesn’t fucking support IPv6?

IPv6 Support and remote installation
Forefront TMG does not support IPv6.

I mean yes, it supports DirectAccess and thus parts of the IPv6 stack, but that’s it. No IPv6 routing, firewalling, proxying or anything else. You know, IPv6, that silly internet standard that’s been around since 1998 and that is enabled by default in all the supported operating systems for TMG.

“Forefront” indeed.

As you may know, all of the Microsoft engineers involved with creating Exchange 2007 were tragically afflicted by sudden onset amnesia, which prevented them from remembering any of the new features or technology they had developed for it when they came to start work on Exchange 2010. This may seem like an unlikely occurrence, but it is the only possible scenario that could explain why Microsoft made the design decisions that they did.

In an Exchange 2007 HA scenario, such as CCR, your clients would point to the mailbox server cluster name. This cluster name would always point at the Active node of your cluster so that no matter your failover situation, your clients could connect to their respective databases. Now you still had to make sure that there were CAS and HT servers available for each AD site with a Mailbox server for mail flow and OWA to work correctly, but that was easily achievable (albeit with a bit of registry fiddling to force the sitename if the servers weren’t actually in the right site) and even if you didn’t have them, at least users could still access their mailboxes and queue messages in their outbox for delivery when everything came back up again.

Leap forward to Exchange 2010; Microsoft, in their infinite wisdom, have changed things so that now the clients connect to the CAS servers and not the mailbox servers. The CAS server they connect to is based on the RPCClientAccessServer attribute on the mailbox database that their mailbox resides on. Microsoft recommend that you use NLB to cluster CAS servers and then modify the RPCClientAccessServer attribute to point to the NLB name, otherwise if you lose a CAS server all the clients pointing to it will break. Of course, you can’t NLB across subnets, so if you’re replicating your mailboxes off site (And you’d be stupid not to if you can) then you can’t have CAS-resiliency without stretch VLANs or other nasty cludges. Microsoft’s actual recommendation for this scenario is to manually update the DNS record for the unreachable CAS array to point to the one on your other site, which is fine if it happens in-hours and you actually know about it as it happens rather than 10 minutes later (or even the next morning) as the support calls come in. Don’t worry though, they have a solution for that too; just buy, install and configure MSCOM to monitor the Mailbox server and alert you when it fails over.

There was a half-arsed fix planned for SP1 (It would have automatically updated the RPCClientAccess value to point to the CAS array on your other site), but your users would have had to restart Outlook before it would read the updated attribute, so it wouldn’t have been a huge improvement and in any case it didn’t make it into the service pack. There’s no ETA that I’m aware of.

So, to summarise, in a multi-site environment with Exchange mailbox clustering, Microsoft have managed to go from Exchange 2007 with a fully functional HA solution to Exchange 2010 with an HA solution that’s only fully functional if you’ve purchased their monitoring products, have staff on-hand 24/7 to make DNS changes or don’t mind extensive user disruption if you have a site or server failure.

Honestly, for all its new features and improvements, Exchange 2010 is easily the worst-implemented version of Exchange since 5.5; it’s like they decided that anything that would help with a smooth transition from 2007 was entirely too much effort and that anyone who deviates from the Exchange infrastructure that Microsoft envisioned should be punished for daring to do so. I’m genuinely regretting starting the 2007->2010 transition now; it’s caused immeasurably more hassle than is justified by the benefits it provides.

There are three things that almost everyone that I meet in IT seems incapable of understanding; Share Permissions vs NTFS Permissions, NTFS Full Control vs Modify permissions and Group Policy vs Local Permissions.

For those who don’t know, Windows folders presented over a network via CIFS share have two levels of permissions: Share Permissions, which are mostly a lingering reminder of the pre-NTFS days, when they were only way to control access to network resources and are pretty basic with only Read, Change & Full Control available to you. Then there are NTFS permissions, which are the normal Windows file system permissions, and have a dizzying array of permission settings available, but usually come down to List, Read, Modify & Full Control.

Best practice is to either leave the share permissions as the default (Everyone: Read) or, if some or all users need to modify files in the share, change them to Everyone: Change. That’s it. Everything else should be handled through NTFS permissions (or other local file system permissions where available) because they’re more granular, can be used to set different permissions on files and subfolders of a share (unlike share permissions which apply to the share as a whole) and, importantly, apply equally to users who log onto the machine locally as well as those connecting via the share. Yet somehow, people are obsessed with screwing around with Share permissions, adding random users or groups to them when they’ve already set Everyone: Change or leaving them as Everyone: Read and then complaining that users can’t write to the share, but can write to the same location locally on the server if they connect via RDP or console.

Seguing nicely into NTFS permissions we find the inability of people to understand the difference between Modify and Full Control. Ever deal with a 3rd party when troubleshooting a permissions issue with one of their apps and they will almost always tell you “The users need Full Control on that folder/file, not just Modify” but will almost never be able to tell you why, or what the difference is. I’ll tell you what the difference is; “Change Permissions” and “Delete subfolders and files”. They are the only two things that Full Control gives you over Modify, namely the ability to alter the permissions on the object and to delet[e] subfolders and files, even if the Delete permission has not been granted on the subfolder or file. In other words, for 99% of use cases, no difference whatsoever.

Finally, Group Policy vs Local Permissions, not NTFS permissions mind, but user account permissions. Specifically, this refers to the way that people are unable to grasp that having local Administrator rights on a machine does not magically allow you to do things that are restricted by Group Policy. If your user account or PC has a GPO applied to it that prevents access to the Control Panel, then adding your account to the local Administrators groups isn’t going to make the slightest bit of difference to your ability to access said Control Panel. What it will do, of course, is allow you to work around Group Policy settings in certain circumstances by modifying registry permissions so that the system can’t apply the GPOs on startup/login, but most people aren’t aware that it’s even possible, let alone how to go about it so my point remains valid.

If I could only make people understand these three things, I’d probably eliminate 20% of the support issues I have to deal with on a daily basis. Oh well, a man can dream…