While most of our client deployments have gone quite smoothly from the perspective of stretching Exchange 2010 Database Availability Groups across multiple sites and WAN connectivities, I recently found myself troubleshooting an inconsistent issue at one client. This environment’s topology was fairly straightforward, with two DAG members in one data center for local high availability and one DAG member in an alternate data center for remote site resiliency. Creating the DAG, adding members, and adding mailbox database copies all presented no issues during the initial deployment although we did need to resolve some issues with database copy replication across the WAN.
As we approached our anticipated IT pre-pilot for the new Exchange 2010 environment, we started to notice significant issues in DAG communications across the WAN. Specifically, we saw the following issues fairly consistently although, at some times, everything worked just fine:
- From the primary data center, viewing the mailbox database and associated copy status from the Exchange Management Console listed mount states for some databases as “Unknown” and copy status for all remote database copies as “ServiceDown.” Running Get-MailboxDatabaseCopyStatus against the DAG member(s) in the remote data center reflected the same results. Databases in an “Unknown” mount state corresponded to cases where the database was activated in one data center and status was being queried across the WAN from the other data center.
- Running “Get-DatabaseAvailabilityGroup -Status” would take an extremely long time to complete.
- Occasionally, databases would be listed in a dismounted state and, upon attempting to mount, an error message stating “Automount consensus not reached” would be returned and the mount would fail.
- Event logs on DAG members in both data centers would report sporadic occurrences of FailoverClustering events reporting that nodes in the repsective remote data center had been removed from cluster membership.
- Test-ReplicationHealth against DAG members across the WAN to the remote data center reported failures for ActiveManager (“Active Manager is in an unknown state”) and TasksRpcListener (“An error occurred while communicating with the Microsoft Exchange Replication service to test the health of the Tasks RPC Listener”).
The issue was clearly related to RPC requests traversing the WAN and having issues somewhere along the path from source to destination. As a next step, I ran the “Validate a Configuration Wizard” for the DAG’s underlying Windows Failover Cluster and, sure enough, RPC errors were reported for queries that needed to cross the WAN to talk to cluster nodes not in the same data center as the node on which the wizard was run. At this point, it was time to install Wireshark and run packet captures on either side of the WAN while executing Exchange actions or the cluster validation wizard to determine what was happening to the traffic.
Upon review of the packet captures, it was revealed that packets were being sent between DAG/cluster members that were larger than a standard 1500 byte packet and those packets were being fragmented in transit from source to destination. Disabling various large TCP offload functionality of the NIC driver in use within the DAG/cluster members (vmxnet3 Ehternet Adapter) helped to bring the packet size down to 1500 bytes but the problems still occurred. Running ping tests between the two data centers (ping -f -l <packet_size> hostname) revealed that the largest packet succeeding across the WAN was 1468 bytes. Once the MTU of the NIC was reduced to match this value (via NETSH), everything began working perfectly. The “Validate a Configuration Wizard” for the cluster completed without any unexpected warnings and all Exchange-related functionality was restored.
While disabling various functionality on the NIC and reducing the NIC’s MTU worked to solve the problem, it was certainly not ideal nor a long term solution for this environment. Ultimately, determining where the issue lied in the WAN environment was key to identify how to resolve this issue without requiring non-standard configurations on various servers in the environment. In working with the client’s networking team, it was understood that their particular WAN connectivity provided a Layer 2 Ethernet hand-off to each data center such that no router was in place on either side. This explained why larger MTU packets were traversing the WAN and being fragmented in the process. Coordination between the WAN provider and the remote data center’s networking team was required to determine where in the network path a device was unable to handle even a standard 1500 byte packet properly.
Ultimately, there were a few options for remediation at this client in the form of either obtaining jumbo frames support across their Layer 2 WAN or placing routers on either side of the WAN. Both of these options would remove the requirement for non-standard configurations on the actual servers while still resolving the issue of communications between the data centers.