Exchange 2010 CAS high availibility cross datacenters

  • Thread starter Ruben van Gogh
  • Start date Views 655
R

Ruben van Gogh

#1
Hiya,

a collegue of mine asked a question about the 'new' high availibility functions in Exchange 2010 cross datacenters.
As we talked about the flexibility of DAG, which is a great aprovement for cross-datacenter availibility since Exchange 2007 because its nolonger restricted to a single Active Directory site, we started wondering about the availibility options for the CAS server roles.

Ofcourse, CAS can be deployed in a NLB configuration, but this will function IN a single site only.

For the sake of argument lets say I have datacenter A and B, where A is the active datacenter. For this sample, lets state that I _wont_ use NLB for the CAS roles in each site.

I deploy Mailbox Role A (MBX_A) in datacenter A and Mailbox Role B (MBX_B) in Datacenter B.
Then I deploy CAS role A (CAS_A) in datacenter A and CAS role B (CAS_B) in Datacenter B.

In case of a failure of MBX_A it fails over to MBX_B and CAS_A will point to MBX_B.

But what happends if CAS_A fails? How will the clients get to CAS_B in the other datacenter?

Regards,

Ruben
 
L

Lee c y

#2
I am not entirely sure of the official solution, but this is based on my own understanding.

Create a CAS array in Exchange, for eg. Cassarray1. This should usually point to the NLB cluster host name, but since there is no "real" NLBed CAS, just manually enter the A record of the Casarray1 and configure it with the IP of CAS_A. Set the TTL of this record to 5 mins.

Configure all DB in Datacenter A to have RPCclientaccessserver set to Casarry1.

If there is a failover to Datacenter B. Manually modify the A record of Casarry1 and configure it to point to the IP of CAS_B.
If assuming the all the databases failover sucessfully to MBX_B. The client should get re-connected automatically, but there is a longer delay and the process is triggered manually.
 
R

Ruben van Gogh

#3
If there is a failover to Datacenter B. Manually modify the A record of Casarry1 and configure it to point to the IP of CAS_B.
If assuming the all the databases failover sucessfully to MBX_B. The client should get re-connected automatically, but there is a longer delay and the process is triggered manually.

Hi,

thanks for your reply, but I already figured that one out..  itis not an automatic failover like DAG does.. So in my opinion its not only the nonofficial solution, but its not even a solution :)

Regards,

Ruben
 
L

Lee c y

#4
If there is a failover to Datacenter B. Manually modify the A record of Casarry1 and configure it to point to the IP of CAS_B.
If assuming the all the databases failover sucessfully to MBX_B. The client should get re-connected automatically, but there is a longer delay and the process is triggered manually.

Hi,

thanks for your reply, but I already figured that one out..  itis not an automatic failover like DAG does.. So in my opinion its not only the nonofficial solution, but its not even a solution :)

Regards,

Ruben

In almost all cases, customers do not want automatic failover across datacenter. It is too risky.

The SLA for datacenter failure is usually higher than in site ones.

I suspect when MS releases the official guidance for cross site failover, it would also recommend a manual process.
From this link http://technet.microsoft.com/en-us/library/dd351049(EXCHG.140).aspx

"A datacenter or site failure is managed differently from the typical failures that cause server or database failover. In a high availability configuration, automatic recovery is initiated by the system, and the failure typically leaves the messaging system in a fully functional state. By contrast, a datacenter failure is considered to be a disaster recovery event, and as such, recovery must be manually performed and completed in order for the client service to be restored, and for the outage to end. As with many disaster recovery scenarios, prior planning and preparation can simplify the recovery process and reduce the duration of the outage.

By combining the built-in site failover support in Microsoft Exchange Server 2010 with your proper planning, a second datacenter can be rapidly activated to serve the failed datacenter's clients. The proper planning and preparation involves not only the deployment of the second datacenter resources, but also pre-configuration of those resources to minimize the changes required at datacenter failover time."

Cheers
 
R

Ruben van Gogh

#5
I dont agree with you on that one.

Especially when I look at the DAC functions MS has built for DAG to prevent split-brain scenario's - which tend to happen in automatic fail-overs.

I'm surprised to see that noone from MSFT has responded to this thread yet.

Regards,

Ruben
 
L

Lee c y

#6
Sorry all the acronyms are getting to me, I know what is the split-brain senario, but what is "DAC functions"?
 
R

Ruben van Gogh

#7
oops, typo. its "function", not "functions".

DAC stands for datacenter activation coordination. This was developped to prevent split brain scenario's in datacenter failovers. Which is an automated proces. Therefor, I dont believe that MSFT's statement in datacenter failovers is: automatic is too risky.
 
L

Lee c y

#8
Ok now I remember what is DAC. Even with DAC, the whole Datacenter switchover is a manual process.

First DAC needs to be manually enabled if your DAG contain 3 or more servers.
Let's say the senario is datacenter A has 2 DAG servers and datacenter B has a single DAG server, if there is a power outage in datacenter A, by default datacenter B server will not mount the database until a majority of member in the DAG is online. This is to prevent a split brain senario, since there is a possiblity that an admin has triggered a recovery on datacenter B and suddenly datacenter A power resume, both side will "think" that they own the acitive copy of the databse.

By turning on DAC, you can mount a database even if there is no majority number of DAG members online. So in the above example, an admin can initiate a recovery in datacenter B. If however datacenter A power resumes, the servers in datacenter A will not mount the database even if it has majority. The PAM will corordinate with the SAM check if current state of the DAG before monnting any databases, so that split brain senarios can be prevented.

In summary, the presence of DAC does not imply that automatic failover across datacenter is possible, but it is a mechanisim to prevent split brain in certain datacetner switchover (manual) senarios.

Cheers
 
B

Brian Day MCITP [MVP]

#9
You're forgetting about Autodiscover.

If you have a cross-datacenter failure then you can do a couple things that I know of.

1. Leave the RPCCLientAcessServer attribute alone on the databases and repoint the CASArray A-record to the IP of the CAS-B. As long as the TTL is low this should get things going soon.

2. Modify the RPCClientAccessServer attribute on the DB to point to CAS-B and Autodiscover capable clients will learn the new endpoint.

Datacenter failovers as mentioned in other replies are typically *not* a great thing to have done automatically and as long as you have an HA solution for CAS in place in Site-A then you'll be ok. :)

-bBrian Day / MCSA / CCNA, Exchange/AD geek.
 
N

Nadim J

#10


Sorry if I'm being dense here but am I the only one who thinks that Ruben van Gough's original question was never answered fully?

I have to design a solution for a 2,000 user company that absolutely cannot have any Single Point of Failures (SPoF). The marketing around DAG is great until you realise you have the slight problem of making your CAS servers reslilient, ordinarility taken care of by using a CAS Array and NLB. However correct me if I'm wrong, a CAS array can only encompass CAS servers from the same AD site. This leaves me with a bit of a problem....(Our setup is the end goal of most organisations I have dealt with over the last 5 years. That is to move to a dual datacentre setup with no SPoF)

The environment:

-12 AD sites with 2,000 users spread across all (10 user sites, 2 datacentres)
-Datacentres (DC1 and DC2) will hold 1 mailbox server each (MBX1 and MBX2). Each mailbox server to handle 50% of the total AD (user) sites
-A third mailbox server held at Site1 (specifically to avoid using a FSW (another SPoF), a 3 node DAG enables you to use node majority only)

If I take users from Site1, who are ordinarility connected to a CAS server at their local site inturn connected to MBX1 (no point in putting CAS servers in DC1 as a failure of comms between SiteA and DC1 will result in users not being automatically redirected to MBX2), the only way I can get their Outlook clients to resume connectivity to DC1 in the event of the local CAS going down is to create more than one CAS in an array. Due to the CAS Array limitation (inter-site only), I need to have a total of 10 CAS arrays across all user sites containing a minimum of 2 CAS servers each just to address my problem. If I could create a CAS Array encompassing CAS servers from multiple sites I'd be sorted. Another reason why this limitation is awkward to work with is that like most orgs we are heavily reliant on VMWare/Hyper-V. Two CAS servers in an array on the same VMWare platform doesn't give us the best resiliency either, due to dependency on the same single SAN!

Clearly it's impractical and costly to have 20 CAS servers for a 2,000 user org. My question therefore is, is there anyway to configure CAS Arrays to achieve cross-site failover for Client Access as well as for mailbox servers?

Thanks
 
Top