Troubleshooting DAG's

Status
Not open for further replies.
K

Korbyn

Has anyone found a good article for actually troubleshooting a DAG? After spending the whole weekend trying to both enable the backup and fix the db copies, I've finally removed all copies, and removed all the files from the passive nodes, PLUS remove the reple folders in each of the source server Translog folders, and for good measure, reset the transaction logs to 0 (dismount DB, run eseutil /mh, make sure clean shutdown, remove all from the log folder, remount) I've been able to finally reseed the databases. The resetting of the translogs revealed both corrupt and locked transaction logs (definately run RU2)

However, one of the newly " reseeded" databases is reporting a Copy Queue Length of over 23,000, which was kinda why I thought it was time to restart in the first place. I suspect a minorbug, as the LastLogCopyNotified, LastLogCopied, LastLogInspected, and LastLogReplayed are all accurate, it's just the LastLogGenerated and CopyQueueLength that are screwed up.

Here's hoping that 3rd times a charm. Just annoying that all the technotes and blogs are regurgitating the same thing, make sure copy and replay queue's are less then 10, but nothing referencing what to do when they're not...
 
F

Fazal Muhammad Khan_

Thank You for your Post here. About Troubleshooting DAG I could only find

http://exchangemaster.wordpress.com/tag/dag/

http://runebelune.spaces.live.com/?_c11_BlogPart_BlogPart=blogview&_c=BlogPart&partqs=cat%3DExchange%253bDatabase%2520Availability%2520Group%253b%2520DAG%253b%25202010%253b%2520Datacenter%253b%2520Failure

With Time with this Post I am sure Some MVPs or microsofts moderators would help you out on this.

If you dont mind Would you tell me about the current Design on exchange Which you have Deployed?

I would Like to know that what was the aspect which led you to troubleshoot DAG.

The reason why I am asking this is I have Placed DAG for a couple of my Customers and it seems to be working " SO FAR SO GOOD"

Another REASON for asking is may be we Can learn something from the issues that you had Encounter or May be highlight some points which you had Missed.

Recently wrote a BLOG on DAG on DAC Mode Could be helpful to you

http://fazalmkhan.spaces.live.com/blog/cns!38CB9E0022685FEC!444.entry

Waiting for Your Reply.

Regards

Fazal M Khan
 
K

Korbyn

It's a fairly " standard" DAG configuration, 2 servers with the CAS/HUB roles, and 2 other seperate servers with the MBX roles installed. Unfortunately when they set this all up they didn't check to see if their backup software supported E2010 yet, which it didn't, so they've ran out of log space a couple of times, extended disk, and I'm sure somewhere along the lines removed log file. At some point then the Replay Queue started growing, and it looks like it coincided with when they installed RU1, which pooched one database, then RU2 which then pooched the other 3.

I've attempted to recreate the db copy for one of the databases 4 times now, all with the same result which I'll attach below. I attempted the 4 db, which is almost 300 GB, and could see from the status that the copy queue length was going to be over 500,000 so I cancled that out. I've already recommended to the client that they create several new Databases with replica's and move the mailboxes in, I don't believe 300 GB is a good starting size... I prefere to also have a better answer, or at least a reason for the Copy Queue Length being corrupted like this, and a solution better then creating new databases.

I've definately learned a few things along the way, one for example, you need to physically remove edb's and logs from databases that have been removed, in order to backup Exchange 2010 using the Windows Backup...

As you can see below, the LastLogGenerated, or at least what it thinks it is, is 23841 and the Copy Queue Length is 23821, which the difference happens to equal Last Log Replayed. And things increment fine, but because the Copy Queue Length is so high, failover is not allowed, at least with using Lossless, and I'm not really wanting to chance using the other options at this time.

RunspaceId : 9f00b572-740a-4422-b9

Identity : IT\EXGEMB01

Name : IT\EXGEMB01

DatabaseName : IT

Status : Healthy

MailboxServer : EXGEMB01

ActiveDatabaseCopy : exgemb02

ActivationSuspended : False

ActionInitiator : Service

ErrorMessage :

ErrorEventId :

ExtendedErrorInfo :

SuspendComment :

SinglePageRestore : 0

ContentIndexState : Crawling

CopyQueueLength : 23821

ReplayQueueLength : 0

LatestAvailableLogTime : 4/5/2010 11:26:45 AM

LastCopyNotificationedLogTime : 4/5/2010 11:26:45 AM

LastCopiedLogTime : 4/5/2010 11:26:45 AM

LastInspectedLogTime : 4/5/2010 11:26:45 AM

LastReplayedLogTime : 4/5/2010 11:26:45 AM

LastLogGenerated : 23841

LastLogCopyNotified : 20

LastLogCopied : 20

LastLogInspected : 20

LastLogReplayed : 20

LatestFullBackupTime : 4/5/2010 2:53:45 AM

LatestIncrementalBackupTime :

LatestDifferentialBackupTime :

LatestCopyBackupTime :

SnapshotBackup : True

SnapshotLatestFullBackup : True

SnapshotLatestIncrementalBackup :

SnapshotLatestDifferentialBackup :

SnapshotLatestCopyBackup :

LogReplayQueueIncreasing : False

LogCopyQueueIncreasing : False

OutstandingDumpsterRequests : {}

OutgoingConnections :

IncomingLogCopyingNetwork :

SeedingNetwork :

ActiveCopy : False
 
J

jader3rd

Of your newly reseeded database, was it for a database which existed removed from the computer and then added back?
 
X

Xiu Zhang

It seems the same issue like Jader mentioned.

I recommend you to use performance Monitor tool to check the copyqueuelength value.

1. Administrative Tools.

2. Performance Monitor.

3. Find ReplayQueueLength and CopyQueueLength counters under the MSExchange Replication performance object.

Regards,

Xiu
 
K

Korbyn

The reason why the replay queue length was so large was due to trans log corruption, an issue which was resolved in RU2. There is zero reason for the Copy Queue length to be 23,734 after reseeding. It's like 23,734 is the new Zero. If you look closely at the status here, CopyQueueLength + LastLogReplayed = LastLogGenerated. Last Log Generated is very much incorrect, E040000006B is the last log in the Translog folder, 6b=107. The issue seems like it's the same as the thread that Jader3ed mentioned, except these are production databases. The perf counters are at zero, and recycling the reple service resulted in a drop of 5...

CopyQueueLength : 23734

ReplayQueueLength : 0

LatestAvailableLogTime : 4/6/2010 8:27:19 AM

LastCopyNotificationedLogTime : 4/6/2010 8:27:19 AM

LastCopiedLogTime : 4/6/2010 8:27:19 AM

LastInspectedLogTime : 4/6/2010 8:27:19 AM

LastReplayedLogTime : 4/6/2010 8:27:19 AM

LastLogGenerated : 23841

LastLogCopyNotified : 107

LastLogCopied : 107

LastLogInspected : 107

LastLogReplayed : 107

LatestFullBackupTime : 4/6/2010 4:06:00 AM

When I recreate the database copy and when I do an update-mailbox... -catalogOnly I get the following error. But I only get this on the two databases where I have the above issue with the LastLogGenerated number being wrong.

A server-side seed operation has failed. Error: An error occurred while performing the seed operation, which may indica

te a problem with the source disk. Error: An error occurred while updating the search catalog files from server 'EXGEMB02' to 'EXGEMB01'. Error: Can't dismount the search catalog. Error: Microsoft.Exchange.Search.Common.FteCatal

ogNotFoundException: SearchCatalog.Dismount failed, error 0x80043629 ---> System.ComponentModel.Win32Exception: Unknown
error (0x80043629)
--- End of inner exception stack trace -
at Microsoft.Exchange.Cluster.Replay.CiFilesSeederInstance.<>c__DisplayClass5.<SeedThreadProcInternal>b__2(Object ,

EventArgs )
at Microsoft.Exchange.Cluster.Replay.CiFilesSeederInstance.RetryCiOperation(EventHandler evt) [Database: IT, Se

rver: EXGEMB01.ab.ca]
+ CategoryInfo : InvalidOperation: :)) [Add-MailboxDatabaseCopy], CiSeederGenericException
+ FullyQualifiedErrorId : F89C965D,Microsoft.Exchange.Management.SystemConfigurationTasks.AddMailboxDatabaseCopy

There is an article that's seemingly related: http://support.microsoft.com/kb/977952 but not really a work around. I'm at the point of telling the client that the databases are fubared, and to create new ones and move the mailboxes and remove these databases. Hopefully by SP1 they will have started to take care of these kinds of issues. Creating netnew isn't my idea of a solution.

If someone is looking into the issue, the reason I started troubleshooting this, the client had disk problems which led to the event id 2156, which stopped log replay on the passive server. RU2 may resolve the issue for future problems, but it didn't fix this servers' issue. I cleared the corrupted logs by removing the copies, dismounting the databases, eseutil /mh to make sure Clean Shutdown, removes all folders and directories in the trans log folders on both nodes, and removed the databases on the passive node, and remounted the databases. Ran a full backup, and then tried recreating the database copies. Thats where the number for the Replay Queue Length for some reason moved over to the CopyQueueLength and I haven't been able to get rid of it since. There is no actual queue, the queues are empty, but because that number is so high, we can't fail over the database to the passive node.

It's going to take a day or two for the client to add some new disk to the server and before we start moving mailboxes, if anyone has some devine inspiration, I'm all ears.
 
Status
Not open for further replies.
Top