Troubleshooting Failed Migrations

Continuing our hybrid migration blog post series (see part 1 here and part 2 here) we arrived at troubleshooting failed migrations.

A ‘failed migration’ is when the status of the move request shows as ‘failed’, and we have one or more failures logged in the move report. The move is stopped and needs the administrator’s attention to investigate the reason of failure. Sometimes, resuming of the move can help, especially if there were some temporary issues on the Exchange Online side that were addressed.

Before getting into troubleshooting, I recommend you check the following ‘Minimum Requirements’; those are the things we know will break migrations (and we see them do so):

We also recommend that you bypass the network devices such as firewalls and reverse proxies during migrations in order to reduce source network latency and avoid frequent communication transient errors that would result in mailbox locks and slow migrations.

Often when troubleshooting Office 365 migrations, the Exchange Admin Center GUI is helpful and quite verbose regarding the reason of failure, and it many times includes a link to the corresponding documentation page for more information on specific issue.

Let me briefly show you some useful info that we can see in the (Classic) Exchange Admin Center. As a note, at the time of writing this article, the New Exchange Admin Center doesn’t currently show all of this information.

migtr01.jpg

The following can be seen in the above screenshot:

  • We have one migration batch called “Test Hybrid Migration” of type Exchange Remote Move
  • Direction of the move is Onboarding (from on-premises to the cloud)
  • The current status Syncing (things are going well so far)
  • There is only one migration user in the batch (looking at the Total column)
  • The user is not Synced (hasn’t reached the Incremental Sync at 95%), not Finalized (hasn’t reached the 100% completion) and not Failed (didn’t encounter a fatal failure)

After clicking on View details, we also see:

  • Who created the batch (crystal@mytenant.onmicrosoft.com),
  • When it was created and started (New-MigrationBatch -AutoStart),
  • When it should complete (after the initial sync will be done)
  • There is no last synced time because the status is syncing and no initial sync has been done
  • Also, the associated endpoint is the name of my migration endpoint (Get-MigrationEndpoint) through which I am running the batch.

migtr02.jpg

After a little while, the user failed because of the ExchangeGuid missing on the mail user object in Exchange Online:

migtr03.jpg

In such situation, the migration service failed to inject the move request because the user failed validation. This means that we don’t have a move request for this user and therefore will have no move report.
If you were to click on ‘Download the report for this user’, you would get an empty .txt file.
Let me show you how this failure looks like in PowerShell and what objects are created and available for us to check there.
With Get-MigrationBatch command, we can see the name of the batch, the status, the type and how many users are contained in the batch:

migtr04.jpg

To see all properties, run Get-MigrationBatch |FL.

Some other attributes values that you saw in the Exchange Admin Center GUI, for this batch were:

CreationDateTime           : 6/1/2020 8:23:35 AM
StartDateTime              : 6/1/2020 8:23:34 AM
LastSyncedDateTime         :
SubmittedByUser            : crystal@<mytenant>.onmicrosoft.com
BatchDirection             : Onboarding
SourceEndpoint             : Hybrid Exchange Miry

If I had multiple batches and I was interested in seeing this particular one, I would run: Get-MigrationBatch “Test Hybrid Migration” or if I wanted to see all batches that are failed, I would run: Get-MigrationBatch -Status SyncedWithErrors

Going further with PowerShell, if I want to see the migration user contained in that batch, I would do it like this: Get-MigrationUser -BatchId “Test Hybrid Migration”. To see all the details on the migration user, I would again append |FL

migtr05.jpg

This error is self-explanatory, ExchangeGuid is missing on the user and I can also see it with Get-MailUser command for this migration user:

migtr06.jpgFrom the Get-MigrationUser output, I can also see the RequestGuid is empty, so this also tells me that there is no move request / move report for this migration user. I can run Get-MoveRequest <user> or Get-MigrationUser -BatchId “Test Hybrid Migration” | Get-MoveRequest to confirm this.

migtr07.jpg

In cases where the error message on the migration user is not so obvious and you still don’t have a move request created for it, you can check Get-MigrationUserStatistics with DiagnosticInfo verbose switch: Get-MigrationUserStatistics <user identity> -DiagnosticInfo verbose |FL and see if any more details found.
I will now go through some more command examples if you want to play around and check simple or more complicated stuff in PowerShell. Also, some things can be only checked from PowerShell and if you have a move request created and this is failed or is progressing slow, you can see more on analyzing move reports with PowerShell in later part of this blog series.

To get an overview of migration statistics:

Get-MigrationStatistics

migtr08.jpg

To get all migration users, their status and corresponding batches:

Get-MigrationUser

migtr09.jpg

To get a specific migration user:

Get-MigrationUser <email address>

migtr10.jpg

To check the error on a specific migration user:

Get-MigrationUser <email address> |FL errorsummary
Get-MigrationUser <email address> |FL

migtr11.jpg

To get all failed migration users:

Get-MigrationUser -Status Failed

migtr12.jpg

To get all failed migration users and their errors:

Get-MigrationUser -Status Failed | FT identity , errorsummary
Get-MigrationUser -Status Failed | FL identity , errorsummary

migtr13.jpg

To get migration users from a particular batch:

Get-MigrationUser -Batch “Batch Name”

migtr14.jpg

To get all migration batches:

Get-MigrationBatch

migtr15.jpg

To get a particular batch:

Get-MigrationBatch “Batch Name”

migtr16.jpg

Checking move requests (specific for hybrid remote moves)

To get all existing move requests:

Get-MoveRequest

migtr17.jpg

To get move request statistics for a specific move request:

Get-MoveRequestStatistics “User”

migtr18.jpg

Know that there are 2 main types of failures:

  • Transient Exceptions, example DataExportTransientException
  • Permanent Exceptions, example StoragePermanentException

Note: For a move request to be in a Failed state, we would need to have a permanent failure. Too many transient failures (usually more than 60) will eventually cause a permanent failure. Too many transient failures can also slow down your migration considerably.

To see the failures (transient or permanent), you would run commands similar to these or export the statistics to an XML file (discussed in the later part of this blog series)

To store the move report in a variable:

$stats = Get-MoveRequestStatistics “Affected User” -IncludeReport

To check all failures and their count:

$stats.report.Failures | group failuretype | Format-Table -AutoSize

migtr19.jpg

To check full details of the last failure:

$stats.report.Failures[-1]

migtr20.jpg

To check the last 2 failures:

$stats.Report.Failures | select -last 2

To check the first failure:

$stats.report.Failures[0]

To check the first 3 failures:

$stats.Report.Failures | select -first 3

If there are a lot of failures, you can create a list of the failures with the PowerShell Index number associated with each failure by running the following:

$i=0;$stats.report.Failures | % { $_ | Select-Object @{name=”index”;expression={$i}},timestamp,failurecode,failuretype,failureside;$i++} | ft

migtr21.jpg

Using this output, you can then easily identify the index number you want to focus on by enclosing the failure index number in [brackets], example:

$stats.report.Failures[4]

migtr22.jpg

To get failed move requests:

Get-MoveRequest -MoveStatus Failed

migtr23.jpg

Most frequent failures

Here is a list of most frequently seen failures in hybrid migrations (and when I say ‘most frequent’ I mean ‘most frequent’ issues that we see in support, not that you will see those errors in every migration). Note that not all are permanent failures, meaning not all these will cause your migrations to fail.

  • “User is already being moved” – reference here
  • “You can’t use the domain because it’s not an accepted domain for your organization” – reference here
  • “Target mailbox doesn’t have an smtp proxy matching ‘.mail.onmicrosoft.com’” – reference here
  • “MigrationPermanentException: Cannot find a recipient that has mailbox GUID” – reference here. Note that another possible scenario for this error is when we cannot find a ComponentShared Mailbox by its GUID on the Exchange Online side. A ComponentShared mailbox is used to host data from other Office 365 workloads like Teams, OneDrive for Business and SharePoint. You would check (in Exchange Online PowerShell) these mailbox GUIDs with the command: Get-MailboxLocation -User <SMTP>. If the mailbox GUID in the error belongs to a component shared mailbox, please log a case with Microsoft Support.
  • “You must specify the PrimaryOnly parameter” – reference here
  • “The remote server returned an Error 404” or “HTTP request has exceeded the allotted timeout” – reference here
  • “The remote server returned an error: (403) Forbidden” – reference here
  • “Access is denied” – reference here
  • “Couldn’t switch the mailbox into Sync Source mode” – reference here
  • “CommunicationErrorTransientException – The remote endpoint no longer recognizes this sequence. This is most likely due to an abort on the remote endpoint. The value of wsrm:Identifier is not a known Sequence identifier. The reliable session was faulted.” – reference here
  • “The server was unable to process the request due to an internal error.  For more information about the error, either turn on IncludeExceptionDetailInFaults …” – references here and here
  • “TooManyBadItemsPermanentException” – Failed to find a principal from the source forest or target forest – references here and here
  • “The data consistency score (Investigate) for this request is too low” – reference here. Note that we will have more on Data Consistency Score later in the blog post series.
  • “Exception has been thrown by the target of an invocation.” – reference here
  • “Transient error CommunicationErrorTransientException has occurred. The system will retry” – reference here
  • “The Mailbox ‘<username>@contoso.com’ isn’t enabled for unified messaging.” – reference here
  • “Failed to convert the source mailbox ‘Primary (00000000-0000-0000-0000-000000000000)’ to mail-enabled user after the move.” or “Unable to update Active Directory information for the source mailbox at the end of the move.” – reference here
  • “Target user <User> already has a primary mailbox”. Note: pay special attention to the scenario, it matters if you get this error in onboarding (move to Exchange Online) or offboarding (move from Exchange Online). For onboarding moves, please see this, and for offboarding see this. For onboarding, follow this. Note on scenario 1 step 7 in that article: it is not supported to remote restore a disconnected mailbox from Exchange 2010 on-premises source server version, it needs to be minimum Exchange 2013 version.
  • “StalledDueTo_Target*” when you move mailboxes to 0365 Exchange Online – reference here. More on this when we will be discussing slow migrations in next part of this blog post series.
  • “MapiExceptionTooComplex: Unable to query table rows. (hr=0x80040117, ec=-2147221225)” – reference here.
  • “Mailbox Replication Proxy Service can’t process this request because it has reached the maximum number of active MRS connections allowed” – reference here.

A few more troubleshooting tips

MoveOptions Parameter

Often mailbox moves fail because of corrupt items or elements in a mailbox. These mailbox move failures can be avoided by excluding those (often corrupt) elements from being migrated.

The MoveOptions parameter (previously known as the SkipMoving parameter which is being deprecated) can be added to the onboard or offboard request from PowerShell with the values of:

‘SkipFolderRules, SkipFolderACLs, SkipFolderPromotedProperties, SkipFolderViews, SkipFolderRestrictions, SkipContentVerification, SkipPerObjectIndex’.

This will tell the migration to skip these elements when performing the move. We recommend you perform these skips under the guidance of Microsoft Support.

You can review a move report from a previously failed move attempt and get some clues on what exclusions you should consider making.

For example, this failure below means that we have a search folder on the source mailbox where the query (restriction) is too complex and cannot be created on the target.

migtr24.jpg

Sometimes the failure identifies the actual problematic source folder so you can look more at the DataContext content. You can then either delete the query on the source mailbox or just skip the migration of the queries (search folders) so that you can complete the migration:

Set-MoveRequest user@contoso.com -MoveOptions @{add=”SkipFolderRestrictions”}

Mailbox Integrity checks

If you migrate a mailbox (primary mailbox or archive) to Exchange Online and the size is bigger than 10GB, this is considered a large mailbox and the MRS will perform an ISinteg task to ensure integrity of the mailbox that is being moved.

If you suspect that your move is stuck on ISinteg task, you can check the move report in EXO PowerShell and search for all strings containing isinteg keyword:

$stats = Get-MoveRequestStatistics <user> -IncludeReport
$stats.report.Entries | where { [string] $_ -like “*IsInteg*” } | % {[string] $_}

If that shows completed, this means there are no issues. Otherwise, you can try running the same command MRS is using on your Exchange on-premises environment, in EMS:

New-MailboxRepairRequest <migration user identity> -CorruptionType MessageId

For more info on the New-MailboxRepairRequest cmdlet, you can check here.

Depending on the Exchange Server Version you can check then the status of the repair request.

For Exchange 2013 and later, use this cmdlet:

Get-MailboxRepairRequest -Mailbox <user identity>

For Exchange 2010 version, you would need to look in Event Viewer for the following events:

  • Event 10047 when the repair request is started
  • Event 10062 when a corruption is detected and repaired
  • Event 10048 when the repair completes successfully

You can also try to move a mailbox locally from one server to another, remove the local move request and then retry migration of the mailbox to Exchange Online.

Testing MRS service

One utility that can be used for troubleshooting the mailbox move operation is the Test-MRSHealth cmdlet.  One thing to realize is that it cannot be tested from Office 365 side since the cmdlet is not available to a tenant administrators. However, at least from my experience, I have never encountered a situation where MRS service would be stopped on the Office 365 side (and was not automatically recovered within seconds). We can use this utility to test the mailbox replication service health on-premises. Also on-premises, you can check if the MRSProxy is enabled on the EWS virtual directories and if EWS application pool is started in IIS manager.

Event Viewer Diagnostic logging

When performing a mailbox move, you can turn up diagnostic logging on the mailbox replication service or other component like asp.net to get better, more granular events in the event log on-premises.

In most situations, you don’t actually get useful events in the on-premises event viewer when troubleshooting an Exchange Online remote move due to the fact that those events would be written in the datacenter. The default event logging can provide you with enough information on what the issue would be, take for example event 1309 from ASP.NET where the description is self-explanatory: MRSproxy service being disabled.

If you do find a relevant event log for the affected Exchange Online remote move in the event viewer and this is related to MRS, you can turn up diagnostic logging for the MRS service with the following cmdlet:

Get-EventLogLevel ‘MSExchange Mailbox Replication*’ | Set-EventLogLevel -Level Expert

Then reproduce the issue or wait for it to be reproduced again and then check in the Event Viewer logs for any relevant events.

Tracking incoming failed requests from EXO

Especially useful in communication or timeout failures, there are 3 main logs to track the MRS requests on the Exchange on-premises servers in order for you to understand if an MRS Exchange Online request reached your Exchange server, or not. These often help us narrow down the issue to a most likely network device (in front of Exchange Server) that could terminate the connection and not pass it to Exchange Servers. Or if the request reaches the Exchange servers, we can see where this is stuck and get a better understanding on what’s the problem on the Exchange server on-premises.

Exchange on-premises server logs to track an EXO Incoming MRS request:

  • HTTPerr logs: %SystemRoot%System32LogFilesHTTPERR
  • IIS logs for Default Web Site (DWS): %SystemDrive%inetpublogsLogFilesW3SVC1 – UTC Timezone

The name of the IIS logs contains the date of the log, for example u_ex190930.log is from Sept 30, 2019.

  • HTTPProxy logs for EWS (available in Exchange 2013 or later): %ExchangeInstallPath%LoggingHttpProxyEws

The name of the HTTPProxy logs contains the date and hour starting to log, for example HttpProxy_2019093014-10.LOG (10th log from Sept 30, 2019, starting hour 14:00 UTC)

Few things to mention here:

  • Always correlate the timestamp of a failure HH:MM:SS in move report with these logs (IIS and HTTPProxy are in UTC timezone)
  • A failed request will never have 200 Status code (if you see it with 200 in logs, it means you are not looking at the failed one). Note that for a request that times out, you might still be able to see it here with 200 status code and possibly a higher time-taken
  • If you see the failed request  in HTTPerr logs, this won’t probably be present in IIS logs or HTTPProxy logs – it is stuck in front of IIS, check the particular reason in HTTPerr logs and check for IIS misconfiguration
  • If you see the failed requests in IIS logs , then you can do IIS failed request tracing on that status code and check further the detailed error in HttpProxy logs

This concludes Part 3 of these blog series. We will be talking about troubleshooting slow migrations next!

I would like to thank the following persons for contributing to this blog and for their time and patience to read this: Angus Leeming, William Rall, Brad Hughes, Chris Boonham, Ben Winzenz,  Cristian Dimofte, Nicu Simion, Nino Bilic,  Timothy Heeney

Mirela Buruiana


Source link

Share this post via

Leave a Reply