Friday, December 23, 2011

SharePoint 2010 Migration

After a few months of preparation and testing we finally upgraded our production portal from MOSS 2007 to SharePoint 2010 about two weeks ago. There’s no major issue reported/found so far, and the overall feedback from the end users is positive.

The database approach was used in our migration, i.e. install a new SharePoint 2010 farm and mount the old MOSS 2007 databases. The migration process looks straightforward: run preupgradecheck then fix the issues, and run Test-SPContentDatabase then fix the issues, and finally run Mount-SPContentDatabase. Fixing those issues wasn't too hard but just a matter of time.

We had spent quite a lot of effort on content analysis and cleanup with our 60+G of content databases. At the end all our custom WebParts, dlls, user controls and some other SharePoint resources were repackaged into one solution called "SP2010Migration" that only builds one solution package for all.



The SP2010Migration solution package was supposed to deployed only once and that’s all. It shouldn’t be used anymore unless to build a new farm from scratch again. There’s no Feature in the solution package because it may bring trouble in later deployment. So in the future we can still package any WebPart or component into a Feature without concern of Feature conflict.

One interesting thing is that all user controls loaded by SmartParts were still working fine after migration. But we got some issues with the DataFormWebPart. For example the relative link was’t wrong after migration, and we had to write a console app to replace all "{@FileRef}" by "/{@FileRef}" inside each DFWP’s XSLT across the whole farm.

Another DFWP issue seemed to be more confusing where we only saw error in the DFWP page:

Unable to display this Web Part. To troubleshoot the problem, open this Web page in a Microsoft SharePoint Foundation-compatible HTML editor such as Microsoft SharePoint Designer. If the problem persists, contact your Web server administrator. Correlation ID:…

By Correlation ID we could easily find out the error detail from ULS log:

Error while executing web part: System.StackOverflowException: Operation caused a stack overflow. At Microsoft.Xslt.NativeMethod.CheckForSufficientStack() at SyncToNavigator(XPathNavigator , XPathNavigator ) at …

I tested the setting locally and everything seemed to be okay. How come a out-of-box SharePoint WebPart got a “stack overflow” error in production and it used to be working fine before migration? It turned out that time-out scenario occurred internally during XSL transform process in DFWP in production environment. That time-out threshold is only 1-second which means anything longer than 1 second will cause the error. We have a big list in our production server and the DFWP displays hundreds of rows in a big table which caused the time-out error.

That’s something new in SharePoint 2010 and also something annoying by Microsoft. It's great to introduce new stuff but it's also important to keep old stuff work right? Why not just turned off that new “time-out” feature by default and let end users to have an option to set it?

The worst thing is that there's no way to change that 1-second time-out setting! Microsoft provided "three solutions" for this issue:

1.) Tune XSL to get shorter transform time.
2.) Don't use DFWP instead use other WebPart.
3.) Write code to inherit DFWP and build your own.

Following the instruction we finally got the DFWP back after tweaking its settings, e.g. less columns and smaller page size. That's of course not an ideal way to solve the problem. We hope Microsoft could provide a better solution on this issue.

[2012-3 Update]: The time-out value of a DFWP now is configurable in Farm level with latest SharePoint CU. Refer to this.

Wednesday, December 07, 2011

A SharePoint Double Hop Issue

A SharePoint DataForm Web Part is not working properly sometimes after migrating from SharePoint 2007 to a SharePoint 2010 environment. Oringal ShaerPoint 2007 farm only has one front-end server and the new SharePoint 2010 farm includes two front-end servers and one application server. NTLM authentication is used in both SharePoint 2007 and 2010 environment.

The DataForm web part is working okay in SharePoint designer, and it's invoking the SharePoint Profile Service to retrieve some user profile data.

The ULS log shows (401) Unauthorized error:


w3wp.exe (0x1150) Error while executing web part: System.Net.WebException: The remote server returned an error: (401) Unauthorized. at System.Net.HttpWebRequest.GetResponse() at ....


Apparently that service call was routed to the other front-end server and then got access error. We verify that the SharePoint Web Services in both front-end servers do have anonymous access enabled. So why access error still happened?

Since the user has already authenticated to the site, the service call inside the DataForm webpart would automatically impersonate the original user instead of accessing outside as anonymous user, and that service call would fail in the other front-end server due to the NTLM setup in our environment. This is a typical NTLM double-hop issue.

Why the service call is not ending at local machine? Well it does sometimes and that's why it works sometimes. The problem is caused by the round robin DNS setup. To resolve the problem, simply add related entries to front-end servers' hosts file with domain name(s) pointing to local server. Then such service calls will always go to local machine and the double-hop issue will be gone.