Tuesday, March 24, 2015

Server reboot takes 1hour - Microsoft Unable to resolve this case

scope agreement for the reported  your issue.

Issue Definition: Server "xxxxxxxxx" takes long time to restart while shutting down and reboot is normal. When the restart is initiated from start menu then the complete reboot cycle takes more than 20 minutes however if we shut down the machine and perform cold start then the machine comes up in not more than 5 minutes. This is a domain controller and the issue started after the server was updated with latest hotfixes/windows updates. The reboot gets stuck while shutting down the machine and we see a black screen on ILO console during that time.

Environment:
OS Name Microsoft® Windows Server® 2008 Standard
Version 6.0.6002 Service Pack 2 Build 6002
System Name xxxxxxxxx
System Manufacturer HP
System Model ProLiant DL360 G7
System Type x64-based PC
Time Zone Eastern Standard Time
Installed Physical Memory (RAM) 12.0 GB
Total Physical Memory 12.0 GB
Available Physical Memory 10.3 GB
Total Virtual Memory 13.6 GB
Available Virtual Memory 12.2 GB
Page File Space 2.00 GB
Page File C:\pagefile.sys

Scope Agreement: The case will be marked as resolved and ready to close once we have successfully resolved the issue "Server "xxxxxxx" takes long time to restart while shutting down and reboot is normal" or if we come to a conclusion that it is a by design behavior. Any further issues, which require dedicated troubleshooting, will be considered as a new case and shall be handled by the concerned team. In case the issue is ruled out to be due to third party application, then we will provide best effort support and you will have to engage the application vendor.

Please Note: The Microsoft tickets are server and issue specific. One ticket can be used only for one server and one issue. If the problem is found to be due to third-party code we will provide information to substantiate this.

We will now begin working together to resolve your issue. If you do not agree with the scope defined above, or would like to amend it, please let me know as soon as possible. If you have any questions or concerns, please don't hesitate to contact me.

Best Regards,

xxxxxxx


I just sent action plan to Marcus.
Here it is:

Hey xxxxxxx,
I am the current owner of your case where rebooting takes over an hour but shutdown and start takes 5 minutes.
From previous engineer’s notes, the action plan was to test with Symantec removed.

Have you had a chance to perform that yet?
If so, are you still having the issue?

If so, the next action plan would be to


1. **setup the server for a complete memory dump.

2.** From diagnostics, it appears your machine is setup for complete dump using NMI.


3.Service 'HP ProLiant System Shutdown Service' is configured to automatically start on this machine.

This service may cause a memory dump to be either corrupted or not be created.

4.**Please stop and disable ‘HP ProLiant System Shutdown Service’

5.After configuring server, shutdown, start, then perform a restart to reproduce issue.

6.When server is hung on shutdown, crash the server using NMI switch.

This may or may not work depending on where machine is actually hanging.

7. If it does not work, we can then proceed with getting a reboot xperf trace.
I uploaded 2k8r2-x64-xperf.zip to your workspace.
Please download and extract to server having issue.

Open admin command prompt

From extracted directory run xperf.mgr.bat /?

If prompted to disablepagingexecutive, type yes

Run xperf.mgr.bat rebootcycle

Wait for machine to reboot
Log on
Xperf will now compile files into c:\*.etl

From admin command prompt run xperf.mgr.bat clean

Zip and upload c:\*.etl

####################


CASE ARCHIVED

Since your down time window is more than a couple of weeks away, we typically request archiving the case until you have new information.

When a case is archived, you can always contact me with case number for same issue and I will reopen. If I am not available, you can also call back into the queue.

  


You have next steps of:
-   Gathering a complete dump
-   If above fails or since you have a down time window you may want to gather this as well, capture an xperf rebootcycle trace. It may be best to do both since it takes so long to get window approved.




We were unable to create Crash dump using NMI switch . Hence , we have generated etl (event trace log )  through xperf tool



Results from xperf were inconclusive.

Only item of interest is Cpqnicmgmt.exe which was only process /service sticking out from services view.

When would be a good time to have a call and discuss next steps?

Thanks,


##########

Our next steps would be to determine why NMI is not working.
It is very rare for NMI not to work.
There may be some configuration changes that we can make to get NMI to work.

Other than that we could either run some isolation tests such as testing in safe mode or disabling processes / services and testing.

Lastly we may have to hook up a debugger to the server to try and capture additional information when it hangs.

If this is just a single server with this issue, it may be best to rebuild.

Please let me know if you have availability during my working hours or how you would like to proceed.

Thanks,


xxxxxxx

After discussion with case contact and reviewing what has been done, here are the current limitations we are hitting:

For slow reboot issues we have only a couple of tools that can be used to diagnose issue.

ISOLATE:
We would first try to isolate the issue by looking at recent changes on the server such as hotfixes, application installs or updates, drivers.
We would also test to see if the same issue occurs while booting into safe mode.

TROUBLESHOOT:
The primary tool would be capturing a complete memory dump of server while it is in a hung state.
The secondary tool would be to capture an XPerf (trace) of server reboot.

For this case we tried to isolate the issue but found no recent changes and booting into safe mode still had the issue. We then tried to capture a complete memory dump using NMI but were unable to capture a dump possibly due to where we are in the shutdown process. We then were able to capture an Xperf but the trace did not show anything conclusive as the trace stopped before the hang situation occurred.

At this point, we have exhausted all of the standard troubleshooting techniques. Since this is a single server issue, the server is a domain controller, and the server does not have any special configuration, the best action would be to rebuild the server.

Thanks,


xxxxxx xxxxx

Windows Platforms Core – Reliability Team – Support Escalation Engineer
Office: xxxxxxx
Working Hours: Mon – Fri 8AM - 5PM EST






###############################################



Microsoft recommended to rebuild this server --- :-(





No comments:

Post a Comment