At a customers SCOM infrastructure I faced a very strange problem. The SCOM infrastructure consists of 2 management servers and 1 SQL server. Only one of the management servers stopped in certain time intervals responding to the agents and encountered the following symptoms:
- All agent computer which had this management server as primary management server sent all of a sudden critical alerts.
- In the System event log a huge amount of errors event id 7011 from Service Control Manager.
- The management server did not respond to TCP port 5723
- The management server appeared running also all services seemed to be up and running.
- After restarting the HealthService on the management server everything went o.k. until a couple hours/days later.
I had been browsing through every event log to get a picture about the situation and in the System event log I found many errors, the detailed error description in the event log was…
“A timeout (30000 milliseconds) was reached while waiting for a transaction response from the HealthService service.”
After some investigation there was an event id before the error occurred which finally pointed me into the right direction…
“Installation Successful: Windows successfully installed the following update: Definitions update für Microsoft Endpoint Protection – KB2461484 (Definition 1.133.361.0)”
“Automatic Updates is now resumed.”
It seemed that the WindowsUpdateClient is making some problems. For me it looked like that the update service would prevent the HealthService from running. There was also SCCM 2012 deployed and after some more investigations I disabled the following service…
…and disabled the Task “Configuration Manager Health Evaluation” on the management server
..then everything ran smoothly and the management server never stopped again.
I know this is not a final solution but it helps at least for the moment. I will try to find a final solution but until then I leave it this way.
I hope you find this helpful…