Since a few weeks I am having a problem with SCOM agents in child domains. I am currently implementing SCOM for an international company which has branch offices in China, US, Germany, Switzerland and other countries. The Active Directory topology has a root domain and for each country a separate child domain. In some of the child domains I am having trouble with SCOM agents which sometimes appear grey in the SCOM console and sometimes they switch to green. In addition they write an event 20070 into the OperationsManager event log….
Daniele Grandini and also Marnix Wolf have written an article about this problem but couldn’t solve the problem so far. I recommend reading their posts first because I am not going to repeat what they already have written.
The official statement from Microsoft support is that there is currently no solution available for this problem.
From my experience I can tell that the average mean of the collected samples for the agent-simulated query is ~505 ms in the problematic domain. In an healthy domain this query took ~114 ms .
Solution (Microsoft): Install a domain controller from the problematic domain in the same site where your management server lives.
There is a TechNet thread about this problem see here.
I am currently also trying to troubleshoot this problem together with Microsoft support. In order to get more information when the problem occurs and which latency breaks the authentication of the agent I modified the very useful Powershell script from Daniele Grandini. Because I had trouble with the special characters in his script and also with querying the SID attribute in this part …
I adjusted it to…
and finally added the functionality which writes the following information into a text file…
- Milliseconds Query 1 (this query is just to get the agents SID)
- Milliseconds Query 2 (this is the important query which the agent actually performs)
- Availability of the agent (true = agent is green; false = agent is grey)
Daniele Grandini explains how to troubleshoot this problem very well but just for completion I will repeat it here.
Here I would like to share my version of the script it might help you also figuring out what’s wrong in your environment and I also explain how I am using it.
1) On your management server execute…
nltest /dsgetdc:[agentdomain where you experience the error]
e.g. nltest /dsgetdc:china.root.com
–> this will return the domain controller which your management server uses to query the china domain e.g ChinaDomainController.china.root.com
2) Define the variables in the script accordingly…
$domain –> domain controller from step 1
$agent –> agent fqdn which you have trouble in the e.g china.root.com domain
3) You could run the script manually….
Or I am using task scheduler in Windows to run the script every 5 minutes which will dump the log files in c:\temp on the management server. I am running currently 4 taks for different domains in different countries…
4) Next I import the log files into Excel (delimiter “;”) and now you can calculate the average time of the queries or you could make a graphical chart to visualize the performance data. According to Daniele Grandini if the query time is near 1000ms then the agent is experiencing this problem.
I am currently working on step 3 and 4 and hope to get some more information why these agents are having this problem and trying to figure out when the agents turn grey.
I will update this post as soon I have some new information on this topic.
Download the Powershell script here .