One thing I am missing in OMS is, to monitor windows services and processes easily. Well, there are many ways to monitor such components with OMS like:
- Run some sort of PowerShell script on the server to check the windows service or processes status and then use the HTTP Data Collector API to ingest the result into Azure Log Analytics. From there you would create a custom alert.
- In case you want to monitor if a windows service has stopped you could also check for Windows events that indicate that a service stopped. This is (as far I know) not consistent across all Windows operating systems but would be a good approach to collect these events with OMS agent. From there you would create a custom alert in the OMS portal.
I think these are the two main approaches to solve this problem. What I was looking for was a more slick approach to figure out if a Windows service has been stopped or a certain process has been started. In this post I would like to cover both approaches.
Process Monitoring
In case you want to be alerted if a certain process has been started, in our example notepad.exe, you could do this by using performance counters.
If you start Perfmon and search for the Process object, you see all the processes instances running at this moment, here I started notepad.exe…
The Process object has many performance counters as you can see here….
Which should we choose now? Well, if you think about it, it is simple. EVERY process has an process ID, no process, no process ID . Therefore we need to collect the ID Process counter of the instance we are interested in, in this case notepad. How do we get the proper performance counter syntax to add it in OMS? If you added the counter in Perfmon you just need to check the properties of the counter…
…like\Process(notepad)\ID Process .
If you are more into PowerShell you could find the process counters like this…
((Get-Counter -ListSet * | Where-Object {$_.CounterSetName -eq "Process"}).PathsWithInstances |
Where-Object {$_ -like "*notepad*"})
So far we know the counter we want to collect. Now we simply add this to OMS. Go to the Windows Performance Counters settings page in the OMS portal…
…add the counter WITHOUT leading “\” like this Process(notepad)\ID Process
At this point the OMS agent starts collecting this performance counter every 10 seconds.
Next we need to create the Azure Log Analytics query, to figure out at which time was the last notepad instance found. The final query looks like this…
Perf // Get Performance log
| where ObjectName == "Process" // Get the Process object
| where InstanceName == "notepad" // Get the notepad instance
| sort by TimeGenerated desc // Sort the data by TimeGenerated
| summarize LastTime = arg_max(TimeGenerated,*) by Computer // Figure out which is the last data received
| where LastTime > ago(5m) // Check if there is a result in the the past 5 minutes
In the Azure Log Analytics portal it looks like this…
As you can see we receive one result. If there was no notepad process started it would return an empty result list.
As a last step we need to create an alert in the OMS portal like this…
…then save your alert settings. Immediately you will receive an alert if a notepad process has been started.
Windows Service Monitoring
For (most) windows services we could use the same approach as described above, but with some inverted / other logic. Our goal is to receive an alert if a windows service has been stopped. For this example we use the Print Spooler service.
The Print Spooler service starts an exe called spoolsv.exe, as you can see on the properties page…
In the Perfmon GUI, the process itself is called spoolsv…
….and the counter name is \Process(spoolsv)\ID Process …
As described above, add the counter to collect it with Azure Log Analytics, like this Process(spoolsv)\ID Process…
At this point the OMS agent starts collecting this performance counter every 10 seconds. Of course you could change the interval to some less aggressive mode if you want e.g. 60 seconds.
Next we need to create again the Azure Log Analytics query. Basically we could use the same query as for the notepad process example above, but we need to change the instance name to spoolsv…
Perf // Get Performance log
| where ObjectName == "Process" // Get the Process object
| where InstanceName == "spoolsv" // Get the spoolsv instance
| sort by TimeGenerated desc // Sort the data by TimeGenerated
| summarize LastTime = arg_max(TimeGenerated,*) by Computer // Figure out which is the last data received
| where LastTime > ago(5m) // Check if there is a result in the the past 5 minutes
In the Azure Log Analytics portal it looks like this…
Again we receive a result. But we want to know if the service is stopped and therefore NO process is running.
We can simply cover this logic in the alert settings, if we specify to send an alert if there is less than 1 result…
…next save your alert settings. Immediately you will receive an alert if the spooler service has been stopped…
This example works great for services, which create their dedicated process instance. If the service runs in some other context like svchost you cannot easily discover the process instance and because of that this approach does not work.
I hope you like it!
Great solution. However, this is only useful for 1 computer. How can I set up this monitoring for multiple computers?
Hi
The performance counters get collected for any computer which is conntected to the workspace. You just need to modify the Azure Log Analytics query and filter for those computers you want to get the information, that’s all.
Cheers,
Stefan
Thanks Stefan. To clarify, modifying the query to get multiple computers isn’t a problem. It’s when we use the query for an alert. That’s great that I can say a service is running on 80 computers, but not the 81st–how do I know which one? How do I modify the query in such a way that it will return only the computers where the service is NOT running, and alert on each individual computer?
Hi
Well in this case you could create a group in ALA based on a query getting all monitored computers (Group A). Then you compare the list of computers returned by the query that checks the services running (List A). Next you compare which computers are missing from Group A in List A. Those are the computers which do not have the specified service running. In the alert setting you can trigger if the count is greater than 0.
If I find time I try to build it :).
I hope it still helps,
Stefan
it is obvious that monitoring with an interval of 1 time every 5 minutes, plus the time to send the metrics in Azure, or sending the events of the event log to the Log Analytics is not a useful solution. Since very often the SLA of business services requires a faster response.
MS need to remove the restriction in 5 minutes or make a separate plan for subscription to send alerts more often.
What do you think about it? Maybe we have additional tricks in OMS?
Well, yes that’s if you use the Log Analytics query in Azure Monitor (which is the new way of alerting). MS recognized this problem and transformed some of the logs in to metric data which lets you alert in near real time. I suggest reading this post here https://azure.microsoft.com/en-us/blog/faster-metric-alerts-for-logs-now-in-limited-public-preview/ habe fun.
Cheers,
Stefan