Recently we’ve had some serious storage related issues regarding latencies, which in some cases even resulted in frozen VM’s. In this post I would like to share how we’ve troubleshooted this issues with the help of VMware vRealize Log Insight and vRealize Operations Manager.
A great source of information for troubleshooting are off-course the log files of ESXi, especially for storage related problems the vmkernel.log and the storagerm.log, but when you have a large number of hosts connecting to several Storage Arrays troubleshooting becomes very difficult and time consuming. In our case the latencies appeared on different times, hosts and in spikes which impact and duration was very different every time.
VMware’s vRealize Log Insight is a great product which delivers real-time log management for VMware environments, with machine learning-based Intelligent Grouping, high performance search and better troubleshooting across physical, virtual, and cloud environments.
Installation is very simple and in our environment took approximately two hours. After that al our vCenter servers and hosts in our clusters were sending their logs to the Log Insight server which was already showing us relevant data in matter of minutes.
Log Insight has some great dashboards with specific data concerning different area’s of your VMware environment. In this chapter I will show how Log Insight provided the help to troubleshoot our storage related problems. The screenshots are made with a selected custom time period to show the results in our log files when the problem was solved. The image below shows the General – Problems dashboard.
The following issues are shown:
- A lot of VMFS Heartbeat timed-out and restored events in the vSphere problems by type pie chart
- Latency problems as shown in the Avarage ESX/ESXi SCSI Latency graph
- vSphere connectivity lost component for Storage
In the following screenshot the Storage – Overview dashboard is shown.
The following issues are apparent:
- VMware VMFS heartbeat events by status at particular times during the day
- VMware VMFS heartbeat events by datastores shows that more datastores are involved at the same time
- VMware VMFS heartbeat timeouts ons several hosts and volumes
In the screenshot below the Storage – SCSI Latency / Errors dashboard is shown.
This dashboard also shows that:
- Latency problems as shown in the Avarage ESX/ESXi SCSI Latency graph on a lot of volumes (Note that the latency problems were gone as soon as the fix was applied around 22:00 on 21th April
- Errors on more devices, paths and hostnames indicating that problems are present on a central component in the SAN infrastructure
Next we looked into the Storage – SCSI Sense Codes dashboard.
This dashboard shows SCSI errors by device and sense data. It is recommended to look up what these sense codes mean.
Recently VMware vExpert Florian Grehl has developed a great tool on his website to help you decode the Sense Codes: Check his ESXi SCSI Sense Code Decoder tool on http://www.virten.net/vmware/esxi-scsi-sense-code-decoder/
The 0x6 0x29 0x0 Sense data showed that a Power on, reset or bus device reset occurred, confirming that latencies were causing connection time-outs and resets.
We then looked into the Interactive Analysis feature of Log Insight with certain filters. With the help of this feature we were able to exactly show when, for instance heartbeat errors as shown below, were occurring.
We also did this with the filters shown below:
With this data we also wanted to get a better view about how high the latencies were exactly, on which datastores or devices these occurred, and if it was possible to discover a certain trend or behaviour in our environment.
VMware vRealize Operations
VMware vRealize Operations (vROps, formerly vCOps) is a great tool to pro-actively identify and solve emerging issues with predictive analytics and smart alerts, ensuring optimum performance and availability of applications and infrastructures across virtualized environments. We’ve already had this solution installed in our vSphere environment so we just had to make custom dashboards for troubleshooting this problem.
I’ve configured a custom dashboard showing our two production clusters with their configured datastores. The dashboards was configured with the Generic Scoreboard and Metric Graph widgets showing the Disk Command Latency status and metric data for each datastore in the cluster.
With the help of this custom dashboard we discovered the exact time when a latency spike occurred and how high it was. With this data we could analyse the performance data and logging on the Storage Arrays. We also discovered that the latencies almost always occurred at one side of our storage cluster which consists of two controllers. By this I mean the latencies at that moment were detected on the datastores which were managed by controller A or B.
Members of the storage team were given access to the dashboard to be able to monitor in conjunction with their monitoring and logging tools and after consulting with their storage vendor they discovered that the interlink connection between both controllers was running at full capacity at the times we saw the latency spikes. The storage vendor advised to install a new software version of the storage array which fixed this problem by increasing the performance of the interlink connection by making it possible that this connection could be used in duplex mode instead of one-way.
The new software version was installed on 21th of april around 22:00 which had immediate result to our latency spikes which is visible on the screenshot below. vCOPS/vROps confirmed we’ve found the solution for our latency problems.
While troubleshooting this we’ve checked a lot of things:
- Firmware levels of the hardware and HBA
- Driver levels for the Fiber Channel adapter
- Multipathing and ALUA settings
- Right configuration of the LUN’s on the storage arrays for VMware
- Queue depth recommendations and settings
- Fiber Channel network traffic and bandwidth usage
- Checking VMware hosts were using the optimal paths to the right controller of the storage array
Log Insight and vROps was key to us to efficiently analyse the date and to troubleshoot effectively. By sharing our data and dashboards with other teams we also accelerated the storage team and their storage vendor support to find the cause of this latencies.