It’s a privilege to be part of Global support team at Nutanix and this blog post is purely my view and won’t reflect any of the company/team policies. We receive calls from the customers and each day it is different story. Nutanix product portfolio changed from traditional storage stack to cloud-native ready platform by introducing many products. Cloud ready products and simplify the UI/users experience is Nutanix main motive but there is a lot of hard work [or] code written beneath the surface to make things easy for the Administrators or users.
For example company CEO or Nutanix Administrator they both can use LCM/1-click button to upgrade AOS/AHV/Firmware and watch their favorite game while the upgrades are happening behind the scenes. Let me start with some basic things come to my mind when customer called in and how we capture details quickly to begin the troubleshooting process. First goal is to understand the customer Infrastructure/setup rather fix the ongoing issues. Better understanding of the Infrastructure will help you to resolve most of the open issues quickly. NCC health checks, Prism Alerts/Tasks, Prism Central capacity planner can give quick ideas or problems with the specific Nutanix cluster.
ncli cluster info => To show the AOS/Cluster UUID/NCC version details
nodetool -h 0 ring => Cassandra ring to show the nodes participating in storage cluster
cluster status | grep -v UP => Check the services
ncli ru list => To understand the hardware details
ncli host list => Additional details about clusters
svmips && hostips && ipmiips && svmips -d => print the Cluster IPs
AOS Upgrade: allssh cat ~/config/upgrade.history
Hypervisor Upgrade: allssh cat ~/config/hypervisor_upgrade.history
NCC Upgrade: allssh cat ~/config/ncc_upgrade.history
Maintenance Mode: allssh cat ~/config/maintenance_mode.history
CVM Memory: allssh cat ~/config/memory_update.history
Hardware Change: allssh cat ~/config/hardware.history
There are multiple pages where you can check more details via CVM like Curator (2010), Cerebro (2020), Stargate (2009), Ergon (2090) and Acropolis (2030). Don’t open these pages if you are not sure what you are looking for and let the support engineers drive it further.
links http:0:2009 => Stargage
links http:0:2010 => Curator
links http:0:2012 => Chronos
links http:0:2014 => Alert Manager
links http:0:2016 => Pithos
links http:0:2020 => Cerebro
links http:0:2025 => Arithmos
links http:0:2030 => Acropolis
links http:0:2031 => Hyperint
links http:0:2070 => From PC only
links http:0:2090 => Ergon
Important ports:
TCP 2009 Stargate
TCP 2010 Curator
TCP 2012 Chronos
TCP 2016 Pithos
TCP 2020 Cerebro leader CVM
TCP 2030 Acropolis
TCP 2074 Comm between NGA (guest-agent) and NGT-CVM service
TCP 5900 VNC port to user to connect to CVS via AHV host
TCP 9440 Prism centralLeaders:
( ztop=/appliance/logical/leaders; for z in $(zkls $ztop | egrep -v ‘vdisk|shard’); do [[ “${#z}” -gt 40 ]] && continue; leader=$(zkls $ztop/$z | grep -m1 ^n) || continue; echo “$z” $(zkcat $ztop/$z/$leader | strings); done | column -t; )
If we need to understand more about Nutanix Files, we use below commands:
List the FSVMs and their IP addresses — a few alternatives as well:
nutanix@CVM:$ ncli file-server list
nutanix@CVM:$ ncli fs ls
nutanix@CVM:$ afs info.nvmips
nutanix@CVM:$ afs info.nvm_info
Find out the Minerva leader:
nutanix@CVM:$ service=/appliance/logical/pyleaders/minerva_service; echo “Minerva ==>” $(zkcat $service/`zkls $service| head -1`)
nutanix@CVM:$ afs info.get_leaderThe version of the Nutanix Files product installed:
nutanix@FSVM:$ afs version
nutanix@CVM:$ ncli file-server list | grep VersionList all the File Shares:
nutanix@CVM:$ ncli fs list-all-fs-shares
nutanix@FSVM:$ afs share.listDetails of a specific FSVM: nutanix@FSVM:$ afs fs.info
Check the health of the File Server: nutanix@FSVM:~$ afs smb.health_check
Get the timezone of the File Server: nutanix@FSVM:~$ afs fs.get_timezone
Print the UUIDs of the FSVMs: nutanix@FSVM:~$ afs ha.minerva_ha_print_nvm_uuid_owners
High Availability status: nutanix@FSVM:~$ afs ha.minerva_check_ha_state
SMB protocol version: nutanix@FSVM:~$ afs smb.get_conf “server max protocol” section=global
When there is maintenance activity and need to place the CVM/Host in maintenance mode, then remember below commands.
ncli host list
ncli host edit id=<cvm_host_id> enable-maintenance-mode=true
ncli host edit id=<cvm_host_id> enable-maintenance-mode=false
cvm_shutdown -P now => Power-off CVM gracefully
cvm_shutdown -r 0 => Reboot the CVM
acli host.enter_maintenance_mode_check <host_ip>
acli host.enter_maintenance_mode <host_ip>
acli host.exit_maintenance_mode <host_ip>
acli ha.get acli ha.update enable_failover=True
acli ha.update enable_failover=False
acli ha.update reservation_type=kAcropolisHANoReservations
acli ha.update reservation_type=kAcropolisHAReserveHosts
acli ha.update reservation_type=kAcropolisHAReserveSegments
Storage usage per VM:
acli vm.list > acli_vm_list_tmp.txt && while read uuid; do echo -n “$(grep $uuid acli_vm_list_tmp.txt) ” && ~/diagnostics/entity_space_usage_stat.py -i $uuid | sed -n ‘s/Live usage: \(.*\)$/\1/p’; echo ” “; done < <(cat acli_vm_list_tmp.txt | grep -v UUID | awk ‘{print $NF}’) ; rm acli_vm_list_tmp.txt
Prism Central docker related issues you can use below commands:
docker exec -it container-IDorName bash
docker plugin inspect pc/nvp:latest | grep DATASERVICES_IP
genesis stop nucalm epsilon
docker plugin disable -f pc/nvp
docker plugin set pc/nvp DATASERVICES_IP=10.xx.xx.xx
docker plugin inspect pc/nvp:latest | grep DATASERVICES_IP
docker plugin set pc/nvp PRISM_IP=10.xx.xx.xx
docker plugin inspect pc/nvp:latest | grep PRISM_IP
docker plugin enable pc/nvp
docker plugin ls
sudo journalctl -u docker-latest
We can use below commands for log collection and other commands:
Upload a default collection directly to Nutanix FTP server: logbay collect –dst=ftp://nutanix -c <case_number>
Logs collection for last 2.5 hours: logbay collect –duration=-2h30m
Logs collection 6.25 hours after 2 pm (using cluster time and timezone): logbay collect –from=2019/04/09-14:00:00 –duration=+6h15m
Collect and aggregate all individual node log bundles to a single file on the CVM where the command is run: logbay collect –aggregate=true
Upload logs to Nutanix storage container to prevent local disk usage (NCC-3.10.0 and above): logbay collect –dst=container:/container_name
Don’t forget to run the OOB (out of band) script for any hardware related problems/issues to get additional insights.
LCM Failure is one of the challenging scenario to troubleshoot and below logs are useful:
Files/Info being collected
Node level information
LCM leader
Foundation version
LCM version
Hypervisor type/versionLCM configuration:
/etc/nutanix/hardware_config.json
/etc/nutanix/factory_config.jsonLogs from CVM:
/home/nutanix/data/logs/genesis.out*
/home/nutanix/data/logs/lcm_ops.out*
/home/nutanix/data/logs/lcm_op.trace
/home/nutanix/data/logs/lcm_wget.log
/home/nutanix/data/logs/foundation*
/home/nutanix/data/logs/catalog.out*
/home/nutanix/data/logs/prism_gateway.log
/home/nutanix/data/logs/hera.out*
/home/nutanix/data/logs/ergon.out*
One of the common problem/scenario is failed disk and identify the underlying issues with hardware. Useful commands to perform deep troubleshooting.
lsscsi
df -h
sudo smartctl -a /dev/sdX -T permissive
fdisk -l /dev/sdX
Verify that the device links are good as follows: sudo /home/nutanix/cluster/lib/lsi-sas/lsiutil -p 1 -i
lsiutil command (sudo lsiutil -a 13,0,0 20) clears the error counter
sudo /home/nutanix/cluster/lib/lsi-sas/lsiutil -a 13,0,0 20
sudo /home/nutanix/cluster/lib/lsi-sas/lsiutil -a 12,0,0 20
Interesting topic about Nutanix storage space usage matching with UVMs and other components. This one is not easy when there are overlapping layers with DR/Files living in same cluster.
- Genuine usage of UVM (Guest VMs)
- Local snapshots taken for UVMs by user
- Snapshots taken by 3rd party backup solution (Veeam/HYCU.. etc)
- DR Snapshots taken by Nutanix solution – Async DR, Metro, Near Sync .. etc
- DR Snapshots taken by Entity Centric – Prism Central based DR
- Nutanix Files
- DR for Nutanix Files
- SSR based snapshots for Nutanix Files
- Features enabled for the storage containers like Compression, De-duplication, Erasure coding
- What is the replication factor?
- Any VM Migrations planned for this cluster?
- Any node failures or outstanding issues with this cluster?
- Simple math to answer 1 node failure equation. For example 3x 10 TB nodes can store maximum of 9 TB user data to handle single node failure with RF2
- Prism central available? Cluster Runway helps for capacity planning?
- Undersize cluster? What is the expected usage vs current usage?
- 90% usage of total capacity – we are at the risk of cluster down scenario (95% – Read only)
We get more of ESXi/VMware cases then below are useful commands:
Logs Location to Verify Network/Storage/Drivers/Memory/CPU issue in ESXi
grep vmnic /var/log/vmkernel.log | grep Link Check Vmkernel logs to see if there was link down (ESXi log)
grep vmnic /var/log/vobd.log | grep -i down Check VOBD logs to see if there was link down (ESXi log)ESXi Commands
vm-support -V List all the VMs registered on the host and their power state
vim-cmd vmsvc/getallvms List all the VMs running and their VMID
vim-cmd vmsvc/power.getstate <Vmid> Get the power state of the VM using VMID
vmware-vl Check the version and build number of the ESXi host
esxcfg-nics -l
esxcli network nic list These commands list the physical NICs currently installed and loaded on the system
esxcfg-route -l List the default gateway of ESXi host
ethtool -i VMNic Name List the driver and firmware information of NIC
vmkchdev -l | grep vmnicX List out vendor ID, Device ID, Sub vendor ID and Sub device ID of NIC
esxcli network vswitch standard portgroup list List all of the port groups currently on the system.
esxcfg-vswitch -l List all of the port groups currently on the system along with the NIC associated with each port group and vSwitch
vim-cmd vmsvc/unregister <Vmid> Unregister a VM
vim-cmd solo/registervm pathto.vmx Register a VM
./sbin/services.sh restart Restart all management agents on the host. If you get an error similar to “This ruleset is required and cannot be disabled”,
run: services.sh restart & tail -f /var/log/jumpstart-stdout.log
Once all services are started, you have to Ctrl+C to break the flow. This is applicable to ESXi 6.5
./etc/init.d/hostd restart
/etc/init.d/mgmt-vmware status
service mgmt-vmware restart To check the status or restart hostd services, if you restart services, you should see a “BEGIN SERVICES” message in hostd logs. Mind that if LACP is configured on the host, this will cause network to briefly disconnectTo verify if LACP is running: esxcli network vswitch dvs vmware lacp status get
ps -ef | grep hostd Verify if hostd is running. If the output is blank, then hostd is not running
/etc/init.d/sshd restart Restart SSH service.
vim-cmd hostsvc/hostsummary | grep bootTime Host boot time
Finally another common issue Node/Disk removal stuck scenario and useful commands information. You can open support case/contact Nutanix support for this scenario as it involves deep understanding of this workflow.
Run this command to understand the node/disk status in Zeus: zeus_config_printer | grep -i to_remove
Curator master page (http://0:2010) and look for Node/Disk removal progress
Look for the pending ExtentGroups for migration:
allssh “grep ExtentGroupsToMigrateFromDisk /home/nutanix/data/logs/curator.INFO”
allssh ‘cat data/logs/curator.* | grep “Egroups for removable disk”‘
medusa_printer –lookup egid –egroup_id EgroupID | head -n 50
grep -i “data_migration_status” zeus.txt
ncli host get-rm-status
cassandra_status_history
nodetool -h 0 leadership
zeus_config_printer | grep dyn_ring_change_info -A10
“Be social and share this on social media, if you feel this is worth sharing it”