News: VMwareGuruz has been  Voted Top 50 vBlog 2018. 

Cloud E2ENutanix

Nutanix Support Engineer Diary

It’s a privilege to be part of Global support team at Nutanix and this blog post is purely my view and won’t reflect any of the company/team policies. We receive calls from the customers and each day it is different story. Nutanix product portfolio changed from traditional storage stack to cloud-native ready platform by introducing many products. Cloud ready products and simplify the UI/users experience is Nutanix main motive but there is a lot of hard work [or] code written beneath the surface to make things easy for the Administrators or users.

 

For example company CEO or Nutanix Administrator they both can use LCM/1-click button to upgrade AOS/AHV/Firmware and watch their favorite game while the upgrades are happening behind the scenes. Let me start with some basic things come to my mind when customer called in and how we capture details quickly to begin the troubleshooting process. First goal is to understand the customer Infrastructure/setup rather fix the ongoing issues. Better understanding of the Infrastructure will help you to resolve most of the open issues quickly. NCC health checks, Prism Alerts/Tasks, Prism Central capacity planner can give quick ideas or problems with the specific Nutanix cluster.

 

ncli cluster info => To show the AOS/Cluster UUID/NCC version details

nodetool -h 0 ring =>  Cassandra ring to show the nodes participating in storage cluster

cluster status | grep -v UP => Check the services

ncli ru list => To understand the hardware details

ncli host list => Additional details about clusters

svmips && hostips && ipmiips && svmips -d  => print the Cluster IPs

AOS Upgrade: allssh cat ~/config/upgrade.history

Hypervisor Upgrade: allssh cat ~/config/hypervisor_upgrade.history

NCC Upgrade: allssh cat ~/config/ncc_upgrade.history

Maintenance Mode: allssh cat ~/config/maintenance_mode.history

CVM Memory: allssh cat ~/config/memory_update.history

Hardware Change: allssh cat ~/config/hardware.history

 

There are multiple pages where you can check more details via CVM like Curator (2010), Cerebro (2020), Stargate (2009), Ergon (2090) and Acropolis (2030). Don’t open these pages if you are not sure what you are looking for and let the support engineers drive it further.

links http:0:2009  => Stargage

links http:0:2010 => Curator

links http:0:2012 => Chronos

links http:0:2014 => Alert Manager

links http:0:2016 => Pithos

links http:0:2020 => Cerebro

links http:0:2025 => Arithmos

links http:0:2030 => Acropolis

links http:0:2031 => Hyperint

links http:0:2070 => From PC only

links http:0:2090 => Ergon

 

Important ports:

TCP 2009 Stargate
TCP 2010 Curator
TCP 2012 Chronos
TCP 2016 Pithos
TCP 2020 Cerebro leader CVM
TCP 2030 Acropolis
TCP 2074 Comm between NGA (guest-agent) and NGT-CVM service
TCP 5900 VNC port to user to connect to CVS via AHV host
TCP 9440 Prism central

Leaders:

( ztop=/appliance/logical/leaders; for z in $(zkls $ztop | egrep -v ‘vdisk|shard’); do [[ “${#z}” -gt 40 ]] && continue; leader=$(zkls $ztop/$z | grep -m1 ^n) || continue; echo “$z” $(zkcat $ztop/$z/$leader | strings); done | column -t; )

 

If we need to understand more about Nutanix Files, we use below commands:

List the FSVMs and their IP addresses — a few alternatives as well:

nutanix@CVM:$ ncli file-server list

nutanix@CVM:$ ncli fs ls

nutanix@CVM:$ afs info.nvmips

nutanix@CVM:$ afs info.nvm_info

Find out the Minerva leader:

nutanix@CVM:$ service=/appliance/logical/pyleaders/minerva_service; echo “Minerva ==>” $(zkcat $service/`zkls $service| head -1`)
nutanix@CVM:$ afs info.get_leader

The version of the Nutanix Files product installed:

nutanix@FSVM:$ afs version
nutanix@CVM:$ ncli file-server list | grep Version

List all the File Shares:

nutanix@CVM:$ ncli fs list-all-fs-shares
nutanix@FSVM:$ afs share.list

Details of a specific FSVM: nutanix@FSVM:$ afs fs.info

Check the health of the File Server: nutanix@FSVM:~$ afs smb.health_check

Get the timezone of the File Server: nutanix@FSVM:~$ afs fs.get_timezone

Print the UUIDs of the FSVMs: nutanix@FSVM:~$ afs ha.minerva_ha_print_nvm_uuid_owners

High Availability status: nutanix@FSVM:~$ afs ha.minerva_check_ha_state

SMB protocol version: nutanix@FSVM:~$ afs smb.get_conf “server max protocol” section=global

 

When there is maintenance activity and need to place the CVM/Host in maintenance mode, then remember below commands.

ncli host list

ncli host edit id=<cvm_host_id> enable-maintenance-mode=true

ncli host edit id=<cvm_host_id> enable-maintenance-mode=false

cvm_shutdown -P now  => Power-off CVM gracefully

cvm_shutdown -r 0 => Reboot the CVM

acli host.enter_maintenance_mode_check <host_ip>

acli host.enter_maintenance_mode <host_ip>

acli host.exit_maintenance_mode <host_ip>

acli ha.get acli ha.update enable_failover=True

acli ha.update enable_failover=False

acli ha.update reservation_type=kAcropolisHANoReservations

acli ha.update reservation_type=kAcropolisHAReserveHosts

acli ha.update reservation_type=kAcropolisHAReserveSegments

Storage usage per VM:

acli vm.list > acli_vm_list_tmp.txt && while read uuid; do echo -n “$(grep $uuid acli_vm_list_tmp.txt) ” && ~/diagnostics/entity_space_usage_stat.py -i $uuid | sed -n ‘s/Live usage: \(.*\)$/\1/p’; echo ” “; done < <(cat acli_vm_list_tmp.txt | grep -v UUID | awk ‘{print $NF}’) ; rm acli_vm_list_tmp.txt

 

Prism Central docker related issues you can use below commands:

docker exec -it container-IDorName bash

docker plugin inspect pc/nvp:latest | grep DATASERVICES_IP

genesis stop nucalm epsilon

docker plugin disable -f pc/nvp

docker plugin set pc/nvp DATASERVICES_IP=10.xx.xx.xx

docker plugin inspect pc/nvp:latest | grep DATASERVICES_IP

docker plugin set pc/nvp PRISM_IP=10.xx.xx.xx

docker plugin inspect pc/nvp:latest | grep PRISM_IP

docker plugin enable pc/nvp

docker plugin ls

sudo journalctl -u docker-latest

 

We can use below commands for log collection and other commands:

Upload a default collection directly to Nutanix FTP server: logbay collect –dst=ftp://nutanix -c <case_number>

Logs collection for last 2.5 hours: logbay collect –duration=-2h30m

Logs collection 6.25 hours after 2 pm (using cluster time and timezone): logbay collect –from=2019/04/09-14:00:00 –duration=+6h15m

Collect and aggregate all individual node log bundles to a single file on the CVM where the command is run: logbay collect –aggregate=true

Upload logs to Nutanix storage container to prevent local disk usage (NCC-3.10.0 and above): logbay collect –dst=container:/container_name

Don’t forget to run the OOB (out of band) script for any hardware related problems/issues to get additional insights.

 

LCM Failure is one of the challenging scenario to troubleshoot and below logs are useful:

Files/Info being collected

Node level information
LCM leader
Foundation version
LCM version
Hypervisor type/version

LCM configuration:

/etc/nutanix/hardware_config.json
/etc/nutanix/factory_config.json

Logs from CVM:

/home/nutanix/data/logs/genesis.out*
/home/nutanix/data/logs/lcm_ops.out*
/home/nutanix/data/logs/lcm_op.trace
/home/nutanix/data/logs/lcm_wget.log
/home/nutanix/data/logs/foundation*
/home/nutanix/data/logs/catalog.out*
/home/nutanix/data/logs/prism_gateway.log
/home/nutanix/data/logs/hera.out*
/home/nutanix/data/logs/ergon.out*

 

One of the common problem/scenario is failed disk and identify the underlying issues with hardware. Useful commands to perform deep troubleshooting.

lsscsi

df -h

sudo smartctl -a /dev/sdX -T permissive

fdisk -l /dev/sdX

Verify that the device links are good as follows: sudo /home/nutanix/cluster/lib/lsi-sas/lsiutil -p 1 -i

lsiutil command (sudo lsiutil -a 13,0,0 20) clears the error counter

sudo /home/nutanix/cluster/lib/lsi-sas/lsiutil -a 13,0,0 20

sudo /home/nutanix/cluster/lib/lsi-sas/lsiutil -a 12,0,0 20

 

Interesting topic about Nutanix storage space usage matching with UVMs and other components. This one is not easy when there are overlapping layers with DR/Files living in same cluster.

  • Genuine usage of UVM (Guest VMs)
  • Local snapshots taken for UVMs by user
  • Snapshots taken by 3rd party backup solution (Veeam/HYCU.. etc)
  • DR Snapshots taken by Nutanix solution – Async DR, Metro, Near Sync .. etc
  • DR Snapshots taken by Entity Centric – Prism Central based DR
  • Nutanix Files
  • DR for Nutanix Files
  • SSR based snapshots for Nutanix Files
  • Features enabled for the storage containers like Compression, De-duplication, Erasure coding
  • What is the replication factor?
  • Any VM Migrations planned for this cluster?
  • Any node failures or outstanding issues with this cluster?
  • Simple math to answer 1 node failure equation. For example 3x 10 TB nodes can store maximum of 9 TB user data to handle single node failure with RF2
  • Prism central available? Cluster Runway helps for capacity planning?
  • Undersize cluster? What is the expected usage vs current usage?
  • 90% usage of total capacity – we are at the risk of cluster down scenario (95% – Read only)

 

We get more of ESXi/VMware cases then below are useful commands:

Logs Location to Verify Network/Storage/Drivers/Memory/CPU issue in ESXi

grep vmnic /var/log/vmkernel.log | grep Link Check Vmkernel logs to see if there was link down (ESXi log)
grep vmnic /var/log/vobd.log | grep -i down Check VOBD logs to see if there was link down (ESXi log)

ESXi Commands

vm-support -V List all the VMs registered on the host and their power state

vim-cmd vmsvc/getallvms List all the VMs running and their VMID

vim-cmd vmsvc/power.getstate <Vmid> Get the power state of the VM using VMID

vmware-vl Check the version and build number of the ESXi host

esxcfg-nics -l

esxcli network nic list These commands list the physical NICs currently installed and loaded on the system

esxcfg-route -l List the default gateway of ESXi host

ethtool -i VMNic Name List the driver and firmware information of NIC

vmkchdev -l | grep vmnicX List out vendor ID, Device ID, Sub vendor ID and Sub device ID of NIC

esxcli network vswitch standard portgroup list List all of the port groups currently on the system.

esxcfg-vswitch -l List all of the port groups currently on the system along with the NIC associated with each port group and vSwitch

vim-cmd vmsvc/unregister <Vmid> Unregister a VM

vim-cmd solo/registervm pathto.vmx Register a VM

./sbin/services.sh restart Restart all management agents on the host. If you get an error similar to “This ruleset is required and cannot be disabled”,

run:  services.sh restart & tail -f /var/log/jumpstart-stdout.log

Once all services are started, you have to Ctrl+C to break the flow. This is applicable to ESXi 6.5

./etc/init.d/hostd restart
/etc/init.d/mgmt-vmware status
service mgmt-vmware restart To check the status or restart hostd services, if you restart services, you should see a “BEGIN SERVICES” message in hostd logs. Mind that if LACP is configured on the host, this will cause network to briefly disconnect

To verify if LACP is running: esxcli network vswitch dvs vmware lacp status get

ps -ef | grep hostd Verify if hostd is running. If the output is blank, then hostd is not running

/etc/init.d/sshd restart Restart SSH service.

vim-cmd hostsvc/hostsummary | grep bootTime Host boot time

 

Finally another common issue Node/Disk removal stuck scenario and useful commands information. You can open support case/contact Nutanix support for this scenario as it involves deep understanding of this workflow.

Run this command to understand the node/disk status in Zeus: zeus_config_printer | grep -i to_remove

Curator master page (http://0:2010) and look for Node/Disk removal progress

Look for the pending ExtentGroups for migration:

allssh “grep ExtentGroupsToMigrateFromDisk /home/nutanix/data/logs/curator.INFO”

allssh ‘cat data/logs/curator.* | grep “Egroups for removable disk”‘

medusa_printer –lookup egid –egroup_id EgroupID | head -n 50

grep -i “data_migration_status” zeus.txt

ncli host get-rm-status

cassandra_status_history

nodetool -h 0 leadership

zeus_config_printer | grep dyn_ring_change_info -A10

 

“Be social and share this on social media, if you feel this is worth sharing it”

 

Related posts
Cloud E2EVMC on AWS

VMware Cloud Foundation 5.1 - Delivering key enhancements across Storage, Networking, Compute and Lifecycle management

Cloud E2EVMC on AWS

VMware Cloud on AWS (VMC) – SDDC Basic Operations

Cloud E2E

VMExplore 2022: VMware Aria Announcement (formerly vRealize Cloud Management)

Cloud E2E

vSphere Diagnostic Tool - Quick health checks via python script