Exported on 29-Oct-2021 11:41:08
Parameters
1 - Check Distro and Version
Check distro and version, print the result to the running log of Attune, and store the result to a temp file.
The following steps can use this info to determine how to install missing packages.
Login as user {Linux User} on node {Linux Node}
if [ -f /etc/os-release ]
then
# freedesktop.org and systemd
# all the distros supported by this blueprint have /etc/os-release
. /etc/os-release
DISTRO=$ID
VER=$VERSION_ID
elif type lsb_release >/dev/null 2>&1
then
# linuxbase.org
# UNTESTED
DISTRO=$(lsb_release -si)
VER=$(lsb_release -sr)
elif [ -f /etc/lsb-release ]
then
# For some versions of Debian/Ubuntu without lsb_release command
# UNTESTED
. /etc/lsb-release
DISTRO=$DISTRIB_ID
VER=$DISTRIB_RELEASE
elif [ -f /etc/debian_version ]
then
# Older Debian/Ubuntu/etc.
# UNTESTED
DISTRO=Debian
VER=$(cat /etc/debian_version)
elif [ -f /etc/SuSe-release ]
then
# Older SuSE/etc.
# TODO currently unimplemented
:
elif [ -f /etc/redhat-release ]
then
# Older Red Hat, CentOS, etc.
# TODO currently unimplemented
:
else
# Fall back to uname, e.g. "Linux <version>", also works for BSD, etc.
# UNTESTED
DISTRO=$(uname -s)
VER=$(uname -r)
fi
echo DISTRO=$DISTRO
echo VER=$VER
# write distro checking result to file
cat << EOF > {linuxDistroCheckingResultTempFile}
DISTRO='$DISTRO'
VER='$VER'
EOF
2 - Gather General Info
Show general info about the system, such as date and time, hostname, info about CPU, etc.
Login as user {Linux User} on node {Linux Node}
echo "========================================================================="
echo "Display the current date and time of the host(date)"
echo "========================================================================="
date
echo
echo "========================================================================="
echo "Print system information(uname -a)"
echo "========================================================================="
uname -a
echo
echo "========================================================================="
echo "Query the system hostname and related settings(hostnamectl)"
echo "========================================================================="
hostnamectl
echo
echo "========================================================================="
echo "Show who is logged on and what they are doing"
echo "This also includes the output of 'uptime'"
echo "(w)"
echo "========================================================================="
w
echo
echo "========================================================================="
echo "Display information about the CPU architecture(lscpu)"
echo "========================================================================="
lscpu
echo
echo "========================================================================="
echo "Display information about the CPU architecture"
echo "in table view with every CPUs in a line"
echo "(lscpu -ae)"
echo "========================================================================="
lscpu -ae
echo
echo "========================================================================="
echo "Show content of kernel's info of CPU(cat /proc/cpuinfo)"
echo "========================================================================="
[ -f /proc/cpuinfo ] && cat /proc/cpuinfo
echo
3 - Gather System Logs
Show content of the system logs, they are usually lengthy, so we display only one in a step.
3.1 - Kernel Ring Buffer - dmesg
Print the kernel ring buffer, this is a long listing, so a single step for it.
Debian requires root privileges to run dmesg
, so a credential with Sudo To root
is needed.
Login as user {Linux User(sudo)} on node {Linux Node}
echo "========================================================================="
echo "Print the kernel ring buffer(dmesg)"
echo "========================================================================="
dmesg
3.2 - System Log File
Print the system log file, this is a long listing, so a single step for it.
Debian requires root privileges to show the content of /var/log/syslog
, so a credential with Sudo To root
is needed.
Login as user {Linux User(sudo)} on node {Linux Node}
# The system log file paths are different from distros,
# so we need to do the seperation
. {linuxDistroCheckingResultTempFile} # load distro checking result
case $DISTRO in
ubuntu | debian)
echo "========================================================================="
echo "Print the system log file(cat /var/log/syslog)"
echo "========================================================================="
[ -f /var/log/syslog ] && cat /var/log/syslog
;;
centos)
echo "========================================================================="
echo "Print the system log file(cat /var/log/messages)"
echo "========================================================================="
[ -f /var/log/messages ] && cat /var/log/messages
;;
*)
echo "unsupported distro"
false # exit code 1 will let Attune suspend running the job
;;
esac
4 - Gather Modules Status
Show kernel modules info, including software and hardware modules.
Package may need to be installed, so Sudo To root
is required.
Login as user {Linux User(sudo)} on node {Linux Node}
# 'lsusb' isn't installed on CentOS by default
. {linuxDistroCheckingResultTempFile} # load distro checking result
case $DISTRO in
ubuntu | debian)
# Nothing to be done
;;
centos)
dnf install -y usbutils
;;
*)
echo "unsupported distro"
false # exit code 1 will let Attune suspend running the job
;;
esac
echo "========================================================================="
echo "Show the status of modules in the Linux Kernel(lsmod)"
echo "========================================================================="
lsmod
echo
echo "========================================================================="
echo "List all PCI devices(lspci)"
echo "========================================================================="
lspci
echo
echo "========================================================================="
echo "List all PCI devices(lspci -v)"
echo "========================================================================="
lspci -v
echo
echo "========================================================================="
echo "List USB devices(lsusb)"
echo "========================================================================="
lsusb
echo
echo "========================================================================="
echo "List USB devices(lsusb -v)"
echo "========================================================================="
lsusb -v
echo
5 - Gather Memory Stats
Show memory related info.
Login as user {Linux User} on node {Linux Node}
echo "========================================================================="
echo "Display amount of free and used memory in the system(free -m)"
echo "========================================================================="
free -m
echo
echo "========================================================================="
echo "Display kenerl info of memory(cat /proc/meminfo)"
echo "========================================================================="
[ -f /proc/meminfo ] && cat /proc/meminfo
echo
6 - Gather Network Info
Show network related info.
Package may need to be installed, so Sudo To root
is required.
Login as user {Linux User(sudo)} on node {Linux Node}
# The commands used in this step need the package net-tools,
# which is not installed by default, so we install it first
. {linuxDistroCheckingResultTempFile} # load distro checking result
case $DISTRO in
ubuntu | debian)
apt update
apt install -y net-tools
;;
centos)
dnf install -y net-tools
;;
*)
echo "unsupported distro"
false # exit code 1 will let Attune suspend running the job
;;
esac
echo "========================================================================="
echo "Show network interfaces(such as IP, subnet, MAC, etc.)"
echo "/usr/sbin/ifconfig"
echo "========================================================================="
# By default, normal users on Debian don't have /usr/sbin in $PATH
/usr/sbin/ifconfig
echo
echo "========================================================================="
echo "Show the routing tables(netstat -r)"
echo "========================================================================="
netstat -r
echo
echo "========================================================================="
echo "Show all sockets(netstat -apn)"
echo "========================================================================="
netstat -apn
echo
echo "========================================================================="
echo "Show content of the resolver configuration file(cat /etc/resolv.conf)"
echo "========================================================================="
[ -f /etc/resolv.conf ] && cat /etc/resolv.conf
echo
echo "========================================================================="
echo "Show statistics of network interfaces(cat /proc/net/dev)"
echo "========================================================================="
[ -f /proc/net/dev ] && cat /proc/net/dev
echo
7 - Gather Storage Info
Show storage related info.
We add a || true
after each grep
to prevent grep
returning exit code other than the configured one - 0
by default(Attune detects the exit code of every line of script, and will cease to run if exit code other than the expected one is seen), in case there is nothing found by grep
.
Login as user {Linux User} on node {Linux Node}
echo "========================================================================="
echo "Report file system disk space usage(df -aTh | grep -v loop)"
echo "========================================================================="
df -aTh | grep -v loop || true
echo
echo "========================================================================="
echo "List block devices(lsblk -al | grep -v loop)"
echo "========================================================================="
lsblk -al | grep -v loop || true
echo
echo "========================================================================="
echo "Print block device attributes(blkid | grep -v loop)"
echo "========================================================================="
# Debian don't have /usr/sbin in $PATH by default
/usr/sbin/blkid | grep -v loop || true
echo
echo "========================================================================="
echo "List active mount points(mount | grep -v loop)"
echo "========================================================================="
mount | grep -v loop || true
echo
8 - Gather GPU Info
Show GPU related info.
Package may need to be installed, so Sudo To root
is required.
Login as user {Linux User(sudo)} on node {Linux Node}
echo "========================================================================="
echo "Check status of NVidia GPU(nvidia-smi)"
echo "========================================================================="
# see if there's NVidia GPU installed
if lspci -vnn | grep VGA | grep -qi NVIDIA
then
# check if 'nvidia-smi' is installed
if ! type nvidia-smi >/dev/null 2>&1
then
# nvidia-smi not found, install it
. {linuxDistroCheckingResultTempFile} # load distro checking result
case $DISTRO in
ubuntu)
apt update
apt install -y nvidia-340
;;
debian)
# add 'non-free' archive area to sources.list
# if there is already 'non-free', then sources.list is unmodified
sed -i -e '/deb http/!b' -e '/non-free/b' -e 's/$/ non-free/' /etc/apt/sources.list
apt update
apt install -y nvidia-smi
;;
centos)
# consult https://docs.nvidia.com/datacenter/tesla/pdf/NVIDIA_Driver_Installation_Quickstart.pdf
# for installation documentation
dnf config-manager --set-enabled PowerTools
dnf install -y epel-release
dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
dnf install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
dnf clean all
dnf -y module install nvidia-driver:latest-dkms
;;
*)
echo "unsupported distro"
false # exit code 1 will let Attune suspend running the job
;;
esac
fi
if type nvidia-smi >/dev/null 2>&1
then
nvidia-smi || true
else
echo "nvidia-smi command install failed"
fi
else
# since 'nvidia-smi' comes with the GPU driver
# which is useless(and huge, may also harmful to system stability) if a GPU is not installed
# so we decide to not install the driver when GPU is not detected
echo "No NVidia GPU found."
fi
echo
echo "========================================================================="
echo "Show OpenCL platforms and devices(clinfo)"
echo "========================================================================="
# check if 'clinfo' is installed
if ! type clinfo >/dev/null 2>&1
then
# clinfo not found, install it
. {linuxDistroCheckingResultTempFile} # load distro checking result
case $DISTRO in
ubuntu | debian)
apt update
apt install -y clinfo
;;
centos)
# no official package for CentOS8
# install with a RHEL7 rpm as a workaround
dnf install -y ocl-icd # prerequisite
wget https://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.el7.x86_64.rpm
rpm -ihv clinfo-2.1.17.02.09-1.el7.x86_64.rpm
rm -f clinfo-2.1.17.02.09-1.el7.x86_64.rpm
;;
*)
echo "unsupported distro"
false # exit code 1 will let Attune suspend running the job
;;
esac
fi
if type clinfo >/dev/null 2>&1
then
clinfo || true
else
echo "clinfo command install failed"
fi
echo
9 - Gather Running Processes and Resource Usage
Show running processes and resource usage of the system.
Login as user {Linux User} on node {Linux Node}
echo "========================================================================="
echo "Display running processes, plus memory and CPU usage info(top -b -n 1)"
echo "========================================================================="
top -b -n 1
echo
echo "========================================================================="
echo "Report a snapshot of the current processes(ps -e)"
echo "========================================================================="
ps -e
echo
Using Attune to dump vital health status to Attune's running log on popular Linux distributions
In this blueprint, we will use common commands to check the health status of the system, print the commands' output to
stdout
(they will be directed to Attune's running log -- can be seen in real time when a job is executing, or fromJobs -> history
interface afterwards) .The main purpose of this blueprint is to let other blueprints inspect the results of this one, and create health report etc. accordingly. So, it's used as the beginning(data generation / gathering) of a data processing pipeline.
Users can also learn from this blueprint the commands used to check heath status of Linux.
This has been tested on Ubuntu 20.04.2 LTS / Debian 11.0.0 / CentOS 8
Pre-Blueprint Attune setup
Sudo To root
set to connect to the host you wish to check health status. This is required for some health check commands to successfully run, and also needed for installing packages when command not found.