A CLI tool for users to report and understand suspected issues with their HPC environment.
FOR LEGACY REASONS THIS TOOL CURRENTLY MUST SUPPORT RUBY 2.6+
This is a application meant to be integrated with the Flight Environment. The scripts in this directory correspond to where they should go in relation to $flight_ROOT (e.g. libexec/commands/report should be placed at /opt/flight/libexec/commands/report in a default flight environment setup).
An example of putting the files in place and installing dependencies (as root):
# Clone repository
git clone https://github.com/openflighthpc/flight-report /tmp/flight-report
# Copy files into place
cp -ru /tmp/flight-report/opt $flight_ROOT/
cp -ru /tmp/flight-report/libexec $flight_ROOT/
# Optional: Configure CLI tool (otherwise will use default settings)
## Open and edit the file to change defaults
cp $flight_ROOT/opt/report/etc/config.yaml.example $flight_ROOT/opt/report/etc/config.yaml
# Ensure log directories are writeable by all users
chmod 777 $flight_ROOT/opt/report/var/{accesslogs,check-results,ratings,reports}
# Install ruby dependencies
cd $flight_ROOT/opt/report/
$flight_ROOT/bin/bundle config set --local path vendor
$flight_ROOT/bin/bundle install
# Remove git repo
rm -rf /tmp/flight-report
Note: These example alias the report script to be accessed via the keyword assist
The Flight Environment can include messages in the banner program (when enabled), to add a banner message with a call-to-action for flight assist, create the file $flight_ROOT/etc/banner/banner.d/30-assist.sh containing the following:
(
bold="$(tput bold)"
clr="$(tput sgr0)"
if [[ $TERM =~ "256color" ]]; then
bgblue="$(tput setab 68)"
fi
echo -e "HOW HAS YOUR HPC EXPERIENCE BEEN TODAY? LET US KNOW!"
printf " ${bold}${bgblue}flight assist${clr} to get in touch\n"
echo
)The Flight Environment can have a "Flight Tips" section which is displayed on login to the system. To add flight assist to the tips list create file $flight_ROOT/etc/banner/tips.d/50-assist.rc containing the following:
flight_TIP_command="flight assist"
flight_TIP_synopsis="Let us know how your experience with the HPC environment is"
flight_TIP_root="false"Note: These example alias the summary script to be accessed via the keyword status
The Flight Environment can include messages in the banner program (when enabled), to add a banner message with a call-to-action for flight status, create the file $flight_ROOT/etc/banner/banner.d/25-status.sh containing the following:
(
bold="$(tput bold)"
clr="$(tput sgr0)"
if [[ $TERM =~ "256color" ]]; then
bgblue="$(tput setab 68)"
fi
echo -e "SEE THE STATUS OF YOUR HPC ENVIRONMENT!"
printf " ${bold}${bgblue}flight status${clr} for more information\n"
echo
)The Flight Environment can have a "Flight Tips" section which is displayed on login to the system. To add flight assist to the tips list create file $flight_ROOT/etc/banner/tips.d/45-status.rc containing the following:
flight_TIP_command="flight status"
flight_TIP_synopsis="See the current status of the HPC environment (including any known issues)"
flight_TIP_root="false"To update the CLI tool simply clone the repository and copy the files into place again, overwriting the existing ones.
Can only overwrite files that have changed with:
cp -r /tmp/flight-report/opt $flight_ROOT/
cp -r /tmp/flight-report/libexec $flight_ROOT/opt/report/etc/issues/: Location of content for this command, this is where issues and specific diagnostics are definedopt/report/etc/issues/general/*.bash: These diagnostics are run for every issue (in alphanumeric order)opt/report/etc/issues/general/metrics.yaml: These metrics are checked for every issue
Checks are scripts used to check & identify issues within a HPC environment. These Checks can then be used to create Statuses.
Each check script must:
- Be created in
etc/checks/and end with.sh - Contain a line starting
# Description:followed by a brief description of what the script does - Be a BASH script
For a user to have access to encrypted checks and have their checks automatically saved to a report file they will need to be in the privileged users list in the config file.
A check can ask some questions to a user too which will be exported as environment variables to the shell. A questions file must:
- Be in the same directory as the script
- Be named the same but end with
.sh.questions
The formatting of this file is the same as the questions.yaml file for Issues.
Encrypted checks are useful for allowing only certain users entrusted with a decryption password to be able to execute the tests (without exposing the content of the scripts to unauthorised users).
Each encrypted check must:
- Be created in
etc/checks/and end with.sh.gpg - Must be decrypted by the same password (across an installation of this tool)
This has been tested by creating an encrypted script as follows:
# Create source script
echo -e "# Description: Run an encrypted check\necho 'Encrypted script test'" > /tmp/encrypted_example.sh.source
# Encrypt with password (will prompt for input)
gpg -c --no-symkey-cache --armour -o etc/checks/encrypted_example.sh.gpg /tmp/encrypted_example.sh.sourceA configurable "site" directory can be specified in config.yaml which provides an extra source of check scripts.
These scripts also have to stick to the "musts" specified at the beginning of this section (except the location must be in the location specified by sitechecksdir in config.yaml)
Statuses provide information about the system to a user describing the state of the system.
To add a status:
- Create a file named after the status in
opt/report/etc/statuses, for example,network.yaml:type: warning message: "Network: We are aware of a network performance degradation which we are investigating"
- Type determines the emoji that's displayed next to the message, it can be:
working: ✅warning:⚠️ - Anything else: ℹ️
- Type determines the emoji that's displayed next to the message, it can be:
The tool presents Issues to a user. These Issues can have related Diagnostics and Metrics. This allows for different data to be collected in different circumstances.
To add an issue:
- Create a directory for the issue in
opt/report/etc/issues/ - Create file
metadata.yamltitle: 'Short Issue Name' desc: 'Brief description of the issue'
- Create file
questions.yamlif any additional questions/data are wanted- question: 'Do you want to answer a question?' type: text var: MY_ENV_VAR # exported to environment where diagnostic script is run - question: 'Did you like answering that question?' type: boolean var: MY_OTHER_ENV_VAR - question: 'How many questions would you ideally like to answer?' type: number var: MY_NUMBER - question: 'How would you rate your answering experience?' type: list options: - good - bad - ugly var: MY_FINAL_VAR
- Create file
diagnostics.bashfor any additional diagnostics to be run- The answers to questions for this issue will be available to the script by their corresponding
var:value
- The answers to questions for this issue will be available to the script by their corresponding
- Create file
metrics.yamlto execute commands that are compared to values to determine if metric is okay or not- id: output-less-than name: "Output Less Than" cmd: "uptime |sed 's/.*load averages: //g;s/ .*//g'" value: 1.0 type: 'lt' - id: output-greater-than name: "Output Greater Than" cmd: "uptime |sed 's/.*load averages: //g;s/ .*//g'" value: 1.0 type: 'gt' - id: output-equals name: "Output Equals" cmd: "uptime |sed 's/.*load averages: //g;s/ .*//g'" value: 1.0 type: 'equals' - id: string-check name: "Comparing Strings" cmd: "grep '^VERSION=' /etc/os-release" value: 'VERSION="9.3 (Blue Onyx)"' type: 'matches'
- The answers to questions for this issue will be available to the commands by their corresponding
var:value - Notes on restrictions:
gt,ltandequalsconvert the output to a floating point number to compare withvaluematchescompares the output to a string- Multiline values will almost definitely not work
- The answers to questions for this issue will be available to the commands by their corresponding
To see if anyone accesses the command, when they initially run bin/report a line will be added to a file named after the user under var/accesslogs with the time they launched the command.
The first thing the command asks for is a rating, this is from 0-2 but is represented in smilies. This rating is output to a file under var/ratings.
This tool generates a report which can be used to track, compare and understand issues within the environment. A report is a YAML file structured as follows:
meta:
issue_id: ISSUE_ID
user: USERNAAME
reported_at: DATE_AND_TIME_COMMAND_RUN
issue_at: DATE_AND_TIME_OF_ISSUE
diagnostics:
general:
SCRIPT_NAME:
script_md5: MD5SUM_OF_SCRIPT # to ensure that the outputs can be compared because script structure hasn't changed
output: |
Multiline output
from the script
issue:
script_md5: MD5SUM_OF_SCRIPT # to ensure that the outputs can be compared because script structure hasn't changed
answers:
VAR: ANSWER
VAR: ANSWER
VAR: ANSWER
output: |
Multiline output
from the script
metrics:
METRIC_ID: # Entire metric metadata including name, command, output, comparison type, success true/false
METRIC_ID: # Entire metric metadata including name, command, output, comparison type, success true/false
METRIC_ID: # Entire metric metadata including name, command, output, comparison type, success true/falseThe CLI is interactive so the user simply needs to run flight report and answer the questions in order to complete a report.
- User report management
- See what issues they've reported, how often, etc
- Let them view whatever previous reports of theirs they want (and compare 2 for an issue type)
- Safely handle report saving such that users cannot delete existing reports
- Separate the reporting CLI and the diagnostic execution+saving (pinging to an API server)
- Input validation to ensure text fields can't be used for injection attacks
- Alces & Site Admin tools
- Possibilities for collating and contrasting reports
- Admin command that can summarise what has been reported in past hour / 24 hours / week
- Support 'General' questions for setting env vars
- Other
- Report historic issue (e.g. not experiencing right now)
- Allow questions with assisted answers (e.g. using commands to populate selection)
- Advisory actions for users based on the problems they're experiencing ("sometimes high disk usage can impact performance, especially with network arrays, if the usage is high due to many workloads running try transferring your workload to scratch before executing it")
- AI chatbot??