HPCC – HPC Cluster

Manuals: Introduction

Mon, 01 Jan 0001 00:00:00 +0000

Introduction

This manual provides an introduction to the usage of the HPCC cluster. All servers and compute resources of the HPCC cluster are available to researchers from all departments and colleges at UC Riverside for a minimal recharge fee (see rates). To request an account, please email support@hpcc.ucr.edu. The latest hardware/facility description for grant applications is available here.

Overview

Storage

Four enterprise class HPC storage systems
Approximately 6 PB of total network storage (3,072 TB production and 3,072 TB backup)
GPFS (NFS and SAMBA via GPFS)
Automatic snapshots and archival backups

Network

Ethernet
- 1 Gb/s switch x 5
- 1 Gb/s switch 10 Gig uplink
- 10 Gb/s switch for Campus wide Science DMZ
- redundant, load balanced, robust mesh topology
Interconnect
- 56 Gb/s InfiniBand (FDR)

Head Nodes

All users should access the cluster via ssh through cluster.hpcc.ucr.edu, this address will automatically balance traffic to one of the available head nodes.

Jay
- Resources: 64 cores, 512 GB memory
- Primary function: submitting jobs to the queuing system
- Secondary function: development; code editing and running small (under 50 % CPU and under 1 GB RAM) sample jobs
Lark
- Resources: 64 cores, 512 GB memory
- Primary function: submitting jobs to the queuing system
- Secondary function: development; code editing and running small (under 50 % CPU and under 1 GB RAM) sample jobs

Worker Nodes

Batch
- c01-c48: each with 64 AMD cores and 512 GB memory
Intel
- i01-i40: each with 32 Intel Broadwell cores and 512 GB memory
Epyc
- r21-r38: each with 64 AMD EPYC cores and 1 TB memory
Highmem
- h01-h06: each with 32 Intel cores and 1024 GB memory
GPU
- gpu01-gpu02: each with 32 (HT) cores Intel Haswell CPUs and 2 x NVIDIA Kepler K80 GPUs (12GB and 2496 CUDA cores per GPU) and 128 GB memory
- gpu03-gpu04: each with 48 (HT) cores Intel Broadwell CPUs and 4 x NVIDIA Kepler K80 GPUs (12GB and 2496 CUDA cores per GPU) and 512 GB memory
- gpu05: 64 (HT) cores Intel Broadwell CPUs and 2 x NVIDIA Pascal P100 GPUs (16GB and 3584 CUDA cores per GPU) and 256 GB memory
- gpu06-gpu08: with 64-128 (HT) cores AMD CPUs and 8 x NVIDIA A100 GPUs (80GB and 6912 CUDA cores per GPU) and 1,024 GB memory

Manuals: Getting Started

Mon, 01 Jan 0001 00:00:00 +0000

The initial login brings users into the cluster head node (i.e. jay, lark). From there, users can submit jobs via srun/sbatch to the compute nodes to perform intensive tests. Since all machines are mounting a centralized file system, users will always see the same home directory on all systems. Therefore, there is no need to copy files from one machine to another.

Open the terminal and type

ssh -X username@cluster.hpcc.ucr.edu

Please refer to the login instructions of our Linux Basics manual.

Change Password

Once you have logged in type the following command:

passwd

Enter the old password (the random characters that you were given as your initial password)
Enter your new password

The password minimum requirements are:

Total length at least 8 characters long
Must have at least 3 of the following:
- Lowercase character
- Uppercase character
- Number
- Punctuation character

Modules

All software used on the HPC cluster is managed through a simple module system. You must explicitly load and unload each package as needed. More advanced users may want to load modules within their bashrc, bash_profile, or profile files.

Available Modules

To list all available software modules, execute the following:

module avail

This should output something like:

------------------------ /opt/linux/rocky/8.x/x86_64/modules -------------------------
AAFTF/0.5.0 workspace/scratch <aL>
abyss/2.3.4 wtdbg2/2.5
almabte/1.3.2 xpdf/4.03
alphafold/2.3.0 xsv/0.13.0
amber/22_mpi_cuda yq/4.35.1
amptk/1.6 zoem/21-341
...

Using Modules

To load a module, run:

module load <software name>[/<version>]

For example, to load R version 4.1.2, run:

module load R/4.1.2

To load the default version of the tophat module, run:

module load tophat

Show Loaded Modules

To show what modules you have loaded at any time, you can run:

module list

Depending on what modules you have loaded, it will produce something like this:

Currently Loaded Modulefiles:
1) vim/7.4.1952 3) slurm/16.05.4 5) R/3.3.0 7) less-highlight/1.0 9) python/3.6.0
2) tmux/2.2 4) openmpi/2.0.1-slurm-16.05.4 6) perl/5.20.2 8) iigb_utilities/1

Unloading Software

Sometimes you want to no longer have a piece of software in path. To do this you unload the module by running:

module unload <software name>

Databases

Loading Databases

NCBI, PFAM, and Uniprot, do not need to be downloaded by users. They are installed as modules on the cluster.

module load db-ncbi
module load db-pfam
module load db-uniprot

Specific database release numbers can be identified by the version label on the module:

module avail db-ncbi
----------------- /usr/local/Modules/3.2.9/modulefiles -----------------
db-ncbi/20140623(default)

Using Databases

In order to use the loaded database users can simply provide the corresponding environment variable (NCBI_DB, UNIPROT_DB, PFAM_DB, etc…) for the proper path in their executables.

This is the old deprecated BLAST and it may not work in the near future, however if you require it:

blastall -p blastp -i proteins.fasta -d $NCBI_DB/nr -o blastp.out

You can can also use this method if you require the old version of BLAST (old BLAST with legacy support):

BLASTBIN=`which legacy_blast.pl | xargs dirname`
legacy_blast.pl blastall -p blastp -i proteins.fasta -d $NCBI_DB/nr -o blast.out --path $BLASTBIN

This is the preferred/recommended method (BLAST+):

blastp -query proteins.fasta -db $NCBI_DB/nr -out proteins_blastp.txt

Usually, we store the most recent release and 2-3 previous releases of each database. This way time consuming projects can use the same database version throughout their lifetime without always updating to the latest releases.

Additional Features

There are additional features and operations that can be done with the module command. Please run the following to get more information:

module help

Quotas

CPU and Memory

Please refer to our Queue Policies page for details regarding CPU and Memory limits.

Data Storage

A standard user account has a storage quota of 20GB. Much more storage space, in the range of many TBs, can be made available in a user account’s bigdata directory. The amount of storage space available in bigdata depends on a user group’s annual subscription. The pricing for extending the storage space in the bigdata directory is available here.

What’s Next?

You should now know the following:

Basic orginization of the cluster
How to login to the cluster
How to use the Module system to gain access to the cluster software
CPU, storage, and memory limitations (quotas and hardware limits)

Now you can start using the cluster.

The HPCC cluster uses the Slurm queuing system and thus the recommended way to run your jobs (scripts, pipelines, experiments, etc…) is to submit them to this queuing system by using sbatch. Please DO NOT RUN ANY computationally intensive tasks on any head node (i.e. jay, lark). If this policy is violated, your process will either run very slow or be killed automatically. The head nodes (login nodes) are a shared resource and should be accessible by all users. Negatively impacting performance would affect all users on the system and will not be tolerated.

Manuals: Managing Jobs

Mon, 01 Jan 0001 00:00:00 +0000

What is a Job?

Submitting and managing jobs is at the heart of using the cluster. A ‘job’ refers to the script, pipeline or experiment that you run on the nodes in the cluster.

Partitions

Jobs are submitted to so-called partitions (or queues). Each partition is a group of nodes, often with similar hardware specifications (e.g. CPU or RAM configurations). The quota policies applying to each partitions are outlined on the Queue Policies page. For more detailed hardware info, see the Hardware Details page.

epyc
- Nodes: r21-r38
- CPU: AMD
- Supported Extensions¹: AVX, AVX2, SSE, SSE2, SSE4
- RAM: 1 GB default
- Time (walltime): 168 hours (7 days) default
intel
- Default partition
- Nodes: i01-02,i17-i40
- CPU: Intel
- Supported Extensions¹: AVX, AVX2, SSE, SSE2, SSE4
- RAM: 1 GB default
- Time (walltime): 168 hours (7 days) default
batch
- Nodes: c01-c48
- CPU: AMD
- Supported Extensions¹: AVX, SSE, SSE2, SSE4
- RAM: 1 GB default
- Time (walltime): 168 hours (7 days) default
highmem
- Nodes: h01-h06
- CPU: Intel
- Supported Extensions¹: AVX, SSE, SSE2, SSE4
- RAM: 100 GB to 1000 GB
- Time (walltime): 48 hours (2 days) default
highclock
- Nodes: hz01-hz04
- CPU: Intel
- Supported Extensions¹: AVX, SSE, SSE2, SSE4
- RAM: 1 GB default
- Time (walltime): 168 hours (7 days) default
gpu
- Nodes: gpu01-gpu08
- CPU: AMD/Intel
- GPUs: NVIDIA k80, p100, a100, h100
- RAM: 1 GB default
- Time (walltime): 48 hours (2 days) default
short
- Nodes: Mixed set of nodes from batch, intel, and group partitions
- Cores: AMD/Intel
- RAM: 1 GB default
- Time (walltime): 2 hours Maximum
short_gpu
- Nodes: gpu01-gpu10
- CPU: AMD/Intel
- GPUs: NVIDIA k80, p100, a100, h100, ada6000
- RAM: 1 GB default
- Time (walltime): 2 hours Maximum
Lab Partitions
- If your lab has purchased nodes then you will have a priority partition with the same name as your group (ie. girkelab).

In order to submit a job to different partitions add the optional ‘-p’ parameter with the name of the partition you want to use:

sbatch -p batch SBATCH_SCRIPT.sh
sbatch -p highmem SBATCH_SCRIPT.sh
sbatch -p epyc SBATCH_SCRIPT.sh
sbatch -p gpu SBATCH_SCRIPT.sh
sbatch -p intel SBATCH_SCRIPT.sh
sbatch -p highclock SBATCH_SCRIPT.sh
sbatch -p mygroup SBATCH_SCRIPT.sh

Slurm

Slurm is used as a queuing system across all head nodes. SSH directly into the cluster and your connection will be automatically load balanced to a head node:

ssh -XY cluster.hpcc.ucr.edu

Resources and Limits

To see your limits you can do the following:

slurm_limits

Check total number of cores used by your group in the all partitions:

group_cpus

However this does not tell you when your job will start, since it depends on the duration of each job. The best way to do this is with the “–start” flag on the squeue command:

squeue --start -u $USER

Submitting Jobs

There are 2 basic ways to submit jobs; non-interactive and interactive. Slurm will automatically start within the directory where you submitted the job from, so keep that in mind when you use relative file paths.

Non-interactive Submission

Non-interactive jobs are submitted as SBATCH scripts, an example is as follows:

sbatch SBATCH_SCRIPT.sh

Here is an example of an SBATCH script:

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=10G
#SBATCH --time=1-00:15:00 # 1 day and 15 minutes
#SBATCH --mail-user=useremail@address.com
#SBATCH --mail-type=ALL
#SBATCH --job-name="just_a_test"
#SBATCH -p epyc # You can use any of the following; epyc, intel, batch, highmem, gpu
# Print current date
date
# Load samtools
module load samtools
# Concatenate BAMs
samtools cat -h header.sam -o out.bam in1.bam in2.bam
# Print name of node
hostname

The above job will request 1 node, 10 cores (parallel threads), 10GB of memory, for 1 day and 15 minutes. An email will be sent to the user when the status of the job changes (Start, Failed, Completed). For more information regarding parallel/multi core jobs refer to Parallelization.

Interactive Submission

Interactive jobs are submitted using srun. An example is as follows:

srun --pty bash -l

If you do not specify a partition then the “epyc” partition is used by default.

Here is a more complete example:

srun --mem=1gb --cpus-per-task 1 --ntasks 1 --time 10:00:00 --x11 --pty bash -l

The above example enables X11 forwarding and requests 1GB of memory and 1 core for 10 hours within an interactive session.

Feature Constraints

Using the --constraint (or -C flag) allows you to fine-tune what type of machine your job can run on, mainly useful on the “short” partitions. Our Node List contains all of the different nodes that we have (both public and private) as well as any feature constraints they have.

Node List

For more info on hardware details, see our Hardware Details page.

Constraint Examples

Since jobs on the “short” partition can run on any node, jobs can be narrowed down using constraints.

If you require an Intel node of any generation:

srun -p short -t 2:00:00 -c 8 --mem 8GB --constraint intel --pty bash -l

If you require an AMD node, but want it to be Rome or Milan generation (ie. not Abu Dhabi):

srun -p short -t 2:00:00 -c 8 --mem 8GB --constraint "amd&(rome|milan)" --pty bash -l

If you want to run on a modern GPU machine, requesting 1 GPU:

srun -p short_gpu -t 2:00:00 -c 8 --mem 8GB --gpus=1 --constraint "gpu_latest" --pty bash -l

When using constraints with GPUs, make sure to request a generic GPU

Monitoring Jobs

To check on your jobs states, run the following:

squeue -u $USER --start

To list all the details of a specific job (the JOBID can be found using squeue), run the following:

scontrol show job JOBID

To view past jobs and their details, run the following:

sacct -u $USER -l

You can also adjust the start -S time and/or end -E time to view, using the YYYY-MM-DD format. For example, the following command uses start and end times:

sacct -u $USER -S 2018-01-01 -E 2018-08-30 -l | less -S # Type 'q' to quit

Custom command for summarizing activity of all users on cluster

jobMonitor # or qstatMonitor

Canceling Jobs

In cancel/stop your job run the following:

scancel JOBID

You can also cancel multiple jobs:

scancel JOBID1 JOBID2 JOBID3

If you want to cancel/stop/kill ALL your jobs it is possible with the following:

# Be very careful when running this, it will kill all your jobs.
squeue --user $USER --noheader --format '%i' | xargs scancel

For more information please refer to Slurm scancel documentation.

Optimizing Jobs

After a job has been completed, you can use seff ## ("##" being your Slurm Job ID) to check how many resources your job consumed during it’s run. seff is only useful after a job has completed, and will not give useful information on currently-running jobs.

For example:

$ seff 123123
Job ID: 123123
Cluster: hpcc
User/Group: your_username/yourlab
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 20
CPU Utilized: 03:26:14
CPU Efficiency: 95.04% of 03:37:00 core-walltime
Job Wall-clock time: 00:10:51
Memory Utilized: 81.20 GB
Memory Efficiency: 81.20% of 100.00 GB

In the above example, we can see good utilization of the CPU cores (95%) as well as good utilization of memory usage (81%).

If CPU Efficiency is low, make sure that the program(s) you are running makes use of multi-threading correctly. Requesting more cores for a job will not make your program run faster if it does not properly take advantage of them.

If Memory Efficiency is low, then you can try reducing the requested memory for a job. Note: Just because you see your job uses 81.20GB of memory does not mean that next time you should request exactly 81.20GB of memory. Variations in input data will cause different memory usage characteristics. You should try to aim to request ~20% higher memory then will actually be used to account for any spikes in memory usage. Slurm might miss some quick spikes of memory usage, but the Operating System will not. In this regard it’s better to overestimate on initial runs, and scale back once you find a good limit.

Slurm Job Reason/Error Codes

If a job is stuck in the queue or fails to start, there are typically Slurm error codes assigned that explain the reason. Typically these are a bit hard to parse, so below is a table of common error codes and how to work around them.

Error Code	Reason	Fix
Resources	This isn’t an error, but rather why your job can’t start immediately.	Once requested resources are available, then your job will start.
Priority	A job with a higher priority than yours is pending and needs to run first.	You have likely submitted many jobs in a short period of time and Slurm’s Fair-Share algorithm is allowing other higher priority jobs to run first.
QOSMaxWallDurationPerJobLimit	The time limit requested on the selected partition goes over the limits. For example, requesting 3 days on the “short” partition.	Make sure that you are within the partition’s time limit. Please refer to the Queue Policies page for the per-partition time limits.
AssocGrpCpuLimit	You are exceeding the Per-User CPU limit on a specific partition.	You must wait until jobs finish within a partition to free up resources to allow additional jobs to run.
AssocGrpMemLimit	You are exceeding the Per-User Memory limit on a specific partition.	You must wait until jobs finish within a partition to free up resources to allow additional jobs to run.
AssocGrpGRES	You are exceeding the Per-User GRES (GPU) limit	You must wait until your GPU jobs finish to free up resources to allow additional jobs to run.
MaxSubmitJobLimit	You are trying to submit more than 5000 jobs. There is a 5000 job limit per-user for queued and running jobs.	Wait until some of your jobs finish, then you can continue submitting jobs.
ReqNodeNotAvail, Reserved for maintenance	The time limit of your job would cause it to overlap with an upcoming maintenance.	You can either reduce your job’s runtime or wait for the maintenance to complete.
PartitionConfig	The job has been queued to the wrong partition under the wrong account.	Some partitions require that you queue under a specific account. eg. preempt jobs need to use the preempt account (`-A preempt`)
QOSMinGRES	The job has not requested the minimum resources required for the specified partition.	Some partitions require that you request a minimum number of resources. For example, for “highmem” you must request >= 100GB, and for “gpu” you must request a GPU using the `--gres` flag.

This is only a small number of the most common reasons. For a full list please see Slurm’s Job Reason Codes page. If you are confused as to why you’re getting a specific reason, please reach out to support.

Advanced Jobs

There is a third way of submitting jobs by using steps. Single Step submission:

srun <command>

Under a single step job your command will hang until appropriate resources are found and when the step command is finished the results will be sent back on STDOUT. This may take some time depending on the job load of the cluster. Multi Step submission:

salloc -N 4 bash -l
srun <command>
...
srun <command>
exit

Under a multi step job the salloc command will request resources and then your parent shell will be running on the head node. This means that all commands will be executed on the head node unless preceeded by the srun command. You will also need to exit this shell in order to terminate your job.

Array Jobs

If a large batch of fairly similar jobs need to be submitted, an Array Job might be a good option. For an array job, include the --array parameter in your sbatch script, similar to the following:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2 # This will be the number of CPUs per individual array job
#SBATCH --mem=1G # This will be the memory per individual array job
#SBATCH --time=0-00:15:00 # 15 minutes
#SBATCH --array=1-2500
#SBATCH --job-name="just_a_test"
echo "I have array ID ${SLURM_ARRAY_TASK_ID}"

Within each job, the SLURM_ARRAY_TASK_ID environment variable is set and can be used to slightly change how each job is run.

Note that there is a 2500 job limit for array jobs.

More information can be found on the Slurm Documentation including other Environment Variables that are set per-job.

Highmem Jobs

The highmem partition does not have a default amount of memory set, however it does has a minimum limit of 100GB per job. This means that you need to explicity request at least 100GB or more of memory.

Non-Interactive:

sbatch -p highmem --mem=100g --time=24:00:00 SBATCH_SCRIPT.sh

Interactive

srun -p highmem --mem=100g --time=24:00:00 --pty bash -l

Of course you should adjust the time argument according to your job requirements.

GPU Jobs

GPU nodes have multiple GPUs, and vary in type (K80, P100, A100, or H100). This means you need to request how many GPUs and of what type that you would like to use.

To request a gpu of any type, only indicate how many GPUs you would like to use.

Non-Interactive:

sbatch -p gpu --gres=gpu:1 --mem=100g --time=1:00:00 SBATCH_SCRIPT.sh

Interactive

srun -p gpu --gres=gpu:4 --mem=100g --time=1:00:00 --pty bash -l

Since the HPCC Cluster has many different types of GPUs installed (eg. K80, P100, A100, H100), GPUs can be requested explicitly by type. More info on what GPUs are available can be found in the Worker Node section of our Hardware Details page.

Non-Interactive:

sbatch -p gpu --gres=gpu:k80:1 --mem=100g --time=1:00:00 SBATCH_SCRIPT.sh
sbatch -p gpu --gres=gpu:p100:1 --mem=100g --time=1:00:00 SBATCH_SCRIPT.sh
sbatch -p gpu --gres=gpu:a100:1 --mem=100g --time=1:00:00 SBATCH_SCRIPT.sh

Interactive

srun -p gpu --gres=gpu:k80:1 --mem=100g --time=1:00:00 --pty bash -l
srun -p gpu --gres=gpu:p100:1 --mem=100g --time=1:00:00 --pty bash -l
srun -p gpu --gres=gpu:a100:1 --mem=100g --time=1:00:00 --pty bash -l

Of course you should adjust the time argument according to your job requirements.

Once your job starts your code must reference the environment variable “CUDA_VISIBLE_DEVICES” which will indicate which GPUs have been assigned to your job. Most CUDA enabled software, like MegaHIT, will check this environment variable and automatically limit accordingly.

For example, after reserving 4 GPUs for a NAMD2 job:

echo $CUDA_VISIBLE_DEVICES
0,1,2,3
namd2 +idlepoll +devices $CUDA_VISIBLE_DEVICES MD1.namd

Each group is limited to a maximum of 8 GPUs on the gpu partition. Please be respectful of others and keep in mind that the GPU nodes are a limited shared resource. Since the CUDA libraries will only run with GPU hardware, development and compiling of code must be done within a job session on a GPU node.

Here are a few more examples of jobs that utilize more complex features (ie. array, dependency, MPI etc): Slurm Examples

Web Browser Access

Ports

Some jobs require web browser access in order to utilize the software effectively. These kinds of jobs typically use (bind) ports in order to provide a graphical user interface (GUI) through a web browser. Users are able to run jobs that use (bind) ports on a compute node. Any port can be used on any compute node, as long as the port number is greater than 1000 and it is not already in use (bound).

Tunneling

Once a job is running on a compute node and bound to a port, you may access this compute node via a web browser. This is accomplished by using 2 chained SSH tunnels to route traffic through our firewall. This acts much like 2 runners in a relay race, handing the baton to the next runner, to get past a security checkpoint.

Running the following command on your local machine will create a tunnel that goes though a headnode and connect to a compute node on a particular port.

ssh -NL 8888:NodeName:8888 username@cluster.hpcc.ucr.edu

Port 8888 (first) is the local port you will be using on your local machine. NodeName is the compute node where where job is running, which can be found by using the squeue -u $USER command. Port 8888 (second) is the remote port on the compute node. Again, the NodeName and ports will be different depending on where your job runs and what port your job uses.

At this point you may need to provide a password to make the SSH tunnel. Once this has succeeded, the command will hang (this is normal). Leave this session connected, if you close it your tunnel will be closed.

Then open a browser on your local computer (PC/laptop) and point it to:

http://localhost:8888

If your job uses TSL/SSL, so you may need to try https if the above does not work:

https://localhost:8888

Examples

A perfect example of this method is used for Jupyter Lab/Notebook. For more details please refer to the JupyterLab Usage page.
RStudio Server instances can also be started directly on a compute node and accessed via an SSH tunnel. For details see here.

Desktop Environments

VNC Server (cluster)

Start VNC Server

Log into the cluster:

ssh username@cluster.hpcc.ucr.edu

The VNC programs are only available on Compute Nodes, and additionally the first time you run the vncserver it will need to be configured:

srun -p epyc -c 2 --mem 4GB -t 10:00 --pty bash -l # Start compute session
vncserver -fg # Configure VNC
exit # Leave compute session

You should set a password for yourself, and the read-only password is optional.

After your vncserver is configured, submit a vncserver job to get it started:

sbatch -p epyc --cpus-per-task=4 --mem=10g --time=2:00:00 --wrap='vncserver -fg' --output='vncserver-%j.out'

Note: Appropriate job resources should be requested based on the processes you will be running from within the VNC session.

Check the contents of your job log to determine the NodeName and Port you were assigned:

cat vncserver-*.out

The contents of your slurm job log should be similar to the following:

vncserver
New 'i54:1' desktop is i54:1
Creating default startup script /rhome/username/.vnc/xstartup
Starting applications specified in /rhome/username/.vnc/xstartup
Log file is /rhome/username/.vnc/i54:1.log

The VNC Port used should be 5900+N, N being the display number mentioned above in the format NodeName:DisplayNumber (ie. i54:1). In this example (default), the port is 5901, if this Port were already in use then the vncserver will automatically increment the DisplayNumber and you might find something like i54:2 or i54:3 and so on.

Stop VNC Server

To stop the vncserver, you can click on the logout option from the upper right hand menu from within your VNC desktop environment. If you want to kill your vncserver manually, then you will need to do the following:

ssh NodeName 'vncserver -kill :DisplayNumber'

You will need to replace NodeName with the node name of your where your job is running, and the DisplayNumber with the DisplayNumber from your slurm job log.

VNC Client (Desktop/Laptop)

After you know the NodeName and VNC Port you should be able to create an SSH tunnel to your vncserver, like so:

ssh -N -L Port:NodeName:Port cluster.hpcc.ucr.edu

Now let us create an SSH tunnel on your local machine (desktop/laptop) using the NodeName and VNC Port from above:

ssh -L 5901:i54:5901 cluster.hpcc.ucr.edu

After you have logged into the cluster with this shell, log into the node where your VNC server is running:

ssh NodeName

After you have logged into the correct NodeName, just let this terminal sit here, do not close it.

Then launch vncviewer on your local system (laptop/workstation), like so:

vncviewer localhost:5901

After launching the vncviewer, and providing your VNC password (not your cluster password), you should be able to see a Linux desktop environment.

For more information regarding tunnels and VNC in MS Windows, please refer More VNC Info.

Parallelization

There are 3 major ways to parallelize work on the cluster:

Batch
Thread
MPI

Parallel Methods

For batch jobs, all that is required is that you have a way to split up the data and submit multiple jobs running with the different chunks. Some data sets, for example a FASTA file is very easy to split up (ie. fasta-splitter). This can also be more easily achieved by submitting an array job. For more details please refer to Advanced Jobs.

For threaded jobs, your software must have an option referring to “number of threads” or “number of processors”. Once the thread/processor option is identified in the software, (ie. blastn flag -num_threads 4) you can use that as long as you also request the same number of CPU cores (ie. slurm flag --cpus-per-task=4).

For MPI jobs, your software must be MPI enabled. This generally means that it was compiled with MPI libraries. Please refer to the user manual of the software you wish to use as well as our documentation regarding MPI. It is important that the number of cores used is equal to the number requested.

In Slurm you will need 2 different flags to request cores, which may seem similar, however they have different purposes:

The --cpus-per-task=N will provide N number of virtual cores with locality as a factor. Closer virtual cores can be faster, assuming there is a need for rapid communication between threads. Generally, this is good for threading, however not so good for independent subprocesses nor for MPI.
The --ntasks=N flag will provide N number of physical cores on a single or even multiple nodes. These cores can be further away, since the need for physical CPUs and dedicated memory is more important. Generally this is good for independent subprocesses, and MPI, however not so good for threading.

Here is a table to better explain when to use these Slurm options:

Slurm Flag	Single Threaded	Multi Threaded (OpenMP)	MPI only	MPI + Multi Threaded (hybrid)
`--cpus-per-task`		X		X
`--ntasks`			X	X

As you can see:

A single threaded job would use neither Slurm option, since Slurm already assumes at least a single core.
A multi threaded OpenMP job would use --cpus-per-task.
A MPI job would use --ntasks.
A Hybrid job would use both.

For more details on how these Slurm options work please review Slurm Multi-core/Multi-thread Support.

MPI

MPI stands for the Message Passing Interface. MPI is a standardized API typically used for parallel and/or distributed computing. The HPCC cluster has a custom compiled versions of MPI that allows users to run MPI jobs across multiple nodes. These types of jobs have the ability to take advantage of hundreds of CPU cores symultaniously, thus improving compute time.

Many implementations of MPI exists, however we only support the following:

For general information on MPI under Slurm look here. If you need to compile an MPI application then please email support@hpcc.ucr.edu for assistance.

When submitting MPI jobs it is best to ensure that the nodes are identical, since MPI is sensitive to differences in CPU and/or memory speeds. The batch and intel partitions are designed to be homogeneous, however, the short partition is a mixed set of nodes. When using the short partition for MPI append the constraint flag for Slurm.

Short Example

Here is an example that shows how to ensure that your job will only run on intel nodes from the short partition:

sbatch -p short --constraint=intel myJobScript.sh

NAMD Example

To run a NAMD2 process as an OpenMPI job on the cluster:

Log-in to the cluster

Create SBATCH script

#!/bin/bash -l
#SBATCH -J c3d_cr2_md
#SBATCH -p epyc
#SBATCH --ntasks=32
#SBATCH --mem=16gb
#SBATCH --time=01:00:00
# Load needed modules
# You could also load frequently used modules from within your ~/.bashrc
module load slurm # Should already be loaded
module load openmpi # Should already be loaded
module load namd
# Run job utilizing all requested processors
# Please visit the namd site for usage details: http://www.ks.uiuc.edu/Research/namd/
mpirun --mca btl ^tcp namd2 run.conf &> run_namd.log

Submit SBATCH script to Slurm queuing system
```
sbatch run_namd.sh
```

Maker Example

OpenMPI does not function properly with Maker, you must use MPICH. Our version of MPICH does not use the mpirun/mpiexec wrappers, instead use srun:

#!/bin/bash -l
#SBATCH -p epyc
#SBATCH --ntasks=32
#SBATCH --mem=16gb
#SBATCH --time=01:00:00
# Load maker
module load maker/2.31.11
mpirun maker # Provide appropriate maker options here

More examples

The range of differing jobs and how to submit them is endless:

1. Singularity containers
2. Database services
3. Graphical user interfaces
4. Etc ...

For a growing list of examples please visit HPCC Slurm Examples.

These only list the most common CPU Extensions for each platform. A full list of supported extensions can be found using the lscpu command on the respective node type. ↩︎

Manuals: Queue Policies

Mon, 01 Jan 0001 00:00:00 +0000

Start Times

Start times are a great way to track your jobs:

squeue -u $USER --start

Start times are rough estimates based on the current state of the queue.

Partition Quotas

Each partition has a specific usecase. Below outlines each partition, it’s usecase, as well as any job/user/group limits that are in place. Empty boxes imply no limit, but is still limited by the next higher limit. Job limits are capped by user limits, and user limits are capped by group limits.

Partition Name	Usecase	Per-User Limit	Per-Job Limit	Max Job Time
epyc (2021 CPU), intel (2016 CPU), batch (2012 CPU)	CPU Intensive Workloads, Multithreaded, MPI, OpenMP	384 Cores, 1TB memory ¹	64GB memory per Core ²,³	30 Days
short	Short CPU Intensive Workloads, Multithreaded, MPI, OpenMP	384 Cores, 1TB memory	64GB memory per Core, 2-hour time limit	2 Hours
highmem	Memory Intensive Workloads	32 Cores, 2TB memory		30 Days
highclock	Low Parallelism Workloads	32 Cores, 256GB memory		30 Days
gpu	GPU-Enabled Workloads	4 GPUs⁴,48 Cores, 512GB memory	16 Cores, 256GB memory ²,⁵	7 Days

In addition to the above limits, there is:

A 768 core group limit that spans across all users in a group across all partitions.
A 8 GPU group limit that spans across all users in a group across all GPU-enabled partitions.

For more detailed information on node hardware, see our Node List spreadsheet.

Attempting to allocate more member than a node can support, eg 500GB on an Intel node, will cause the job to immediately fail. Limits are for actively running jobs, and any newly queued job that exceeds a limit will be queued until resources become available. If you require additional resourced beyond the listed limits, please see the “Additional Resource Request” section below.

Partition quotas can also be viewed on the cluster using the slurm_limits command.

Additionally, users can have up to 5000 jobs in queue/running at the same time. Attempting to queue more than 5000 jobs will cause jobs submissions to fail with the reason “MaxSubmitJobLimit”.

External Labs

Labs external to UCR will have reduced resource limits as follows:

Labs will have a CPU quota of 256 cores across all lab users across all partitions
Per user CPU quotas on epyc, intel, batch, and short will be 128 cores
Per user CPU quotas on highmem will be 16
GPU quotas on the gpu partition will be 4 per-lab, and 2 per-user
CPU quotas on the gpu partition will be 24 per-user and 8 per-job

Private Node Ownership

Labs have the ability to purchase nodes and connect them to the cluster for increased quotas. More information can be found in the Ownership Model section of our Access page.

Additional Resource Request

Sometimes, whether it be due to deadlines or technical limitations, more resources might be needed than are supplied by default. If you require a temporary increase in quotas, please reach out to support@hpcc.ucr.edu with a complete “Justification for Quota Exception” form. The following are typical circumstances that could justify increased quotas:

Urgent Deadlines: ie. Grant submissions, conference presentations, paper deadlines
Special Technical Needs: The limits do not meet the technical requirements for the program(s) that are trying to be ran.

The amount of additional resources, the length of time that the resources are needed, and the frequency of the requests are all factors that determine whether your request will be accepted. It also must be within the capacity of the HPCC’s infrastructure while also ensuring minimal disruption to other users. The final decision of approving exception requests, and how many extra resources to provide, will be decided by the HPCC Staff, the Director, and in exceptional cases the HPCC Oversight Committee.

Requests limited by unoptimized code/datasets or strictly for the sake of convenience will be denied.

Additionally at this time we are unable to grant additional resource requests for external labs due to how our cluster is partially subsidized by our campus. We appologize for this, and suggest looking into national computing facilities or cloud offerings to fill the gap in compute.

Example Scenarios

Per-Job Limit

A job is submitted on the gpu partition. The job requests 32 cores.

This job will not be able to be submitted, as 32 cores is above the partition’s 16 core per-job limit.

Per-User Limit

You submit a job the highmem partition, requesting 32 cores.

This job will start successfully, as it is within the partition’s core limit.

You submit a second job while the first job is still running. The new job is requesting 32 cores.

Because you are at your per-user core limit on the highmem partition, the second job will be queued until the first job finishes.

Per-Lab Limit

User A submits a job requesting 384 cores. User B submits a job requesting 384 cores.

Because each user is within their per-user limits and the lab is within their limit, the jobs will run in parallel.

User C submits a job, requesting 16 cores.

Because User A and User B are using all 768 cores within the lab, User C’s job will be queued until either User A’s or User B’s jobs finishes.

Changing Partitions

In srun commands and sbatch scripts, the -p or --partition flag controls which partition/queue a job will run on. For example, using -p epyc will have your job queued and ran on the epyc partition. For more examples and information on running jobs, see the Managing Jobs page of our documentation.

Users that have not submitted any jobs in a long time usually have a higher priority over others that have ran jobs recently. Thus the estimated start times can be extended to allow everyone their fair share of the system. This prevents a few large groups from dominating the queuing system for long periods of time.

You can see with the sqmore command what priority your job has (list is sorted from lowest to highest priority). You can also check to see how your group’s priority is compared to other groups on the cluster with the “sshare” command.

For example:

sshare

It may also be useful to see your entire group’s fairshare score and who has used the most shares:

sshare -A $GROUP --all

Lastley, if you only want to see your own fairshare score:

sshare -A $GROUP -u $USER

The fairshare score is a number between 0 and 1. The best score being 1, and the worst being 0. The fairshare score approches zero the more resource you (or your group) consume. Your individual consumption of resources (usage) does affect your entire group’s fiarshare score. The affects of your running/completed jobs on your fairshare score are halved each day (half-life). Thus, after waiting several days without running any jobs, you should see an improvment in your fairshare score.

Here is a very good explaination of fairshare.

Priority

The fairshare score and jobs queue wait time is used to calculate your job’s priority. You can use the sprio command to check the priority of your jobs:

sprio -u $USER

Even if your group has a lower fairshare score, your job may still have a very high priority. This would be likely due to the job’s queue wait time, and it should start as soon as possible regardless of fairshare score. You can use the sqmore command to see a list of all jobs sorted by priority.

Backfill

Some small jobs may start before yours, only if they can complete before yours starts and thus not negatively affecting your start time.

Priority Partition

Some groups on our system have purchased additional hardware. These nodes will not be affected by the fairshare score. This is because jobs submitted to the group’s partition will be evaluated first before any other jobs that have been submitted to those nodes from a different partition.

Using the Preempt Partitions (TENATIVE)

NOTE The full release of the preempt partition is planned for future release and is not yet available!

This guide assumes that you know how to run Interactive and Batch jobs through Slurm. This is a more advanced job that expands on other jobs, so if you are not familiar with running jobs then please see the Managing Jobs page of our documentation.

There are two partitions that will have preemption enabled: “preempt” for CPU jobs, and “preempt_gpu” for GPU jobs.

To fully take advantage of preemption, your jobs must be be able to tolerate being cancelled at a random time and restarted at some later point in the future. When your job is preempted, it will be cancelled and requeued. When the job is elegible to start again, it will start from the beginning of the sbatch script as if it were newly run. Jobs run under the preemption-enabled partitions run at a lower priority than the other public partitions as preemption’s main job is to attempt to fill capacity that would be otherwise idle.

Your job is only guaranteed 1 minute of uninterrupted runtime after it starts before it is elegible to be preempted by higher priority jobs.

Job Limitations

Time

As mentioned above, jobs can be killed at any time after the 1 minute grace period. Jobs should be set up such that any initialization steps that cannot tolerate being randomly killed happen within that first minute. The max walltime of a job is currently set to 1 day (24 hours). The time limit can may be changed in the future depending on how the community utilizes the partitions.

Resources

Currently, users are allowed to use an equal number of CPU cores as their current CPU limit. If you’re currently allowed to use 384 cores on the public partitions, then you can use 384 cores on the preempt partition. The same applies to memory. For the GPU partition, users are currently allowed to use 1 GPU on the “preempt_gpu” partition. Resource limits might be changed in the future depending on how the community utilizes the partitions.

Starting a job

Similar to other partitions, you must specifically queue jobs to the preempt partition. One special thing that is required is to also specify the preempt account using -A preempt. Jobs started on the preempt partition do not count against your lab’s CPU quota.

Interactive Example

To start a CPU preemptable interactive job, you can build off of the following command:

srun -A preempt -p preempt -c 8 --mem 8GB --pty bash -l

This will start a job with 8 cores and 8GB of memory on the preempt partition under the preempt account. Jobs that do not explicitly state -A preempt will fail to start. Note that because this is a preemptable job, your session can be terminated at any moment without notice after the 1 minute grace period.

To start a GPU preemptable interactive job, you can build off of the following command:

srun -A preempt -p preempt_gpu --gres=gpu:1 -c 8 --mem 8GB --pty bash -l

Non-interactive (batch) Example

As with all preemptable jobs, batch jobs can be cancelled at any time without notice and your programs/scripts must be able to tolerate this. Jobs that have been preempted will automatically be requeued to resume running at a later time when resources become available. The $SLURM_RESTART_COUNT environment variable can be used to check if the job has been preempted and restarted to allow you to recover and resume running.

To start a batch job, you can build off of the following sbatch file:

#!/bin/bash -l
#SBATCH -A preempt # Remember to use the "preempt" account!
#SBATCH -p preempt
#SBATCH -c 8
#SBATCH --mem 8GB
#SBATCH --time 1-00:00:00
# Check if this is the first run or a resumed job
if [ "$SLURM_RESTART_COUNT" -eq 0 ]; then
echo "This is the first time running the job"
# Put the code for the first run here
# Example: initializing data or setting up environment
# Remember that a job only has 1 minute of guaranteed runtime. Keep
# any initialization/recovery short otherwise it might be interrupted
else
echo "The job is being resumed after a preemption"
# Put the code for a resumed job here
# Example: resuming from a checkpoint or continuing work
# Remember that a job only has 1 minute of guaranteed runtime. Keep
# any initialization/recovery short otherwise it might be interrupted
fi
# Common job code that runs regardless of first run or resume
echo "Running main job tasks..."
# Put your main job code here

Jobs that do not explicitly state #SBATCH -A preempt will fail to start. Note that because this is a preemptable job, your job can be cancelled at any moment without notice.

Selecting Resources

Similar to the “short” partition, the “preempt” partition is a union of all public and private machines, excluding specialty partitions like highmem, highclock, GPU, etc. This means that if you do not specify any restrictions, your job can run on nodes in the batch, intel, or epyc partition. If a certain architecture is required for your job, then you can use the --constraint flag. Similarly, the “preempt_gpu” partition is a union of all public and private GPU machines, similar to “short_gpu”. Constraints can be used on the “preempt_gpu” partition as well to use specific resources.

For example, if you want your job to run on an Intel machine, you can include #SBATCH --constraint=intel in your sbatch script, or --constraint=intel in your srun command. If you want either an Intel or Epyc Rome machine, then you could use #SBATCH --constraint=intel|rome in your sbatch script, or constraint=intel|rome in your srun command. More information on constraints is available in the Slurm Documentation.

To view which nodes contain which features, see the Feature Constraints listed on the Feature Constraints page

The 384 core/1TB limit is per-user across all CPU compute partitions (epyc, intel, and batch). Attempting to run more then 384 cores, even if across multiple CPU compute partitions, will be queued until resources become available. Specialized partitions (eg. short, highmem, gpu) will not count against the CPU compute partition’s quotas. ↩︎
A 64GB-per-core limit is placed to prevent over allocating memory compared to CPUs. If more than a 64GB-per-core ratio is requested, the core count will be increased to match. ↩︎
Allocatable memory per-node in the epyc partition is limited to ~950GB to allow for system overhead. ↩︎
If a user needs more than 4 GPUs, please contact support@hpcc.ucr.edu with a short justification for a temporary increase. ↩︎
Allocatable memory per-node in the gpu partition is dependent on the node. 115GB for gpu[01-02], 500GB for gpu[03-04], 200GB for gpu05, 922GB for gpu06, 950GB for gpu[07-08] ↩︎

Manuals: Package Management

Mon, 01 Jan 0001 00:00:00 +0000

Python

The scope of this manual is a brief introduction on how to manage Python packages.

Python Versions

Different Python versions do not play nice with each other. It is best to only load one Python module at any given time. The miniconda3 module for Python is the default version. This will enable users to leverage the conda installer, but with as few Python packages pre-installed as possible. This is to avoid conflicts with future needs of individuals.

Conda

We have several Conda software modules:

miniconda3 - Basic Python 3 install (Default)
anaconda - Full Python 3 install For more information regarding our module system please refer to Environment Modules.

The miniconda modules are very basic installs, however users can choose to unload this basic install for a fuller one (anaconda), like so:

module load miniconda3

After loading anaconda, you will see that there are many more Python packages installed (ie. numpy, scipy, pandas, jupyter, etc…). For a list of installed Python packages try the following:

pip list

Virtual Environments

Sometimes it is best to create your own environment in which you have full control over package installs. Conda allows you to do this through virtual environments.

Initialize

Conda will now auto initialize when you load the corresponding module. No need to run the conda init or make any modifications to your ~/.bashrc file.

Configure

Installing many packages can consume a large (ie. >20GB) amount of disk space, thus it is recommended to store conda environments under your bigdata space. If you have bigdata, create the .condarc file (otherwise conda environments will be created under your home directory).

Create the file .condarc in your home, with the following content:

channels:
- defaults
pkgs_dirs:
- ~/bigdata/.conda/pkgs
envs_dirs:
- ~/bigdata/.conda/envs
auto_activate_base: false

After changing the configuration, environments can be moved to the new bigdata location using conda rename -n NAME NAME_tmp, then conda rename -n NAME_tmp NAME to return it to it’s original name. Replacing NAME with the name of the environment you wish to move. If you receive an error while trying to rename, try activting the base conda environment using conda activate base and running the conda rename commands again.

It’s also recommended to clean your old conda packages using conda clean -a. Note that this command can take a while (»1 hour) if there are a lot of downloaded packages.

Create a Python 3.10 conda environment, like so:

module load miniconda3 # Should already be auto-loaded during login
conda create -n NameForNewEnv python=3.10 # Many Python versions are available

Activating

Once your virtual environment has been created, you need to activate it before you can use it:

conda activate NameForNewEnv

With more modules being added as conda environments, it’s sometimes requried to “stack” user environments on top of module-provided environments. Running conda activate will deactivate the current environment before activating the new environment.. To counter this, the --stack flag can be used to effectively “combine” environments. For example conda activate --stack NameForNewEnv. Please see the conda page on Nested Activation for more details.

Deactivating

In order to exit from your virtual environment, do the following:

conda deactivate

Installing packages

Before installing your packages, make sure you are on a computer node. This ensures your downloads to be done quickly and with less chance of running out of memory. This can be done using the following command:

srun -p short -c 4 --mem=10g --pty bash -l # Adjust the resource request as needed

Here is a simple example for installing packages under your Python virtual environment via conda:

conda install -n NameForNewEnv PackageName

You may need to enable an additional channel to install the package (refer to your package’s documentation):

conda install -n NameForNewEnv -c ChannelName PackageName

Cloning

It is possible for you to copy an existing environment into a new environment:

conda create --name AnotherNameForNewEnv --clone NameForNewEnv

Listing Environments

Run the following to get a list of currently installed conda evironments:

conda env list

Removing

If you wish to remove a conda environment run the following:

conda env remove --name myenv

More Info

For more information regarding conda please visit Conda Docs.

Jupyter

You can run jupyter as an interactive job or you can use the web instance, see Jupyter Usage for details.

Virtual Environments (Kernels)

In order to use a custom Python/Conda virtual environment within Jupyter, it must be configured as a kernel. You will need to do the following:

# Create a virtual environment named "ipykernel_py3", if you don't already have one
# It can be named whatever you like, "ipykernel_py3" is just an example.
# You can also indicate a more specific version of Python here. Otherwise you'll get
# the latest version provided by Anaconda.
conda create -n ipykernel_py3 python=3 ipykernel
# Load the new environment
conda activate ipykernel_py3
# Install kernel
# --name is used to define the internal name used by Jupyter, and should not contain spaces.
# --display-name is the name you will see in the Jupyter web interface, should be descriptive.
python -m ipykernel install --user --name ipykernel_py3 --display-name "IPyKernel (Python 3)"

Now when you visit the notebook you should see the option “JupyterPy3” when you click the “New” dropdown menu in the upper left corner of the home page.

To remove an unwanted kernel, use the following commands:

jupyter kernelspec list # List available kernels
jupyter kernelspec uninstall UNWANTEDKERNEL

Replace UNWANTEDKERNEL with the name of the kernel you wish to remove

Further reading: Installing the IPython kernel

R

For instructions on how to configure your R environment please visit IRkernel. Since we should already have IRkernel install in the latest version of R, you would only need to do the following within R:

IRkernel::installspec(name = 'ir44', displayname = 'R 4.0.1')

R

This section is regarding how to manage R packages.

Current R Version

NOTE: Please be aware that this version of R is built with GCC/8.3.0, which means that previously compiled modules may be incompatible.

Currently the default version of R is R/4.3.0 and is loaded automatically for you.

When a new release of R is available, you should reinstall any local R packages, however keep in mind of the following:

Remove redundantly installed local R packages with the RdupCheck command.
Newer version of R packages are not backward compatible, once installed they only work for that specific version of R.

Older R Versions

You can load other versions of R with the following:

module unload R
module avail R
module load R/VERSION

Installing R Packages

The default version of R has many of the most popular R packages already installed and available. It is also possible for you to install additional R packages in your local environment.

Only install packages if they are not already available, this will minimize issues later. You can check the current version of R from the command line, like so:

Rscript -e "library('some-package-name')"

Or you can check from within R, like so:

library('some-package-name')

If the package is not available, then proceed with installation.

Bioconductor Packages

To install from Bioconductor you can use the following method:

BiocManager::install(c("package-to-install", "another-packages-to-install"))
Update all/some/none? [a/s/n]: n

For more information please visit Bioconductor Install Page.

GitHub Packages

# Load devtools
library(devtools)
# Replace name with the GitHub account/repo
install_github("duncantl/RGoogleDocs")

Local Packages

# Replace URL with your URL or local path to your .tar.gz file
install.packages("http://hartleys.github.io/QoRTs/QoRTs_LATEST.tar.gz",repos=NULL,type="source")

Manuals: Selected Research Software Usage

Mon, 01 Jan 0001 00:00:00 +0000

The below links to usage manuals of selected research software requiring more complex usage instructions.

Manuals: Data Storage

Mon, 01 Jan 0001 00:00:00 +0000

Dashboard

HPCC cluster users are able to check on their home and bigdata storage usage from the Dashboard Portal. Note that there is a known issue with the dashboard’s display of usage when users are a part of multiple labs.

Home

Home directories are where you start each session on the HPC cluster and where your jobs start when running on the cluster. This is usually where you place the scripts and various things you are working on. This space is very limited. Please remember that the home storage space quota per user account is 20 GB.

Path	/rhome/`username`
User Availability	All Users
Node Availability	All Nodes
Quota Responsibility	User

Bigdata

Bigdata is an area where large amounts of storage can be made available to users. A lab purchases bigdata space separately from access to the cluster. This space is then made available to the lab via a shared directory and individual directories for each user.

Lab Shared Space This directory can be accessed by the lab as a whole.

Path	/bigdata/`labname`/shared
User Availability	Labs that have purchased space.
Node Availability	All Nodes
Quota Responsibility	Lab

Individual User Space This directory can be accessed by specific lab members.

Path	/bigdata/`labname`/`username`
User Availability	Labs that have purchased space.
Node Availability	All Nodes
Quota Responsibility	Lab

Non-Persistent Space

Frequently, there is a need for faster temporary storage. For example activities like the following would fall under this category:

Output a significant amount of intermediate data during a job
Access a dataset from a faster medium than bigdata or the home directories
Write out lock files

These types of activities are well suited to the use of fast non-persistent spaces. Below are the filesystems available on the HPC cluster that would best suited for these actions.

SSD Backed Scratch Space This space is much faster than the persistent space (/rhome,/bigdata), but slower than using RAM based storage. The scratch space should be used with the $SCRATCH environment variable which is automatically set for each job. This location is local to the node it is ran on and is automatically deleted once a job has finished.

Path	/scratch
User Availability	All Users
Node Availability	All Nodes
Quota Responsibility	N/A

Temporary Space This is a standard space available on all Linux systems. Please be aware that it is limited to the amount of free disk space on the node you are running on.

Path	/tmp
User Availability	All Users
Node Availability	All Nodes
Quota Responsibility	N/A

RAM Space This type of space takes away from physical memory but allows extremely fast access to the files located on it. When submitting a job you will need to factor in the space your job is using in RAM as well. For example, if you have a dataset that is 1G in size and use this space, it will take at least 1G of RAM.

Path	/dev/shm
User Availability	All Users
Node Availability	All Nodes
Quota Responsibility	N/A

Usage and Quotas

To quickly check your usage and quota limits:

check_quota home
check_quota bigdata

To get the usage of your current directory, run the following command:

du -sh .

To calculate the sizes of each separate sub directory, run:

du -sch *
du -sch .[!.]* * | sort -h # includes hidden files and directories

This may take some time to complete, please be patient.

For more information on your home directory, please see the Linux Basics Orientation.

Automatic Backups and Snapshots

The cluster creates monthly backups, however it is still advantageous for users to periodically make copies of their critical data to a separate storage device. The cluster is a production system for research computations with a very expensive high-performance storage infrastructure.

Home snapshots are created daily and kept for one week. Bigdata snapshots are created weekly and kept for one month.

Home and bigdata backups are located under the following respective directories:

/rhome/.snapshots/
/bigdata/.snapshots/

The individual snapshot directories have names with numerical values in epoch time format. The higher the value the more recent the snapshot.

To view the exact time of when each snapshot was taken execute the following commands:

mmlssnapshot home
mmlssnapshot bigdata

Manuals: Sharing Data

Mon, 01 Jan 0001 00:00:00 +0000

Permissions

It is useful to share data and results with other users on the cluster, and we encourage collaboration. The easiest way to share a file is to place it in a location that both users can access. Then the second user can simply copy it to a location of their choice. However, this requires that the file permissions permit the second user to read the file. Basic file permissions on Linux and other Unix like systems are composed of three groups: owner, group, and other. Each one of these represents the permissions for different groups of people: the user who owns the file, all the group members of the group owner, and everyone else, respectively Each group has 3 permissions: read, write, and execute, represented as r,w, and x. For example the following file is owned by the user username (with read, write, and execute), owned by the group groupname (with read and execute), and everyone else cannot access it.

username@pigeon:~$ ls -l myFile
-rwxr-x--- 1 username groupname 1.6K Nov 19 12:32 myFile

If you wanted to share this file with someone outside the groupname group, read permissions must be added to the file for ‘other’:

username@pigeon:~$ chmod o+r myFile

To learn more about ownership, permissions, and groups please visit Linux Basics Permissions.

Set Default Permissions

In Linux, it is possible to set the default file permission for new files. This is useful if you are collaborating on a project, or frequently share files and you do not want to be constantly adjusting permissions The command responsible for this is called ‘umask’. You should first check what your default permissions currently are by running ‘umask -S’.

username@pigeon:~$ umask -S
u=rwx,g=rx,o=rx

To set your default permissions, simply run umask with the correct options. Please note, that this does not change permissions on any existing files, only new files created after you update the default permissions. For instance, if you wanted to set your default permissions to you having full control, your group being able to read and execute your files, and no one else to have access, you would run:

username@pigeon:~$ umask u=rwx,g=rx,o=

It is also important to note that these settings only affect your current session. If you log out and log back in, these settings will be reset. To make your changes permanent you need to add them to your .bashrc file, which is a hidden file in your home directory (if you do not have a .bashrc file, you will need to create an empty file called .bashrc in your home directory). Adding umask to your .bashrc file is as simple as adding your umask command (such as umask u=rwx,g=rx,o=r) to the end of the file. Then simply log out and back in for the changes to take affect. You can double check that the settings have taken affect by running umask -S.

To learn more about umask please visit What is Umask and How To Setup Default umask Under Linux?.

File Transfers

For file transfers and data sharing, both command-line and GUI applications can be used. For beginners we recommend the FileZilla GUI application (download/install from here) since it is available for most OSs including macOS, Windows, Linux and ChromeOS. A basic user manual for FileZilla is here and a video tutorial is here. Alternative user-friendly SCP/SFTP GUI applications include Cyberduck and WinSCP for Mac and Windows OSs, respectively.

FileZilla Usage

FileZilla supports both Password+DUO and SSH Key based authentication methods. Both options are described below.

Authentication with Password+DUO

When using FileZilla a new site can be created by selecting File -> Site Manager. In the subsequent window, New Site should be selected. Next the following information should be provided in the right pane of the General tab.

Protocol: SFTP
Host: cluster.hpcc.ucr.edu
Logon Type: Interactive
User: <username>

Under <username> the actual username of an HPCC account should be provided. The Logon Type can be Interactive or Key File for Password+DUO or SSH Keys authentication, respectively. When choosing Password+DUO authentication, the max connections should be configured. To do so, navigate to the Transfer Settings tab and make the following settings.

 Limit Number of simultaneous connections: checked
Maximum number of connections: 1

By clicking “OK” the new site will be saved. Subsequently, one can select the new site from the main window by clicking the arrow next to the site list, or just reopen the Site Manager and clicking the “connect” button from the new site window.

Authentication with SSH Key

For ssh key based access, users want to make the selections shown in the Figure below. For this access method it is important to choose the Site Manager option as FileZilla’s Quick Access method will not work here.

FileZilla settings with an SSH key. For generating SSH keys see here.

Command-line SCP

Advantages of this method include: batch up/downloads and ease of automation. A detailed manual is available here.

To copy files To the server run the following on your workstation or laptop:

scp -r <path_to_directory> <your_username>@<host_name>:
To copy files From the server run the following on your workstation or laptop:

scp -r <your_username>@<host_name>:<path_to_directory> .

Copying bigdata

Rsync can:

Copy (transfer) directories between different locations
Perform transfers over the network via SSH
Compare large data sets (-n, --dry-run option)
Resume interrupted transfers

Rsync Notes:

Rsync can be used on Windows, but you must install Cygwin. Most Mac and Linux systems already have rsync install by default.
Always put the / after both folder names, e.g: FOLDER_A/ Failing to do so will result in the nesting folders every time you try to resume. If you don’t put / you will get a second folder_B inside folder_B FOLDER_A/FOLDER_A/
Rsync only copies by default.
Once the rsync command is done, run it again. The second run will be shorter and can be used as a double check. If there was no output from the second run then nothing changed.
To learn more try man rsync

If you are transfering to, or from, your laptop/workstation it is required that you run the rsync command locally from your laptop/workstation.

To transfer to the cluster:

rsync -av --progress FOLDER_A/ cluster.hpcc.ucr.edu:FOLDER_A/

To transfer from the cluster:

rsync -av --progress cluster.hpcc.ucr.edu:FOLDER_A/ FOLDER_A/

Rsync will use SSH and will ask you for your cluster password, the same way SSH or SCP does.

If your rsync transer was interrupted, rsync can continue where it left off. Simply run the same command again to resume.

Copying large folders on the cluster between Directories

If you want to syncronize the contents from one directory to another, then use the following:

rsync -av --progress PATH_A/FOLDER_A/ PATH_B/FOLDER_A/

Rsync does not move but only copies. Thus you would need to delete the original once you confirm that everything has been transfered.

Copying large folders between the cluster and other servers

If you want to copy data from the cluster to your own server, or another remote system, use the following:

rsync -ai FOLDER_A/ sever2.xyz.edu:FOLDER_A/

The sever2.xyz.edu machine must be a server that accepts Rsync connections via SSH.

Note: This is not intended to be used as a long-term solution or referenced in publications. It should be used for internal project purposes only. A long-term solution is required, please use a web or cloud-based installation.

Simply create a symbolic link or move the files into your html directory when you want to share them. For exmaple, log into the HPC cluster and run the following:

# Make sure you have an html directory
mkdir ~/.html
#Make sure permissions are set correctly
chmod a+x ~/
chmod a+rx ~/.html
# Make a new web project directory
mkdir www-project
chmod a+rx www-project
# Create a default test file
echo '<h1>Hello!</h1>' > ~/www-project/index.html
# Create shortcut/link for new web project in html directory
ln -s ~/www-project ~/.html/

Now, test it out by pointing your web-browser to https://cluster.hpcc.ucr.edu/~username/www-project/ Be sure to replace username with your actual user name. The forward slash at the end is important.

Common Problems

“403 Forbidden” / You don’t have permissions

If using a symbolic link to data stored elsewhere on the cluster, every folder in the tree leading up to the shared folder must, at a minimum, have the execute permission (chmod a+x folder_name). For example, if you have a symbolic link to /bigdata/mylab/myuser/data/web-content, then myuser, data, and web-content must all have the execute permission (bigdata and mylab should already have them).

Password Protect Web Pages

Files in web directories can be password protected. First create a password file and then create a new user:

touch ~/.html/.htpasswd
htpasswd ~/.html/.htpasswd newwebuser

This will prompt you to enter a password for the new user ‘newwebuser’. Create a new directory, or go to an existing directory, that you want to password protect:

mkdir ~/.html/locked_dir
cd ~/.html/locked_dir

For the above commands you can choose any directory name you want.

Then place the following content within a file called .htaccess:

AuthName 'Please login'
AuthType Basic
AuthUserFile /rhome/username/.html/.htpasswd
require user newwebuser

Now, test it out by pointing your web-browser to http://cluster.hpcc.ucr.edu/~username/locked_dir Be sure to replace username with your actual user name for the above code and URL.

Google Drive

There are several tools used to transfer files from Google Drive to the cluster, however RClone may be the easiest to setup.

Create an SSH tunnel to the cluster, (MS Windows users should use MobaXterm):
```
ssh -L 53682:localhost:53682 username@cluster.hpcc.ucr.edu
```
Once you have logged into the cluster with the above command, then load rclone via the module system and run it, like so:
```
module load rclone
rclone config
```
After that, follow this RClone Walkthrough to complete your setup.

Globus

See Globus page here.

Manuals: Security

Mon, 01 Jan 0001 00:00:00 +0000

Protection Levels and Classification

UCR protection levels, and data classifications are outlined by UCOP as a UC wide policy: UCOP Institutional Information and IT Resource Classification According to the above documentation, there are 4 levels of protection for 4 classifications of data:

Protection Level	Policy	Examples
P1 - Minimal	IS-1	Internet facing websites, press releases, anything intended for public use
P2 - Low	IS-2	Unpublished research work, intellectual property NOT classified as P3 or P4
P3 - Moderate	IS-3	Research information classified by an Institutional Review Board as P3 (ie. dbGaP from NIH)
P4 - High	IS-4	Protected Health Information (PHI/HIPAA), patient records, sensitive identifiable human subject research data, Social Security Numbers

The HPC cluster could be compliant with with other security polices (ie. NIH), however the policy must be reviewed by our security team.

At this time the HPC cluster is not a IS-4 (P4) compliant cluster. If you have needs for very sensitive data, it may be best to work with UCSD and their Sherlock service. Our cluster is IS-3 compliant, however there are several responsibilities that users will need to adhere to.

General Guidelines

First, please contact us (support@hpcc.ucr.edu) before transferring any data to the cluster. After we have reviewed your needs, data classification and appropriate protection level, then it may be possible to proceed to use the HPCC.

Here are a few basic rules to keep in mind:

Always be aware of access control methods (Unix permissions and ACLs), do not allow others to view the data (ie. chmod 400 filename)
Do not make unnecessary copies of the data
Do not transfer the data to insecure locations
Encrypt data when/where possible
Delete all data when it is no longer needed

Access Controls

When sharing files with others, it is imperative that proper permission are used. However, basic Unix permissions (user,group,other) may not be adequate. It is better to use ACLs in order to allow fine grained access to sensitive files.

GPFS ACLs

GPFS is used for most of our filesystems (/rhome and /bigdata) and it uses nfsv4 style ACLs. Users are able to explicitly allow many individuals, or groups, access to specific files or directories.

# Get current permissions and store in acls file
mmgetacl /path/to/file > ~/acls.txt
# Edit acls file containing permissions
vim ~/acls.txt
# Apply new permissions to file
mmputacl -i ~/acls.txt /path/to/file
# Delete acls file
rm ~/acls.txt

For more information regarding GPFS ACLs refer to the following: GPFS ACLs. An example for granting another user access to a file is given here.

XFS ACLs

The XFS filesystem is used for the CentOS operating system and typical unix locations (/,/var,/tmp,etc). For more information on how to use ACLs under XFS, please refer to the following: CentOS 7 XFS

Note: ACLs are not applicable to gocryptfs, which is a FUSE filesystem, not GPFS nor XFS.

Encryption

Under the IS-3 policy, P3 data encryption is mandatory. It is best if you get into the habit of doing encryption in transit, as well as encryption at rest. This means, when you move the data (transit) or when the data is not in use (rest), it should be encrypted.

In Transit

When transferring files make sure that files are encrypted in flight with one of the following transfer protocols:

SCP
SFTP
RSYNC (via SSH)

The destination for sensitive data on the cluster must also be encrypted at rest under one of the follow secure locations:

/dev/shm/ - This location is in RAM, so it does not exist at rest (ensure proper ACLs)
/run/user/$EUID/unencrypted - This location is manually managed, and should be created for access to unencrypted files.

It is also possible to encrypt your files with GPG (GPG Example), before they are transferred. Thus, during transfer they will be GPG encrypted. However, decryption must occur in one of the secure locations mentioned above.

Note: Never store passphrases/passwords/masterkeys in an unsecure location (ie. a plain text script under /rhome).

At Rest

There are 3 methods available on the cluster for encryption at rest:

GPG encryption of files via the command line GPG Example, however you must ensure proper ACLs and decryption must occur in a secure location.
Create your own location with gocryptfs.

GocryptfsMgr

You can use gocryptfs directly or use the gocryptfsmgr, which automates a few steps in order to simplify things.

Here are the basics when using gocryptfsmgr:

# Load the gocryptfs module. Not strictly required, but sets a handful of useful environment variables
module load gocryptfs
# Create new encrypted data directory
gocryptfsmgr create bigdata privatedata1
# List all encrypted and unencrypted (access point) directories
gocryptfsmgr list
# Unencrypted privatedata1 (create access point)
gocryptfsmgr open bigdata privatedata1 rw
# Transfer files (ie. SCP,SFTP,RSYNC)
scp user@remote-server:sensitive_file.txt $UNENCRYPTED/privatedata1/sensitive_file.txt
# Remove access point (re-encrypt) privatedata1
gocryptfsmgr close privatedata1
# Remove all access points (re-encrypt all)
gocryptfsmgr quit

For subsequent access to the encrypted space, (ie. computation or analysis) the follow procedure is recommended:

# Request a 2hr interactive job on an exclusive node, resources can be adjusted as needed
srun -p short --exclusive=user --pty bash -l
# Unencrypted privatedata1 in read-only mode (create access point)
gocryptfsmgr open bigdata privatedata1 ro
# Read file contents from privatedata1 (simulating work or analysis)
cat $UNENCRYPTED/privatedata1/sensitive_file.txt
# List all encrypted and unencrypted (access points) directories
gocryptfsmgr list
# Make sure we re-encrypt (close access point) for privatedata1
gocryptfsmgr close privatedata1
# Exit from interactive job
exit

With the above methods you can create multiple encrypted directories and access points and move between them.

Gocryptfs

When using the gocryptfs directly, you will need to know a bit more details on how it works. The gocryptfs module on the HPCC cluster uses these predefined variables:

HOME_ENCRYPTED = /rhome/$USER/encrypted - Very small encrypted space, not recommended to use
BIGDATA_ENCRYPTED = /rhome/$USER/bigdata/encrypted - Best encrypted space for private data sets
SHARED_ENCRYPTED = /rhome/$USER/shared/encrypted - Encrypted space when intending to share data sets with group
UNENCRYPTED = /run/user/$UID/unencrypted - Access directory where encrypted data will be viewed as unencrypted

Here is an example how to create an encrypted directory under the BIGDATA_ENCRYPTED location using gocryptfs:

# Load gocyptfs software
module load gocryptfs
# Create empty data directory
mkdir -p $BIGDATA_ENCRYPTED/privatedata1
# Then intialize empty directory and encrypt it
gocryptfs -aessiv -init $BIGDATA_ENCRYPTED/privatedata1
# Create access point directory where encrypted files will be viewed as unencrypted
mkdir -p $UNENCRYPTED/privatedata1
# After that mount the encrypted directory on the access point and open a new shell within it
gocryptfssh $BIGDATA_ENCRYPTED/privatedata1
# Transfer files (ie. SCP,SFTP,RSYNC)
scp user@remote-server:sensitive_file.txt $UNENCRYPTED/sensitive_file.txt
# Exiting the `gocryptfssh` shell will automatically close the mount
exit

For subsequent access to the encrypted space, (ie. computation or analysis) the follow procedure is recommended:

# Request a 2hr interactive job on an exclusive node, resources can be adjusted as needed
srun -p short --exclusive=user --pty bash -l
# Load cyptfs software
module load gocryptfs
# Create unencrypted directory
mkdir -p $UNENCRYPTED/privatedata1
# Mount encrypted filesystem as read-only and unmount idling for 1 hour
gocryptfs -ro -i 1h -sharedstorage $BIGDATA_ENCRYPTED/privatedata1 $UNENCRYPTED/privatedata1
# Read file contents (simulating work or analysis)
cat $UNENCRYPTED/privatedata1/sensitive_file.txt
# Manually close access point when analysis has completed
fusermount -u $UNENCRYPTED/privatedata1
# Delete old empty access point
rmdir $UNENCRYPTED/privatedata1

WARNING: Avoid writing to the same file at the same time from different nodes. The encrypted file system cannot handle simultaneous writes and will corrupt the file. If simultaneous jobs are necessary then using write mode from a head node and read-only mode from compute nodes may be the best solution here. Also, be mindful of reamaining job time and make sure that you have unmounted the unencrypted directories before your job ends.

For another example on how to use gocrypfs on an HPC cluster: Luxembourg HPC gocryptfs Example

Deletion

To ensure the complete removal of data, it is best to shred files instead of removing them with rm. The shred program will overwrite the contents of a file with randomized data such that recovery of this file will be very difficult, if not impossible.

Instead of using the common rm command to delete something, please use the shred command, like so:

shred -u somefile

The above command will overwrite the file with random data, and then remove (unlink) it.

If we want to be even more secure, we can pass over the file seven times to ensure that reconstruction is nearly impossible, then remove it:

shred -v -n 6 -z -u somefile

Manuals: Communicating

Mon, 01 Jan 0001 00:00:00 +0000

Communicating with others

The cluster is a shared resource, and communicating with other users can help to schedule large computations.

Looking-Up Specific Users

A convenient overview of all users and their lab affiliations can be retrieved with the following command:

user_details.sh

You can search for specific users by running:

MATCH1='username1' # Searches by real name, and username, and email address and PI name
MATCH2='username2'
user_details.sh | grep -P "$MATCH1|$MATCH2"

Listing Users with Active Jobs on the Cluster To get a list of usernames:

squeue --format '%u' | sort | uniq

To get the list of real names:

grep <(user_details.sh | awk '{print $2,$3,$4}') -f <(squeue --format '%u' --noheader | sort | uniq) | awk '{print $1,$2}'

To get the list of emails:

grep <(user_details.sh | awk '{print $4,$5}') -f <(squeue --format '%u' --noheader | sort | uniq) | awk '{print $2}'

Manuals: Terminal-based Working Environments

Mon, 01 Jan 0001 00:00:00 +0000

Terminal IDEs

This page introduces tmux and Neovim as terminal-based working environments for working efficiently on remote systems like HPC clusters or cloud systems. They can be used independently or in combination, and provide many useful functionalities for working in local or remote terminal environments. Users who prefer a more graphical environment, VSCode might be a good alternative.

Tmux: virtual terminal multiplexer

Tmux is a virtual terminal multiplexer providing persistent terminal sessions that are de- and re-attachable. It is an incredible useful tool for terminal-based work on remote systems. Major advantages of tmux are:

Work on remote system cannot get lost due to disconnects. One can always re-attach to a session from the same or different computers.
Many useful functionalities ‘power charge’ terminals.

Screen is a related virtual terminal multiplexer tool that can be used as an alternative (not covered here). It has similar functionalities as tmux.

Tmux can be downloaded and installed from here. A custom tmux (and nvim) environment with extensions can be installed by HPCC users with a single command (here Install_Nvim-R_Tmux). The same script also installs several useful Nvim plugins (see below). Alternatively, the install script can be downloaded from here. After installing the provided tmux environment in a user account, one needs to log out and in again to activate the environment. Note, installing the custom environment is optional and not required for any of the following examples. Users also need to be aware that the install script will make changes to their .bashrc and.tmux.conf` files. If this is not desirable, then one can install the components stepwise, or run the install, and then undo any configuration changes by following the instructions printed to the screen during the install.

Tmux: Window Split into Several Panes

Important considerations for virtual tmux sessions

Both tmux and screen sessions run on the system, where they were initialized.
To reattach to a specific session on a remote system, like the HPCC cluster, one needs to first log in to the same node (here headnode) and then re-attach to the corresponding tmux session.
It is important not to run tmux (or screen) sessions on computer nodes since tmux sessions are persistent. Instead tmux sessions should be run on a headnode. From an open tmux session one can then log in to a computer node via srun, or just submit jobs from a tmux session with sbatch.

Start Tmux

module load tmux: only required on systems that use environment modules, and the tmux load command is not specified in a user’s .bashrc file
tmux: starts a new tmux session
tmux a: attaches to an existing session, or a default session of a system, e.g. specified under ~/.tmux.conf
tmux attach -t <id>: attaches to a running session selected under <id>
tmux ls: lists existing tmux sessions

Prefix

The prefix for controlling tmux depends on a user’s settings in their ~/.tmux.conf file.

Ctrl-b: default is hard to type, and thus often not preferred
Ctrl-a: more commonly used, also on HPCC

The prefix can be changed by placing the following lines into ~/.tmux.conf.

unbind C-b
set -g prefix C-a

Mouse Support

Mouse support in tmux can be enabled with the following command.

Ctrl-a : set -g mouse on

To turn mouse support on by default, include on a separate line of ~/.tmux.conf this command: set -g mouse on

Important keybindings for tmux

Tmux sessions are organized in panes, windows and sessions themselves, where a window can have a single or several panes, and a session a single or several windows. The following commands for controlling tmux are organized by pane-, window- and session-level commands.

Pane-level commands

Ctrl-a %: splits pane vertically
Ctrl-a “: splits pane horizontally
Ctrl-a o or Ctrl-a <arrow keys>: jumps cursor to next pane
Ctrl-a Ctrl-o: swaps panes
Ctrl-a <space bar>: rotates pane arrangement
Ctrl-a Alt <left or right>: resizes to left or right
Ctrl-a Esc <up or down>: resizes to left or right
Ctrl-a z: zoom into split pane (full window view); press again to zoom out

Window-level commands

Ctrl-a n: switches to next tmux window
Ctrl-a Ctrl-a: switches to previous tmux window
Ctrl-a c: creates a new tmux window; any tmux window can be closed by typing exit on the command prompt
Ctrl-a 1: switches to specific tmux window selected by number

Session-level commands

Ctrl-a d: detaches from current session
Ctrl-a s: switch between available tmux sessions
$ tmux new -s <name>: starts new session with a specific name
$ tmux ls: lists available tmux session(s)
$ tmux attach -t <id>: attaches to specific tmux session
$ tmux attach: reattaches to session
$ tmux kill-session -t <id>: kills a specific tmux session
Ctrl-a : kill-session: kills a session from tmux command mode that can be initiated with Ctrl-a :

Vim/Nvim overview

Vim is a widely used, extremely powerful and versatile text editor for coding that is usually available on most Linux, Unix and macOS systems by default, and also can be installed on Windows. The newer version is called Neovim or Nvim. The main advantages of Nvim compared to Vim are its better performance and its built-in terminal emulator facilitating the communication among Nvim and interactive programming environments, such as command-lines, octave, R, etc. Since Vim and Nvim are managed independently, one can easily install and use them in parallel on the same system without interfering with each other. The usage of Nvim is almost identical to Vim. Emacs is a powerful alternative that can be used as an alternative to Nvim.

Neovim Example with Autocompletion

Nvim introduction

The following opens a file (here myfile) with nvim (or vim). If nvim is not found then one might need to load it with module load neovim first. A custom nvim/tmux environment with extensions can be installed by HPCC users with the Install_Nvim-R_Tmux command. For details about this install script, see the corresponding tmux section above.

Open file with Nvim

nvim myfile.txt # for neovim (or 'vim myfile.txt' for vim)

Tip: to always load Nvim with the standard vim command, one can add alias vim=nvim to ~/.bashrc.

Three main modes

Within Vim/Nvim, there are three main modes: normal, insert and command mode. The most important commands for navigating between the three modes are:

i: The i key switches from the normal mode to the insert mode. The latter is used for typing.
Esc: The Esc key switches from the insert mode back to the normal mode.
:: The : key starts the command mode at the bottom of the screen.

Most important modifier keys

The arrow keys can be used to move the cursor in the text. Using Fn Up/Down key allows to page through the text quicker (more on this below). In the following command overview, all commands starting with : need to be typed in the command mode. All other commands are typed in the normal mode after pressing the Esc key.

:w: save changes to file. If you are in editing mode you have to hit Esc first.
:q: quit file that has not been changed
:wq: save and quit file
:!q: quit file without saving any changes

Mouse support

When enabled, one can position the cursor anywhere with the mouse as well as resize Nvim split windows, and switch the scope from one window split to another.

:set mouse=n # enables mouse support, also try the a option
:set mouse-=n # disables mouse support

To enable mouse support by default, add set mouse=n to Nvim’s config file located in a user’s home under ~/.config/nvim/init.vim. The corresponding config file for the older Vim version is ~/.vimrc.

Moving around

arrow_keys: move cursor in the text
Fn Up/Down: faster scrolling via paging.
$ or 0: jump to back or beginning of line
G or gg: jump to end of document and back to beginning
w or b: move forward and backward by word
) or (: move forward and backward by sentence

Important keybindings

:split or :vsplit: splits viewport (similar to pane split in tmux)
gz: maximizes size of viewport in normal mode (similar to Tmux’s Ctrl-a z zoom utility)
Ctrl-w w: jumps cursor to other viewport and back
Ctrl-w r: swaps viewports
Ctrl-w =: resizes splits to equal size
:resize <+5 or -5>: resizes height by specified value
:vertical resize <+5 or -5>: resizes width by specified value
Ctrl-w H or Ctrl-w K: toggles between horizontal/vertical splits
:vsplit term://bash or :terminal: opens terminal in split mode or in a separate window, respectively.
Ctrl-s and Ctrl-x: freezes/unfreezes vim (some systems)

Powerful features of command mode

For example, search and replace with regular expression support. A detailed overview for using regular expressions in vim is here.

/ or ?: search in text forward and backward
:%s/search_pattern/replace_pattern/cg: replacement syntax

Set command

The set command typed in the command mode provides access to a large number of additional functions. Only a small number of examples is given here. For a more complete listing type :set all or consult the vim help with :help.

:set wrap or :set nowrap: toggle for turning line wrapping on/off
:set number or :set nonumber: toggle for turning line nubers on/off
:set syntax=bash: toggle syntax highlighting for different languages (e.g. python, perl, bash, etc) or turn off with set syntax=off

Visual mode

Initialized from normal mode with v, V or Ctrl + v.
Delete and copy selected text with d and y, respectively. For paste use p from normal mode. The copied (yanked) text is stored in a separate vim clipboard.

Copy and delete lines

yy: copies line where cursor is or those that are selected via visual mode. Paste works with p as above.
dd: deletes line where cursor is or those that are selected via visual mode.

Indentation guides

Vertical indentation lines (guides) are useful for tracking context in code. To enable indentation lines in nvim, one can use the indent-blankline.nvim plugin. Installation and configuration instructions for this plugin are here.

Indentation Guides with `indent-blankline.nvim` Plugin

Help

Vim has a comprehensive built-in help system. To access and navigate it, here are some important commands. For a more detailed overview, visit this Built-in Vim Help page.

:help: opens vim help system (:q closes it)
Ctrl-] or Ctrl-[: use in help to jump to tagged topic
:help helphelp: opens help as a file
:help quickhelp or :help index: short help overview

File browser built into vim: `NERDtree`

NERDtree provides file browser functionality for Vim. To enable it, the NERDtree plugin needs to be installed. It is included in the account configuration with Install_Nvim-R_Tmux mentioned above. To use NERDtree, open a file with vim/nvim and then type in normal mode zz. The same command closes NERDtree. Note the default for opening NERDtree is :NERDtree which has been remapped here to zz for quicker access. The basic NERDtree usage is explained here.

NERDtree in action

Useful resources for learning vim/nvim

Nvim for R users with `nvim-R` plugin

Basics

The Nvim-R plugin provides a powerful command-line working environment for R where users can send code from an R/Rmd script opened in Nvim to the R console. Essentially, this provides an RStudio like working environment within a terminal, which is often more flexible when working on remote systems than a GUI solution. It also can be combined with tmux to support ‘persistent’ R sessions that can be de- and re-attached (see tmux session above).

Nvim-R IDE for R

Quick configuration in user accounts

The following steps 1-3 can be skipped if Nvim, Tmux and nvimR are already configured on a user’s system or account. One can also follow the detailed instructions for installing Nvim-R-Tmux from scratch.

Log in to your user account on HPCC and execute Install_Nvim-R_Tmux (old: install_nvimRtmux). Additional details on this install are given in the tmux section above. Alternatively, one can use the step-by-step install here.
To enable the nvim-R-tmux environment, log out and in again.
Follow usage instructions of next section.

Basic usage of Nvim-R-Tmux

The official and much more detailed user manual for Nvim-R is available here. The following gives a short introduction into the basic usage of Nvim-R-Tmux:

1. Start tmux session (optional)

Note, running Nvim from within a tmux session is optional. Skip this step if tmux functionality is not required (e.g. re-attaching to sessions on remote systems).

tmux # starts a new tmux session
tmux a # attaches to an existing session

2. Open nvim-connected R session

Open a *.R or *.Rmd file with nvim and intialize a connected R session with \rf. This command can be remapped to other key combinations, e.g. uncommenting lines 10-12 in .config/nvim/init.vim will remap it to the F2 key. Note, the resulting split window between Nvim and R behaves like a split viewport in nvim meaning the usage of Ctrl-w w followed by i and Esc is important for navigation (for details see vim usage above).

nvim myscript.R # or *.Rmd file

3. Send R code from nvim to the R pane

Single lines of code can be sent from nvim to the R console by pressing the space bar. To send several lines at once, one can select them in nvim’s visual mode and then press the space bar. Please note, the default command for sending code lines in the nvim-r-plugin is \l. This key binding has been remapped in the provided .config/nvim/init.vim file to the space bar. Most other key bindings (shortcuts) still start with the \ as LocalLeader, e.g. \rh opens the help for a function/object where the curser is located in nvim. More details on this are given below.

Important keybindings for tmux

See corresponding tmux section above.

`Nvim-R`-like solutions for Bash, Python and other languages

Basics

For languages other than R one can use the vimcmdline plugin for nvim (or vim). Supported languages include Bash, Python, Golang, Haskell, JavaScript, Julia, Jupyter, Lisp, Macaulay2, Matlab, Prolog, Ruby, and Sage. The nvim terminal also colorizes the output, as in the screenshot below, where different colors are used for general output, positive and negative numbers, and the prompt line.

vimcmdline

Install

To install it, one needs to copy from the vimcmdline resository the directories ftplugin, plugin and syntax and their files to ~/.config/nvim/. For user accounts of UCR’s HPCC, the above install script Install_Nvim-R_Tmux (old: install_nvimRtmux) includes the install of vimcmdline (since 09-Jun-18).

Usage

The usage of vimcmdline is very similar to nvim-R. To start a connected terminal session, one opens with nvim a code file with the extension of a given language (e.g. *.sh for Bash or *.py for Python), while the corresponding interactive interpreter session is initiated by pressing the key sequence \s (corresponds to \rf under nvim-R). Subsequently, code lines can be sent with the space bar. More details are available here.

Manuals: Parallel Evaluations in R

Mon, 01 Jan 0001 00:00:00 +0000

Overview

R provides a variety of packages for parallel computations. One of the most comprehensive parallel computing environments for R is batchtools (formerly BatchJobs). It supports both multi-core and multi-node computations with and without schedulers. By making use of cluster template files, most schedulers and queueing systems are also supported (e.g. Torque, Sun Grid Engine, Slurm).

R code of this section

To simplify the evaluation of the R code of this page, the corresponding text version is available for download from here.

Parallelization with batchtools

The following introduces the usage of batchtools for a computer cluster using SLURM as scheduler (workload manager).

Set up working directory for SLURM

First login to your cluster account, open R and execute the following lines. This will create a test directory (here mytestdir), redirect R into this directory and then download the required files:

dir.create("mytestdir")
setwd("mytestdir")
download.file("https://bit.ly/3Oh9dRO", "slurm.tmpl")
download.file("https://bit.ly/3KPBwou", ".batchtools.conf.R")

Load package and define some custom function

This is the test function (here toy example) that will be run on the cluster for demonstration purposes. It subsets the iris data frame by rows, and appends the host name and R version of each node where the function was executed. The R version to be used on each node can be specified in the slurm.tmpl file (under module load).

library('RenvModule')
module('load','slurm') # Loads slurm among other modules
library(batchtools)
myFct <- function(x) {
result <- cbind(iris[x, 1:4,],
Node=system("hostname", intern=TRUE),
Rversion=paste(R.Version()[6:7], collapse="."))
}

Submit jobs from R to cluster

The following creates a batchtools registry, defines the number of jobs and resource requests, and then submits the jobs to the cluster via SLURM.

reg <- makeRegistry(file.dir="myregdir", conf.file=".batchtools.conf.R")
Njobs <- 1:4 # Define number of jobs (here 4)
ids <- batchMap(fun=myFct, x=Njobs)
done <- submitJobs(ids, reg=reg, resources=list(partition="short", walltime=60, ntasks=1, ncpus=1, memory=1024))
waitForJobs() # Wait until jobs are completed

Summarize job status

After the jobs are completed one instect their status as follows.

getStatus() # Summarize job status
showLog(Njobs[1])
# killJobs(Njobs) # # Possible from within R or outside with scancel

Access/assemble results

The results are stored as .rds files in the registry directory (here myregdir). One can access them manually via readRDS or use various convenience utilities provided by the batchtools package.

readRDS("myregdir/results/1.rds") # reads from rds file first result chunk
loadResult(1)
lapply(Njobs, loadResult)
reduceResults(rbind) # Assemble result chunks in single data.frame
do.call("rbind", lapply(Njobs, loadResult))

Remove registry directory from file system

By default existing registries will not be overwritten. If required one can exlicitly clean and delete them with the following functions.

clearRegistry() # Clear registry in R session
removeRegistry(wait=0, reg=reg) # Delete registry directory
# unlink("myregdir", recursive=TRUE) # Same as previous line

Load registry into R

Loading a registry can be useful when accessing the results at a later state or after moving them to a local system.

from_file <- loadRegistry("myregdir", conf.file=".batchtools.conf.R")
reduceResults(rbind)

Manuals: SSH Keys

Mon, 01 Jan 0001 00:00:00 +0000

The below links to detailed instructions. A shorter but more comprehensive summary for all major OSs is available here.

Manuals: Visualization

Mon, 01 Jan 0001 00:00:00 +0000

Compute Node

We support running graphical programs on the cluster using VNC. For more information refer to Desktop Environments.

GPU Workstation

If a remote compute node does not fit your needs then we also have a GPU workstation specifically designed for rendering high resolution 3D graphics.

Hardware

Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
DDR4 256GB @ 2400 MHz
NVIDIA Corporation GM204GL [Quadro M5000]
1TB RAID 1 HDD

Software

The GPU workstation is uniquely configured to be an extension of the HPCC cluster. Thus, all software available to the cluster is also available on the GPU workstation through Environment Modules.

Access

The GPU workstation is currently located in the Genomics building room 1208. Please check ahead of time to make sure the machine is available support@hpcc.ucr.edu. Once you have access to the GPU workstation, login with your cluster credentials. If your username does not appear in the list, you may need to click Not listed? at the bottom of the screen so that you are able to type in your username.

Usage

There are 2 ways to use the GPU workstation:

Local - Run processes directly on the GPU workstation hardware
Remote - Run processes remotely on the GPU cluster hardware

Local

Local usage is very simple. Open a terminal and use the Environment Modules to load the desired software, then run your software from the terminal. For example:

module load amira
Amira

Remotely

Open a terminal and submit a job. This is to reserve the time on the remote GPU node. Then once your job has started connect to the remote GPU node via ssh and forward the graphics back to the GPU workstation. For example:

Submit a job for March 28th, 2018 at 9:30am for a duration of 24 hours, 4 cpus, 100GB memory:

sbatch --begin=2018-03-28T09:30:00 --time=24:00:00 -p gpu --gres=gpu:1 --mem=100g --cpus-per-task=4 --wrap='echo ${CUDA_VISIBLE_DEVICES} > ~/.CUDA_VISIBLE_DEVICES; sleep infinity'

Read about GPU jobs for more information regarding the above.

Run the VirtualGL client in order to receive 3D graphics from the remove GPU node:
```
vglclient &
```
Wait for the job to start, and then check where your job is running:
```
GPU_NODE=$(squeue -h -p gpu -u $USER -o '%N'); echo $GPU_NODE
```
The above command should result in a GPU node name, which you then need to SSH directly into with the following:
```
ssh -XY $GPU_NODE
```

Once you have SSH’ed into the remote GPU node, run setup the environment and run your software:

export NO_AT_BRIDGE=1
module load amira
vglrun -display :$(head -1 ~/.CUDA_VISIBLE_DEVICES) Amira

Manuals: Singularity Jobs

Mon, 01 Jan 0001 00:00:00 +0000

What is Singularity

In short, Singularity is a program that will allow a user to run code or command, within a customized environment. We will refer to this customized environment as a container. This type of container system is common, the more popular one being Docker. Since Docker requires root access and HPC users are not typically granted these permissions, we use Singularity instead. Docker containers can be used via Singularity, with varying compatibility.

Singularity is forking into 2 branches:

Singularity-CE - Community Edition from Sylabs.io
Apptainer - Original Sinularity open source project

We will be using Apptainer when it is ready for production use. However, in the meantime, singularity-ce is currently availble on the cluster.

Limitations

Currently we are not supporting Slurm jobs being submitted from within a container. If you load the container centos/7.9 and try to submit a job from within it will fail. Please contact support in order to work around this issue.

Additionally, the building of Singularity contains on the cluster is not possible due to the steps requiring elevated permissions. If custom containers are required, you will need to build them on a machine that you have root/sudo access on (such as your local machine) or use a Remote Builder.

How to use Singularity

You can use Singularity by running module load singularity. You can run singularity in an interactive mode by calling a shell, or you can run singularity in a non-interactive mode and just pass it a script/program to run. These 2 modes are very similar to job submission on the cluster; srun is used for interactive, while sbatch is used for non-interactive.

Pulling Container Images

The first step in using Singularity is to get a container to run inside of. Containers can be custom built, pulled from Docker Hub or from SyLabs Container Library.

For example, if you wanted to run your program within an Ubuntu environment you could use the following command to pull the Ubuntu 22.04:

# From Singularity Library:
singularity pull library://library/default/ubuntu:22.04
# From Docker Hub:
singularity pull docker://ubuntu:22.04

Note that the environment within these containers will be limited, mainly you lose the ability to use the module system. This is expected, as the environment (and the operating system) within the container will be different than the one we are running on our nodes. Even if you are able to get the modules mounted within the container, compatability can not be guatanteed as different libraries versions and packages might be present within the container that the modules were not compiled with.

NOTE: If you get an error similar to “unexpected HTTP status: 401”, make sure your project on the Container Builder website is set to “Public”.

HPCC Provided Images

In an attempt to preserve some legacy software, we created a CentOS 7 image that integrates with the old CentOS 7 modules. Access to the CentOS 7 container can be granted by running module load centos/7.9. This will set the CENTOS7_SING environment variable, which is the location of the CentOS 7 container image. Usage examples of the CentOS 7 image are in the below sections.

Building Container Images

In order to build a custom image, you must use a machine you have sudo access on or use a Remote Builder.

Local/Sudo Machine

Installing Singularity is outside of the scope for this tutorial. Please see the Installing SingularityCE steps.

Once Singularity is installed, you must create a definition (def) file. More details on creating a definition file can be found on the Singularity The Definition File documentation, but a simple definition file of a Debian container that installs Python3 is the following:

BootStrap: docker
From: debian:12
%post
apt-get update -y
apt-get install -y python3

If the above file was named “debian.def”, then an image could be build using singularity build debian.sif debian.def. This will create an image called debian.sif that can be ran using the sections below.

Remote Builder

After signing up for the remote builder , log in using the steps from singularity remote login.

After logging in, you must create a definition file. We can use the same “debian.sif” file from the “Local/Sudo Machine” section. With the definition file, build the container image using singularity build --remote debian.sif debian.def. After the image has been built and downloaded, you can run it using the sections below.

Interactive Singularity

When running singularity you need to provide the path to a singularity image file. For example, this would be the most basic way to get a shell within your container:

module load singularity
singularity pull library://library/default/ubuntu:22.04
singularity shell ubuntu_22.04.sif
cat /etc/os-release # Inside Container
> PRETTY_NAME="Ubuntu 22.04.4 LTS"
> NAME="Ubuntu"
> VERSION_ID="22.04"

To run the CentOS 7 container:

module load centos
singularity shell $CENTOS7_SING

Additionally, there is a special shortcut for the centos module that allows us to run the above more simply, as:

module load centos
centos.sh

While running containers on a head node is technically possible, compute resources are still limited. You can use the following commands to run a job on a compute node:

Ubuntu:

module load singularity
singularity pull library://library/default/ubuntu:22.04
singularity shell ubuntu_22.04.sif
srun -p epyc --mem=1g -c 4 --time=2:00:00 --pty singularity shell ubuntu_22.04.sif
hostname # Inside container
> r21
cat /etc/os-release # Inside container
> PRETTY_NAME="Ubuntu 22.04.4 LTS"
> NAME="Ubuntu"
> VERSION_ID="22.04

CentOS 7:

module load centos
srun -p epyc --mem=1g -c 4 --time=2:00:00 --pty centos.sh
cat /etc/os-release # Inside container
> PRETTY_NAME="CentOS Linux 7 (Core)"
> NAME="CentOS Linux"
> VERSION_ID="7"

Non-Interactive Singularity

When running singularity non-interactivly, the same basic rules apply. We need a path to our singularity image file as well as a command to run.

Basics

For example, here is the basic syntax:

module load singularity
singularity exec /path/to/singularity/image someCommand

Using ubuntu.sif as an example, you can execute an abitraty command like so:

module load singularity
singularity pull library://library/default/ubuntu:22.04
singularity exec ubuntu_22.04.sif cat /etc/os-release

And using our CentOS 7 image:

module load singularity
singularity exec $CENTOS7_SING cat /etc/redhat-release

Shortcuts

Using the centos.sh shortcut that we provide for CentOS 7:

module load centos
centos.sh "cat /etc/redhat-release"

Here is a more complex example with modules:

module load centos
centos.sh "module load samtools; samtools --help"

Jobs

Here is an example job submitted using an Ubuntu container:

module load singularity
singularity pull library://library/default/ubuntu:22.04
sbatch -p epyc --wrap="singularity exec ubuntu_22.04.sif cat /etc/os-release; whoami; date"

Here is an example submitted as a job using the CentOS 7 container:

module load centos
sbatch -p epyc --wrap="centos.sh 'module load samtools; samtools --help'"

Variables

Here is an example with passing environment variables:

export SINGULARITYENV_SOMETHING='stuff'
centos.sh 'echo $SOMETHING'

Notice: Just add the SINGULARITYENV_ prefix to pass any varibales to the centos container.

Enable GPUs

First review how to submit a GPU job from here. Then request an interactive GPU job, or embed one of the following within your submission script.

In order to enable GPUs within your container you need to add the --nv option to the singularity command:

module load centos
singularity exec -nv $CENTOS7_SING cat /etc/redhat-release

However, when using the centos shortcut it is easier to just set the following environment variable then run centos.sh as usual:

module load centos
export SINGULARITY_NV=1
centos.sh

Singularity Usecases

In addition to using Singularity to run operating system containers (Debian, Ubuntu, CentOS, etc), it can also be used to run certain software on the cluster.

The most prominent example of this is AlphaFold. If you are interested in using AlphaFold on the cluster, see the AlphaFold Usage on HPCC page of our documentation. In addition to AlphaFold, we also offer freefem and prymetime through singularity, available by using module load freefem and module load prymetime respectively, and runnable with using singularity shell $FREEFEM_SING and singularity shell $PRYMETIME_SING.

Manuals: Data Transfer

Mon, 01 Jan 0001 00:00:00 +0000

These pages describe how to use common data transfer software on the UCR HPCC cluster.

HPCC – HPC Cluster

Manuals: Introduction

Introduction

Overview

Storage

Network

Head Nodes

Worker Nodes

Manuals: Getting Started

Login from Mac, Linux, MobaXTerm

Login from Windows

Change Password

Modules

Available Modules

Using Modules

Show Loaded Modules

Unloading Software

Databases

Loading Databases

Using Databases

Additional Features

Quotas

CPU and Memory

Data Storage

What’s Next?

Manuals: Managing Jobs

What is a Job?

Partitions

Slurm

Resources and Limits

Submitting Jobs

Non-interactive Submission

Interactive Submission

Feature Constraints

Constraint Examples

Monitoring Jobs

Canceling Jobs

Optimizing Jobs

Slurm Job Reason/Error Codes

Advanced Jobs

Array Jobs

Highmem Jobs

GPU Jobs

Web Browser Access

Ports

Tunneling

Examples

Desktop Environments

VNC Server (cluster)

VNC Client (Desktop/Laptop)

Parallelization

Parallel Methods

MPI

More examples

Manuals: Queue Policies

Start Times

Partition Quotas

External Labs

Private Node Ownership

Additional Resource Request

Example Scenarios

Per-Job Limit

Per-User Limit

Per-Lab Limit

Changing Partitions

Fair-Share

Priority

Backfill

Priority Partition

Using the Preempt Partitions (TENATIVE)

Job Limitations

Time

Resources

Starting a job

Interactive Example

Non-interactive (batch) Example

Selecting Resources

Manuals: Package Management

Python

Python Versions