UCR High-Performance Computing Center

with applications in parallel computing

Isaac Quintanilla Salinas

5/20/2022

Introduction

You can follow this presentation at https://www.inqs.info/files/GSS/HPCC/hpcc_5-20.html

Terminology

  • job: A task a computer must complete

  • bash script: tells the computer what to do

My Cluster Usage

  • Computationally Intensive

  • Use R and C++ (via Rcpp)

HPCC Cluster

How do I view the cluster?

  • Virtual Desktop Computer
    • Set number of cores (max 32 cpus)
    • Set RAM (max 128 GB)
    • Set storage (20 GB)

How to access the cluster?

RStudio Server

  • Edit any text document

  • Submit jobs

  • Upload/Download Data or Documents

  • Do not do any analysis in RStudio Server

Linux Commands

  • sbatch: submit a job to the cluster

  • scancel: cancel jobs

  • squeue: view jobs status

  • slurm_limits: view what is available to you

How I use the cluster?

  • Code on my computer at a smaller scale

    • Parallelize using my computer’s cores
  • Scale it up for the cluster

  • Have an R Script do everything for me

  • Save all results in an RData file

  • Use the try() function to catch errors

  • Use RProjects

Submitting Jobs

  • Use a bash script to specify the parameters for the cluster

  • Add the following line at the end of a bash script:

Rscript NAME_OF_FILE.R
  • Submit the job with the line below:
sbatch NAME_OF_BASH.sh

Anatomy of Bash Script

#!/bin/bash

#SBATCH --nodes=1 # Setting the number of nodes, usually 1 node is used
#SBATCH --ntasks=1 # Setting the number of tasks, usually 1 task is used
#SBATCH --cpus-per-task=8 # Setting the number of cpus per task
#SBATCH --mem-per-cpu=1G # How much ram per cpu, usually 1 GB
#SBATCH --time=0-01:00:00    # Time to run task, changes based on predicted time of task
#SBATCH --output=my.stdout # Where to store the output, usually a standard output
#SBATCH --mail-user=NETID@ucr.edu # Where to email information about job
#SBATCH --mail-type=ALL # Not Sure
#SBATCH --job-name="Cluster Job 1" # Name of the job, can be anything
#SBATCH -p statsdept # statsdept is the only partition dept of stats students can use

Rscript Parallel_Job.R # The command that tells Linux to process your R Script

Anatomy of R Script

# Obtain System Date and Time
date_time <- format(Sys.time(),"%Y-%m-%d-%H-%M")

# Set Working Directory
setwd("~/rwork")

# Load libraries and functions
library(parallel)
source("Fxs.R")

# Pre - Parallel Analysis
# Parallel Analysis
results <- mclapply(data, FUN, mc.cores = number_of_cores)
# Post - Parallel Analysis

# Save Results
file_name <- paste("Results_", date_time, ".RData",sep = "")
save(results, file = file_name, version = 2)

Parallel Processing in R

For more information:

https://ucrgradstat.github.io/stat_comp/hpcc/parallel_notebook.nb.html

How to parallelize your R Code

From the parallel R package:

  • mclapply()

    • Recommended for the cluster

    • Has built-in try() function

    • Replace any lapply() with mclapply() and add mc.cores= argument.

  • parLapply()

    • Use if multiple nodes are involved

    • Use if on Windows PC

Where to parallelize?

  • Identify loops or *apply functions

    • Iterations must be independent of each other
  • Identify bottlenecks

    • Use benchmark R packages

How to speed up you R Code?

  • Vectorize your R code

  • Minimize loops

    • Use *apply functions
  • Use optimized functions

    • colMeans() and rowMeans()
  • Implement c++ via Rcpp

  • More Information: Advanced R (adv-r.hadley.nz)

Simulation Study

Simulation Study

  • Show that Ordinary Least Squares provides consistent estimates

  • Model: Y = XTβ + ϵ

    • β = (β0,β1,β2,β3)T = (5,4,−5,−3)T

    • X = (1,X1,X2,X3)T

    • ϵ ∼ N(0,3)

Simulation Parameters

  • Number of Data sets: 10000

  • Number of Observations: 200

  • X ∼ N((−2,0,2)T,I3)

  • I3: 3 × 3 Identity Matrix

Parallelization of Simulation

R Script

## Date-Time ####
Date_Time <- format(Sys.time(),"%Y-%m-%d-%H-%M") #Used as a unique identifier

## Setting WD ####
# ("~/rwork") # Setting working directory to rwork, were all the data is saved

## Loading R Packages ####
library(parallel)


## Functions ####

data_sim <- function(seed, nobs, beta, sigma, xmeans, xsigs){ # Simulates the data set
  set.seed(seed) # Sets a seed
  xrn <- cbind(rnorm(nobs, mean = xmeans[1], sd = xsigs[1,1]),
               rnorm(nobs, mean = xmeans[2], sd = xsigs[2,2]),
               rnorm(nobs, mean = xmeans[3], sd = xsigs[3,3])) # Simulates Predictors
  xped <- cbind(rep(1,nobs),xrn) # Creating Design Matrix
  y <- xped %*% beta + rnorm(nobs ,0, sigma) # Simulating Y
  df <- data.frame(x=xrn, y=y) # Creating Data Frame
  return(df)
}


parallel_lm <- function(data){ # Applying a Ordinary Least Squares to data frame
  lm_res <- lm(y ~ x.1 + x.2 + x.3, data = data) # Find OLS Estimates
  return(list(coef=coef(lm_res), lm_results=lm_res))
}


## Parallel Parameters ####
ncores <- 8 # Number of cpus to be used

## Simulation Parameters ####
N <- 10000 # Number of Data sets
nobs <- 200 # Number of observations
beta <- c(5, 4, -5, -3) # beta parameters
xmeans <- c(-2, 0, 2) # Means for predictors
xsigs <-diag(rep(1, 3)) # Variance for predictor
sig <- 3 # Variance for error term
  
## Simulating Data ####

standard_data <- lapply(c(1:N), data_sim, # Using data_sim function to simulate N data sets
                        nobs = nobs, beta = beta, sigma = sig, # Model Parameters
                        xmeans = xmeans, xsigs = xsigs) # Predictor Parameters for simulation


## Obtaining Estimates ####
start <- Sys.time()# Used for timing process
standard_results <- lapply(standard_data, parallel_lm) # Using 1 core to process the data 
print("Standard lapply")
Sys.time()-start# Time it took

start <- Sys.time()# Used for timing process
parallel_results <- mclapply(standard_data, parallel_lm, # Using Multiple cores to process the data 
                             mc.cores = ncores) # Setting the number of cores to use
print("mclapply")
Sys.time()-start# Time it took

## Extracting Betas ####

standard_beta <- matrix(ncol=4, nrow = N) # Creating a matrix for beta values 
parallel_beta <- matrix(ncol=4, nrow = N) # Creating a matrix for beta values 
for (i in 1:N){
  standard_beta[i, ] <- standard_results[[i]]$coef #Extracting coefficients from lapply
}
for (i in 1:N){
  parallel_beta[i, ] <- parallel_results[[i]]$coef #Extracting coefficients from mclapply
}

## Average results
print("From Standard lapply")
colMeans(standard_beta)
print("From mclapply")
colMeans(parallel_beta)


## Saving Results ####
standard_save <- list(lm_res = standard_results, betas = standard_beta) # Creating a list or results from mclapply
parallel_save <- list(lm_res = parallel_results, betas = parallel_beta) # Creating a list or results from mclapply
params <- list(N = N, # Creating a list of simulation parameters
                nobs = nobs,
                beta = beta,
                xmeans = xmeans,
                xsigs = xsigs,
                sig = sig) 

results <- list(standard = standard_save, parallel = parallel_save, data = standard_data, # Combining list
                parameters = params, Date_Time = Date_Time)

save_dir <- paste("Results_", Date_Time, ".RData", sep="") # Creating file name, contains date-time
save(results, file = save_dir, version = 2) # Saving RData file, recommend using version 2

Executing Simulation Study

download.file("https://ucrgradstat.github.io/presentations/Parallel_Job.R", "Parallel_Job.R")
download.file("https://ucrgradstat.github.io/presentations/Cluster_Script.sh", "Cluster_Script.sh")
  • Bash Script

    • Line 9: Change to your email
  • Submit Job

    • Console Pane

    • Terminal Tab

    • Type: sbatch Cluster_Script.sh

Advanced Topics

Linux Commands

Resources

Read the UCR HPCC Handbook for more information.

https://hpcc.ucr.edu/

Terminal

ssh netid@cluster.hpcc.ucr.edu

Use any terminal app

Directory

  • When logging on, you will be in /rhome/netid

  • In Linux: ~=/rhome/netid

  • ls: List all visible files

    • ls -a: List all files

    • ls -l: Provides information on files

    • ls -t: List files chronologically

    • ls -R: List all subdirectories

  • pwd: Print working directory

  • cd: Change working directory

    • cd ~: Return to Home directory

    • cd ..: Move one directory above

    • cd ../../: Move two directories above

  • mkdir: Creates a directory

  • rmdir: Deletes an empty directory

  • rm -r: Deletes a nonempty directory

File Manipulation

  • Neovim is a text editor that lets you manipulate documents

  • 3 Modes

    • Insert Mode

      • Change text

      • i

    • Visual Mode

      • ESC
    • Command Mode

      • Execute Commands from Visual mode

      • :wq : Write and Quit

      • :q, :q!: Quit

  • Delete a file
rm FILE_Name
  • Move a file
mv FILE_NAME DIR/
  • Copy a file to a directory
cp FILE_NAME DIR
  • Copy a file to current directory

    • Use a period to indicate current directory
cp FILE_PATH/FILE_NAME .
  • Copy all files from one directory to another
cp -r DIR_1/* DIR_2/

scp

  • Similar to the cp command, scp allows you to copy and paste files from the cluster to your personal computer.

  • scp only works on your computer, it will not work on the cluster

  • The scp requires 2 commands: the file location, and where to paste it

netid@cluster.hpcc.ucr.edu:~/FILE_PATH

# OR

netid@cluster.hpcc.ucr.edu:~/FILE_PATH/FILE_NAME
  • There 2 components

    • User and Cluster

    • File Path/Location

  • Both are separated by a :

Download 1 File

scp netid@cluster.hpcc.ucr.edu:~/FILE_PATH/FILE_NAME .

Downloading all files in a directory

scp -r netid@cluster.hpcc.ucr.edu:~/FILE_PATH/* .

Upload 1 File

scp FILE_NAME netid@cluster.hpcc.ucr.edu:~/FILE_PATH/

Upload all files in a directory

scp -r DIR/* netid@cluster.hpcc.ucr.edu:~/FILE_PATH/

rsync

  • rsync is similar to scp; however, it syncs folders instead just copy and pasting

  • Option -av represent archive and verbose, respectively.

    • archive means to sync all files

    • verbose means to print information

  • You need to specify /; otherwise, you will get strange results

To sync to the cluster

rsync -av DIR/ netid@cluster.hpcc.ucr.edu:~/DIR/

To sync to local computer

rsync -av netid@cluster.hpcc.ucr.edu:~/DIR/ DIR/

ssh Keys

Thank You!