Overview

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the award-winning S system which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. It can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. It provides a wide variety of statistical and graphical techniques (linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, ...).

R is designed as a true computer language with control-flow constructions for iteration and alternation, and it allows users to add additional functionality by defining new functions. For computationally intensive tasks, C, C++ and Fortran code can be linked and called at run time.

There is an Australian AARNet mirror of the main R web site.

More information: https://www.r-project.org/about.html

Usage

You can check the versions installed in Gadi with a module query:

$ module avail R

We normally recommend using the latest version available and always recommend to specify the version number with the module command:

$ module load R/4.3.1

For more details on using modules see our modules help guide at https://opus.nci.org.au/display/Help/Environment+Modules.

An example PBS job submission script named r_job.sh is provided below. It requests 1 CPU core, 2 GiB memory, and 8 GiB local disk on a compute node on Gadi from the normal queue for its exclusive access for 30 minutes against the project a00. It also requests the system to enter the working directory once the job is started. This script should be saved in the working directory from which the analysis will be done.


#!/bin/bash

#PBS -P a00
#PBS -q normal
#PBS -l ncpus=1
#PBS -l mem=2GB
#PBS -l jobfs=8GB
#PBS -l walltime=00:30:00
#PBS -l wd

# Load module, always specify version number.
module load R/4.3.1

# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`. Details on:
# https://opus.nci.org.au/display/Help/PBS+Directives+Explained

# Run R application
R --vanilla < input.r > output

For more information about R command's options: https://cran.r-project.org/doc/manuals/r-release/R-intro.html

To run the job you would use the PBS command:

$ qsub r_job.sh

This will execute the instructions in input.r after starting up R and the output that you would expect to see on the desktop for interactive execution will appear in the file output. Check the files r_job.sh.e**** and r_job.sh.o**** for any errors and to see the time consumed. Note the request for /scratch space in jobfs as R uses TMPDIR.

Executing R commands in an interactive way is also possible. Please see the details at https://opus.nci.org.au/display/Help/0.+Welcome+to+Gadi#id-0.WelcometoGadi-InteractiveJobs.

This version of R has been built with the Intel MKl library for dense linear algebra BLAS and LAPACK. If your algorithm is heavily dependent on LAPACK routines, you may be able to benefit by running in parallel. An example job script with 2 CPU cores provided below. Note that if your application does not work in parallel, setting the number of CPU cores to 1 and changing the memory and jobfs according to the information available at https://opus.nci.org.au/display/Help/Queue+Structure is required to prevent the compute resource waste.

#!/bin/bash

#PBS -q normal
#PBS -l ncpus=2
#PBS -l mem=4GB
#PBS -l jobfs=16GB
#PBS -l walltime=00:15:00
#PBS -l wd

# Load module, always specify version number.
module load R/4.3.1

# Set number of OMP threads
export OMP_NUM_THREADS=$PBS_NCPUS

# Must include `#PBS -l storage=scratch/ab12+gdata/yz98` if the job
# needs access to `/scratch/ab12/` and `/g/data/yz98/`. Details on:
# https://opus.nci.org.au/display/Help/PBS+Directives+Explained

# Run R application
R --vanilla -f input.r > output

To see if it is worth using multiple CPU cores, you should run some timing tests with 1, 2, 4 up to no more than 16 CPU cores and check the walltime used. Your problems need to be fairly large to benefit from parallelism.

If you wish to add extra packages such as randomForest, you need to load appropriate Intel modules. We recommend using the same Intel compiler version that were used to build R.

The list of modules that were loaded during the R build are in the /apps/R/<version>/README.nci file. For example, for R/4.3.1, the file is /apps/R/4.3.1/README.nci. There you can see that intel-compiler/2021.10.0 was used. Therefore this is the version that needs to be loaded, as shown below:

# Unload modules
$ module unload R intel-compiler

# Load modules, always specify version number.
$ module load R/4.3.1
$ module load intel-compiler/2021.10.0

$ R
....
> install.packages("randomForest",repos="https://mirror.aarnet.edu.au/pub/CRAN/")
Warning in install.packages("randomForest") :
  ''''''''''''''''''''''''''''''''lib = "/apps/R/4.3.1/lib64/R/library"'''''''''''''''''''''''''''''''' is not writeable
Would you like to create a personal library
''''''''''''''''''''''''''''''''~/R/x86_64-unknown-linux-gnu-library/4.3.1''''''''''''''''''''''''''''''''
to install packages into?  (y/n) y

If you wish to install packages in a different directory from the default ~/R/x86_64-unknown-linux-gnu-library/4.3.1, you need to set the environment variable R_LIBS to the new directory. For bash, you will be able to set it using

$ export R_LIBS=/path/to/your/new/directory:$R_LIBS

command. This will also need to be set every time you use R.

Note, that some packages can not be build with Intel compilers. The problem usually happens when a package using complex variables. In such cases, you need to switch to GNU compilers. This is done by modifying ~/.R/Makevars file in your $HOME directory. Putting the following lines in this file:

CXX=g++
CXX11=g++
CXX14=g++
CXX17=g++
CC=gcc

will force R to use gcc/g++ instead of icc. Do not forget to comment out these lines (i.e. add # symbol in front of each line) after installing that problematic package.

Installation Guides of Frequently Asked Packages

# Unload modules
$ module unload R intel-compiler intel-mkl glpk

# Load modules, always specify version number.
$ module load R/4.3.1
$ module load intel-compiler/2021.10.0
$ module load intel-mkl/2023.2.0
$ module load glpk/5.0

# Install Biobase, NMF and irlba packages.
$ R
> if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
> BiocManager::install("Biobase")
> install.packages("NMF",repos="http://cran.ms.unimelb.edu.au/")
> install.packages("irlba",repos="http://cran.ms.unimelb.edu.au/")

# Install igraph package
> install.packages("igraph",repos="http://cran.ms.unimelb.edu.au/")


# Install RcppTOML package
> q()
Save workspace image? [y/n/c]: n
$ cd /g/data/z00/mma900/build
$ wget https://cran.r-project.org/src/contrib/Archive/RcppTOML/RcppTOML_0.1.7.tar.gz
$ R CMD INSTALL RcppTOML_0.1.7.tar.gz


# Now you can install Seurat package

# Install Seurat package
$ R
> install.packages("Seurat",repos="http://cran.ms.unimelb.edu.au/")


# Test Seurat
> library(Seurat)
Loading required package: SeuratObject
Loading required package: sp

Attaching package: ‘SeuratObject’

The following object is masked from ‘package:base’:

    intersect

>


# All the above packages, except Seurat, are required to install
# the following packages.

# Install phangorn package
> install.packages("phangorn",repos="http://cran.ms.unimelb.edu.au/")

# Install SNPRelate package
> BiocManager::install("SNPRelate")

# Install muscle
$ cd /g/data/z00/mma900/build
$ wget https://github.com/rcedgar/muscle/archive/refs/tags/5.1.0.tar.gz
$ tar -zxvf 5.1.0.tar.gz
$ cd muscle-5.1.0/src
$ make

# Install phylip for dnaml package
$ cd /g/data/z00/mma900/build
$ wget https://phylipweb.github.io/phylip/download/phylip-3.697.tar.gz
$ tar -zxvf phylip-3.697.tar.gz
$ cd phylip-3.697/src
$ make -f Makefile.unx install

# Install snphylo
$ cd /g/data/z00/mma900/build
$ wget https://github.com/thlee/SNPhylo/archive/refs/tags/20180901.tar.gz
$ tar -zxvf 20180901.tar.gz
$ cd SNPhylo-20180901
$ chmod 755 setup.sh
$ export R_LIBS=/g/data/z00/mma900/build/SNPhylo-20180901/R_LIBS:$R_LIBS
$ ./setup.sh
Version: 20141127

START TO SET UP FOR SNPHYLO!!!

The detected path of R is /apps/R/4.3.1/bin/R. Is it correct? [Y/n] Y
The detected path of python is /bin/python. Is it correct? [Y/n] Y
muscle is not found. Is the program already installed? [y/N] y
Please enter the path of muscle program (ex: /home/foo/bin/muscle): /g/data/z00/mma900/build/muscle-5.1.0/src/Linux/muscle

dnaml is not found. Is the program already installed? [y/N] y
Please enter the path of dnaml program (ex: /home/foo/bin/dnaml): /g/data/z00/mma900/build/phylip-3.697/exe/dnaml

At least one R package (gdsfmt, SNPRelate, getopt or phangorn) to run this pipeline is not found. Are the packages already installed? [y/N] N
Do you want to install the packages by this script? [y/N] y
...
* DONE (getopt)
...
* DONE (phangorn)

The downloaded source packages are in
	‘/scratch/z00/mma900/tmp/RtmpjiNQkB/downloaded_packages’
Error: With R version 3.5 or greater, install Bioconductor packages using BiocManager; see https://bioconductor.org/install
Execution halted

SNPHYLO is successfully installed!!!

# Test snphylo
$ ./snphylo.sh

# Test phangorn
$ R
> library(ape)
> library(phangorn)
>