WRF

From HPCC Wiki
Revision as of 19:32, 27 October 2022 by James (talk | contribs) (Text replacement - "[pP][bB][sS]" to "SLURM")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

There are two distinct WRF development trees and versions, one for production forecasting and another for research and development. NCAR's experimental, advanced research version, called ARW (Advanced Research WRF) features very high resolution and is being used to explore ways of improving the accuracy of hurricane tracking, hurricane intensity, and rainfall forecasts, among a host of other meteorological questions. It is ARW along with its pre- and post- processing modules (WPS and WPP), and the MET and GRaDS display tools that are supported here at the CUNY HPC Center. ARW is supported on both the the CUNY HPC Center SGI (ANDY) and Cray (SALK). The CUNY HPC Center build includes the NCAR Command Language (NCL) tools on both SALK and ANDY.

A complete start-to-finished use of ARW requires a significant number of steps in pre-processing, parallel production modeling, and post-processing and display. There are several alternative paths that can be taken through each stage. In particular, ARW itself offers users the ability to process either real or idealized weather data. Completing one type of simulation or the other requires different steps and even different user-compiled versions of the ARW executable. To help our users familiarize themselves with running ARW at the CUNY HPC Center, the steps required to complete a start-to-finish, real-case forecast are presented below. For more complete coverage, the CUNY HPC Center recommends that new users study the detailed description of the ARW package and how to use it at the University Corporation for Atmospheric Research (UCAR) website here [1].

WRF Pre-Processing with WPS

The WPS part of the WRF package is responsible for mapping time-equals-zero simulation input data onto the simulation domain's terrain. This process involves the execution of the preprocessing applications geogrid.exe, ungrib.exe, and metgrid.exe. Each of these applications reads its input parameters from the 'namelist.wps' input specifications file.

NOTE: In general, these steps do not take much processing time; however, in some cases they may. When users discover that pre-processing steps are running longer than five minutes as interactive jobs on the head node of either ANDY or SALK they should be instead run as batch jobs. HPC Center staff may decide to kill such long-running interactive pre-processing steps if they are slowing head node performance.

In the example presented here, we will run a weather simulation based on input data provided from January of 2000 for the eastern United States. These steps should work both on ANDY and SALK with minor differences as noted. To begin this example, create a working WPS directory and copy the test case namelist file into it.

mkdir -p $HOME/wrftest/wps
cd $HOME/wrftest/wps
cp /share/apps/wrf/default/WPS/namelist.wps .

Next, you should edit the 'namelist.wps' to point to the sample data made available in the WRF installation tree. This involves making sure that the 'geog_data_path' assignment in the geogrid section of the namelist file points to the sample data tree. From an editor make the following assignment:

geog_data_path = '/share/apps/wrf/default/WPS_DATA/geog'

Once this is completed, you must symbolically link or copy the geogrid data table directory to your working directory ($HOME/wrftest/wps here).

ln -sf /share/apps/wrf/default/WPS/geogrid ./geogrid

Now, you can run 'geogrid.exe', the geogrid executable, which defines the simulation domains and interpolates the various terrestrial data sets between the model's grid lines. The global environment on ANDY has been set to include the path to all the WRF-related executables including 'geogrid.exe'. On SALK, you must load the WRF module ('module load wrf') first to set the environment. The geogrid executable is an MPI parallel program which could be run in parallel as part of a SLURM batch script to complete the combined WRF preprocessing and execution steps, but often it runs only a short while and can be run interactively on ANDY's head node before submitting a full WRF batch job.

First you will first have to load the WRF module with:

module load wrf

Once this is done from the $HOME/wrftest/wps working directory run:

geogrid.exe > geogrid.out

On Salk (Cray system) you will have to run:

 aprun -n 1 geogrid.exe > geogrid.out

Note that 'geogrid.exe' is an MPI program and can be run in parallel. Long running WRF pre-processing jobs should be run either with more cores per node interactively as above (with -n 8, or -n 16) or as complete SLURM batch jobs, so that SALK's interactive nodes are not held by long running jobs.

Two domain files should be produced (geo_em.d01.nc geo_em.d02.nc) for this basic test case, as well as a log and output file which indicates success at the end with:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!  Successful completion of geogrid.        !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The next required preprocessing step is to run 'ungrib.exe', the ungrib executable. The purpose of ungrib is to unpack 'GRIB' ('GRIB1' and 'GRIB2') meteorological data and pack it into an intermediate file format usable by 'metgrid.exe' in the final preprocessing step.

The data for the January 2000 simulation being documented here has already been downloaded and placed in the WRF installation tree in /share/apps/wrf/default/WPS_DATA. Before running 'ungrib.exe', the WRF installation 'Vtable' file must first be symbolically linked into the working directory with:

$ln -sf /share/apps/wrf/default/WPS/ungrib/Variable_Tables/Vtable.AWIP Vtable
$ls
geo_em.d01.nc  geo_em.d02.nc  geogrid  geogrid.log  namelist.wps  Vtable

The Vtable file specifies which fields to unpack from the GRIB files. The Vtables list the fields and their GRIB codes that must be unpacked. For this test case the required Vtable file has already been defined, but users may have to construct a custom Vtable file for their data.

Next, the GRIB files themselves must also be symbolically linked into the working directory. WRF provides a script to do this.

$link_grib.csh /share/apps/wrf/default/WPS_DATA/JAN00/2000012
$ls
geo_em.d01.nc  geogrid      GRIBFILE.AAA  GRIBFILE.AAC  GRIBFILE.AAE  GRIBFILE.AAG  GRIBFILE.AAI  GRIBFILE.AAK  GRIBFILE.AAM  namelist.wps
geo_em.d02.nc  geogrid.log  GRIBFILE.AAB  GRIBFILE.AAD  GRIBFILE.AAF  GRIBFILE.AAH  GRIBFILE.AAJ  GRIBFILE.AAL  GRIBFILE.AAN  Vtable

Note 'ls' shows that the 'GRIB' files are now present.

Next, more edits to the 'namelist.wps' file are required--one to set the start and end dates for the simulation to our January 2000 time frame, and the second to set the number of simulation seconds to complete (21600 / 3600 = 6.0 hours in this case). Edit the 'namelist.wps' file by setting the following in the shared section of the file:

 start_date = '2000-01-24_12:00:00','2000-01-24_12:00:00',
 end_date   = '2000-01-25_12:00:00','2000-01-25_12:00:00',
interval_seconds = 21600

Now you can run 'ungrib.exe' to create the intermediate files required by 'metgrid.exe':

$ungrib.exe > ungrib.out
$ls
FILE:2000-01-24_12  FILE:2000-01-25_06  geo_em.d02.nc  GRIBFILE.AAA  GRIBFILE.AAD  GRIBFILE.AAG  GRIBFILE.AAJ  GRIBFILE.AAM  ungrib.log
FILE:2000-01-24_18  FILE:2000-01-25_12  geogrid        GRIBFILE.AAB  GRIBFILE.AAE  GRIBFILE.AAH  GRIBFILE.AAK  GRIBFILE.AAN  ungrib.out
FILE:2000-01-25_00  geo_em.d01.nc       geogrid.log    GRIBFILE.AAC  GRIBFILE.AAF  GRIBFILE.AAI  GRIBFILE.AAL  namelist.wps  Vtable

Note that 'ungrib.exe', unlike the other pre-processing tools mentioned here, is NOT an MPI parallel program and for larger WRF jobs can run for a fairly long time. Long running 'ungrib.exe' pre-processing jobs should be run as complete SLURM batch jobs, so that SALK's interactive nodes are not held for hours at a time.

After a successful 'ungrib.exe' run you should get the familiar message at the end of the output file:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! Successful completion of ungrib.!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Like geogrid, the metgrid executable, 'metgrid.exe' needs to be able to find its table directory in the preprocessing working directory. The metgrid table directory may either be copied or symbolically linked into the working directory location.

ln -sf /share/apps/wrf/default/WPS/metgrid ./metgrid

Finally, all the files required for a successful run of 'metgrid.exe' have been provided. Like 'geogrid.exe', 'metgrid.exe' is an MPI parallel program that could be run in SLURM batch mode, but often runs for only a short time and can be run on ANDY's head node, as follows:

$metgrid.exe > metgrid.out 
$ls
FILE:2000-01-24_12  geogrid       GRIBFILE.AAF  GRIBFILE.AAM                       met_em.d02.2000-01-24_12:00:00.nc  metgrid.out
FILE:2000-01-24_18  geogrid.log   GRIBFILE.AAG  GRIBFILE.AAN                       met_em.d02.2000-01-24_18:00:00.nc  namelist.wps
FILE:2000-01-25_00  GRIBFILE.AAA  GRIBFILE.AAH  met_em.d01.2000-01-24_12:00:00.nc  met_em.d02.2000-01-25_00:00:00.nc  ungrib.log
FILE:2000-01-25_06  GRIBFILE.AAB  GRIBFILE.AAI  met_em.d01.2000-01-24_18:00:00.nc  met_em.d02.2000-01-25_06:00:00.nc  ungrib.out
FILE:2000-01-25_12  GRIBFILE.AAC  GRIBFILE.AAJ  met_em.d01.2000-01-25_00:00:00.nc  met_em.d02.2000-01-25_12:00:00.nc  Vtable
geo_em.d01.nc       GRIBFILE.AAD  GRIBFILE.AAK  met_em.d01.2000-01-25_06:00:00.nc  metgrid
geo_em.d02.nc       GRIBFILE.AAE  GRIBFILE.AAL  met_em.d01.2000-01-25_12:00:00.nc  metgrid.log

If you are on SALK (Cray XE6), you will have to run:

 aprun -n 1 metgrid.exe > metgrid.out

Note that 'metgrid.exe' is an MPI program and can be run in parallel. Long running WRF pre-processing jobs should be run either with more cores per node interactively as above (with -n 8, or -n 16) or as complete SLURM batch jobs, so that SALK's interactive nodes are not held by long running jobs.

Successful runs will produce an output file that includes:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!  Successful completion of metgrid.  !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Note that the met files required by WRF are now present (see the 'ls' output above). At this point, the preprocessing phase of this WRF sample run is complete. We can move on to actually running this real (not ideal) WRF test case using the SLURM Pro batch scheduler in MPI parallel mode.

Running a WRF Real Case in Parallel Using SLURM

Our frame of reference now turns to running 'real.exe' and 'wrf.exe' in parallel on ANDY or SALK via SLURM Pro. As you perhaps noticed in walking through the preprocessing steps above, the preprocessing files are all installed in their own subdirectory (WPS) under the WRF installation tree root (/share/apps/wrf/default). The same is true for the files to run WRF. They reside under the WRF install root in the 'WRFV3' subdirectory.

Within this 'WRFV3' directory, the 'run' subdirectory contains the all common files needed for a 'wrf.exe' run except the 'met' files that were just created in the preprocessing section above and those that are produced by 'real.exe' which is run before 'wrf.exe' in real-data weather forecasts.

Note that the ARW version of WRF allows one to produce a number of different executables depending on the type of run that is needed. Here, we are relying on the fact that the 'em_real' version of the code has already been built. Currently, the CUNY HPC Center has only compiled this version of WRF. Other versions can be compiled upon request. The subdirectory 'test' underneath the 'WRFV3' directory contains additional subdirectories for each type of WRF build (em_real, em_fire, em_hill2d_x, etc.).

To complete an MPI parallel run of this WRF real data case, a 'wrfv3/run' working directory for your run should be created, and it must be filled with the required files from the installation root's 'run' directory, as follows:

$cd $HOME/wrftest
$mkdir -p wrfv3/run
$cd wrfv3/run
$cp /share/apps/wrf/default/WRFV3/run/* .
$rm *.exe
$
$ls
CAM_ABS_DATA       ETAMPNEW_DATA.expanded_rain      LANDUSE.TBL            ozone_lat.formatted   RRTM_DATA_DBL      SOILPARM.TBL  URBPARM_UZE.TBL
CAM_AEROPT_DATA    ETAMPNEW_DATA.expanded_rain_DBL  MPTABLE.TBL            ozone_plev.formatted  RRTMG_LW_DATA      tr49t67       VEGPARM.TBL
co2_trans          GENPARM.TBL                      namelist.input         README.namelist       RRTMG_LW_DATA_DBL  tr49t85
ETAMPNEW_DATA      grib2map.tbl                     namelist.input.backup  README.tslist         RRTMG_SW_DATA      tr67t85
ETAMPNEW_DATA_DBL  gribmap.txt                      ozone.formatted        RRTM_DATA             RRTMG_SW_DATA_DBL  URBPARM.TBL
$

Note that the '*.exe' files were removed in the sequence above after the copy because they are already pointed to by ANDY's and SALK's system PATH variable.

Next, the 'met' files produced during the preprocessing phase above need to be copied or symbolically linked into the 'wrv3/run' directory.

$
$pwd
/home/guest/wrftest/wrfv3/run
$
$cp ../../wps/met_em* .
$ls
CAM_ABS_DATA                     grib2map.tbl                       namelist.input         RRTM_DATA_DBL      tr67t85
CAM_AEROPT_DATA                  gribmap.txt                        namelist.input.backup  RRTMG_LW_DATA      URBPARM.TBL
co2_trans                        LANDUSE.TBL                        ozone.formatted        RRTMG_LW_DATA_DBL  URBPARM_UZE.TBL
ETAMPNEW_DATA                    met_em.d01.2000-01-24_12:00:00.nc  ozone_lat.formatted    RRTMG_SW_DATA      VEGPARM.TBL
ETAMPNEW_DATA_DBL                met_em.d01.2000-01-25_12:00:00.nc  ozone_plev.formatted   RRTMG_SW_DATA_DBL
ETAMPNEW_DATA.expanded_rain      met_em.d02.2000-01-24_12:00:00.nc  README.namelist        SOILPARM.TBL
ETAMPNEW_DATA.expanded_rain_DBL  met_em.d02.2000-01-25_12:00:00.nc  README.tslist          tr49t67
GENPARM.TBL                      MPTABLE.TBL                        RRTM_DATA              tr49t85
$

The user may have edits to complete on the WRF 'namelist.input' file listed to craft the exact job they wish to run. The default namelist file copied into our working directory is in large part what is needed for this test run, but we will reduce the total simulation time (for the weather model, not the job) from from 12 to 1 hour by setting the 'run_hours' variable to 1.

At this point we are ready to submit a SLURM job. The SLURM Pro batch script below first runs 'real.exe' which creates the WRF input files 'wrfbdy_d01' and 'wrfinput_d01', and then runs 'wrf.exe' itself. Both executables are MPI parallel programs, and here they are both run on 16 processors. Here is the 'wrftest.job' SLURM script that will run on ANDY:

#!/bin/bash
#SBATCH --partition production_gdr
#SBATCH --job-name wrf_realem 
#SBATCH --nodes=16
#SBATCH --ntasks=1
#SBATCH --mem=2880

# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
echo ""
hostname
echo ""

# Find out the contents of the SLURM node file which names the node
# allocated by SLURM
echo -n ">>>> SLURM Node file contains: "
echo ""
cat $SLURM_NODEFILE
echo ""

# You must explicitly change to the working directory in SLURM
cd $SLURM_SUBMIT_DIR

# Just point to the pre-processing executable to run
echo ">>>> Runnning REAL.exe executable ..."
mpirun -np 16 /share/apps/wrf/default/WRFV3/run/real.exe
echo ">>>> Running WRF.exe executable ..."
mpirun -np 16 /share/apps/wrf/default/WRFV3/run/wrf.exe
echo ">>>> Finished WRF test run ..."

The full path to each executable is used for illustrative purposes, but both binaries (real.exe and wrf.exe) are in the WRF install tree run directory and would be picked up from the system PATH environmental variable without the full path. This job requests 16 resource chunks, each with 1 processor and 2880 MBytes of memory. This job asks to be run on the QDR InfiniBand (faster interconnect) side of the ANDY system. Details on the use and meaning of the SLURM option section of the job are available elsewhere in the CUNY HPC Wiki.

To submit the job type:

qsub wrftest.job

A slightly difference version of the script is required to run the same job on SALK (the Cray):

#!/bin/bash
#SBATCH --partition production
#SBATCH --job-name wrf_realem 
#SBATCH --nodes=16
#SBATCH --ntasks=1
#SBATCH --o wrf_test16_O1.out




# Find out name of master execution host (compute node)
echo -n ">>>> SLURM Master compute node is: "
echo ""
hostname
echo ""

# Find out the contents of the SLURM node file which names the node
# allocated by SLURM
echo -n ">>>> SLURM Node file contains: "
echo ""
cat $SLURM_NODEFILE
echo ""

# You must explicitly change to the working directory in SLURM
cd $SLURM_SUBMIT_DIR

# Tune some MPICH parameters on the Cray
export MALLOC_MMAP_MAX=0
export MALLOC_TRIM_THRESHOLD=536870912
export MPICH_RANK_ORDER 3

# Just point to the pre-processing executable to run
echo ">>>> Runnning REAL.exe executable ..."
aprun -n 16  /share/apps/wrf/default/WRFV3/run/real.exe
echo ">>>> Running WRF.exe executable ..."
aprun -n 16  /share/apps/wrf/default/WRFV3/run/wrf.exe
echo ">>>> Finished WRF test run ..."

A successful run on either ANDY or SALK will produce an 'rsl.out' and 'rsl.error' file for each processor on which the job ran. So for this test case there will be 16 of each such files. The 'rsl.out' files reflect the run settings requested in the namelist file and then time-stamp the progress the job is making until the total simulation time is completed. The tail end of an 'rsl.out' file for a successful run should look like this:

:
:
v
Timing for main: time 2000-01-24_12:45:00 on domain   1:    0.06060 elapsed seconds.
Timing for main: time 2000-01-24_12:48:00 on domain   1:    0.06300 elapsed seconds.
Timing for main: time 2000-01-24_12:51:00 on domain   1:    0.06090 elapsed seconds.
Timing for main: time 2000-01-24_12:54:00 on domain   1:    0.06340 elapsed seconds.
Timing for main: time 2000-01-24_12:57:00 on domain   1:    0.06120 elapsed seconds.
Timing for main: time 2000-01-24_13:00:00 on domain   1:    0.06330 elapsed seconds.
 d01 2000-01-24_13:00:00 wrf: SUCCESS COMPLETE WRF
taskid: 0 hostname: gpute-2
taskid: 0 hostname: gpute-2

Post-Processing and Displaying WRF Results