WRF
There are two distinct WRF development trees and versions, one for production forecasting and another for research and development. NCAR's experimental, advanced research version, called ARW (Advanced Research WRF) features very high resolution and is being used to explore ways of improving the accuracy of hurricane tracking, hurricane intensity, and rainfall forecasts, among a host of other meteorological questions. It is ARW along with its pre- and post- processing modules (WPS and WPP), and the MET and GRaDS display tools that are supported here at the CUNY HPC Center. ARW is supported on both the the CUNY HPC Center SGI (ANDY) and Cray (SALK). The CUNY HPC Center build includes the NCAR Command Language (NCL) tools on both SALK and ANDY.
A complete start-to-finished use of ARW requires a significant number of steps in pre-processing, parallel production modeling, and post-processing and display. There are several alternative paths that can be taken through each stage. In particular, ARW itself offers users the ability to process either real or idealized weather data. Completing one type of simulation or the other requires different steps and even different user-compiled versions of the ARW executable. To help our users familiarize themselves with running ARW at the CUNY HPC Center, the steps required to complete a start-to-finish, real-case forecast are presented below. For more complete coverage, the CUNY HPC Center recommends that new users study the detailed description of the ARW package and how to use it at the University Corporation for Atmospheric Research (UCAR) website here [1].
WRF Pre-Processing with WPS
The WPS part of the WRF package is responsible for mapping time-equals-zero simulation input data onto the simulation domain's terrain. This process involves the execution of the preprocessing applications geogrid.exe, ungrib.exe, and metgrid.exe. Each of these applications reads its input parameters from the 'namelist.wps' input specifications file.
NOTE: In general, these steps do not take much processing time; however, in some cases they may. When users discover that pre-processing steps are running longer than five minutes as interactive jobs on the head node of either ANDY or SALK they should be instead run as batch jobs. HPC Center staff may decide to kill such long-running interactive pre-processing steps if they are slowing head node performance.
In the example presented here, we will run a weather simulation based on input data provided from January of 2000 for the eastern United States. These steps should work both on ANDY and SALK with minor differences as noted. To begin this example, create a working WPS directory and copy the test case namelist file into it.
mkdir -p $HOME/wrftest/wps cd $HOME/wrftest/wps cp /share/apps/wrf/default/WPS/namelist.wps .
Next, you should edit the 'namelist.wps' to point to the sample data made available in the WRF installation tree. This involves making sure that the 'geog_data_path' assignment in the geogrid section of the namelist file points to the sample data tree. From an editor make the following assignment:
geog_data_path = '/share/apps/wrf/default/WPS_DATA/geog'
Once this is completed, you must symbolically link or copy the geogrid data table directory to your working directory ($HOME/wrftest/wps here).
ln -sf /share/apps/wrf/default/WPS/geogrid ./geogrid
Now, you can run 'geogrid.exe', the geogrid executable, which defines the simulation domains and interpolates the various terrestrial data sets between the model's grid lines. The global environment on ANDY has been set to include the path to all the WRF-related executables including 'geogrid.exe'. On SALK, you must load the WRF module ('module load wrf') first to set the environment. The geogrid executable is an MPI parallel program which could be run in parallel as part of a SLURM batch script to complete the combined WRF preprocessing and execution steps, but often it runs only a short while and can be run interactively on ANDY's head node before submitting a full WRF batch job.
First you will first have to load the WRF module with:
module load wrf
Once this is done from the $HOME/wrftest/wps working directory run:
geogrid.exe > geogrid.out
On Salk (Cray system) you will have to run:
aprun -n 1 geogrid.exe > geogrid.out
Note that 'geogrid.exe' is an MPI program and can be run in parallel. Long running WRF pre-processing jobs should be run either with more cores per node interactively as above (with -n 8, or -n 16) or as complete SLURM batch jobs, so that SALK's interactive nodes are not held by long running jobs.
Two domain files should be produced (geo_em.d01.nc geo_em.d02.nc) for this basic test case, as well as a log and output file which indicates success at the end with:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! Successful completion of geogrid. ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The next required preprocessing step is to run 'ungrib.exe', the ungrib executable. The purpose of ungrib is to unpack 'GRIB' ('GRIB1' and 'GRIB2') meteorological data and pack it into an intermediate file format usable by 'metgrid.exe' in the final preprocessing step.
The data for the January 2000 simulation being documented here has already been downloaded and placed in the WRF installation tree in /share/apps/wrf/default/WPS_DATA. Before running 'ungrib.exe', the WRF installation 'Vtable' file must first be symbolically linked into the working directory with:
$ln -sf /share/apps/wrf/default/WPS/ungrib/Variable_Tables/Vtable.AWIP Vtable $ls geo_em.d01.nc geo_em.d02.nc geogrid geogrid.log namelist.wps Vtable
The Vtable file specifies which fields to unpack from the GRIB files. The Vtables list the fields and their GRIB codes that must be unpacked. For this test case the required Vtable file has already been defined, but users may have to construct a custom Vtable file for their data.
Next, the GRIB files themselves must also be symbolically linked into the working directory. WRF provides a script to do this.
$link_grib.csh /share/apps/wrf/default/WPS_DATA/JAN00/2000012 $ls geo_em.d01.nc geogrid GRIBFILE.AAA GRIBFILE.AAC GRIBFILE.AAE GRIBFILE.AAG GRIBFILE.AAI GRIBFILE.AAK GRIBFILE.AAM namelist.wps geo_em.d02.nc geogrid.log GRIBFILE.AAB GRIBFILE.AAD GRIBFILE.AAF GRIBFILE.AAH GRIBFILE.AAJ GRIBFILE.AAL GRIBFILE.AAN Vtable
Note 'ls' shows that the 'GRIB' files are now present.
Next, more edits to the 'namelist.wps' file are required--one to set the start and end dates for the simulation to our January 2000 time frame, and the second to set the number of simulation seconds to complete (21600 / 3600 = 6.0 hours in this case). Edit the 'namelist.wps' file by setting the following in the shared section of the file:
start_date = '2000-01-24_12:00:00','2000-01-24_12:00:00', end_date = '2000-01-25_12:00:00','2000-01-25_12:00:00', interval_seconds = 21600
Now you can run 'ungrib.exe' to create the intermediate files required by 'metgrid.exe':
$ungrib.exe > ungrib.out $ls FILE:2000-01-24_12 FILE:2000-01-25_06 geo_em.d02.nc GRIBFILE.AAA GRIBFILE.AAD GRIBFILE.AAG GRIBFILE.AAJ GRIBFILE.AAM ungrib.log FILE:2000-01-24_18 FILE:2000-01-25_12 geogrid GRIBFILE.AAB GRIBFILE.AAE GRIBFILE.AAH GRIBFILE.AAK GRIBFILE.AAN ungrib.out FILE:2000-01-25_00 geo_em.d01.nc geogrid.log GRIBFILE.AAC GRIBFILE.AAF GRIBFILE.AAI GRIBFILE.AAL namelist.wps Vtable
Note that 'ungrib.exe', unlike the other pre-processing tools mentioned here, is NOT an MPI parallel program and for larger WRF jobs can run for a fairly long time. Long running 'ungrib.exe' pre-processing jobs should be run as complete SLURM batch jobs, so that SALK's interactive nodes are not held for hours at a time.
After a successful 'ungrib.exe' run you should get the familiar message at the end of the output file:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! Successful completion of ungrib.! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Like geogrid, the metgrid executable, 'metgrid.exe' needs to be able to find its table directory in the preprocessing working directory. The metgrid table directory may either be copied or symbolically linked into the working directory location.
ln -sf /share/apps/wrf/default/WPS/metgrid ./metgrid
Finally, all the files required for a successful run of 'metgrid.exe' have been provided. Like 'geogrid.exe', 'metgrid.exe' is an MPI parallel program that could be run in SLURM batch mode, but often runs for only a short time and can be run on ANDY's head node, as follows:
$metgrid.exe > metgrid.out $ls FILE:2000-01-24_12 geogrid GRIBFILE.AAF GRIBFILE.AAM met_em.d02.2000-01-24_12:00:00.nc metgrid.out FILE:2000-01-24_18 geogrid.log GRIBFILE.AAG GRIBFILE.AAN met_em.d02.2000-01-24_18:00:00.nc namelist.wps FILE:2000-01-25_00 GRIBFILE.AAA GRIBFILE.AAH met_em.d01.2000-01-24_12:00:00.nc met_em.d02.2000-01-25_00:00:00.nc ungrib.log FILE:2000-01-25_06 GRIBFILE.AAB GRIBFILE.AAI met_em.d01.2000-01-24_18:00:00.nc met_em.d02.2000-01-25_06:00:00.nc ungrib.out FILE:2000-01-25_12 GRIBFILE.AAC GRIBFILE.AAJ met_em.d01.2000-01-25_00:00:00.nc met_em.d02.2000-01-25_12:00:00.nc Vtable geo_em.d01.nc GRIBFILE.AAD GRIBFILE.AAK met_em.d01.2000-01-25_06:00:00.nc metgrid geo_em.d02.nc GRIBFILE.AAE GRIBFILE.AAL met_em.d01.2000-01-25_12:00:00.nc metgrid.log
If you are on SALK (Cray XE6), you will have to run:
aprun -n 1 metgrid.exe > metgrid.out
Note that 'metgrid.exe' is an MPI program and can be run in parallel. Long running WRF pre-processing jobs should be run either with more cores per node interactively as above (with -n 8, or -n 16) or as complete SLURM batch jobs, so that SALK's interactive nodes are not held by long running jobs.
Successful runs will produce an output file that includes:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! Successful completion of metgrid. ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Note that the met files required by WRF are now present (see the 'ls' output above). At this point, the preprocessing phase of this WRF sample run is complete. We can move on to actually running this real (not ideal) WRF test case using the SLURM Pro batch scheduler in MPI parallel mode.
Running a WRF Real Case in Parallel Using SLURM
Our frame of reference now turns to running 'real.exe' and 'wrf.exe' in parallel on ANDY or SALK via SLURM Pro. As you perhaps noticed in walking through the preprocessing steps above, the preprocessing files are all installed in their own subdirectory (WPS) under the WRF installation tree root (/share/apps/wrf/default). The same is true for the files to run WRF. They reside under the WRF install root in the 'WRFV3' subdirectory.
Within this 'WRFV3' directory, the 'run' subdirectory contains the all common files needed for a 'wrf.exe' run except the 'met' files that were just created in the preprocessing section above and those that are produced by 'real.exe' which is run before 'wrf.exe' in real-data weather forecasts.
Note that the ARW version of WRF allows one to produce a number of different executables depending on the type of run that is needed. Here, we are relying on the fact that the 'em_real' version of the code has already been built. Currently, the CUNY HPC Center has only compiled this version of WRF. Other versions can be compiled upon request. The subdirectory 'test' underneath the 'WRFV3' directory contains additional subdirectories for each type of WRF build (em_real, em_fire, em_hill2d_x, etc.).
To complete an MPI parallel run of this WRF real data case, a 'wrfv3/run' working directory for your run should be created, and it must be filled with the required files from the installation root's 'run' directory, as follows:
$cd $HOME/wrftest $mkdir -p wrfv3/run $cd wrfv3/run $cp /share/apps/wrf/default/WRFV3/run/* . $rm *.exe $ $ls CAM_ABS_DATA ETAMPNEW_DATA.expanded_rain LANDUSE.TBL ozone_lat.formatted RRTM_DATA_DBL SOILPARM.TBL URBPARM_UZE.TBL CAM_AEROPT_DATA ETAMPNEW_DATA.expanded_rain_DBL MPTABLE.TBL ozone_plev.formatted RRTMG_LW_DATA tr49t67 VEGPARM.TBL co2_trans GENPARM.TBL namelist.input README.namelist RRTMG_LW_DATA_DBL tr49t85 ETAMPNEW_DATA grib2map.tbl namelist.input.backup README.tslist RRTMG_SW_DATA tr67t85 ETAMPNEW_DATA_DBL gribmap.txt ozone.formatted RRTM_DATA RRTMG_SW_DATA_DBL URBPARM.TBL $
Note that the '*.exe' files were removed in the sequence above after the copy because they are already pointed to by ANDY's and SALK's system PATH variable.
Next, the 'met' files produced during the preprocessing phase above need to be copied or symbolically linked into the 'wrv3/run' directory.
$ $pwd /home/guest/wrftest/wrfv3/run $ $cp ../../wps/met_em* . $ls CAM_ABS_DATA grib2map.tbl namelist.input RRTM_DATA_DBL tr67t85 CAM_AEROPT_DATA gribmap.txt namelist.input.backup RRTMG_LW_DATA URBPARM.TBL co2_trans LANDUSE.TBL ozone.formatted RRTMG_LW_DATA_DBL URBPARM_UZE.TBL ETAMPNEW_DATA met_em.d01.2000-01-24_12:00:00.nc ozone_lat.formatted RRTMG_SW_DATA VEGPARM.TBL ETAMPNEW_DATA_DBL met_em.d01.2000-01-25_12:00:00.nc ozone_plev.formatted RRTMG_SW_DATA_DBL ETAMPNEW_DATA.expanded_rain met_em.d02.2000-01-24_12:00:00.nc README.namelist SOILPARM.TBL ETAMPNEW_DATA.expanded_rain_DBL met_em.d02.2000-01-25_12:00:00.nc README.tslist tr49t67 GENPARM.TBL MPTABLE.TBL RRTM_DATA tr49t85 $
The user may have edits to complete on the WRF 'namelist.input' file listed to craft the exact job they wish to run. The default namelist file copied into our working directory is in large part what is needed for this test run, but we will reduce the total simulation time (for the weather model, not the job) from from 12 to 1 hour by setting the 'run_hours' variable to 1.
At this point we are ready to submit a SLURM job. The SLURM Pro batch script below first runs 'real.exe' which creates the WRF input files 'wrfbdy_d01' and 'wrfinput_d01', and then runs 'wrf.exe' itself. Both executables are MPI parallel programs, and here they are both run on 16 processors. Here is the 'wrftest.job' SLURM script that will run on ANDY:
#!/bin/bash #SBATCH --partition production_gdr #SBATCH --job-name wrf_realem #SBATCH --nodes=16 #SBATCH --ntasks=1 #SBATCH --mem=2880 # Find out name of master execution host (compute node) echo -n ">>>> SLURM Master compute node is: " echo "" hostname echo "" # Find out the contents of the SLURM node file which names the node # allocated by SLURM echo -n ">>>> SLURM Node file contains: " echo "" cat $SLURM_NODEFILE echo "" # You must explicitly change to the working directory in SLURM cd $SLURM_SUBMIT_DIR # Just point to the pre-processing executable to run echo ">>>> Runnning REAL.exe executable ..." mpirun -np 16 /share/apps/wrf/default/WRFV3/run/real.exe echo ">>>> Running WRF.exe executable ..." mpirun -np 16 /share/apps/wrf/default/WRFV3/run/wrf.exe echo ">>>> Finished WRF test run ..."
The full path to each executable is used for illustrative purposes, but both binaries (real.exe and wrf.exe) are in the WRF install tree run directory and would be picked up from the system PATH environmental variable without the full path. This job requests 16 resource chunks, each with 1 processor and 2880 MBytes of memory. This job asks to be run on the QDR InfiniBand (faster interconnect) side of the ANDY system. Details on the use and meaning of the SLURM option section of the job are available elsewhere in the CUNY HPC Wiki.
To submit the job type:
qsub wrftest.job
A slightly difference version of the script is required to run the same job on SALK (the Cray):
#!/bin/bash #SBATCH --partition production #SBATCH --job-name wrf_realem #SBATCH --nodes=16 #SBATCH --ntasks=1 #SBATCH --o wrf_test16_O1.out # Find out name of master execution host (compute node) echo -n ">>>> SLURM Master compute node is: " echo "" hostname echo "" # Find out the contents of the SLURM node file which names the node # allocated by SLURM echo -n ">>>> SLURM Node file contains: " echo "" cat $SLURM_NODEFILE echo "" # You must explicitly change to the working directory in SLURM cd $SLURM_SUBMIT_DIR # Tune some MPICH parameters on the Cray export MALLOC_MMAP_MAX=0 export MALLOC_TRIM_THRESHOLD=536870912 export MPICH_RANK_ORDER 3 # Just point to the pre-processing executable to run echo ">>>> Runnning REAL.exe executable ..." aprun -n 16 /share/apps/wrf/default/WRFV3/run/real.exe echo ">>>> Running WRF.exe executable ..." aprun -n 16 /share/apps/wrf/default/WRFV3/run/wrf.exe echo ">>>> Finished WRF test run ..."
A successful run on either ANDY or SALK will produce an 'rsl.out' and 'rsl.error' file for each processor on which the job ran. So for this test case there will be 16 of each such files. The 'rsl.out' files reflect the run settings requested in the namelist file and then time-stamp the progress the job is making until the total simulation time is completed. The tail end of an 'rsl.out' file for a successful run should look like this:
: : v Timing for main: time 2000-01-24_12:45:00 on domain 1: 0.06060 elapsed seconds. Timing for main: time 2000-01-24_12:48:00 on domain 1: 0.06300 elapsed seconds. Timing for main: time 2000-01-24_12:51:00 on domain 1: 0.06090 elapsed seconds. Timing for main: time 2000-01-24_12:54:00 on domain 1: 0.06340 elapsed seconds. Timing for main: time 2000-01-24_12:57:00 on domain 1: 0.06120 elapsed seconds. Timing for main: time 2000-01-24_13:00:00 on domain 1: 0.06330 elapsed seconds. d01 2000-01-24_13:00:00 wrf: SUCCESS COMPLETE WRF taskid: 0 hostname: gpute-2 taskid: 0 hostname: gpute-2