Motivation

We are three second-year graduate students in the astronomy department. For our research, we perform simulations with massively parallel codes (e.g. moving mesh fluid dynamics code Arepo and particle-in-cell code tristan-mp) to study various astrophysical systems.

Efficient analysis and visualization of the simulation data is crucial to our work. With 3D simulations, visualizing the data is a challenge even when the simulation box has evenly-spaced grid points, simply because the screen is 2D. One common simplification technique is to just plot a couple of 2D slices through the 3D box, which inevitably misses a lot of structures in the simulation. A further challenge arises when the simulation grid is not uniform, which is the case for many cosmological simulations, including simulations generated by Arepo.

A lot of recent effort in the community has gone into developing volume-rendering software to visualize the full 3D volumetric data sets. In particular, yt is a widely used, open source software, primarily written in Python with MPI, for this purpose. However, it lacks GPU support, which could potentially speedup and increase the efficiency of the algorithm.

In addition, yt and other visualization software only support output files from limited number of simulation codes and data structures, which unfortunately does not include our codes.

For our final project, we developed an algorithm which combines GPU with MPI to render 3D data of arbitrary grid type at lightning fast speed.

We would like to visualize 3D scalar fields that are large data set outputs of simulations for both structured and unstructured grids. The goal is to create a rapid method to render the whole 3D data using raycasting with GPUs and MPI so that we can visualize the outputs of simulations we use in our research projects.

Our Data

The data comes from astrophysical fluid dynamics simulations used in our research. These simulations are run with a veriety of codes, including the moving mesh code Arepo and the particle-in-cell code tristan-mp. As mentioned above, both codes produces different kind of outputs. In the case of Arepo, the data is an unstructured Voronoi mesh, while tristan-mp generates structured data. The file size of the outputs depend strongly on the size of the grid/mesh, and on the resolution employed in the simulations, ranging from a few MB up to a few GB.

Program Design

Raycasting is highly parallelizable, so we use GPUs to have each thread calculate a single picture in our image. In addition, we choose to further boost our method by using MPI to distribute the calculation of different frames in the movie across a GPU cluster. This is the most efficient way to use a cluster for generating movies, as the frame calculations are independent of each other and the algorithm strongly scales. Our code, Pavoreal, can be run by importing the pavoreal package and writing a simple and short driver script and setting up a configuration file, making it easy to use for a variety of applications.

Usage

The code is used by setting up a configuration file which specifies the camera path/orientation, transfer functions, data, output directory, and gridding/smoothing parameters and importing the pavoreal package and running a simple driver script (see README or the download page for details and examples).

Performance

The MPI version of our code scales strongly, since there is no communication between processes. Using a GPU, as opposed to a CPU, on each process has significant advantages, making the total runtime of the code 100-1000 times faster for the various data sets. Some optimizations we implemented include:

  • Rewriting all the kernels as a single kernel, which uses less memory and makes the algorithm more computationally dominated
  • Casting the 3D data as a texture. This increased our speed by over an order of magnitude
  • Making sure all reads are coalesced
  • Casting the transfer function as a texture for rapid lookup. This increased the speed by a factor of 1.4
  • Writing a cubic spline interpolated as a weighted sum of 8 trilinear texture lookups rather than using a full 64 point stencil.

  • Optimization

  • Cast 3D data to texture
  • Cast transfer function to texture
  • Cubic spline calculated with 8-pt weighted stencil texture lookup rather than 64-pt stencil
  • Coalesced memory reads, massively parallel texture lookups in Approach 1
  • Combined kernel, computation dominated, in Approach 2 (5x faster than Approach 1)
  • Insights

  • We learned a variety of insights from this project. We saw that textures can be used in more general ways besides a simple lookup. We can construct a cubic spline lookup by using a stencil of 8 texture lookups.
  • We learned that method is inexpensive and a good way to get smoother quality images.
  • Additionally, we can cast the transfer function as a texture (which also makes sampling it smoother than just using a fixed number of bins) to further optimize our code.
  • While writing our code we also learned that we have to manually deallocate memory on the GPU in PyCUDA once it is no longer being used, otherwise the available memory might not be enough.
  • Future Work

    We would like to extend this project by using PyOpenGL, or the GL module included in PyCUDA, to create an interactive version of pavoreal by having the GPU memory feed directly into video output. Unfortunately, PyCUDA was compiled without GL support on resonance, so our current method generates images for every frame (with matplotlib) and movies on the CPU. The data transfer between GPU and CPU, and the file writing actually takes a significant amount of time (on order of a second per frame) compared to the rendering calculations, so the interactive mode would also help visualize data even faster.

    Reflections

    We definitely enjoyed harnessing the raw power of GPUs and also viewing the final movies we created, which were a nice reward. Some of the simulations have never been visualized before and it was exciting to see what they looked like. One of the challenging aspects of the project was dealing with the unstructured data sets. Binning data on GPUs is not straightforward since naive CPU-like implementations can lead to bank conflicts. Working out the geometry of the camera grid was frustrating, especially after we changed the data to be cast to a texture because initially we fixed the camera and rotated the object, but textures cannot be rotated so we have to rotate the camera instead and rework all the code and conventions.