Vectorization and Parallelization of the Adaptive Mesh Refinement N-body Code

Guardado en:

Detalles Bibliográficos
Publicado en:	arXiv.org (Jul 14, 2005), p. n/a
Autor principal:	Yahagi, Hideki
Publicado:	Cornell University Library, arXiv.org
Materias:	Distributed shared memory Grid refinement (mathematics) Finite element method Cold dark matter Dark matter Processors Microprocessors Distributed memory Computer simulation Parallel computers
Acceso en línea:	Citation/Abstract Full text outside of ProQuest
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC


LEADER	00000nab a2200000uu 4500
001	2083133968
003	UK-CbPIL
022			\|a 2331-8422
024	7		\|a 10.1093/pasj/57.5.779 \|2 doi
035			\|a 2083133968
045	0		\|b d20050714
100	1		\|a Yahagi, Hideki
245	1		\|a Vectorization and Parallelization of the Adaptive Mesh Refinement N-body Code
260			\|b Cornell University Library, arXiv.org \|c Jul 14, 2005
513			\|a Working Paper
520	3		\|a In this paper, we describe our vectorized and parallelized adaptive mesh refinement (AMR) N-body code with shared time steps, and report its performance on a Fujitsu VPP5000 vector-parallel supercomputer. Our AMR N-body code puts hierarchical meshes recursively where higher resolution is required and the time step of all particles are the same. The parts which are the most difficult to vectorize are loops that access the mesh data and particle data. We vectorized such parts by changing the loop structure, so that the innermost loop steps through the cells instead of the particles in each cell, in other words, by changing the loop order from the depth-first order to the breadth-first order. Mass assignment is also vectorizable using this loop order exchange and splitting the loop into \(2^{N_{dim}}\) loops, if the cloud-in-cell scheme is adopted. Here, \(N_{dim}\) is the number of dimension. These vectorization schemes which eliminate the unvectorized loops are applicable to parallelization of loops for shared-memory multiprocessors. We also parallelized our code for distributed memory machines. The important part of parallelization is data decomposition. We sorted the hierarchical mesh data by the Morton order, or the recursive N-shaped order, level by level and split and allocated the mesh data to the processors. Particles are allocated to the processor to which the finest refined cells including the particles are also assigned. Our timing analysis using the \(\Lambda\)-dominated cold dark matter simulations shows that our parallel code speeds up almost ideally up to 32 processors, the largest number of processors in our test.
653			\|a Distributed shared memory
653			\|a Grid refinement (mathematics)
653			\|a Finite element method
653			\|a Cold dark matter
653			\|a Dark matter
653			\|a Processors
653			\|a Microprocessors
653			\|a Distributed memory
653			\|a Computer simulation
653			\|a Parallel computers
773	0		\|t arXiv.org \|g (Jul 14, 2005), p. n/a
786	0		\|d ProQuest \|t Engineering Database
856	4	1	\|3 Citation/Abstract \|u https://www.proquest.com/docview/2083133968/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch
856	4	0	\|3 Full text outside of ProQuest \|u http://arxiv.org/abs/astro-ph/0507339