Vectorization and Parallelization of the Adaptive Mesh Refinement N-body Code

Guardado en:
Detalles Bibliográficos
Publicado en:arXiv.org (Jul 14, 2005), p. n/a
Autor principal: Yahagi, Hideki
Publicado:
Cornell University Library, arXiv.org
Materias:
Acceso en línea:Citation/Abstract
Full text outside of ProQuest
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!

MARC

LEADER 00000nab a2200000uu 4500
001 2083133968
003 UK-CbPIL
022 |a 2331-8422 
024 7 |a 10.1093/pasj/57.5.779  |2 doi 
035 |a 2083133968 
045 0 |b d20050714 
100 1 |a Yahagi, Hideki 
245 1 |a Vectorization and Parallelization of the Adaptive Mesh Refinement N-body Code 
260 |b Cornell University Library, arXiv.org  |c Jul 14, 2005 
513 |a Working Paper 
520 3 |a In this paper, we describe our vectorized and parallelized adaptive mesh refinement (AMR) N-body code with shared time steps, and report its performance on a Fujitsu VPP5000 vector-parallel supercomputer. Our AMR N-body code puts hierarchical meshes recursively where higher resolution is required and the time step of all particles are the same. The parts which are the most difficult to vectorize are loops that access the mesh data and particle data. We vectorized such parts by changing the loop structure, so that the innermost loop steps through the cells instead of the particles in each cell, in other words, by changing the loop order from the depth-first order to the breadth-first order. Mass assignment is also vectorizable using this loop order exchange and splitting the loop into \(2^{N_{dim}}\) loops, if the cloud-in-cell scheme is adopted. Here, \(N_{dim}\) is the number of dimension. These vectorization schemes which eliminate the unvectorized loops are applicable to parallelization of loops for shared-memory multiprocessors. We also parallelized our code for distributed memory machines. The important part of parallelization is data decomposition. We sorted the hierarchical mesh data by the Morton order, or the recursive N-shaped order, level by level and split and allocated the mesh data to the processors. Particles are allocated to the processor to which the finest refined cells including the particles are also assigned. Our timing analysis using the \(\Lambda\)-dominated cold dark matter simulations shows that our parallel code speeds up almost ideally up to 32 processors, the largest number of processors in our test. 
653 |a Distributed shared memory 
653 |a Grid refinement (mathematics) 
653 |a Finite element method 
653 |a Cold dark matter 
653 |a Dark matter 
653 |a Processors 
653 |a Microprocessors 
653 |a Distributed memory 
653 |a Computer simulation 
653 |a Parallel computers 
773 0 |t arXiv.org  |g (Jul 14, 2005), p. n/a 
786 0 |d ProQuest  |t Engineering Database 
856 4 1 |3 Citation/Abstract  |u https://www.proquest.com/docview/2083133968/abstract/embedded/L8HZQI7Z43R0LA5T?source=fedsrch 
856 4 0 |3 Full text outside of ProQuest  |u http://arxiv.org/abs/astro-ph/0507339