As a modifiedgravity proposal to handle the dark matter problem on galactic scales, Modified Newtonian Dynamics (MOND) has shown a great success. However, the
Modified Newtonian Dynamics (MOND) is an alternative proposal to popular dark matter (DM) theory, accounting for the missing mass problem in astrophysics. To study the outskirts of disk galaxies,
Gravitational
Hardware accelerators for
During the simulation, parameters of these
In this paper, by utilizing the FPGASoC, we propose a highly integrated accelerating solution for
The rest of this paper is organized as follows. The background of MOND and
Modified Newtonian Dynamics (MOND) can be interpreted as a modification to the law of gravity. It is an alternative proposal to popular dark matter (DM) theory, accounting for the missing mass problem in astrophysics. Both MOND and DM elegantly fit the rotation curve of spiral galaxies. However, there exist some challenges for the DMbased model; the biggest one is that the tight scaling relations cannot be understood [
Modified Newtonian Dynamics (MOND) was firstly proposed by Milgrom in 1983 [
Here
In conventional Newtonian dynamics, we have the classical Poisson equation:
Equation (
Therefore, in the weak field limit, we have the socalled quasilinear MOND (QUMOND) [
For simplification, denote
Analogous to dark matter, the socalled phantom dark matter (PDM) is introduced, and
An
with the known baryonic matter distribution
calculating the phantom dark matter (PDM) distribution with (
solving modified Poisson equation (
The
Table
Time cost of MOND calculation on Intel i5 processor.
Number of particles  Time cost of Newtonian potential (s)  Time cost of PDM distribution (s)  Time cost of MOND potential (s) 

10000  2.062  0.001  2.065 
20000  8.551  0.003  8.536 
50000  55.09  0.012  55.68 
Besides pursuing a high calculation speed, there also exist other essential considerations, like the power consumption, the economic cost, and so on. Most existing accelerating solutions utilize a cardhost scheme; the accelerator is implemented as an addin card, relying on external host processors to deal with the data dispatching. However, as mentioned in Section
The calculation speed of an FPGAbased accelerator is mainly decided twofold: one is the number of pipelines the FPGA integrates and the other is the throughput of each pipeline; here the pipeline throughput is limited by two factors, the data bandwidth and the frequency pipelines work on. To obtain a high calculation speed, we are motivated to analyze the physical model of potential calculation and optimize the pipeline design, thus to pursue less logic resource occupations, lower data bandwidth demands, and higher operating frequencies.
In this paper, we focus on a highly integrated accelerating solution for
The utilization of FPGASoC makes the solution highly integrated.
We propose optimized summation pipelines for the calculation of Newtonian and MOND potentials, in which the square term is conducted by three DSP48E1s in Xilinx 7 series FPGAs, such that the logic resource occupation of each pipeline is reduced and more pipelines are implemented.
Based on the particlemesh scheme, the data flow from memory to pipelines is optimized: the space coordinates of each object are automatically calculated out, other than being transferred, which benefits the reduction of data bandwidth.
We conduct extensive experiments to test our proposed solution. The results show that 9 optimized pipelines can be implemented in an Zynq7020 FPGA; if we utilize the typical pipeline, the number of pipelines that can be implemented in the same FPGA is 7, and the HLS (high level synthesis) tool only produces 4 pipelines.
We choose the nearest grid point (NGP) scheme, one of the particlemesh strategies, as our basic physical model. Based on the NGP scheme, particles in the galaxy object are interpolated onto a mesh, and each particle is approximatively supposed to be located at the closest point in the mesh, as depicted in Figure
Illustration of the discretization scheme in the
In the particlemesh scheme, the baryonic matter distribution
With the particlemesh scheme described in Figure
In
Illustration of multistage particlemesh scheme.
Higher resolution ratio
Lower resolution ratio
In this paper, for simplification, we set a fixed resolution. Technically, the computation flow is as follows:
building a mesh with a coverage of the galaxy and its outskirts as shown in Figure
solving Newtonian potential, PDM distribution, and the final MOND potential through the three steps in Section
Figure
System architecture.
Different from conventional accelerating solutions, we utilize the embedded ARM processor to deal with the data dispatching. Besides, the ARM processor also conducts all the lightweight computing tasks, including initializing the particlemesh scheme, monitoring the status of DMA and pipelines, and calculating the PDM distribution. To maximize the bandwidth between memory and accelerating pipelines, the control stream and the data stream are splitted. A 32bit AXI bus is implemented to transfer commands and states, and the data is transferred through a 64bit AXI high performance bus and two DMAFIFO groups. Each pipeline is concentrated to deal with one fixed particle. The potential of the fixed particle is actually the summation of
Partial diagram of controller module.
The issue of calculation bottleneck arises mainly in the potential summation. To handle this problem, we make endeavours 3fold: pipeline stage splitting, pipeline simplification, and bandwidth reducing.
Up: typical potential summation pipeline with stage optimization; dotted line shows the dividing of pipeline stages. Down: modifying the logic in shadow box with DSP48E1.
The working flow of
Flow chart of MOND simulation.
In our solution, we define a structure to store the information of every particle, named grid, which is described in Algorithm
Illustration of memory copy operation.
To test our accelerating solution, we choose the Zedboard as the experiment platform. The Zedboard utilizes a Zynq7020 FPGASoC, which consists of a dualcore ARM CortexA9 processor as the processing system (PS) and a Xilinx 7series FPGA including 220 DSP48E1s as the programmable logic (PL). The maximum frequency of AXI bus between PS and PL is 250 MHz. What is more, 512 MB DDR3 with 32bit interface is included on the Zedboard.
Table
Zynq7020 resource utilization.
Entity  LUTs  Registers  DSP48E1s 

Pipeline (simplified)  3826 (7%)  7467 (7%)  15 (7%) 
Pipeline (typical)  4417 (8%)  7761 (7%)  24 (11%) 
Pipeline (HLS)  7014 (13%)  12132 (11%)  24 (11%) 
Accelerator (9 simplified pipelines)  45014 (85%)  80403 (76%)  171 (78%) 
Accelerator (7 typical pipelines)  39616 (74%)  64777 (61%)  196 (89%) 
From Table
Suppose that an astronomical system contains
The same C source code is running in different types of CPUs without multicore parallelization, so the results only reflect the performance of a single core. We also run a welloptimized CUDA code in both embedded GPU and high performance GPU. For the accelerator proposed in this paper, we test two schemes. Both are running at 142 MHz, with 9 pipelines integrated; the difference is that one scheme utilizes the bandwidth reducing method mentioned in Section
Calculating the potential in
Computing unit  Type  Frequency (MHz)  The number of processor cores or CUDA cores or pipelines  s/frame  Mpairs/s  Power (watt)  Speedup 

ARM CortexA9  Embedded CPU  667  2  345.7  3.106  1.5  1 
ARM CortexA15  Embedded CPU  2300  4  32.24  33.30  ≈5  10.7 
Intel i52476m  CPU  1600  2  26.97  39.81  17  12.8 
Intel Xeon E52660  CPU  2200  8  15.75  68.17  95  21.9 
Tegra Kepler  Embedded GPU  950  192  5.592  192.0  <2  61.8 
Tesla K80  GPU  562 

0.1701  6312  ≈100  2032.2 
Zynq7020 (normal)  FPGA  142.8  9  1.107  970.0  1.3  312.3 
Zynq7020 (bandwidth reduced)  FPGA  142.8  9  0.853  1200.5  1.3  386.5 
Comparison on performance per watt.
From Table
What is more, by analyzing Table
In this paper, we propose a highly integrated accelerating solution based on FPGASoC for
The authors declare that they have no competing interests.