Difference between revisions of "FPGA setup and building"
Gmarchiori (talk | contribs) (Added link to paper.) |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 24: | Line 24: | ||
Therefore, this shell version must also be the one installed on the Xilinx Alveo U280 FPGA board which is used to run the accelerated flow application.<br> | Therefore, this shell version must also be the one installed on the Xilinx Alveo U280 FPGA board which is used to run the accelerated flow application.<br> | ||
In principle, the shell version can be changed to a different one (typically a newer one), but the bitstream generation process may fail or generate sub-performing bitstreams (e.g. in case the shell allocates the static resources differently than the one used currently), and changes may be required to the makefiles and other code in the FPGA repository. | In principle, the shell version can be changed to a different one (typically a newer one), but the bitstream generation process may fail or generate sub-performing bitstreams (e.g. in case the shell allocates the static resources differently than the one used currently), and changes may be required to the makefiles and other code in the FPGA repository. | ||
+ | |||
+ | '''Warning''': Please note that the last opm-simulators master branch tested that worked with the FPGA accelerator was from 2021-1-15 09:00 starting with the hash code 57d158b. opm-simulators releases & the master branch after that point broke the FPGA API due to incompatibilities with the GPU code in how the data is organized (i.e., reordered) and transferred to the accelerator. | ||
= Downloading = | = Downloading = | ||
Line 90: | Line 92: | ||
* run flow in FPGA mode with a small dataset (e.g., SPE1CASE1.DATA found in opm-tests/spe1/): this will load the bitstream and make it available; | * run flow in FPGA mode with a small dataset (e.g., SPE1CASE1.DATA found in opm-tests/spe1/): this will load the bitstream and make it available; | ||
* run the Xilinx programming tool to pre-load the bitstream (xbutil is part of the Xilinx XRT package): | * run the Xilinx programming tool to pre-load the bitstream (xbutil is part of the Xilinx XRT package): | ||
− | xbutil program -p /path/to/bicgstab_kernel.xclbin | + | xbutil program -p /path/to/bicgstab_kernel.xclbin |
and then proceed to run with the interesting datasets.<br> | and then proceed to run with the interesting datasets.<br> | ||
Note: if you have more than one FPGA boards on the host, add "-d<device number>" on the xbutil command line to specify which FPGA must be programmed. Flow will always use only the first Alveo U280 board in the host (e.g. the one with the lowest device number). | Note: if you have more than one FPGA boards on the host, add "-d<device number>" on the xbutil command line to specify which FPGA must be programmed. Flow will always use only the first Alveo U280 board in the host (e.g. the one with the lowest device number). | ||
Line 117: | Line 119: | ||
For further information, please refer to this paper currently on Arxiv: [https://arxiv.org/abs/2101.01745 Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs]<br> | For further information, please refer to this paper currently on Arxiv: [https://arxiv.org/abs/2101.01745 Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs]<br> | ||
This paper provides an in-depth description of the ILU0-BiCGSTAB solver kernel, including some FPGA performances gathered with a slightly older version of the Flow simulator (2020.10-rc4) and a comparison with software-only and GPU implementations. | This paper provides an in-depth description of the ILU0-BiCGSTAB solver kernel, including some FPGA performances gathered with a slightly older version of the Flow simulator (2020.10-rc4) and a comparison with software-only and GPU implementations. | ||
+ | |||
+ | === Notes on HBM/DDR4 memory banks usage === | ||
+ | |||
+ | Here we report some notes about the rationale of the memory banks usage on the current target platform.<br> | ||
+ | The kernel needs some memory buffers to store the ILU matrices as provided by the decomposition pre-processing step performed on the host, along with the initial values for the input system and some additional data.<br> | ||
+ | We choose to store most of these data in the DDR4 memory banks because they allow for a larger size (16 GB per bank, two banks available) compared to the HBM memory stacks (256 MB per port because the Xilinx shell doesn't use the crossbar, 32 ports available), and because on the machine we used the data bandwidth between the host and the DDR4 memory was higher than the one between the host and the HBM banks (this may have been a configuration issue specific of the machine we used, but we couldn't test the board on another host to confirm it).<br> | ||
+ | Moreover, the HBM memory is used by the solver mainly to transfer data between the various units of its pipeline (and to hold the final results), hence the lower latency offered by the HBM ports was beneficial for the overall performances.<br> | ||
+ | Re-targeting the solver for another platform which does not have the HBM memory stacks should be possible, e.g. the Xilinx Alveo U200/U250 boards.<br> | ||
+ | However, the limited number of DDR4 memory ports available on those boards (up to 4) and the additional routing complexity (the DDR4 ports would be most probably implemented in different SLRs than the solver kernel, which would incur in additional routing delays) would make it difficult to reach high clock speed, compared to the one reached by the current design on the Xilinx Alveo U280 (around 280 MHz). | ||
= Reporting issues = | = Reporting issues = |
Latest revision as of 08:27, 30 September 2022
This page contains instructions on how to setup, compile and use the FPGA bitstream required to run the Flow simulator when compiled with the fpgaSolver enabled (ILU0-BiCGSTAB accelerated solver).
Prerequisites
Using the Flow simulator with an FPGA requires the (offline) generation of a bitstream file, which is loaded into the FPGA at runtime and configures it to make available a specific function, in this case the ILU0-BiCGSTAB solver.
This generation process requires, for the kernel currently present in the FPGA repository, the installation of some Xilinx tools and an additional package to target the Xilinx Alveo U280 board.
Note that, in order to generate the FPGA bitstream, there is no need for an FPGA board to be installed on the host chosen for the generation process.
The bitstream generation has only been tested on Linux systems with CentOS 7 (7.4-7.9).
It may however be possible to use other OS configurations (e.g. Ubuntu 16.04/18.04 as officially supported by Xilinx), but this has not been tested.
Please refer to this page for documentation about:
- a list of supported OS and minimum hardware requirements in order to compile bitstreams for the Xilinx Alveo boards (Application Acceleration Development Flow)
- additional packages that may be required on the host (e.g. OpenCL packages, kernel headers, etc.)
The main tools required are:
- Xilinx Vitis, version 2019.2
To download the Xilinx Vitis package, a (free) registration to the Xilinx website is required.
Choose a file under "Vitis Core Development Kit", not the "Update 1" version; for example "Xilinx Unified Installer 2019.2: Linux Self Extracting Web Installer" - Xilinx XRT, version 2.3
Note that the XRT can be used for all the Alveo board versions (not just the U280).
In the page "Alveo U280 Package File Downloads", click on the button "Archive" in the "Tool Version" line.
This will expand showing the "2019.2" version button: click on it and leave selected "XDMA" as "Platform Type" and "x86_64" as "Architecture".
Finally, select the "Operating System" for which you want to download the package (currently either "RHEL/CentOS" or "Ubuntu").
For example, when selecting "RHEL/CentOS", "OS Version" will appear: select "7.x" and a frame will appear below, named "Download Installer for Alveo U280".
Select the choice number 1 (Download the Xilinx Runtime): the file to download will be named xrt_201920.2.3.1301_7.4.1708-xrt.rpm
The additional package needed to target the Xilinx Alveo U280 board is:
- Development Target Platform, version 201920.1
In the page "Alveo U280 Package File Downloads", click on the button "Archive" in the "Tool Version" line.
This will expand showing the "2019.2" version button: click on it and leave selected "XDMA" as "Platform Type" and "x86_64" as "Architecture".
Finally, select the "Operating System" for which you want to download the package (currently either "RHEL/CentOS" or "Ubuntu").
For example, when selecting "RHEL/CentOS", "OS Version" will appear: select "7.x" and a frame will appear below, named "Download Installer for Alveo U280".
Select the choice numer 3 (Download the Development Target Platform): the file to download will be named xilinx-u280-xdma-dev-201920.1-2699728.x86_64.rpm
Warning: the Xilinx FPGA shell used to generate the bitstream must be version 201920.1 (XDMA).
Therefore, this shell version must also be the one installed on the Xilinx Alveo U280 FPGA board which is used to run the accelerated flow application.
In principle, the shell version can be changed to a different one (typically a newer one), but the bitstream generation process may fail or generate sub-performing bitstreams (e.g. in case the shell allocates the static resources differently than the one used currently), and changes may be required to the makefiles and other code in the FPGA repository.
Warning: Please note that the last opm-simulators master branch tested that worked with the FPGA accelerator was from 2021-1-15 09:00 starting with the hash code 57d158b. opm-simulators releases & the master branch after that point broke the FPGA API due to incompatibilities with the GPU code in how the data is organized (i.e., reordered) and transferred to the accelerator.
Downloading
Download the FPGA repository (read-only) by doing:
git clone git://github.com/OPM/FPGA.git
Building
After having installed the Xilinx tools and downloaded the FPGA repository, the following steps are needed to generate the bitstream:
- setup the Xilinx Vitis environment, e.g.:
source <path to the Xilinx Vitis base directory>/Vitis/2019.2/settings64.sh
- setup the Xilinx XRT environment, e.g.:
source <path to the Xilinx XRT base directory>/source.sh
- change to the implementation directory and run:
cd <path to the FPGA repository base directory>/linearalgebra/ilu0bicgstab/xilinx/alveo_u280/vitis_20192/HW-implementation make bitstream
Important: in case the Xilinx board package has been installed in a path different than the default one (/opt/xilinx/platforms/xilinx_u280_xdma_201920_1/xilinx_u280_xdma_201920_1.xsa), use the variable PLATFORM_XSA to change its path, e.g.:
make PLATFORM_XSA=/path/to/xilinx_u280_xdma_201920_1.xsa bitstream
After the last command finishes successfully (it may take several hours), the following files, among others, will be present in the implementation directory:
- bicgstab_kernel.xclbin (the bitstream file)
- bicgstab_kernel.xclbin.info (a text file with a summary of the current bitstream implementation results)
Typically, the bitstream generation process will be mainly constrained by the CPU's cores maximum frequency (the tools don't always use multi-threading), the local storage speed and the amount of RAM.
As an example, the bitstream generation flow has been tested by using an host running with CentOS 7.9 (x86_64), an AMD Ryzen 9 3950X CPU, 64 GB of RAM and a fast RAID5 disk configuration.
The whole generation process took around 3 hours and used around 4 GB of disk space.
Finally, this command can be used to completely clean up the result of a previous implementation; this will delete also the generated bitstream, so use with care!
make clean_all
Using the bitstream
To use the bitstream, copy the file bicgstab_kernel.xclbin on the host where the FPGA board is installed.
Please refer to the "Install Guide" in this page for documentation regarding the installation and setup of the Xilinx Alveo U280 board.
The XSA shell installed on the board must be the same specified in the #Prerequisites section.
It may also be possible to use a cloud service provider that makes available nodes with the Alveo U280 board instead of an on-premises host. Please check the Alveo pages for more info about this option.
Moreover, you'll need to use a Flow simulator binary file compiled with FPGA support enabled.
The current source code for the Flow simulator (still under review at the moment of writing) contains modifications to enable the Xilinx Alveo board (see PR #2998, opm-simulators).
The host containing the FPGA board should be already properly setup (e.g. the proper drivers and the Xilinx XRT must be installed and setup).
Then, the following command line parameters can be used to run flow with the FPGA:
--accelerator-mode=fpga --fpga-bitstream=/path/to/bicgstab_kernel.xclbin --matrix-add-well-contributions=true # required for FPGA --opencl-ilu-reorder=[level_scheduling|graph_coloring] # optional: the default will be "level_scheduling"
An example command line to run flow with the FPGA accelerator is (set the proper paths before running it!):
/path/to/flow /path/to/opm-tests/norne/NORNE_ATW2013.DATA \ --accelerator-mode=fpga \ --fpga-bitstream=/path/to/bicgstab_kernel.xclbin \ --matrix-add-well-contributions=true \ --threads-per-process=8 \ --output-dir=./output_norne_fpga
Notes about performance measurements
In order to work with flow, the FPGA must be programmed with the proper bitstream before it makes available the solver kernel functionality.
This is done automatically by flow when it's run with the proper command line parameters as shown in the previous section.
However, when running flow for the first time after the host is started (or if the FPGA has been previously programmed with a different bitstream than the solver), the bitstream will be programmed into the FPGA, and this may take around 6 seconds, thus adding overhead to the total execution time.
The FPGA programming is not needed for successive executions of flow, because the Xilinx runtime (XRT) will recognize that the FPGA is already programmed with the proper bitstream, and there will be no overhead on the total execution time.
Hence, when interested in checking FPGA performance, it is suggested to use one of the following options before executing the tests:
- run flow in FPGA mode with a small dataset (e.g., SPE1CASE1.DATA found in opm-tests/spe1/): this will load the bitstream and make it available;
- run the Xilinx programming tool to pre-load the bitstream (xbutil is part of the Xilinx XRT package):
xbutil program -p /path/to/bicgstab_kernel.xclbin
and then proceed to run with the interesting datasets.
Note: if you have more than one FPGA boards on the host, add "-d<device number>" on the xbutil command line to specify which FPGA must be programmed. Flow will always use only the first Alveo U280 board in the host (e.g. the one with the lowest device number).
Warnings
The FPGA acceleration is still experimental and the solver kernel may not work when the input case exceeds some internal limitations, which may not be directly connected with the input case size.
When this happens, flow will exit with an error.
For example, when running with spe1/SPE10_MODEL1.DATA:
[...] ERROR: decode_debuginfo_bicgstab: HW kernel was aborted because it ran for more than 2000000000 clock cycles. ERROR: detected unrecoverable FPGA error (ABRT=1,SIG=0,OVF=0). [...]
or, when running with norne/NORNE_ATW2013.DATA and using graph coloring instead of the default level scheduling reordering (level scheduling would work fine):
[...] ERROR: Current reordering exceeds maximum number of columns per color limit: 9736/8192. ERROR: findPartitionColumns failed (-1). [...]
(note that the format of the error messages above may change with the evolution of the source code).
The FPGA acceleration does not support ensemble runs, because currently there is only one solver instance on the FPGA which cannot be shared among different processes.
Hence, the first flow process that starts using the FPGA won't release it until it finishes, barring the use of the FPGA by other processes for the whole run time.
Additional documentation
For further information, please refer to this paper currently on Arxiv: Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs
This paper provides an in-depth description of the ILU0-BiCGSTAB solver kernel, including some FPGA performances gathered with a slightly older version of the Flow simulator (2020.10-rc4) and a comparison with software-only and GPU implementations.
Notes on HBM/DDR4 memory banks usage
Here we report some notes about the rationale of the memory banks usage on the current target platform.
The kernel needs some memory buffers to store the ILU matrices as provided by the decomposition pre-processing step performed on the host, along with the initial values for the input system and some additional data.
We choose to store most of these data in the DDR4 memory banks because they allow for a larger size (16 GB per bank, two banks available) compared to the HBM memory stacks (256 MB per port because the Xilinx shell doesn't use the crossbar, 32 ports available), and because on the machine we used the data bandwidth between the host and the DDR4 memory was higher than the one between the host and the HBM banks (this may have been a configuration issue specific of the machine we used, but we couldn't test the board on another host to confirm it).
Moreover, the HBM memory is used by the solver mainly to transfer data between the various units of its pipeline (and to hold the final results), hence the lower latency offered by the HBM ports was beneficial for the overall performances.
Re-targeting the solver for another platform which does not have the HBM memory stacks should be possible, e.g. the Xilinx Alveo U200/U250 boards.
However, the limited number of DDR4 memory ports available on those boards (up to 4) and the additional routing complexity (the DDR4 ports would be most probably implemented in different SLRs than the solver kernel, which would incur in additional routing delays) would make it difficult to reach high clock speed, compared to the one reached by the current design on the Xilinx Alveo U280 (around 280 MHz).
Reporting issues
Issues specific to the FPGA bitstream can be reported in the git issue tracker at:
http://github.com/OPM/FPGA/issues