GPU acceleration feature in QLM¶
Computationally intensive simulations involving LinAlg simulator can be run on a GPU (if your license permits). In this notebook we introduce how to off-load simulations to a GPU and present the key points of this new feature. Typical cases where a performance improvement can be expected are discussed in this notebook.
Limitation: Depending on the available free memory in the GPU, there is an upper limit on the number of qubits that can be simulated. Using Nvidia-V100 with (32 GB memory) one can simulate 29 qubits (30 qubits) in double (single) precision, the limit can be lower if multiple users are simultaneously using the same GPU.
Executing simulations on a GPU¶
To execute simulation on a GPU it suffices to set "use_GPU" to "True" in the constructor of LinAlg, as shown below. Note that, by default, "use_GPU" is set to "False" and it is executed on a CPU. Once a QPU is initialized, jobs (in sampling or observable evaluation modes) can be submitted as usual. Therefore, one can seamlessly toggle between GPU and CPU execution modes with minimal changes in code.
from qlmaas.qpus import LinAlg
linalg_gpu = LinAlg(use_GPU=True)
linalg_cpu = LinAlg()
The GPU used by LinAlg to simulate a quantum circuit can be selected by using the argument gpu_index
in the constructor of LinAlg. By default, the chosen GPU is the one of index 0
linalg_selected_gpu = LinAlg(use_GPU=True, gpu_index=1)
Supported gateset¶
Currently, we support any arbitrary gate of arity upto 2. We provide optimized implementations of standard gates i.e., H, X, Y, Z, RX, RY, RZ, PH, T, SWAP and CNOT. We also support any controlled versions of arbitrary single and two qubit gates i.e. of type H.ctrl().ctrl()...ctrl(). GPU based LinAlg simulator also accepts any user defined AbstractGate as long as the arity is 1 or 2. However, for performance reasons, we recommended that you rewrite circuits using only the standard gates rather than custom AbstractGates.
In addition, GPU simulator accepts intermediate measurements, reset and gates controlled by classical bits.
Submiting jobs involving any unsupported gate raises a QPUException.
Example¶
In the following, we give an example to show usage and compare the results of a simulation done on a GPU with that of a CPU.
from qat.lang.AQASM import Program, H, CNOT
from qat.core import Observable, Term
prog = Program()
qreg = prog.qalloc(2)
prog.apply(H, qreg[0])
prog.apply(CNOT, qreg)
## Sampling
print("\n####Sampling states####\n")
job_sampling = prog.to_circ().to_job(nbshots=1000)
result = linalg_gpu.submit(job_sampling)
print(result)
result = linalg_cpu.submit(job_sampling)
print(result)
##observable evaluation
print("\n####Observable evaluation####\n")
job_obs = prog.to_circ().to_job("OBS", nbshots=1000, observable=Observable(2, pauli_terms=[Term(1.0, "ZZ", [0, 1])]))
result = linalg_gpu.submit(job_obs)
print(result)
result = linalg_cpu.submit(job_obs)
print(result)
####Sampling states#### Submitted a new batch: Job143 <qat.qlmaas.result.AsyncResult object at 0x7f1bdd2325a0> Submitted a new batch: Job144 <qat.qlmaas.result.AsyncResult object at 0x7f1bf81ca030> ####Observable evaluation#### Submitted a new batch: Job146 <qat.qlmaas.result.AsyncResult object at 0x7f1bdd1d6c30> Submitted a new batch: Job147 <qat.qlmaas.result.AsyncResult object at 0x7f1bdd251310>
Cases where we can expect an acceleration¶
As a rule of thumb, we resort to a GPU only when the circuits are huge and the overall simulation time is reasonably long (note that the above example is provided only for illustration).
Sampling mode: Gain can be observed when "nbshots" is finite or much smaller than the total number of states. Querying the full state vector by selecting "nbshots=0" can hamper the performance, as it involves data transfer from GPU to CPU. It should be used sparingly and only for deep enough circuits.
Observable sampling mode: Gain can be observed in both cases when, "nbshots=0" (exact evaluation) and with finite number of shots, as we only transfer few scalar values from GPU to CPU at the end of the simulation.
In most cases, a performance gain can be obtained when simulating a variational algorithm involving several calls to the QPU.
For benchmarks comparing the performance on random circuits, check: benchmarks
Simulation precision¶
GPU simulator supports simulations in single and double precision, by default, all calculations are done in double precision. The precision can be explicitly chosen when the QPU is initialized as shown below.
We can slash the memory usage and (asymptotically) the simulation time by a factor 2 using single precision. If the circuits are sufficiently deep, errors accumulate in single precision. An indictator of the simulation accuracy is the state vector norm at the end of the simulation. It can be accessed from the "meta_data" field of a result object returned by the QPU. Depending on the final norm and the desired accuracy, one can decide for any particular simulation, if single precision is enough or double precision is needed.
linalg_gpu_float = LinAlg(use_GPU=True, precision= 1)
result = linalg_gpu_float.submit(job_obs)
print(result.meta_data['final_norm'])
Submitted a new batch: Job148 1.000000
Usually, single precision calculations are enough for a preliminary assessment of how well a variational algorithm works. One can then switch to double precision for an accurate result. Alternatively, in a variational algorithm, one can redirect first few calls to a single precision QPU and the rest to a double precision QPU. Check MixedPrecisionQPU for more details.
Noisy simulations¶
In addition to ideal circuits, we also have the possibility to run noisy simulations on GPU, check noisy simulations for examples.