News
On a B200, the nvjet_tst_16x64_64x16_4x1_v_bz_TNN kernel is used, and it takes roughly 8.1 microseconds. On a H200, the nvjet_tst_64x8_64x16_4x1_v_bz_TNT kernel is ...
Abstract: A mixed-precision analog compute-in-memory (Mix-ACIM) is presented for mixed-precision vector-matrix multiplication (VMM). The design features an all-analog current-domain fixed-point (FxP) ...
This code accompanies the blog post Matrix Multiplication Faster Than Nvidia, Sometimes. It provides a CUDA kernel for single-precision matrix-matrix multiplication, with two notable features: use of ...
Abstract: The demand for high-speed matrix multiplication continues to grow due to recent developments in images processing, graphics processing, digital signal processing and communication via ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results