OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653)

* Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation
author: 0cc4m <picard12@live.de> 2023-06-04 08:12:05 +0200
committer: GitHub <noreply@github.com> 2023-06-04 08:12:05 +0200
commit: dcb2ed48268e421baf25adc00d602dad0f415564 (patch)
tree: 261ef84fe660d06fce90c58fc01a16ae0e69be52 /ggml.c
parent: d8bd0013e8768aaa3dc9cfc1ff01499419d5348e (diff)
1 files changed, 7 insertions, 0 deletions
diff --git a/ggml.c b/ggml.c
index 4cd0d72..91552c9 100644
--- a/ggml.c
+++ b/ggml.c
@@ -8134,6 +8134,13 @@ static void ggml_compute_forward_mul_f32(
         }
         return;
     }
+#elif defined(GGML_USE_CLBLAST)
+    if (src1->backend == GGML_BACKEND_CL) {
+        if (ith == 0) {
+            ggml_cl_mul(src0, src1, dst);
+        }
+        return;
+    }
 #endif
 
     const int64_t nr = ggml_nrows(src0);
author	0cc4m <picard12@live.de>	2023-06-04 08:12:05 +0200
committer	GitHub <noreply@github.com>	2023-06-04 08:12:05 +0200
commit	dcb2ed48268e421baf25adc00d602dad0f415564 (patch)
tree	261ef84fe660d06fce90c58fc01a16ae0e69be52 /ggml.c
parent	d8bd0013e8768aaa3dc9cfc1ff01499419d5348e (diff)