Hi @TerryT9, I encountered the same issue and I'm not sure how to resolve it. Could you share how you solved it?
Hi @Gianthard-cyh , would you mind to say which model you were using?
I'm using Llama-3.2-1B-Instruct-f16.gguf. By the way, I run the model successfully after increasing the size limit of MUL_MAT op and changing the precision option of NPU (the execution of Convert op will fail without this) .
However, as stated in previous issues, the NPU backend achives around 1/3 performance of CPU. I think more profiling and optimizing work could be done. I'm happy to help with that.
My device is Oneplus Ace 3 with Snapdragon 8 Gen 2.
--- a/ggml/src/ggml-qnn/graph.cpp
+++ b/ggml/src/ggml-qnn/graph.cpp
@@ -192,8 +192,15 @@ qnn_graph::qnn_graph(const std::string &graph_name, QNNBackend device, std::shar
graph_vtcm_config.option = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
graph_vtcm_config.customConfig = &vtcm_config;
+ QnnHtpGraph_CustomConfig_t precision_config;
+ precision_config.option = QNN_HTP_GRAPH_CONFIG_OPTION_PRECISION;
+ precision_config.precision = QNN_PRECISION_FLOAT16;
+ QnnGraph_Config_t graph_precision_config;
+ graph_precision_config.option = QNN_GRAPH_CONFIG_OPTION_CUSTOM;
+ graph_precision_config.customConfig = &precision_config;
+
const QnnGraph_Config_t *graph_configs[] = {&graph_hvx_config, &graph_dlbc_config, &graph_vtcm_config,
- &graph_opt_config, nullptr};
+ &graph_opt_config, &graph_precision_config, nullptr};
error = qnn_interface->qnn_graph_create(qnn_context, graph_name.c_str(), graph_configs, &graph_handle);
} else {
error = qnn_interface->qnn_graph_create(qnn_context, graph_name.c_str(), nullptr, &graph_handle);
Originally posted by @Gianthard-cyh in #20
Originally posted by @Gianthard-cyh in #20