Abstract:Large language models (LLMs), represented by ChatGPT and DeepSeek, are rapidly developing and widely used in various tasks, such as text generation and intelligent assistants. However, these large models also face severe privacy and security risks. Especially in high security scenarios such as healthcare and finance, threats such as model theft and data privacy leakage are often key factors hindering the application of large models. Existing security solutions for protecting large model inference usually have certain limitations, such as the lack of runtime protection for the inference computation process, or practical challenges caused by the high cost of computation and communication. Confidential computing can build a secure inference environment based on trusted execution environment (TEE) hardware, and is a practical and effective security technology for implementing secure inference of large language models. Therefore, this study proposes a secure inference application scheme for large language models based on confidential computing, which ensures the integrity of the inference computing environment, model weight parameters, and model image files through remote attestation, implements encryption protection for large model inference traffic via confidential interconnection based on TEE hardware, and protects the privacy of user prompts in multi-user scenarios by isolating the inference contexts among different users. The proposed scheme provides comprehensive security protection for the entire process and full chain of large language model inference, while verifying the integrity of the execution environment to achieve efficient and secure confidential large language model inference. Furthermore, a prototype system is implemented on a heterogeneous TEE server platform (SEV and CSV), and the system’s security and performance are evaluated. The results show that while achieving the expected security goals, the performance loss introduced by the proposed scheme theoretically does not exceed 1% of the inference overhead of the native AI model, which can be ignored in practical applications.