Get in Touch

Course Outline

Tencent Hunyuan Production Fundamentals

  • Overview of Tencent Hunyuan model serving scenarios.
  • Production characteristics of large and MoE models.
  • Common latency, throughput, and cost bottlenecks.
  • Defining service-level objectives for inference workloads.

Deployment Architecture and Serving Flow

  • Core components of a production inference stack.
  • Choosing between containerised, on-premise, and cloud deployment models.
  • Model loading, request routing, and GPU allocation basics.
  • Designing for reliability and operational simplicity.

Latency Optimization in Practice

  • Using optimised inference engines such as TensorRT where applicable.
  • KV-cache concepts and practical cache tuning.
  • Reducing startup, warmup, and response overhead.
  • Measuring time to first token and token generation speed.

Throughput, Batching, and GPU Efficiency

  • Continuous batching and request batching strategies.
  • Managing concurrency and queue behavior.
  • Improving GPU utilization without harming user experience.
  • Handling long-context and mixed-workload requests.

Quantization and Cost Control

  • Why quantization matters for production serving.
  • Practical trade-offs of FP16, INT8, and other common precision options.
  • Balancing model quality, latency, and infrastructure cost.
  • Building a simple cost optimization checklist.

Operations, Monitoring, and Readiness Review

  • Autoscaling triggers for inference services.
  • Monitoring latency, throughput, cache usage, and GPU health.
  • Logging, alerting, and incident response basics.
  • Reviewing a reference deployment and creating an improvement plan.

Requirements

  • Fundamental understanding of large language model deployment and inference workflows.
  • Experience with containers, cloud or on-premise infrastructure, and API-based services.
  • Proficiency in Python or system engineering tasks.

Audience

  • ML engineers deploying LLMs into production environments.
  • Platform engineers responsible for managing GPU-based inference services.
  • Solution architects designing scalable AI serving platforms.
 14 Hours

Number of participants


Price per participant

Upcoming Courses

Related Categories