Nonfiction
Book
0 Holds on 1 Copy
Availability
Details
PUBLISHED
EDITION
DESCRIPTION
xxvi, 1031 pages : color illustrations ; 24 cm
ISBN/ISSN
LANGUAGE
NOTES
Includes index
Introduction and AI system overview -- AI system hardware overview -- OS, Docker, and Kubernetes tuning for GPU-based environments -- Tuning distributed networking communication -- GPU-based storage I/O optimizations -- GPU architecture, CUDA programming, and maximizing occupancy -- Profiling and tuning GPU memory access patterns -- Occupancy tuning, warp efficiency, and instruction-level parallelism -- Increasing CUDA kernel efficiency and arithmetic intensity -- Intra-kernel pipelining, warp specialization, and cooperative thread block clusters -- Inter-kernel pipelining, synchronization, and CUDA stream-ordered memory allocations -- Dynamic scheduling, CUDA graphs, and device-initiated kernel orchestration -- Profiling, tuning, and scaling PyTorch -- PyTorch compiler, OpenAI Triton, and XLA backends -- Multinode interference, parallelism, decoding, and routing optimizations -- Profiling, debugging, and tuning inference at scale -- Scaling disaggregated prefill and decode for inference -- Advanced prefill-decode and KV cache tuning -- Dynamic and adaptive inference engine optimizations -- AI-assisted performance optimizations and scaling toward multimillion GPU clusters -- Appendix: AI systems performance checklist (175+ items)