校招 AI软件应用工程师(模型推理/优化)
上海
硕士及以上
不限专业
使用简历深度优化功能,快速提升简历质量
职位介绍
Responsibilities
THE PERSON:
Success in this role will require deep knowledge of Data Center AI workloads such as LLM, Generative AI, Recommendation, NLP, Video Analytics, and/or transformer … AI cross cloud, client, edge… the candidate needs to have hands-on experiences with various AI models, end-to-end pipeline, industry framework / SDKs and solutions.
KEY RESPONSIBILITIES:
• High-Performance Kernel Development: Design, implement, and optimize high-performance GPU kernels for AI/ML workloads to maximize hardware utilization.
• Performance Optimization: Analyze and optimize kernel execution for latency and throughput, addressing bottlenecks in memory bandwidth, instruction latency, and thread divergence.
• Workload Analysis: Evaluate the end-to-end performance impact of individual kernels on full-stack AI models, ensuring that micro-optimizations translate to application-level speedups.
• Profiling & Tuning: Utilize advanced GPU profiling tools (e.g., ROCm Profiler, Pytorch Profiler) to identify performance cliffs, stall pipelines, and memory hierarchy inefficiencies.
• Architecture Adaptation: Tailor implementation strategies to leverage specific features of modern GPU architectures (e.g., Matrix Cores, HBM characteristics).
• Framework Integration: Collaborate with software stack teams to expose optimized kernels within high-level frameworks and inference engines.
PREFERRED EXPERIENCE:
• GPU Architecture Mastery: In-depth understanding of modern GPU underlying architectures, including streaming multiprocessors (SMs/CUs), memory hierarchy (registers, shared memory, L1/L2 cache, HBM), and warp/wavefront execution models.
• Kernel Programming Expertise: Strong proficiency in C++ and parallel computing, with extensive hands-on experience in NVIDIA CUDA or AMD HIP kernel programming.
• Performance Engineering: Demonstrated ability to debug and profile complex GPU workloads, interpreting low-level metrics to drive architectural-aware optimizations.
• Systems Knowledge: Familiarity with asynchronous execution, stream management, and host-device memory transfers.
• Python DSLs & Triton: Experience implementing kernels using OpenAI Triton or other Python-based DSLs for agile kernel development and auto-tuning.
• Inference Engine Experience: Hands-on experience integrating custom kernels into large-scale inference frameworks such as vLLM, SGLang, or TensorRT-LLM.
• Deep Learning Frameworks: Familiarity with writing custom extensions or operators for PyTorch (C++/CUDA extensions).
• Hardware Agnosticism: Experience porting kernels between NVIDIA and AMD architectures or working with cross-platform HPC libraries.
ACADEMIC CREDENTIALS:
• MS candidates who graduate in 2025/2026

