logologo
寻找工作
返回简章2026-04-04 更新

校招 AI软件应用工程师(模型推理/优化)

上海
硕士及以上
不限专业
使用简历深度优化功能,快速提升简历质量
职位介绍
Responsibilities THE PERSON: Success in this role will require deep knowledge of Data Center AI workloads such as LLM, Generative AI, Recommendation, NLP, Video Analytics, and/or transformer … AI cross cloud, client, edge… the candidate needs to have hands-on experiences with various AI models, end-to-end pipeline, industry framework / SDKs and solutions. KEY RESPONSIBILITIES: • High-Performance Kernel Development: Design, implement, and optimize high-performance GPU kernels for AI/ML workloads to maximize hardware utilization. • Performance Optimization: Analyze and optimize kernel execution for latency and throughput, addressing bottlenecks in memory bandwidth, instruction latency, and thread divergence. • Workload Analysis: Evaluate the end-to-end performance impact of individual kernels on full-stack AI models, ensuring that micro-optimizations translate to application-level speedups. • Profiling & Tuning: Utilize advanced GPU profiling tools (e.g., ROCm Profiler, Pytorch Profiler) to identify performance cliffs, stall pipelines, and memory hierarchy inefficiencies. • Architecture Adaptation: Tailor implementation strategies to leverage specific features of modern GPU architectures (e.g., Matrix Cores, HBM characteristics). • Framework Integration: Collaborate with software stack teams to expose optimized kernels within high-level frameworks and inference engines. PREFERRED EXPERIENCE: • GPU Architecture Mastery: In-depth understanding of modern GPU underlying architectures, including streaming multiprocessors (SMs/CUs), memory hierarchy (registers, shared memory, L1/L2 cache, HBM), and warp/wavefront execution models. • Kernel Programming Expertise: Strong proficiency in C++ and parallel computing, with extensive hands-on experience in NVIDIA CUDA or AMD HIP kernel programming. • Performance Engineering: Demonstrated ability to debug and profile complex GPU workloads, interpreting low-level metrics to drive architectural-aware optimizations. • Systems Knowledge: Familiarity with asynchronous execution, stream management, and host-device memory transfers. • Python DSLs & Triton: Experience implementing kernels using OpenAI Triton or other Python-based DSLs for agile kernel development and auto-tuning. • Inference Engine Experience: Hands-on experience integrating custom kernels into large-scale inference frameworks such as vLLM, SGLang, or TensorRT-LLM. • Deep Learning Frameworks: Familiarity with writing custom extensions or operators for PyTorch (C++/CUDA extensions). • Hardware Agnosticism: Experience porting kernels between NVIDIA and AMD architectures or working with cross-platform HPC libraries. ACADEMIC CREDENTIALS: • MS candidates who graduate in 2025/2026