Zeyu Zhang

Email: qxc4fh@virginia.edu / zhang@zeyu.tw

INTRODUCTION

I am a PhD student at the University of Virginia (UVA), focusing on systems for training, inference, and evaluation of Large Language Models (LLMs) and recommendation models. My research primarily centers on optimizing long-context models and improving the communication, computation, and memory efficiency of KV cache. I also work on mitigating straggler issues in large-scale Machine Learning (ML) training. Prior to my PhD, I worked on network communication optimization, including user-space networking stacks and Network Function Virtualization (NFV). Earlier in my career, I also conducted research in recommender systems and algorithms.

EXPERIENCES

Meta (Formerly Facebook), Sunnyvale, California, USA (06/2025-12/2025)

Research Scientist Intern (AI Systems Machine Learning)
Worked on systems for large language and recommendation models.

Harvard University, Boston, Massachusetts, USA (03/2024-08/2024)

Visiting Researcher
Worked on systems for Large Language Models (LLMs). The main topic was LLM KV cache quantization, and we propsed homomorphic quantization for LLM KV cache to deal with communication, computation, and memory issues in disaggregated LLM serving.

Microsoft, Seattle, Washington, USA (05/2023-08/2023, 09/2024-12/2024)

Visiting Researcher
Worked with DeepSpeed on LLM training, especially on long-context-model training.
Worked with Azure on multi-modality model serving.

University of Virginia, Charlottesville, Virginia, USA (08/2021-Now)

PhD in Computer Science
Working on systems for AI.
Worked on LLM KV cache optimization and proposed ZACK to reduce KV cache size in the hidden size dimension, which is orthogonal to quantization and token eviction based methods. I also enhanced the self-attention kernel used for ZACK.
Worked on long-context-model inference and proposed CSPS and PecSched for efficient long-context-model serving.
Also worked on straggler problems in Machine Learning (ML) training.

Intel, Shanghai, China (06/2019-11/2019)

Intern in Network and Custom Logic Group (NCLG)
Worked on user-space networking stack (collaborating with Cisco). I optimized NGINX based on open-source high-performance packet processing framework VPP to increase its throughput, achieve good scalability, reduce latency and reduce CPU usage.

Shanghai Jiao Tong University, Shanghai, China (09/2017-03/2020)

Master in Software Engineering
Worked on Network Function Virtualization (NFV).

Wuhan University, Wuhan, China (09/2013-06/2017)

Bachelor in Software Engineering
Made identification of consumer groups in shopping malls and promoted the maximization of merchants' profits by predicting consumer groups' group behaviors.
I designed an improved Apriori Algorithm that is able to efficiently conduct trajectory prediction in large shopping malls.

RESEARCH PAPERS

Zeyu Zhang and Haiying Shen. Straggler tolerant and resilient DL training on homogeneous GPUs. In 2026 35th International Conference on Computer Communications and Networks (ICCCN), Honolulu, Hawaii, USA, July 2026.
Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Haiying Shen, Juncheng Yang, and Yue Cheng. MorphServe: Efficient and workload-aware LLM serving via runtime quantized layer swapping and KV cache resizing. In Proceedings of the Ninth Conference on Machine Learning and Systems (MLSys 2026), Bellevue, WA, USA, May 2026.
Zeyu Zhang and Haiying Shen. PEACE: Preemptive and efficient cluster scheduling for LLM inference with mixed prompts. In Proceedings of the 40th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2026), New Orleans, USA, May 2026.
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, and Rodrigo Fonseca. ModServe: Modality- and stage-aware resource disaggregation for scalable multimodal model serving. In Proceedings of the 2025 ACM Symposium on Cloud Computing, SoCC '25, Online, USA, pages 817–830, November 2025. Association for Computing Machinery.
Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, and Minlan Yu. HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference. In Proceedings of the ACM SIGCOMM 2025 Conference, SIGCOMM '25, São Francisco Convent, Coimbra, Portugal, pages 1245–1247, September 2025. Association for Computing Machinery. https://doi.org/10.1145/3718958.3750481 [Link]
Zeyu Zhang and Haiying Shen. FDC: Fast KV dimensionality compression for efficient LLM inference. arXiv: 2408.04107v3, 2025. [Link]
Suraiya Tairin, Zeyu Zhang, and Haiying Shen. Revisiting the straggling problem in GPU-based distributed deep learning training. In 2025 34th International Conference on Computer Communications and Networks (ICCCN), Tokyo, Japan, pages 1–9, August 2025.
Haiying Shen and Zeyu Zhang. Deep learning training job scheduling for proactive straggler reduction. In 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Tromsø, Norway, pages 1–12, May 2025.
Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, and Minlan Yu. HACK: Homomorphic acceleration via compression of the key-value cache for disaggregated LLM inference. arXiv: 2502.03589v1, 2025. [Link]
Zeyu Zhang and Haiying Shen. ZACK: Zero-overhead LLM inference acceleration via dimensionality compression of the key-value cache. arXiv: 2408.04107v2, 2024. [Link]
Zeyu Zhang and Haiying Shen. CSPS: A communication-efficient sequence-parallelism based serving system for transformer based models with long prompts. arXiv: 2409.15104v1, 2024. [Link]
Suraiya Tairin, Haiying Shen, and Zeyu Zhang. Embracing uncertainty for equity in resource allocation in ML training. In Proceedings of the 52nd International Conference on Parallel Processing, ICPP '23, Salt Lake City, UT, USA, pages 423–432, August 2023. Association for Computing Machinery.
Zeyu Zhang and Weiping Zhu. Location and motion prediction of consumers in a large shopping mall. In 2017 Fifth International Conference on Advanced Cloud and Big Data (CBD), Shanghai, China, pages 250–255, August 2017.

PATENTS

A Network Request Processing System and Method (Jian Li, Zeyu Zhang, Haibing Guan) (Chinese Patent Number: 202010059255.0)