Zeyu Zhang

E-mail: zeyuzhang@meta.com / qxc4fh@virginia.edu / zhang@zeyu.tw

INTRODUCTION

I am a PhD student at the University of Virginia (UVA), focusing on systems for training, inference, and evaluation of Large Language Models (LLMs) and recommendation models. My research primarily centers on optimizing long-context models and improving the communication, computation, and memory efficiency of KV cache. I also work on mitigating straggler issues in large-scale Machine Learning (ML) training. Prior to my PhD, I worked on network communication optimization, including user-space networking stacks and Network Function Virtualization (NFV). Earlier in my career, I also conducted research in recommender systems and algorithms.

EXPERIENCES

Meta (Formerly Facebook), Sunnyvale, California, USA (06/2025-Now)

Research Scientist Intern (AI Systems Machine Learning)
Working on systems for large language and recommendation models.

Harvard University, Boston, Massachusetts, USA (03/2024-08/2024)

Visiting Researcher
Worked on systems for Large Language Models (LLMs). The main topic was LLM KV cache quantization, and we propsed homomorphic quantization for LLM KV cache to deal with communication, computation, and memory issues in disaggregated LLM serving.

Microsoft, Seattle, Washington, USA (05/2023-08/2023, 09/2024-12/2024)

Visiting Researcher
Worked with DeepSpeed on LLM training, especially on long-context-model training.
Worked with Azure on multi-modality model serving.

University of Virginia, Charlottesville, Virginia, USA (08/2021-Now)

PhD in Computer Science
Working on systems for AI.
Worked on LLM KV cache optimization and proposed ZACK to reduce KV cache size in the hidden size dimension, which is orthogonal to quantization and token eviction based methods. I also enhanced the self-attention kernel used for ZACK.
Worked on long-context-model inference and proposed CSPS and PecSched for efficient long-context-model serving.
Also worked on straggler problems in Machine Learning (ML) training.

Intel, Shanghai, China (06/2019-11/2019)

Intern in Network and Custom Logic Group (NCLG)
Worked on user-space networking stack (collaborating with Cisco). I optimized NGINX based on open-source high-performance packet processing framework VPP to increase its throughput, achieve good scalability, reduce latency and reduce CPU usage.

Shanghai Jiao Tong University, Shanghai, China (09/2017-03/2020)

Master in Software Engineering
Worked on network function virtualization (NFV).

Wuhan University, Wuhan, China (09/2013-06/2017)

Bachelor in Software Engineering
Made identification of consumer groups in shopping malls and promoted the maximization of merchants' profits by predicting consumer groups' group behaviors.
I designed an improved Apriori Algorithm that is able to efficiently conduct trajectory prediction in large shopping malls.

RESEARCH PAPERS

Zeyu Zhang and Haiying Shen. PecSched: Preemptive and efficient cluster scheduling for LLM inference. arXiv: 2409.15104v2, 2025. [Link]
Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, and Minlan Yu. 2025. HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference. In ACM SIGCOMM 2025 Conference (SIGCOMM '25), September 8-11, 2025, Coimbra, Portugal. ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/3718958.3750481 [Link]
Zeyu Zhang and Haiying Shen. FDC: Fast KV dimensionality compression for efficient LLM inference. arXiv: 2408.04107v3, 2025. [Link]
Suraiya Tairin, Zeyu Zhang, and Haiying Shen. Revisiting the straggling problem in GPU-based distributed deep learning training. In 2025 34th International Conference on Computer Communications and Networks (ICCCN), pages 1-9, 2025.
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, and Rodrigo Fonseca. ModServe: Scalable and resource-efficient large multimodal model serving. arXiv: 2502.00937v2, 2025. [Link]
Haiying Shen and Zeyu Zhang. Deep learning training job scheduling for proactive straggler reduction. In 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pages 1-12, 2025.
Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, and Minlan Yu. HACK: Homomorphic acceleration via compression of the key-value cache for disaggregated LLM inference. arXiv: 2502.03589v1, 2025. [Link]
Zeyu Zhang and Haiying Shen. ZACK: Zero-overhead LLM inference acceleration via dimensionality compression of the key-value cache. arXiv: 2408.04107v2, 2024. [Link]
Zeyu Zhang and Haiying Shen. CSPS: A communication-efficient sequence-parallelism based serving system for transformer based models with long prompts. arXiv: 2409.15104v1, 2024. [Link]
Suraiya Tairin, Haiying Shen, and Zeyu Zhang. Embracing uncertainty for equity in resource allocation in ML training. In Proceedings of the 52nd International Conference on Parallel Processing, ICPP '23, page 423-432, New York, NY, USA, 2023. Association for Computing Machinery.
Zeyu Zhang and Weiping Zhu. Location and motion prediction of consumers in a large shopping mall. In 2017 Fifth International Conference on Advanced Cloud and Big Data (CBD), pages 250-255, 2017.

PATENTS

A Network Request Processing System and Method (Jian Li, Zeyu Zhang, Haibing Guan) (Chinese Patent Number: 202010059255.0)