
Publications by categories in reversed chronological order. * represents co-first author.


  1. ICML
    Accelerating retrieval-augmented language model serving with speculation
    Zhihao Zhang, Alan Zhu , Lijie Yang , and 4 more authors
    To appear at ICML 2024, 2024
  2. NeurIPS
    Communication Bounds for the Distributed Experts Problem
    Zhihao Jia , Qi Pang , Trung Tran , and 3 more authors (in alphabetic order)
    In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024
  3. Preprint
    TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
    Lijie Yang *Zhihao Zhang * , Zhuofu Chen , and 2 more authors


    Specinfer: Accelerating generative llm serving with speculative inference and token tree verification
    Xupeng Miao * , Gabriele Oliaro *Zhihao Zhang * , and 7 more authors
    To appear at ASPLOS 2024, 2023
  2. arXiv
    Towards efficient generative large language model serving: A survey from algorithms to systems
    Xupeng Miao , Gabriele Oliaro , Zhihao Zhang, and 4 more authors
    arXiv preprint arXiv:2312.15234, 2023


  1. ICLR
    GradSign: Model Performance Inference with Theoretical Insights
    Zhihao Zhang, and Zhihao Jia
    In International Conference on Learning Representations , 2021
  2. TITS
    Spatio-temporal graph dual-attention network for multi-agent prediction and tracking
    Jiachen Li , Hengbo Ma , Zhihao Zhang, and 2 more authors
    IEEE Transactions on Intelligent Transportation Systems, 2021


  1. arXiv
    Social-wagdat: Interaction-aware trajectory prediction via wasserstein graph double-attention network
    Jiachen Li , Hengbo Ma , Zhihao Zhang, and 1 more author
    arXiv preprint arXiv:2002.06241, 2020