Publications

* Equal contribution, ✉ Corresponding author

2024

  1. 2024fire.png
    FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
    Pengxiang Li* , Zhi Gao* , Bofei Zhang* , Tao Yuan , Yuwei Wu , Mehrtash Harandi , Yunde Jia , Song-Chun Zhu , and Qing Li
    Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2024
  2. 2024ultraedit.png
    UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
    Haozhe Zhao* , Xiaojian Ma* , Liang Chen , Shuzheng Si , Rujie Wu , Kaikai An , Peiyu Yu , Minjia Zhang , Qing Li , and Baobao Chang
    Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2024
  3. 2024omnijarvis.png
    OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
    Zihao Wang , Shaofei Cai , Zhancun Mu , Haowei Lin , Ceyao Zhang , Xuejie Liu , Qing Li , Anji Liu , Xiaojian Ma , and Yitao Liang
    Neural Information Processing Systems (NeurIPS), 2024
  4. 2024sg3d.png
    Task-oriented Sequential Grounding in 3D Scenes
    Zhuofan Zhang , Ziyu ZhuPengxiang LiTengyu LiuXiaojian MaYixin ChenBaoxiong JiaSiyuan Huang , and Qing Li
    arXiv preprint arXiv:2408.04034, 2024
  5. luo2024insight.png
    End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations Spotlight (top 3.5%)
    Lirui LuoGuoxi Zhang , Hongming Xu , Yaodong YangCong Fang , and Qing Li
    International Conference on Machine Learning (ICML), 2024
  6. zhu2024unifying.png
    Unifying 3D Vision-Language Understanding Via Promptable Queries
    Ziyu Zhu , Zhuofan Zhang , Xiaojian Ma , Xuesong Niu , Yixin ChenBaoxiong Jia , Zhidong DengSiyuan Huang , and Qing Li
    European Conference on Computer Vision (ECCV), 2024
  7. fan2024videoagent.png
    VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
    Yue Fan* , Xiaojian Ma* , Rujie WuYuntao Du , Jiaqi Li , Zhi Gao , and Qing Li
    European Conference on Computer Vision (ECCV), 2024
  8. guo2024semantic.png
    Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting
    Jun Guo* , Xiaojian Ma* , Yue Fan , Huaping Liu , and Qing Li
    arXiv preprint arXiv:2403.15624, 2024
  9. xin2024parameter.png
    Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey
    Yi Xin , Siqi Luo , Haodi Zhou , Junlong Du , Xiaohong Liu , Yue Fan , Qing Li , and Yuntao Du
    arXiv preprint arXiv:2402.02242, 2024
  10. jia2024sceneverse.png
    SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
    Baoxiong Jia* , Yixin Chen* , Huangyue Yu , Yan Wang , Xuesong Niu , Tengyu LiuQing Li , and Siyuan Huang
    European Conference on Computer Vision (ECCV), 2024
  11. gao2024clova.png
    CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
    Zhi GaoYuntao Du , Xintong Zhang , Xiaojian MaWenjuan HanSong-Chun Zhu , and Qing Li
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
  12. huang2024embodied.png
    An Embodied Generalist Agent in 3D World
    Jiangyong Huang* , Silong Yong* , Xiaojian Ma* , Xiongkun Linghu* , Puhao Li , Yan Wang , Qing LiSong-Chun ZhuBaoxiong Jia , and Siyuan Huang
    International Conference on Machine Learning (ICML), 2024
  13. li2024nsr.png
    Neural-Symbolic Recursive Machine for Systematic Generalization
    International Conference on Learning Representations (ICLR), 2024
  14. wu2024bongard.png
    Bongard-OpenWorld: Few-Shot Reasoning for Free-Form Visual Concepts in the Real World
    Rujie Wu* , Xiaojian Ma* , Zhenliang Zhang , Wei WangQing LiSong-Chun Zhu , and Yizhou Wang
    International Conference on Learning Representations (ICLR), 2024

2023

  1. qin2023learning.png
    Learning Non-Markovian Decision-Making from State-Only Sequences
    Aoyang Qin , Feng GaoQing LiSong-Chun Zhu , and Sirui Xie
    Neural Information Processing Systems (NeurIPS), 2023
  2. li2023hint.png
    A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics Notable-top-25%
    International Conference on Learning Representations (ICLR), 2023
  3. zhu2023vista.png
    3D-VisTA: Pre-Trained Transformer for 3D Vision and Text Alignment
    Ziyu ZhuXiaojian MaYixin Chen , Zhidong DengSiyuan Huang , and Qing Li
    International Conference on Computer Vision (ICCV), 2023
  4. ma2023sqa3d.png
    SQA3D: Situated Question Answering in 3D Scenes
    International Conference on Learning Representations (ICLR), 2023

2022

  1. Close the Loop of Neural Perception, Grammar Parsing, and Symbolic Reasoning
    Qing Li
    University of California, Los Angeles, 2022

2021

  1. li2021unified.png
    Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
    Qing LiBoqing GongYin Cui , Dan Kondratyuk , Xianzhi Du , Ming-Hsuan Yang , and Matthew Brown
    arXiv preprint arXiv:2112.07074, 2021
  2. hong2021smart.png
    SMART: A Situation Model for Algebra Story Problems via Attributed Grammar
    Yining HongQing LiRan Gong , Daniel Ciao , Siyuan Huang , and Song-Chun Zhu
    AAAI Conference on Artificial Intelligence (AAAI), 2021
  3. hong2021learning.png
    Learning by Fixing: Solving Math Word Problems with Weak Supervision
    Yining HongQing Li , Daniel Ciao , Siyuan Huang , and Song-Chun Zhu
    AAAI Conference on Artificial Intelligence (AAAI), 2021
  4. chen2021yourefit.png
    YouRefIt: Embodied Reference Understanding with Language and Gesture Oral
    Yixin ChenQing Li , Deqian Kong , Yik Lun Kei , Song-Chun Zhu , Tao Gao , Yixin Zhu , and Siyuan Huang
    International Conference on Computer Vision (ICCV), 2021
  5. hong2021vlgrammar.png
    VLGrammar: Grounded Grammar Induction of Vision and Language
    Yining HongQing LiSong-Chun Zhu , and Siyuan Huang
    International Conference on Computer Vision (ICCV), 2021

2020

  1. li2020competence.png
    A Competence-Aware Curriculum for Visual Concepts Learning Via Question Answering Oral
    Qing LiSiyuan HuangYining Hong , and Song-Chun Zhu
    European Conference on Computer Vision (ECCV), 2020
  2. li2020ngs.png
    Closed Loop Neural-Symbolic Learning Via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning Best Paper in ICML Workshop
    International Conference on Machine Learning (ICML), 2020

2019

  1. bhattacharya2019visual.png
    Why Does a Visual Question Have Different Answers?
    Nilavra Bhattacharya , Qing Li , and Danna Gurari
    International Conference on Computer Vision (ICCV), 2019
  2. gurari2019vizwizpriv.png
    VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People
    Danna GurariQing Li , Chi Lin , Yinan Zhao , Anhong Guo , Abigale Stangl , and Jeffrey P Bigham
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2018

  1. li2018tell.png
    Tell-and-Answer: Towards Explainable Visual Question Answering Using Attributes and Captions Oral
    Qing LiJianlong Fu , Dongfei Yu , Tao Mei , and Jiebo Luo
    Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018
  2. li2018vqa.png
    VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions
    Qing Li , Qingyi Tao , Shafiq Joty , Jianfei Cai , and Jiebo Luo
    European Conference on Computer Vision (ECCV), 2018
  3. gurari2018vizwiz.png
    VizWiz Grand Challenge: Answering Visual Questions from Blind People Spotlight
    Danna GurariQing Li , Abigale J Stangl , Anhong Guo , Chi Lin , Kristen Grauman , Jiebo Luo , and Jeffrey P Bigham
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2017

  1. li2017learning.png
    Learning Hierarchical Video Representation for Action Recognition
    Qing LiZhaofan QiuTing YaoTao Mei , Yong Rui , and Jiebo Luo
    International Journal of Multimedia Information Retrieval, 2017

2016

  1. li2016action.png
    Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation Best Paper Finalist
    Qing LiZhaofan QiuTing YaoTao Mei , Yong Rui , and Jiebo Luo
    International Conference on Multimedia Retrieval, 2016