liqing.jpg

Qing Li 李庆

Email: dylan.liqing[at]gmail[dot]com

I am a research scientist and team lead at Beijing Institute for General Artificial Intelligence (BIGAI), China. I received my Ph.D. in 2022 from University of California, Los Angeles (UCLA), advised by Professor Song-Chun Zhu. During my Ph.D., I have interned at Google Research, Microsoft Azure AI and Amazon Alexa. Before UCLA, I obtained my degrees of Bachelor in 2015 and Master in 2018 from University of Science and Technology of China (USTC).

My long-term research goal is to develop a generalist agent that can perceive the real world, communicate with humans, and learn from feedback. To achieve this goal, I currently focus on:

  • AGI Agents: LLM Agents, Vision-Language-Action (VLA), Embodied Agents
  • Multimodal Understanding: Vision-Language Modeling (VLM), 3D Visual Grounding, Long-term Video Understanding
  • Machine Learning: Neural-Symbolic Learning, Continual Learning, In-Context Learning

Our team is actively recruiting full-time research scientists, engineers, and self-motivated interns. We are also seeking prospective PhD students and long-term collaborators for TongProgam (通计划). Feel free to contact me if you are interested!


News

2024-08 🔥🔥🔥 Three papers are accepted by NeurIPS 2024! Check out these awesome works: FIRE, A dataset for feedback refining of large multimodal models. UltraEdit, a large-scale (~4M) high-quality dataset for instruction-based image editing. OmniJARVIS, a novel Vision-Language-Action (VLA) model for instruction following in Minecraft.
2024-08 🔥🔥🔥 Call for papers to the 1st workshop on Open-World Agents (NeurIPS 2024).
2024-07 I, together with Xiaojian Ma and Zhi Gao, gave a joint tutorial on “Multimodal Generalist Agents: Reasoning, Reflecting, and Learning like Humans” for the participants in TongProgram Summer School 2024.
2024-07 🔥🔥🔥 Three papers are accepted by ECCV 2024! Check out these awesome works: PQ3D, a unfied model for 3D vision-language understanding; SceneVerse, the first million-scale 3D vision-language dataset; VideoAgent, a LLM agent that understands videos by using a structured memory and 4 tools.
2024-06 Call for papers to IJCLR 2024, which will happen on 20 - 22 September 2024 in Nanjing! I will serve as an area chair on neuro-symbolic learning and reasoning. If you are willing to be a PC member, please contact me!

Selected Publications

* Equal contribution, ✉ Corresponding author

  1. 2024fire.png
    FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
    Pengxiang Li* , Zhi Gao* , Bofei Zhang* , Tao Yuan , Yuwei Wu , Mehrtash Harandi , Yunde Jia , Song-Chun Zhu , and Qing Li
    Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2024
  2. 2024ultraedit.png
    UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
    Haozhe Zhao* , Xiaojian Ma* , Liang Chen , Shuzheng Si , Rujie Wu , Kaikai An , Peiyu Yu , Minjia Zhang , Qing Li , and Baobao Chang
    Neural Information Processing Systems: Datasets and Benchmarks (NeurIPS D&B), 2024
  3. 2024sg3d.png
    Task-oriented Sequential Grounding in 3D Scenes
    Zhuofan Zhang , Ziyu ZhuPengxiang LiTengyu LiuXiaojian MaYixin ChenBaoxiong JiaSiyuan Huang , and Qing Li
    arXiv preprint arXiv:2408.04034, 2024
  4. luo2024insight.png
    End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations Spotlight (top 3.5%)
    Lirui LuoGuoxi Zhang , Hongming Xu , Yaodong YangCong Fang , and Qing Li
    International Conference on Machine Learning (ICML), 2024
  5. zhu2024unifying.png
    Unifying 3D Vision-Language Understanding Via Promptable Queries
    Ziyu Zhu , Zhuofan Zhang , Xiaojian Ma , Xuesong Niu , Yixin ChenBaoxiong Jia , Zhidong DengSiyuan Huang , and Qing Li
    European Conference on Computer Vision (ECCV), 2024
  6. fan2024videoagent.png
    VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
    Yue Fan* , Xiaojian Ma* , Rujie WuYuntao Du , Jiaqi Li , Zhi Gao , and Qing Li
    European Conference on Computer Vision (ECCV), 2024
  7. jia2024sceneverse.png
    SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
    Baoxiong Jia* , Yixin Chen* , Huangyue Yu , Yan Wang , Xuesong Niu , Tengyu LiuQing Li , and Siyuan Huang
    European Conference on Computer Vision (ECCV), 2024
  8. gao2024clova.png
    CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update
    Zhi GaoYuntao Du , Xintong Zhang , Xiaojian MaWenjuan HanSong-Chun Zhu , and Qing Li
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
  9. huang2024embodied.png
    An Embodied Generalist Agent in 3D World
    Jiangyong Huang* , Silong Yong* , Xiaojian Ma* , Xiongkun Linghu* , Puhao Li , Yan Wang , Qing LiSong-Chun ZhuBaoxiong Jia , and Siyuan Huang
    International Conference on Machine Learning (ICML), 2024
  10. li2024nsr.png
    Neural-Symbolic Recursive Machine for Systematic Generalization
    International Conference on Learning Representations (ICLR), 2024
  11. wu2024bongard.png
    Bongard-OpenWorld: Few-Shot Reasoning for Free-Form Visual Concepts in the Real World
    Rujie Wu* , Xiaojian Ma* , Zhenliang Zhang , Wei WangQing LiSong-Chun Zhu , and Yizhou Wang
    International Conference on Learning Representations (ICLR), 2024
  12. qin2023learning.png
    Learning Non-Markovian Decision-Making from State-Only Sequences
    Aoyang Qin , Feng GaoQing LiSong-Chun Zhu , and Sirui Xie
    Neural Information Processing Systems (NeurIPS), 2023
  13. li2023hint.png
    A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics Notable-top-25%
    International Conference on Learning Representations (ICLR), 2023
  14. zhu2023vista.png
    3D-VisTA: Pre-Trained Transformer for 3D Vision and Text Alignment
    Ziyu ZhuXiaojian MaYixin Chen , Zhidong DengSiyuan Huang , and Qing Li
    International Conference on Computer Vision (ICCV), 2023
  15. ma2023sqa3d.png
    SQA3D: Situated Question Answering in 3D Scenes
    International Conference on Learning Representations (ICLR), 2023
  16. hong2021smart.png
    SMART: A Situation Model for Algebra Story Problems via Attributed Grammar
    Yining HongQing LiRan Gong , Daniel Ciao , Siyuan Huang , and Song-Chun Zhu
    AAAI Conference on Artificial Intelligence (AAAI), 2021
  17. hong2021learning.png
    Learning by Fixing: Solving Math Word Problems with Weak Supervision
    Yining HongQing Li , Daniel Ciao , Siyuan Huang , and Song-Chun Zhu
    AAAI Conference on Artificial Intelligence (AAAI), 2021
  18. chen2021yourefit.png
    YouRefIt: Embodied Reference Understanding with Language and Gesture Oral
    Yixin ChenQing Li , Deqian Kong , Yik Lun Kei , Song-Chun Zhu , Tao Gao , Yixin Zhu , and Siyuan Huang
    International Conference on Computer Vision (ICCV), 2021
  19. hong2021vlgrammar.png
    VLGrammar: Grounded Grammar Induction of Vision and Language
    Yining HongQing LiSong-Chun Zhu , and Siyuan Huang
    International Conference on Computer Vision (ICCV), 2021
  20. li2020competence.png
    A Competence-Aware Curriculum for Visual Concepts Learning Via Question Answering Oral
    Qing LiSiyuan HuangYining Hong , and Song-Chun Zhu
    European Conference on Computer Vision (ECCV), 2020
  21. li2020ngs.png
    Closed Loop Neural-Symbolic Learning Via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning Best Paper in ICML Workshop
    International Conference on Machine Learning (ICML), 2020
  22. bhattacharya2019visual.png
    Why Does a Visual Question Have Different Answers?
    Nilavra Bhattacharya , Qing Li , and Danna Gurari
    International Conference on Computer Vision (ICCV), 2019
  23. gurari2019vizwizpriv.png
    VizWiz-Priv: A Dataset for Recognizing the Presence and Purpose of Private Visual Information in Images Taken by Blind People
    Danna GurariQing Li , Chi Lin , Yinan Zhao , Anhong Guo , Abigale Stangl , and Jeffrey P Bigham
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019
  24. li2018tell.png
    Tell-and-Answer: Towards Explainable Visual Question Answering Using Attributes and Captions Oral
    Qing LiJianlong Fu , Dongfei Yu , Tao Mei , and Jiebo Luo
    Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018
  25. li2018vqa.png
    VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions
    Qing Li , Qingyi Tao , Shafiq Joty , Jianfei Cai , and Jiebo Luo
    European Conference on Computer Vision (ECCV), 2018
  26. gurari2018vizwiz.png
    VizWiz Grand Challenge: Answering Visual Questions from Blind People Spotlight
    Danna GurariQing Li , Abigale J Stangl , Anhong Guo , Chi Lin , Kristen Grauman , Jiebo Luo , and Jeffrey P Bigham
    The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018
  27. li2016action.png
    Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation Best Paper Finalist
    Qing LiZhaofan QiuTing YaoTao Mei , Yong Rui , and Jiebo Luo
    International Conference on Multimedia Retrieval, 2016