Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Filters








617 Hits in 4.2 sec

Contextual Modeling for 3D Dense Captioning on Point Clouds [article]

Yufeng Zhong, Long Xu, Jiebo Luo, Lin Ma
2022 arXiv   pre-print
3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located  ...  Extensive experiments on the ScanRefer and Nr3D datasets demonstrate that our proposed method sets a new record on the 3D dense captioning task, and verify the effectiveness of our raised contextual modeling  ...  We make the first attempt to incorporate the contextual information of the point clouds into 3D dense captioning task. 2.  ... 
arXiv:2210.03925v1 fatcat:ar7ucs6mprdd7jib2grsdr6ote

End-to-End 3D Dense Captioning with Vote2Cap-DETR [article]

Sijin Chen, Hongyuan Zhu, Xin Chen, Yinjie Lei, Tao Chen, Gang YU
2023 arXiv   pre-print
3D dense captioning aims to generate multiple captions localized with their associated object regions.  ...  with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection  ...  Vote2Cap-DETR is an one-stage transformer model that takes a 3D point cloud as its input, and generates a set of box predictions and sentences localizing and describing each object in the point cloud.  ... 
arXiv:2301.02508v1 fatcat:e6nvic32czg2hbqs6fmk3toseu

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes [article]

Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
2022 arXiv   pre-print
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.  ...  In this paper, aiming at improving 3D dense captioning via capturing and utilizing the complex relations in the 3D scene, we propose MORE, a Multi-Order RElation mining model, to support generating more  ...  Recently, the 3D dense captioning task has been proposed by Chen et al. [11] , where pure point clouds are adopted as the visual representation to perform object localization and captioning on.  ... 
arXiv:2203.05203v2 fatcat:dudhhbfuyrelznkjzajtp4ltsa

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning [article]

Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen
2023 arXiv   pre-print
3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions.  ...  We also insert additional spatial information to the caption head for more accurate descriptions.  ...  This problem is challenging, given 1) the sparsity of a point cloud and 2) the cluttered 3D scene. Prior works have achieved great success in 3D dense captioning.  ... 
arXiv:2309.02999v1 fatcat:fo34votyojdyxgtbpdd2lkride

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding [article]

Dave Zhenyu Chen, Ronghang Hu, Xinlei Chen, Matthias Nießner, Angel X. Chang
2022 arXiv   pre-print
Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding.  ...  In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning.  ...  Static and Dynamic 3D Data Practical.  ... 
arXiv:2212.00836v1 fatcat:5n6pfscpovbvdhpebtke4bh7py

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans [article]

Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang
2020 arXiv   pre-print
As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects.  ...  We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors.  ...  Method We propose an end-to-end architecture on the input point clouds to address the 3D dense description generation task.  ... 
arXiv:2012.02206v1 fatcat:xedwiq2j3bc2hh3nns5n2czhoe

Fully-Convolutional Point Networks for Large-Scale Point Clouds [article]

Dario Rethage, Johanna Wald, Jürgen Sturm, Nassir Navab, Federico Tombari
2018 arXiv   pre-print
semantic part segmentation and 3D scene captioning.  ...  One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed  ...  The point cloud in the first row is the input of our captioning model.  ... 
arXiv:1808.06840v1 fatcat:ofm65e5wjzaaxphqgb7nyxnhna

Fully-Convolutional Point Networks for Large-Scale Point Clouds [chapter]

Dario Rethage, Johanna Wald, Jürgen Sturm, Nassir Navab, Federico Tombari
2018 Lecture Notes in Computer Science  
semantic part segmentation and 3D scene captioning.  ...  One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed  ...  The point cloud in the first row is the input of our captioning model.  ... 
doi:10.1007/978-3-030-01225-0_37 fatcat:fwuiwtsmzzhmhmpev657ls3f6a

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds [article]

Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai
2022 arXiv   pre-print
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding.  ...  Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural  ...  To directly tackle 3D world, Scan2Cap proposed 3D dense captioning on point cloud data.  ... 
arXiv:2204.10688v1 fatcat:fs666df6fndujo6igo4wnwj3ja

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding [article]

Dave Zhenyu Chen, Qirui Wu, Matthias Nießner, Angel X. Chang
2022 arXiv   pre-print
Recent studies on dense captioning and visual grounding in 3D have achieved impressive results.  ...  Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods.  ...  Static and Dynamic 3D Data Practical.  ... 
arXiv:2112.01551v2 fatcat:7dcyeual3rgbnpv5f7dkak5x6a

Say What You Are Looking At: An Attention-Based Interactive System for Autistic Children

Furong Deng, Yu Zhou, Sifan Song, Zijian Jiang, Lifu Chen, Jionglong Su, Zhenglong Sun, Jiaming Zhang
2021 Applied Sciences  
On the basis of this method, we propose a novel gaze-based image caption system, which has been studied for the first time.  ...  To address this problem, we propose a method of gaze following that utilizes a geometric map for better estimation. With the help of the map, this method is competitive for cross-frame estimation.  ...  Acknowledgments: We thank Lifu Chen and his colleagues in DoGoodly International Education Center and Smart Children Education Center for providing facilitation and assistance to our facilitators in collecting  ... 
doi:10.3390/app11167426 fatcat:gtavacta7fawjb2yud7kn6ue2i

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [article]

Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
2023 arXiv   pre-print
Our main contribution is three-fold: 1) We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.  ...  Recent works on multi-modal large language models, such as GPT-4V and Bard, have demonstrated their effectiveness in handling visual modalities.  ...  C.3 LAMM-Benchmark on point cloud tasks For LAMM-benchmark on point cloud tasks, we focus on three tasks of scene perception, including 3D object detection, visual grounding, and 3D visual question answering  ... 
arXiv:2306.06687v3 fatcat:qjzrzjt44bdzbj2birzgryegm4

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding [article]

Senqiao Yang, Jiaming Liu, Ray Zhang, Mingjie Pan, Zoey Guo, Xiaoqi Li, Zehui Chen, Peng Gao, Yandong Guo, Shanghang Zhang
2023 arXiv   pre-print
LiDAR-LLM attains a 40.9 BLEU-1 on the 3D captioning task and achieves a 63.1\% classification accuracy and a 14.3\% BEV mIoU on the 3D grounding task.  ...  The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc  ...  They focus on the 3D point cloud with a single object or an indoor scene.  ... 
arXiv:2312.14074v1 fatcat:zv6yjk4e2jfetk6ik6flq2mwya

Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers [article]

Haifeng Huang, Zehan Wang, Rongjie Huang, Luping Liu, Xize Cheng, Yang Zhao, Tao Jin, Zhou Zhao
2023 arXiv   pre-print
Additionally, we create a 3D scene captioning dataset annotated with rich object identifiers, with the assistant of GPT-4.  ...  Recent research has evidenced the significant potentials of Large Language Models (LLMs) in handling challenging tasks within 3D scenes.  ...  proposed method. 3D Representation Learning 3D point cloud is a fundamental visual modality.  ... 
arXiv:2312.08168v2 fatcat:o64h36od3nhezjaeufqi7yyn4q

A survey of Object Classification and Detection based on 2D/3D data [article]

Xiaoke Shen
2022 arXiv   pre-print
Compared with 2D image-based systems, 3D-based systems are more complicated due to the following five reasons: 1) Data representation itself is more complicated. 3D images can be represented by point clouds  ...  However, one challenge for 2D image-based systems is that they cannot provide accurate 3D location information.  ...  These results are based on PointNet++ [66] models , running at 5 fps and achieving test set 3D AP of 70.39, 44.89 and 56.77 for car, pedestrian and cyclist, respectively. 3D instance masks on point cloud  ... 
arXiv:1905.12683v2 fatcat:e5ladbmkzjg53c3o6uxzlup3ky
« Previous Showing results 1 — 15 out of 617 results