Contextual Modeling for 3D Dense Captioning on Point Clouds.

3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located ... Extensive experiments on the ScanRefer and Nr3D datasets demonstrate that our proposed method sets a new record on the 3D dense captioning task, and verify the effectiveness of our raised contextual modeling ... We make the first attempt to incorporate the contextual information of the point clouds into 3D dense captioning task. 2. ...

arXiv:2210.03925v1 fatcat:ar7ucs6mprdd7jib2grsdr6ote

3D dense captioning aims to generate multiple captions localized with their associated object regions. ... with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection ... Vote2Cap-DETR is an one-stage transformer model that takes a 3D point cloud as its input, and generates a set of box predictions and sentences localizing and describing each object in the point cloud. ...

arXiv:2301.02508v1 fatcat:e6nvic32czg2hbqs6fmk3toseu

Open Access

3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart. ... In this paper, aiming at improving 3D dense captioning via capturing and utilizing the complex relations in the 3D scene, we propose MORE, a Multi-Order RElation mining model, to support generating more ... Recently, the 3D dense captioning task has been proposed by Chen et al. [11] , where pure point clouds are adopted as the visual representation to perform object localization and captioning on. ...

arXiv:2203.05203v2 fatcat:dudhhbfuyrelznkjzajtp4ltsa

Multiple Versions

3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. ... We also insert additional spatial information to the caption head for more accurate descriptions. ... This problem is challenging, given 1) the sparsity of a point cloud and 2) the cluttered 3D scene. Prior works have achieved great success in 3D dense captioning. ...

arXiv:2309.02999v1 fatcat:fo34votyojdyxgtbpdd2lkride

Open Access

Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding. ... In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. ... Static and Dynamic 3D Data Practical. ...

arXiv:2212.00836v1 fatcat:5n6pfscpovbvdhpebtke4bh7py

As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. ... We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. ... Method We propose an end-to-end architecture on the input point clouds to address the 3D dense description generation task. ...

arXiv:2012.02206v1 fatcat:xedwiq2j3bc2hh3nns5n2czhoe

semantic part segmentation and 3D scene captioning. ... One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed ... The point cloud in the first row is the input of our captioning model. ...

arXiv:1808.06840v1 fatcat:ofm65e5wjzaaxphqgb7nyxnhna

semantic part segmentation and 3D scene captioning. ... One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed ... The point cloud in the first row is the input of our captioning model. ...

doi:10.1007/978-3-030-01225-0_37 fatcat:fwuiwtsmzzhmhmpev657ls3f6a

Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. ... Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural ... To directly tackle 3D world, Scan2Cap proposed 3D dense captioning on point cloud data. ...

arXiv:2204.10688v1 fatcat:fs666df6fndujo6igo4wnwj3ja

Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. ... Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. ... Static and Dynamic 3D Data Practical. ...

arXiv:2112.01551v2 fatcat:7dcyeual3rgbnpv5f7dkak5x6a

Multiple Versions

On the basis of this method, we propose a novel gaze-based image caption system, which has been studied for the first time. ... To address this problem, we propose a method of gaze following that utilizes a geometric map for better estimation. With the help of the map, this method is competitive for cross-frame estimation. ... Acknowledgments: We thank Lifu Chen and his colleagues in DoGoodly International Education Center and Smart Children Education Center for providing facilitation and assistance to our facilitators in collecting ...

doi:10.3390/app11167426 fatcat:gtavacta7fawjb2yud7kn6ue2i

DOAJ

Our main contribution is three-fold: 1) We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision. ... Recent works on multi-modal large language models, such as GPT-4V and Bard, have demonstrated their effectiveness in handling visual modalities. ... C.3 LAMM-Benchmark on point cloud tasks For LAMM-benchmark on point cloud tasks, we focus on three tasks of scene perception, including 3D object detection, visual grounding, and 3D visual question answering ...

arXiv:2306.06687v3 fatcat:qjzrzjt44bdzbj2birzgryegm4

Multiple Versions

LiDAR-LLM attains a 40.9 BLEU-1 on the 3D captioning task and achieves a 63.1\% classification accuracy and a 14.3\% BEV mIoU on the 3D grounding task. ... The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc ... They focus on the 3D point cloud with a single object or an indoor scene. ...

arXiv:2312.14074v1 fatcat:zv6yjk4e2jfetk6ik6flq2mwya

Additionally, we create a 3D scene captioning dataset annotated with rich object identifiers, with the assistant of GPT-4. ... Recent research has evidenced the significant potentials of Large Language Models (LLMs) in handling challenging tasks within 3D scenes. ... proposed method. 3D Representation Learning 3D point cloud is a fundamental visual modality. ...

arXiv:2312.08168v2 fatcat:o64h36od3nhezjaeufqi7yyn4q

Open Access Multiple Versions

Compared with 2D image-based systems, 3D-based systems are more complicated due to the following five reasons: 1) Data representation itself is more complicated. 3D images can be represented by point clouds ... However, one challenge for 2D image-based systems is that they cannot provide accurate 3D location information. ... These results are based on PointNet++ [66] models , running at 5 fps and achieving test set 3D AP of 70.39, 44.89 and 56.77 for car, pedestrian and cyclist, respectively. 3D instance masks on point cloud ...

arXiv:1905.12683v2 fatcat:e5ladbmkzjg53c3o6uxzlup3ky

Multiple Versions

Contextual Modeling for 3D Dense Captioning on Point Clouds [article]

Preserved Fulltext

End-to-End 3D Dense Captioning with Vote2Cap-DETR [article]

Preserved Fulltext

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes [article]

Preserved Fulltext

Other Versions

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning [article]

Preserved Fulltext

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding [article]

Preserved Fulltext

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans [article]

Preserved Fulltext

Fully-Convolutional Point Networks for Large-Scale Point Clouds [article]

Preserved Fulltext

Fully-Convolutional Point Networks for Large-Scale Point Clouds [chapter]

Preserved Fulltext

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds [article]

Preserved Fulltext

D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding [article]

Preserved Fulltext

Other Versions

Say What You Are Looking At: An Attention-Based Interactive System for Autistic Children

Preserved Fulltext

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [article]

Preserved Fulltext

Other Versions

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding [article]

Preserved Fulltext

Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers [article]

Preserved Fulltext

A survey of Object Classification and Detection based on 2D/3D data [article]

Preserved Fulltext

Other Versions