A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Contextual Modeling for 3D Dense Captioning on Point Clouds
[article]
2022
arXiv
pre-print
3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds and generate a distinctive natural language sentence for describing each located ...
Extensive experiments on the ScanRefer and Nr3D datasets demonstrate that our proposed method sets a new record on the 3D dense captioning task, and verify the effectiveness of our raised contextual modeling ...
We make the first attempt to incorporate the contextual information of the point clouds into 3D dense captioning task. 2. ...
arXiv:2210.03925v1
fatcat:ar7ucs6mprdd7jib2grsdr6ote
End-to-End 3D Dense Captioning with Vote2Cap-DETR
[article]
2023
arXiv
pre-print
3D dense captioning aims to generate multiple captions localized with their associated object regions. ...
with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection ...
Vote2Cap-DETR is an one-stage transformer model that takes a 3D point cloud as its input, and generates a set of box predictions and sentences localizing and describing each object in the point cloud. ...
arXiv:2301.02508v1
fatcat:e6nvic32czg2hbqs6fmk3toseu
MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes
[article]
2022
arXiv
pre-print
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart. ...
In this paper, aiming at improving 3D dense captioning via capturing and utilizing the complex relations in the 3D scene, we propose MORE, a Multi-Order RElation mining model, to support generating more ...
Recently, the 3D dense captioning task has been proposed by Chen et al. [11] , where pure point clouds are adopted as the visual representation to perform object localization and captioning on. ...
arXiv:2203.05203v2
fatcat:dudhhbfuyrelznkjzajtp4ltsa
Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning
[article]
2023
arXiv
pre-print
3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. ...
We also insert additional spatial information to the caption head for more accurate descriptions. ...
This problem is challenging, given 1) the sparsity of a point cloud and 2) the cluttered 3D scene. Prior works have achieved great success in 3D dense captioning. ...
arXiv:2309.02999v1
fatcat:fo34votyojdyxgtbpdd2lkride
UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding
[article]
2022
arXiv
pre-print
Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding. ...
In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. ...
Static and Dynamic 3D Data Practical. ...
arXiv:2212.00836v1
fatcat:5n6pfscpovbvdhpebtke4bh7py
Scan2Cap: Context-aware Dense Captioning in RGB-D Scans
[article]
2020
arXiv
pre-print
As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. ...
We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. ...
Method We propose an end-to-end architecture on the input point clouds to address the 3D dense description generation task. ...
arXiv:2012.02206v1
fatcat:xedwiq2j3bc2hh3nns5n2czhoe
Fully-Convolutional Point Networks for Large-Scale Point Clouds
[article]
2018
arXiv
pre-print
semantic part segmentation and 3D scene captioning. ...
One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed ...
The point cloud in the first row is the input of our captioning model. ...
arXiv:1808.06840v1
fatcat:ofm65e5wjzaaxphqgb7nyxnhna
Fully-Convolutional Point Networks for Large-Scale Point Clouds
[chapter]
2018
Lecture Notes in Computer Science
semantic part segmentation and 3D scene captioning. ...
One striking characteristic of our approach is its ability to process unorganized 3D representations such as point clouds as input, then transforming them internally to ordered structures to be processed ...
The point cloud in the first row is the input of our captioning model. ...
doi:10.1007/978-3-030-01225-0_37
fatcat:fwuiwtsmzzhmhmpev657ls3f6a
Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
[article]
2022
arXiv
pre-print
Dense captioning in 3D point clouds is an emerging vision-and-language task involving object-level 3D scene understanding. ...
Apart from coarse semantic class prediction and bounding box regression as in traditional 3D object detection, 3D dense captioning aims at producing a further and finer instance-level label of natural ...
To directly tackle 3D world, Scan2Cap proposed 3D dense captioning on point cloud data. ...
arXiv:2204.10688v1
fatcat:fs666df6fndujo6igo4wnwj3ja
D3Net: A Unified Speaker-Listener Architecture for 3D Dense Captioning and Visual Grounding
[article]
2022
arXiv
pre-print
Recent studies on dense captioning and visual grounding in 3D have achieved impressive results. ...
Despite developments in both areas, the limited amount of available 3D vision-language data causes overfitting issues for 3D visual grounding and 3D dense captioning methods. ...
Static and Dynamic 3D Data Practical. ...
arXiv:2112.01551v2
fatcat:7dcyeual3rgbnpv5f7dkak5x6a
Say What You Are Looking At: An Attention-Based Interactive System for Autistic Children
2021
Applied Sciences
On the basis of this method, we propose a novel gaze-based image caption system, which has been studied for the first time. ...
To address this problem, we propose a method of gaze following that utilizes a geometric map for better estimation. With the help of the map, this method is competitive for cross-frame estimation. ...
Acknowledgments: We thank Lifu Chen and his colleagues in DoGoodly International Education Center and Smart Children Education Center for providing facilitation and assistance to our facilitators in collecting ...
doi:10.3390/app11167426
fatcat:gtavacta7fawjb2yud7kn6ue2i
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
[article]
2023
arXiv
pre-print
Our main contribution is three-fold: 1) We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision. ...
Recent works on multi-modal large language models, such as GPT-4V and Bard, have demonstrated their effectiveness in handling visual modalities. ...
C.3 LAMM-Benchmark on point cloud tasks For LAMM-benchmark on point cloud tasks, we focus on three tasks of scene perception, including 3D object detection, visual grounding, and 3D visual question answering ...
arXiv:2306.06687v3
fatcat:qjzrzjt44bdzbj2birzgryegm4
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
[article]
2023
arXiv
pre-print
LiDAR-LLM attains a 40.9 BLEU-1 on the 3D captioning task and achieves a 63.1\% classification accuracy and a 14.3\% BEV mIoU on the 3D grounding task. ...
The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc ...
They focus on the 3D point cloud with a single object or an indoor scene. ...
arXiv:2312.14074v1
fatcat:zv6yjk4e2jfetk6ik6flq2mwya
Chat-3D v2: Bridging 3D Scene and Large Language Models with Object Identifiers
[article]
2023
arXiv
pre-print
Additionally, we create a 3D scene captioning dataset annotated with rich object identifiers, with the assistant of GPT-4. ...
Recent research has evidenced the significant potentials of Large Language Models (LLMs) in handling challenging tasks within 3D scenes. ...
proposed method. 3D Representation Learning 3D point cloud is a fundamental visual modality. ...
arXiv:2312.08168v2
fatcat:o64h36od3nhezjaeufqi7yyn4q
A survey of Object Classification and Detection based on 2D/3D data
[article]
2022
arXiv
pre-print
Compared with 2D image-based systems, 3D-based systems are more complicated due to the following five reasons: 1) Data representation itself is more complicated. 3D images can be represented by point clouds ...
However, one challenge for 2D image-based systems is that they cannot provide accurate 3D location information. ...
These results are based on PointNet++ [66] models , running at 5 fps and achieving test set 3D AP of 70.39, 44.89 and 56.77 for car, pedestrian and cyclist, respectively. 3D instance masks on point cloud ...
arXiv:1905.12683v2
fatcat:e5ladbmkzjg53c3o6uxzlup3ky
« Previous
Showing results 1 — 15 out of 617 results