Deep sequential fusion LSTM network for image description.

Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. ... ., objects, object parts, and interactions between them) from images, labeling each with a short descriptive phrase. ... Introduction The computer vision community has recently witnessed the success of deep neural networks for image captioning, in which a sentence is generated to describe a given image. ...

doi:10.1109/cvpr.2017.214 dblp:conf/cvpr/YangTYL17 fatcat:7mcgtr3oinag5pnwob5bqviglq

Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. ... ., objects, object parts, and interactions between them) from images, labeling each with a short descriptive phrase. ... Introduction The computer vision community has recently witnessed the success of deep neural networks for image captioning, in which a sentence is generated to describe a given image. ...

arXiv:1611.06949v2 fatcat:gm2xmjbq65edhpxn5qrlpdnswy

Multiple Versions

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving ... tasks, image description and retrieval problems, and video narration challenges. ... Acknowledgements The authors thank Oriol Vinyals for valuable advice and helpful discussion throughout this work. ...

doi:10.1109/cvpr.2015.7298878 dblp:conf/cvpr/DonahueHGRVDS15 fatcat:5w4eeyesm5hipiieav2nzc4et4

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving ... tasks, image description and retrieval problems, and video narration challenges. ... Acknowledgements The authors thank Oriol Vinyals for valuable advice and helpful discussion throughout this work. ...

doi:10.1109/tpami.2016.2599174 pmid:27608449 fatcat:xoluu2wo6jfbxddalikzfbrgk4

Our system first builds a semantic tree structure based on sentence parsing, aimed at aligning textual words and image regions for accurate analysis. ... Sentiment analysis is crucial for extracting social signals from social media content. ... Acknowledgment This work was generously supported in part by Adobe Research and New York State through the Goergen Institute for Data Science at the University of Rochester. ...

doi:10.1145/2964284.2964288 dblp:conf/mm/YouCJL16 fatcat:jwftcdellngpblor5mpbhooq6i

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving ... tasks, image description and retrieval problems, and video narration challenges. ... ACKNOWLEDGMENTS The authors thank Oriol Vinyals for valuable advice and helpful discussion throughout this work. ...

arXiv:1411.4389v4 fatcat:24egskhcdjcsfet4onvzhwfskq

Multiple Versions

Multimodal Attention Long-Short Term Memory networks (MA-LSTM). ... Different from existing approaches that employ the same LSTM structure for different modalities, we train modality-specific LSTM to capture the intrinsic representations of individual modalities. ... In [24] , Venugopalan et al. design an encoderdecoder neural network to generate descriptions. ...

doi:10.1145/3123266.3123448 dblp:conf/mm/XuYZM17 fatcat:walf26k3tve45idna47khhtsq4

In this work, we propose a deep-temporal LSTM architecture which extends standard LSTM and allows better encoding of temporal information. ... In addition, we propose to fuse 3D skeleton geometry with deep static appearance. ... The complementary nature of the LSTM and CNN based networks are evident from the boosted performance for MSRDailyActiv-ity3D and NTU-RGB+D on fusion. ...

arXiv:1802.00421v2 fatcat:th4clkmfzvejrlmhs3n7z5smvu

Multiple Versions

A short comparison is also made with the handcrafted feature-based approach and its fusion with deep learning to show the evolution of HAR methods. ... We propose a new taxonomy for categorizing the literature as CNN and RNN-based approaches. ... State-of-the-art Deep Learning Approach In based methods as -multi-stream networks and sequential networks, whereas RNN-based methods as -LSTM with CNN and Fusion LSTM. ...

doi:10.1080/08839514.2022.2093705 fatcat:6on4g3sp3vaktnyyrk72k4mqta

Open Access

LSTM fusion structure. ... A new long-short-term-memory (LSTM) based deep fusion structure, is introduced, where, the texture features computed from lung nodules through new volumetric grey-level-co-occurrence-matrices (GLCM) computations ... This LSTM-based deep network fusion for adjacent VSs features is applied for classifying lung nodules on behalf of the spatially correlated information between sequenced VSs. ...

arXiv:2205.05123v1 fatcat:mpgnycr7mnddpju6o7df2eknh4

Multiple Versions

To address these challenges, in this paper, we propose a novel deep neural network architecture consisting of a two-stream auto-encoder and a long short term memory for effectively integrating visual and ... audio signal streams for emotion recognition. ... Those features are then combined via a fusion layer before being fed to the LSTM for sequential learning of the features (for every 0.2s long sequence) from input streams. ...

arXiv:2004.13236v1 fatcat:qgoybjaf3jgphcoed7qablxqlm

To address this problem, this paper proposes a long-term motion descriptor called sequential Deep Trajectory Descriptor (sDTD). ... Specifically, we project dense trajectories into two-dimensional planes, and subsequently a CNN-RNN network is employed to learn an effective representation for long-term motion. ... [4] proposed their own recurrent networks respectively by connecting LSTMs to CNNs. Donahue et al. tested their model on activity recognition, image description and video description. Wu et al. ...

doi:10.1109/tmm.2017.2666540 fatcat:lddmr4wecnesbcua2tphb7qvxy

Multiple Versions

In this work, we propose a deep-temporal LSTM architecture which extends standard LSTM and allows better encoding of temporal information. ... In addition, we propose to fuse 3D skeleton geometry with deep static appearance. ... The complementary nature of the LSTM and CNN based networks are evident from the boosted performance for MSRDailyActiv-ity3D and NTU-RGB+D on fusion. ...

doi:10.1109/avss.2018.8639122 dblp:conf/avss/DasKBF18 fatcat:qzl7akjymfeijjjv6mbpolrm5m

Long Short-Term Memory (LSTM) is a prominent recurrent neural network for extracting dependencies from sequential data such as time-series and multi-view data, having achieved impressive results for different ... The efficacy of the novel LSTM cell architectures is assessed by integrating them into deep learning-based methods for face recognition with multi-view, light field images. ... Nowadays, due to their superior representation and prediction performance, deep Convolutional Neural Networks (CNNs) are increasingly adopted for visual recognition and description tasks [4] . ...

arXiv:1905.04421v2 fatcat:xi7fxu2qrjde7cnp6su4pujlpe

Multiple Versions

This could be considered as a discriminant aggregation network and informative patch-level features are propagated and accumulated to the deeper nodes of the recurrent network for final classification. ... First, we develop a novel attribute-guided attentive network to sequentially discover informative parts/regions, by seeking a good registration between attentive regions and predefined object attributes ... description generation [29] . ...

doi:10.1145/3123266.3123358 dblp:conf/mm/YanNY17 fatcat:q5dlco5zrbfdfenmuhhpwzcp2a

Dense Captioning with Joint Inference and Visual Context

Preserved Fulltext

Dense Captioning with Joint Inference and Visual Context [article]

Preserved Fulltext

Other Versions

Long-term recurrent convolutional networks for visual recognition and description

Preserved Fulltext

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Preserved Fulltext

Robust Visual-Textual Sentiment Analysis

Preserved Fulltext

Long-term Recurrent Convolutional Networks for Visual Recognition and Description [article]

Preserved Fulltext

Other Versions

Learning Multimodal Attention LSTM Networks for Video Captioning

Preserved Fulltext

Deep-Temporal LSTM for Daily Living Action Recognition [article]

Preserved Fulltext

Other Versions

A Review of Deep Learning-based Human Activity Recognition on Benchmark Video Datasets

Preserved Fulltext

Deep fusion of gray level co-occurrence matrices for lung nodule classification [article]

Preserved Fulltext

Other Versions

Deep Auto-Encoders with Sequential Learning for Multimodal Dimensional Emotion Recognition [article]

Preserved Fulltext

Sequential Deep Trajectory Descriptor for Action Recognition With Three-Stream CNN

Preserved Fulltext

Other Versions

Deep-Temporal LSTM for Daily Living Action Recognition

Preserved Fulltext

Long Short-Term Memory with Gate and State Level Fusion for Light Field-Based Face Recognition [article]

Preserved Fulltext

Other Versions

Fine-Grained Recognition via Attribute-Guided Attentive Feature Aggregation

Preserved Fulltext