A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Visually Exploring Multi-Purpose Audio Data
[article]
2021
arXiv
pre-print
We analyse multi-purpose audio using tools to visualise similarities within the data that may be observed via unsupervised methods. ...
We use the visual assessment of cluster tendency (VAT) technique on a well known data set to observe how the samples naturally cluster, and we make comparisons to the labels used for audio geotagging and ...
Fig. 2 . 2 A series of ordered dissimilarity matrices produced by VAT and SpecVAT on multi-purpose audio data. ...
arXiv:2110.04584v1
fatcat:aa6apxylabdirmifi4luxttd4i
Learning in Audio-visual Context: A Review, Analysis, and New Perspective
[article]
2022
arXiv
pre-print
Through our analysis, we discover that, the consistency of audio-visual data across semantic, spatial and temporal support the above studies. ...
Then, we systematically review the recent audio-visual learning studies and divide them into three categories: audio-visual boosting, cross-modal perception and audio-visual collaboration. ...
Hence, audio-visual learning is essential to our pursuit of human-like machine perception ability. Its purpose is to explore computational approaches that learn from both audio-visual data. ...
arXiv:2208.09579v1
fatcat:xrjedf2ezbhbzbkysw2z2jsm7e
Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines
2020
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips ...
Both are examples which are correctly classified by our multi-modality approach. ...
We explore visual and audio in this experiment due to accessibility, since there is a lot of audio-visual video data available to researchers. ...
doi:10.1109/iros45743.2020.9341557
fatcat:yusylnuo2nfthag7ipnezfjgcu
Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines
[article]
2020
arXiv
pre-print
The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. ...
This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips ...
We explore visual and audio in this experiment due to accessibility, since there is a lot of audio-visual video data available to researchers. ...
arXiv:2007.10175v1
fatcat:wsufgbkxbfhujb2kra7zmb3yky
Cloud Bridge: A Data-Driven Immersive Audio-Visual Software Interface
2013
Zenodo
It explores how information can besonified and visualized to facilitate findings, and eventually becomeinteractive musical compositions. Cloud Bridge functions as a multi-user,multimodal instrument. ...
Cloud Bridge leads to a new media interactiveinterface utilizing audio synthesis, visualization and real-time interaction. ...
This project is a proof of concept for an interactive multi-user software instrument that utilizes data as the driver for visual/audio content. ...
doi:10.5281/zenodo.1178596
fatcat:zxkacyryujawxohsn5chmsg3sy
Mediation Exploring Multi-Sensory Elements Through the Use of Songs and its Effects to Pupils with Learning Disabilities
2019
International Journal of Academic Research in Progressive Education and Development
The purpose of this study is to explore multi-sensory elements through songs in learning and the impact on pupils with learning disabilities (PLD). ...
Data of the study were analyzed by using constant comparison techniques of multi-sensory elements, the use of songs and the effects on pupils using the Nvivo software. ...
DATA FREQUENCY
Total
Observations
(Recorded)
Interviews
(Triangulatio
n)
P1 P2 P3 P1 P2 P3
Perceptions Audio
Songs
(EA)
S1
/
/
/
3
S2
/
/
/
3
Visual
Track
(EV)
S1
/
/
/ ...
doi:10.6007/ijarped/v8-i4/6909
fatcat:ntzq56jtyjdcfh7lcaxg4cetky
Improving acoustic event detection using generalizable visual features and multi-modality modeling
2011
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Hidden Markov models (HMMs) are used for audio-only modeling, and multi-stream HMMs or coupled HMMs (CHMM) are used for audio-visual joint modeling. ...
To allow the flexibility of audio-visual state asynchrony, we explore effective CHMM training via HMM state-space mapping, parameter tying and different initialization schemes. ...
Different fusion methods have been explored for the audio and visual modalities. ...
doi:10.1109/icassp.2011.5946412
dblp:conf/icassp/HuangZH11
fatcat:tvzxgu5ls5dmjgc2bej3tvqdlq
Sensate abstraction: hybrid strategies for multi-dimensional data in expressive virtual reality contexts
2009
The Engineering Reality of Virtual Reality 2009
The installation utilizes a combination of infrared motion tracking, custom computer vision, multi-channel (10.1) spatialized interactive audio, 3D graphics, data sonification, audio design, networking ...
Here we describe the physical and audio display systems for the installation and a hybrid strategy for multi-channel spatialized interactive audio rendering in immersive virtual reality that combines amplitude ...
The installation explores the potential interplay between artistic and data-driven strategies, based on visual and auditory pattern, in working with massive multidimensional multi-scale, multi-resolution ...
doi:10.1117/12.806928
fatcat:ssz77u5g2ngo5fx5sq54f4s76i
Listen, Look and Deliberate: Visual context-aware speech recognition using pre-trained text-video representations
[article]
2020
arXiv
pre-print
The deliberation model outperforms the multi-stream model and achieves a relative WER improvement of 6% and 8.7% for the clean and masked data, respectively, compared to an audio-only model. ...
Firstly, we propose a multi-stream attention architecture to leverage signals from both audio and video modalities. ...
For this purpose, we augment the data by masking out specific words in the audio stream. ...
arXiv:2011.04084v1
fatcat:4ibujf7lc5cerf3v2h74rpcfum
Audio-Visual LLM for Video Understanding
[article]
2023
arXiv
pre-print
This mechanism is pivotal in enabling end-to-end joint training with video data at different modalities, including visual-only, audio-only, and audio-visual formats. ...
This dataset allows Audio-Visual LLM to adeptly process a variety of task-oriented video instructions, ranging from multi-turn conversations and audio-visual narratives to complex reasoning tasks. ...
AudioGPT [33] leverages various audio foundation models to process audio data, where LLMs are regarded as the general-purpose interface. ...
arXiv:2312.06720v2
fatcat:vjwjvxyrbvg57eddyjwk7ytvka
CNN-Based Multi-Modal Camera Model Identification on Video Sequences
2021
Journal of Imaging
Differently from mono-modal methods, which use only the visual or audio information from the investigated video to tackle the identification task, the proposed multi-modal methods jointly exploit audio ...
To this purpose, we develop two different CNN-based camera model identification methods, working in a novel multi-modal scenario. ...
Given a query video, we extract and pre-process its visual and audio content. Then, we feed these data to one multi-input CNN, composed of two CNNs whose last fully-connected layers are concatenated. ...
doi:10.3390/jimaging7080135
pmid:34460771
fatcat:7v5nxgk225akffydyiq3ojtd24
Noisy Agents: Self-supervised Exploration by Predicting Auditory Events
[article]
2020
arXiv
pre-print
Humans integrate multiple sensory modalities (e.g. visual and audio) to build a causal understanding of the physical world. ...
First, we allow the agent to collect a small amount of acoustic data and use K-means to discover underlying auditory event clusters. ...
Active exploration or random explorations? We propose an online clustering-based intrinsic module for active audio data collections. ...
arXiv:2007.13729v1
fatcat:yvlz2stdxnckbnacgyc7rjmrui
Wav2CLIP: Learning Robust Audio Representations From CLIP
[article]
2022
arXiv
pre-print
as it does not require learning a visual model in concert with an auditory model. ...
Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods ...
data, or via audio-visual correspondence, we plot the confusion matrices of YamNet and Wav2CLIP in Figure3on TAU, an audio-visual scene classification dataset. ...
arXiv:2110.11499v2
fatcat:uq6dxnke6ne5nissu6yividmsu
PathoSonic: Performing Sound In Virtual Reality Feature Space
2020
Proceedings of the International Conference on New Interfaces for Musical Expression
Through implementation of a multi-sensory experience, including visual aesthetics, sound, and haptic feedback, we explore inclusive approaches to sound visualization, making it more accessible to a wider ...
The name comes from the different paths the participant can create through their sonic explorations. ...
Through implementation of a multi-sensory experience, including visual aesthetics, sound, and haptic feedback, this investigation seeks to explore inclusive approaches to sound visualization, making it ...
doi:10.5281/zenodo.4813510
fatcat:byfxpchd45airm5l3dnxgjxapa
Using Machine Learning to Classify Music Genre
2021
International Journal for Research in Applied Science and Engineering Technology
In this literature, we aimed to build a machine learning model to classify the genre of an input audio file using 8 machine learning algorithms and determine which algorithm is the best suitable for genre ...
First, we performed data visualization to get familiar with our data. For visualization purposes, we considered one instance of our dataset -the 11th audio file belonging to the Pop genre. ...
Such extensive exploration and visualization are necessary due to the input files being audios. Now, we are fully equipped to begin building our model. ...
doi:10.22214/ijraset.2021.38365
fatcat:ub2nn2xb3fedjgalqve6ryfgdq
« Previous
Showing results 1 — 15 out of 66,850 results