Research
Audio-Visual Learning for Understanding and Generation
- Proposed a unified model that learning feature representation and generation.
- Experimented both audio and video trained encoders on classification and retrieval tasks.
- Introduced generative models to model the conditional probability distribution for each modality.
Adapting LLMs for Audio Understanding
- Proposed an audio language model that ingests multiple audio clips and generate text tokens by interleaving the acoustic embeddings with text embeddings in a sequence.
- Proposed a training receipt by combining curriculum learning and multi-task learning.
- Evaluated the proposed audio large language model with various downstream tasks.
- Extended large language models and visual large models to the audio domain.
LLM-Based Agents for Audio Creation and Editing
- Applied LLMs, such as ChatGPT, to create/edit audio content based on user instructions and available recordings.
- Produced audio content in a controllable manner by coordinating various generative models.
- Evaluated the proposed system ability on audio drama where models should manipulate audio content without explicit user commands.
Bootstrapping Audio Language Models in Few-Shot Learning
- Improved Contrastive Language-Audio Pretrained networks (CLAPs) performance with a few examples while preserving its ability to zero-shot classification.
- Proposed a new module to retrieve labels of the test examples by measuring the affinity between test and support embeddings.
- Devised a cosine initialisation strategy such that the proposed methods can benefit from the few-shot settings even without training.
Efficient convolution neural network (CNN) for mobile applications
- Research on the reduction of computational complexity and resource cost without much performance degradation.
- Investigation on how to reduce the feature redundancy in CNN architecture.
- Extracted feature representions to solve multimedia tasks, including acoustic scene classification, sound event detection and image classification.
Detection and classification of acoustic scenes and events
- To recognize an audio scene or a sound event either from a recording or an on-line stream through pattern recognition and signal processing.
- Illustrated the relationship between receptive field in CNN and time-frequency feature resolution in mel energy spectrogram.
- Proposed a deep CNN architecture using the fine resolution feature.