Research

Audio-Visual Learning for Understanding and Generation

Proposed a unified model that learning feature representation and generation.
Experimented both audio and video trained encoders on classification and retrieval tasks.
Introduced generative models to model the conditional probability distribution for each modality.

Adapting LLMs for Audio Understanding

Proposed an audio language model that ingests multiple audio clips and generate text tokens by interleaving the acoustic embeddings with text embeddings in a sequence.
Proposed a training receipt by combining curriculum learning and multi-task learning.
Evaluated the proposed audio large language model with various downstream tasks.
Extended large language models and visual large models to the audio domain.

LLM-Based Agents for Audio Creation and Editing

Applied LLMs, such as ChatGPT, to create/edit audio content based on user instructions and available recordings.
Produced audio content in a controllable manner by coordinating various generative models.
Evaluated the proposed system ability on audio drama where models should manipulate audio content without explicit user commands.

Bootstrapping Audio Language Models in Few-Shot Learning

Improved Contrastive Language-Audio Pretrained networks (CLAPs) performance with a few examples while preserving its ability to zero-shot classification.
Proposed a new module to retrieve labels of the test examples by measuring the affinity between test and support embeddings.
Devised a cosine initialisation strategy such that the proposed methods can benefit from the few-shot settings even without training.

Efficient convolution neural network (CNN) for mobile applications

Research on the reduction of computational complexity and resource cost without much performance degradation.
Investigation on how to reduce the feature redundancy in CNN architecture.
Extracted feature representions to solve multimedia tasks, including acoustic scene classification, sound event detection and image classification.

Detection and classification of acoustic scenes and events

To recognize an audio scene or a sound event either from a recording or an on-line stream through pattern recognition and signal processing.
Illustrated the relationship between receptive field in CNN and time-frequency feature resolution in mel energy spectrogram.
Proposed a deep CNN architecture using the fine resolution feature.