Audio-Visual Learning for Understanding and Generation

  • Proposed a unified model that learning feature representation and generation.
  • Experimented both audio and video trained encoders on classification and retrieval tasks.
  • Introduced generative models to model the conditional probability distribution for each modality.

Adapting LLMs for Audio Understanding

  • Proposed an audio language model that ingests multiple audio clips and generate text tokens by interleaving the acoustic embeddings with text embeddings in a sequence.
  • Proposed a training receipt by combining curriculum learning and multi-task learning.
  • Evaluated the proposed audio large language model with various downstream tasks.
  • Extended large language models and visual large models to the audio domain.

LLM-Based Agents for Audio Creation and Editing

  • Applied LLMs, such as ChatGPT, to create/edit audio content based on user instructions and available recordings.
  • Produced audio content in a controllable manner by coordinating various generative models.
  • Evaluated the proposed system ability on audio drama where models should manipulate audio content without explicit user commands.

Bootstrapping Audio Language Models in Few-Shot Learning

  • Improved Contrastive Language-Audio Pretrained networks (CLAPs) performance with a few examples while preserving its ability to zero-shot classification.
  • Proposed a new module to retrieve labels of the test examples by measuring the affinity between test and support embeddings.
  • Devised a cosine initialisation strategy such that the proposed methods can benefit from the few-shot settings even without training.

Efficient convolution neural network (CNN) for mobile applications

  • Research on the reduction of computational complexity and resource cost without much performance degradation.
  • Investigation on how to reduce the feature redundancy in CNN architecture.
  • Extracted feature representions to solve multimedia tasks, including acoustic scene classification, sound event detection and image classification.

Detection and classification of acoustic scenes and events

  • To recognize an audio scene or a sound event either from a recording or an on-line stream through pattern recognition and signal processing.
  • Illustrated the relationship between receptive field in CNN and time-frequency feature resolution in mel energy spectrogram.
  • Proposed a deep CNN architecture using the fine resolution feature.