Open Source Competitions

Time

TBD

Authors

Siyi Hu (Monash University)*; Fengda Zhu (Monash); Xiaojun Chang (Monash University); Xiaodan Liang (Sun Yat-sen University)

Abstract

We present a transformer-like agent to learn the policy of multi-agent cooperation tasks, which is a breakthrough for traditional RNN-based multi-agent models that need to be retrained for each tasks. Our model is able to handle various input and output with strong transfer ability and can paralleled tackle different tasks. Besides, We are the first to successfully utilize transformer into a recurrent architecture, especially in POMDP, providing an insight on stabilizing transformer in recurrent RL tasks.

Time

TBD

Authors

Liu Liu (Shanghai Jiao Tong University); Wenzhe Wang (Zhejiang University); Zhijie Zhang (Shanghai Jiao Tong University); mengdan zhang (Youtu, Tencent)*; Pai Peng (Tencent Youtu Lab); Xing Sun (Tencent)

Abstract

Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. Recent researches handle different issues of this task such as exploiting multi-modal video cues, hierarchical reasoning, and learning pre-trained models. The implementations of these approaches vary a lot, which brings difficulty for the further research. Therefore, in this paper, we provide a unified video-text retrieval framework that has following features: 1) a modular design for easy modification of different structures of deep learning models; 2) training and test pipelines of the state-of-the-art (SOTA) models that leverage hierarchy cues and interactions between different levels of granularity and different video modalities; 3) support for various benchmark datasets; 4) demo exhibitions and well tested and documented. We hope our unified framework useful and efficient for the further research.

Time

TBD

Authors

Dalu Feng (Institute of Computing Technology, Chinese Academy of Sciences)*; Shuang Yang (ICT, CAS); Shiguang Shan (Institute of Computing Technology, Chinese Academy of Sciences)

Abstract

Lip reading is an impressive skill for human beings and has lots of potential applications. However, building an automatic lip-reading system is still a challenging task at present. Previous explorations of building a lip reading model usually contain several implicit operations and training details, which usually brings a big difficulty to the re-implementation for most researchers. In this work, we aim to establish a clear and efficient pipeline, which provides a convenient implementation tool and an easy start point for researchers and any others who want to study the problem of lip reading. We empirically study and introduce several useful training strategies in a clear and unified implementation procedure and compare their effects. Our pipeline got a performance of 88.4%/56.0% on two popular large-scale lipreading datasets, which is based on the basic model but achieves comparable or even higher performance than the state-of-the-art results. Our codes and models are available at https://github.com/Fengdalu/learn-an-effective-lip-reading-model-without-pains.

Time

TBD

Authors

Ran Yi (Tsinghua University); Yong-Jin Liu (Tsinghua University)*

Abstract

The software transforms face photos into artistic portrait drawings of multiple styles using a GAN-based model. The link to the software source code is https://github.com/yiranran/Unpaired-Portrait-Drawing. The software is the implementation for the CVPR 2020 paper ‘‘Unpaired Portrait Drawing Generation via Asymmetric Cycle Mapping’’, which is also an improved version over our previous work on TPAMI and CVPR 2019. The preliminary version of these code was integrated into a mini-program on WeChat, which is the most popular free messaging and calling app in China. Our mini-program became popular, receiving about 400K user clicks in only two weeks.

Time

TBD

Authors

Yezhi Shu (Tsinghua University)*; Ran Yi (Tsinghua University); Yong-Jin Liu (Tsinghua University)

Abstract

The software transforms real-world photos into multi-style cartoons based on generative adversarial network (GAN). The web page of the source code is available (https://github.com/syz825211943/Multi-Style-Photo-Cartoonization), which is the implementation of our TVCG paper "GAN-based Multi-Style Photo Cartoonization". This code is also an improved version of our preliminary work on CVPR 2018, which has drawn considerable attentions in image style translation area and so far has been cited 124 times in Google Scholar. A variant of these code have been deployed in Huawei Ascend chip processor, and incorporated into Atlas 200 DK Developer Kit and Atlas 300 Inference Server. Averagely 27k users per month can use our code to generate their own cartoons in Huawei Ascend community.

Time

TBD

Authors

Yaohua Liu (Dalian University of Technology)*; Risheng Liu (Dalian University of Technology)

Abstract

Meta-learning (a.k.a. learning to learn) has recently emerged as a promising paradigm for a variety of applications. There are now many meta-learning methods, each focusing on different modeling aspects of base and meta learners, but all can be (re)formulated as specific bilevel optimization problems. This work presents BOML, a modularized optimization library that unifies several meta-learning lgorithms into a common bilevel optimization framework. It provides a hierarchical optimization pipeline together with a variety of iteration modules, which can be used to solve the mainstream categories of meta-learning methods, such as meta-feature-based and meta-initialization-based formulations. The library is written in Python and is available at https://github.com/dut-media-lab/BOML.

Time

TBD

Authors

Xiyue Sun (Beijing Jiaotong University)*; Feng Li (Beijing Jiaotong University); Huihui Bai (Beijing Jiaotong University); Yao Zhao (Beijing Jiaotong University)

Abstract

We present a deep dual attention network (DDAN) for video super-resolution, which cascades a motion compensation network (MCNet) and an SR reconstruction network (ReconNet). The MCNet utilize pyramid framework and learn the optical flow representations progressively to synthesize the motion information across adjacent frames. And it extracts detail components of LR neighboring frames for more accurate motion compensation. In ReconNet, we combine dual attention mechanisms and residual learning strategy for recovering high-frequency details. The DDAN performs effectively and generally on video super-resolution tasks. Relevant project has been released on Github.

Time

TBD

Authors

jiarong Han (PCL); Cheng Lai (Peng Cheng Lab)*; Hao Dong (Peking University)

Abstract

TensorLayer 3.0 is a deep learning library refactored based on TensorLayer 2.0. It is compatible with multiple deep learning frameworks, designed for researchers and engineers. The library provides an extensive collection of customised neural layers enabling users to build advanced AI models quickly. TensorLayer 3.0 designed backend layers to unify the low-level APIs operators of multiple frameworks, and model abstraction layers to be compatible with model building using different frameworks. Therefore, the library has the combined advantage of low coupling, easy to extend and great compatibility with other frameworks. The TensorLayer community has accumulated a large number of users, and there are many open-source examples to help users get started quickly.

Time

TBD

Authors

Aaro Altonen (Tampere University)*; Joni Räsänen (Tampere University); Alexandre MERCAT (Tampere University); Jarno Vanne (Tampere University)

Abstract

Real-time video transport plays a central role in various interactive and streaming media applications. This paper presents the new release of our open-source Real-time Transport Protocol (RTP) library called uvgRTP (github.com/ultravideo/uvgRTP) that is designed for economic video and audio transmission in real time. It is the first public library that comes with built-in support for modern VVC, HEVC, and AVC video codecs and Opus audio codec. It can also be tailored to diversified media formats with an easy-to-use generic API. According to our experiments, uvgRTP can stream 8K VVC video at 300 fps with an average round-trip latency of 4.9 ms over a 10 Gbit link. This cross-platform library can be run on Windows and Linux operating systems and the permissive BSD 2-Clause license makes it accessible to a broad range of commercial and academic streaming media applications.

Time

TBD

Authors

Adam Wieckowski (HHI)*; Jens Brandenburg (HHI); Tobias Hinz (HHI); Christian Bartnik (Fraunhofer HHI); Valeri George (Fraunhofer HHI); Gabriel Hege (HHI); Christian R. Helmrich (Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, HHI); Anastasia Henkel (Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute); Christian Lehmann (HHI); Christian Stoffers (HHI); Ivan Zupancic (Fraunhofer HHI); Benjamin Bross (HHI); Detlev Marpe (HHI)

Abstract

The recently finalized Versatile Video Coding (VVC) standard promises to reduce the video bitrate by 50% compared to its predecessor, High Efficiency Video Coding (HEVC). The increased efficiency comes at a cost of increased computational burden. The Fraunhofer Versatile Video Encoder VVenC is the first openly available optimized implementation providing access to VVC’s efficiency at only 46% of the runtime of the VVC test model VTM, without multi-threading. An alternative operating point allows 30× faster encoding for the price of around 12% bitrate increase, while still providing around 38% bitrate reduction compared to HEVC test model HM. In the fastest configuration, VVenC runs over 140× faster than VTM while still providing over 10% bitrate reduction compared to HM. Even faster encoding is possible with multi-threading.

Time

TBD

Authors

Marc Gorriz Blanch (BBC)*; Saverio Blasi (BBC); Noel O'Connor (DCU); Alan Smeaton (Insight Centre for Data Analytics, Dublin City University); Marta Mrak (BBC)

Abstract

Neural networks can be successfully used to improve several modules of advanced video coding schemes. In particular, compression of colour components was shown to greatly benefit from usage of machine learning models, thanks to the design of appropriate attention-based architectures that allow the prediction to exploit specific samples in the reference region. However, such architectures tend to be complex and computationally intense, and may be difficult to deploy in a practical video coding pipeline. The software presented in this paper introduces a collection of simplifications to reduce the complexity overhead of the attention-based architectures. The simplified models are integrated into the Versatile Video Coding (VVC) prediction pipeline, retaining compression efficiency of previous chroma intra-prediction methods based on neural networks, while offering different directions for significantly reducing coding complexity.

Time

TBD

Authors

Maurice Quach (L2S, CNRS, CentraleSupelec)*; Giuseppe Valenzise (CNRS); Frederic Dufaux (CNRS)

Abstract

This short paper describes a TensorFlow toolbox for point cloud geometry coding based on deep neural networks. This coding method employs a deep auto-encoder trained with a focal loss to learn good representations for voxel occupancy. The software provides several coding parameters to achieve different rate-distortion trade-offs, and comes with pre-trained models to reproduce the results of the published paper. It also offers a number of utility functions for evaluating and comparing the codec. To our knowledge, this is the first publicly available open-source toolbox for deep-learning-based point cloud coding.

Open Source Chairs

Shiguang Shan
Institute of Computing Technology, China
Wei Hu
Peking University, China