ICCV19 Experiences

With the fast-paced advancements in the machine learning community, conferences have become the frontier of the field. Unlike many other mature science disciplines, most of the research in machine learning is presented at these conferences. Therefore, attending these conferences has become an extremely valuable activity. Attending ML conferences is so valuable that it is hard to get tickets if you do not book them early on. For instance, NIPS18 sold out in 18 minutes, and almost every other conference in the field is witnessing enormous growth in both the number of submitted papers and attendees [36, 37]. The International Conference on Computer Vision (ICCV) is a prestigious venue for deep learning and computer vision. Fortunately, last week, ICCV19 was held in Seoul. It was a fantastic opportunity for me since my university is located near Seoul. Thanks to my supervisor, Dr. Sung-Ho Bae, I was able to attend it. I had an incredible seven days of experience attending workshops, tutorials, oral and poster presentations, intermingling with researchers, hearing and understanding great ideas of the field, and much much more. In this post, I will share my first-ever experience attending a major science conference.

Workshops and Tutorials

The conference was scheduled to have workshops and tutorials on the first two and last days. All the paper presentations (oral+poster) were sandwiched in between. On the first day, I attended two different talks from the Workshop on statistical deep learning and robust subspace learning. The first talk was on the matrix and vector representation in Neuroscience by Prof. Ying Nian Wu.

The central theme of the talk was around the place and grid cells in the brain of mammalian that are responsible for navigation [1]. The place cells fire only when we encounter a specific place, while the grid makes certain hexagonal firing patterns. These neurons are responsible for path planning and motion and are even called our GPS because of their assistance in navigation, even in the absence of visual clues. This paper [1] showed that the self-location of an agent can be represented by a vector and self-displacement with a matrix in higher dimensions so that path planning, displacement, and many other activities can be represented as transformations on these cells. They even showed that hexagonal patterns that emerged from these cells can also be recovered via these higher dimensional representations.

The second talk was on Tensor factorization for robust learning by Jean Kossaifi, based on [2]. The talk was about using tensor decomposition in deep neural networks, which decreases computation complexity and increases robustness. This compression showed that representing the whole neural network with a higher dimensional tensor makes it lightweight and more robust while maintaining accuracy. Jean Kossaifi is also the author of a famous Python library for this purpose called Tensorly

In the second half of this day, I attended the tutorial on the interpretation of Deep learning, which was jam-packed, similar to its predecessor in CVPR19.

The first talk of this tutorial was on Understanding models via visualizations and attribution by Andrea Veladi. Despite the impressive performance, deep neural networks are black boxes, and understanding is crucial. We can divide understanding them into three parts: what a network does, how it does it, and how it learns. This talk was on the second aspect, i.e., how a network does a specific task. This talk revolved around two central questions: what does an intermediate layer of DNN know about the input, and what does a neural network see in the input image?

First Question: What a model knows of input image \(x\) at an intermediate layer \(y\), as shown in the following figure?

One way to answer this question is to project \(y\) on input space and get \(x_0\). This can be done by simple reconstruction, i.e., start from a random initialization of \(x_0\) and minizine and get the so-called pre-image or \(x_0\). This reconstruction is done without any training data. There are several ways to reconstruct this pre-image, such as presented by [3,4,5]. Interestingly, reconstruction from any layer shows a striking similarity to its input. Even the last fully connected layer, despite having only information on output probabilities, does show similarity with the input image.

The deep image prior [4] has previously shown a fascinating result: we can fit a neural network with a single image and perform many tasks. If we fit a neural network with a single image and then ask the above question from this neural network, it still gives a very clear reconstruction of the input image. This is interesting because the neural network is not trained on any data. For an interesting discussion on this topic, see this.

Question 2 What does a neural network see in the input image? One straightforward way to answer this question is by visualizing gradients of a network w.r.t to its input image [6,7,8]?

These visualizations do give some hints, but they could be clearer. This problem can be solved by smoothing gradients [9]. But, all of these visualization methods have an issue of channel specificity, i.e., changing output class does not change visualization.

This problem can be solved with CAM and GradCAM [9,10], using an intermediate layer to see prominent parts of the input image.

One question that is worth asking is what these visualizations mean. Andrew argued that gradients can be considered local linear approximations of the model so that we can think of gradient-based visualizations as sensitivity analyses. This is where new work on perturbation analysis [12] from him and his co-authors comes along: to understand neural networks via interesting perturbations, i.e., by injecting noise, rotation, erasing parts of the input, etc., and see network behavior.

This analysis gives several interesting results. Foreground might be sufficient, but eliminating background can negatively affect the network and many more.

The second talk on understanding the latent semantics of GAN was delivered by Bolei Zhou. In recent years, GANs have become better at producing high-quality, realistic, and diverse images. This talk was on understanding this generation process. The talk revolved around three things: understanding neurons’ role, semantics in latent space, and biases and limitations.

The first question was on the role of individual neurons in image generation. GAN dissection paper [13] solved this riddle by turning off individual neurons and then using semantic segmentation to find the effect on generated images. Interestingly, they found that different neurons are responsible for different objects and scenes in a GAN. For instance, one can find a neuron accountable for generating trees in a scene, and then by turning this neuron off or on, one can include/exclude these objects. The following figure shows the effect of different units. You can try a demo here.

GANs generate images given a random vector from latent space. A random walk in this latent space changes the generated image in a meaningful way. This shows the relationship between this vector and the semantics of the generated image. In ongoing work, Bolei Zhou et al. showed that it is, in fact, possible to classify latent space according to attributes. For example, one can find the boundary of latent space to turn indoor lights on/off.

The third question was, can we use GAN to reconstruct an image, i.e., given an image \(x\), can we find its latent space \(z\) such that GAN can reproduce the original image \(x\)? While GAN can reconstruct images, the generated image turns out to be different from the original, as shown in the figure.

As shown in the paper [14], GAN doesn’t like generating several types of objects, such as people, vehicles, signs, etc. Similarly, given an Asian face, GANs tend to make it white. Does this show the limitations and biases of the GAN?

After the talk, I was able to ask Zhou a few questions. One question that came to my mind from the Asian face to the Whiteface phenomenon is what the root of issues in reconstruction is: data or GAN? To me, it seems like a data rather than a GAN problem.

The third and last talk of the session and the day: Explaining deep learning for identifying structures and biases in computer vision by Prof. Alexander Binder. The talk was around the concept of layer-wise relevance propagation (LRP) to interpret neural networks. This method calculates the score for each input dimension and is used as an explanation. This method successfully found game strategies learned by DNN to play Atari breakout. He also discussed different ways to find biases in datasets.

On the second day of the workshops and tutorials, I took the Neural Architects Workshop arranged by the famous VGG group of Oxford University. The first talk was Neural Architecture and Beyond by Barret Zoph. This talk gave a holistic view of progress in the architectures of neural networks and NAS. They divided AI into four periods: good old fashioned, where handcrafted features were used without any learning; second, when handcrafted features were used with a learning algorithm; and third, recent deep learning wave, where most of the feature engineering is done at the algorithmic level, i.e., change of architecture, etc. and fourth learn to learn or neural architecture search, a recent phenomenon.

While it can be argued that most of the recent benefits in Deep Learning came from improvements in architecture (AlexNet [16], VGG [17], ResNet[18], DenseNet[19], SENet [20], etc. ), this is also one of the most difficult, time-consuming and almost more of an artistic kind of a job. It requires a lot of human effort, intuition, and experimentation. Therefore, this is an ideal task for machine learning, i.e., making a neural network to learn the neural network’s architecture. In the modern deep learning era, this was initiated by [21, 22]. Specifically, these papers introduced a reinforcement learning way to find the best architectures. In this setting, a controller proposes an architecture from a search space; this proposed network is trained on a subset of the dataset, and then this trained classifier’s accuracy and other relevant parameters are fed to the controller as a reward. This way, the controller learned to output the best-performing architecture.

An RNN controller is primarily used while search space consists of different architectural elements such as convolution layer, BN layers, connections, etc. Using this way, efficient architecture can be found in almost any field. It has been shown to work in classification, segmentation, video processing, machine translation, and many other tasks.

It is also interesting to inspect decisions made by these controllers. While most of the decisions are not very understandable, a few do give interesting insights. For instance, in many of the proposed architectures, more convolution layers are used at the start than at the end.

In recent times, many variants of NAS have been introduced. Platform-aware NAS [23] is used to find an architecture that is both better in accuracy and latency. To achieve this, latency and accuracy-based rewards are defined. EffecientNAS [24] solves the challenge of high computations required by NAS by assuming search space as a giant graph where each step has multiple options and NAS’s job is to find the better path. The neural network selected by the controller is trained for a few steps, and weights are shared from previous training.

Similarly, NAS has also been used to find the best data augmentation policy [25]. In this, we try to find three parameters: which operation to perform (rescaling, rotation, etc.) and with what probability and magnitude. In [26], the authors found that limiting this search space to only probability and magnitude works well.

My question for this talk: Human-devised architectures are very generalizable i.e. an architecture made for classification(Alexnet, Resnet) also works well for many many other tasks. Is it the same for NAS as well? And can we define some notion of overfitting in NAS, i.e., if an architecture works well on a task it was found by but fails to work on different tasks? You can hear the question and answer here

The second talk was Towards General Vision Architectures: Attentive Single-Tasking of Multiple Tasks by Iasonas Kokkinos. This talk was about finding a multi-tasking network. One can see how many things we can extract from one image to motivate one-network-multiple-tasks.

We get a performance boost if we use one network for a few tasks, but the performance decreases as the number of tasks increases. One way to solve this problem is to increase the task-specific processing of the network. But this makes multi-tasking less lucrative as gains in required computations diminish.

We know that multi-tasking does work for some tasks. But we need to get a few things straight to get a full view. First, some tasks cannot be done via the same network due to task inference [27, 28].

Similarly, it is even difficult for humans to multi-task. To demonstrate this point, one interesting example was shown. According to him, the solution to this problem is to give each task space, i.e., divide the network into shared and task-specific spaces.

One way to do this is first to define a general network and then, for each task, find relevant blocks and their connections [29, 30, 31, 32]. The following picture shows one such setting.

However, these strategies have a combinatorial search problem. Here comes work done by presenters called attentive single-tasking or multiple-tasking [33]. The main aim of the work is to make a network perform one task at a time instead of multiple in the same forward pass, encourage relevant features, and ignore irrelevant features. To do this, they start with an encoder-decoder setup. They made the encoder task-specific and made the decoder learn universal representation. To make the encoder task-specific, the same backbone is used with task-specific attention mechanisms [34] to focus on task-specific features while ignoring task-irrelevant features, task-specific residual adapters [35], and on top of the network, an adversarial task discriminator. These additional layers keep 95% of the network shared while creating task-specific space.

Poster Presentation of Our Paper

On the last day of the conference, I presented our paper on computational imaging in the workshop. This paper was a collaboration with my former colleagues from Spider Lab.

Interesting Papers

  • SinGAN: Learning a Generative Model from a Single Natural Image
  • Why Does Data-Driven Beat Theory-Driven Computer Vision?
  • GeoStyle: Discovering Fashion Trends and Events
  • Symmetric Cross-Entropy for Robust Learning with Noisy Labels
  • NLNL: Negative Learning for Noisy Labels
  • Adversarial Robustness vs. Model Compression, or Both?
  • What Else Can Fool Deep Learning? Addressing Color Constancy Errors on Deep Neural Network Performance
  • Exploring Randomly Wired Neural Networks for Image Recognition
  • Exploring Randomly Wired Neural Networks for Image Recognition
  • Transferability and Hardness of Supervised Classification Tasks
  • Switchable Whitening for Deep Representation Learning
  • Image Generation From Small Datasets via Batch Statistics Adaptation
  • Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks
  • EvalNorm: Estimating Batch Normalization Statistics for Evaluation
  • Seeing What a GAN Cannot Generate
  • Sparse and Imperceivable Adversarial Attacks
  • Improving Adversarial Robustness via Guided Complement Entropy
  • CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features
  • No Fear of the Dark: Image Retrieval Under Varying Illumination Conditions

Socializing with other Researchers

One of the main attractions of ICCV for me was a chance to meet bigwigs in our community. I use this opportunity to its fullest. Apart from arranging a meetup for Pakistanis attending ICCV, I also had lengthy discussions with many researchers.

  • Prof. Ping Luo: I was following Prof. Ping Luo (Hong Kong University, ex-Research director at SenseTime) due to his work on Batch Normalization. Upon my request, he agreed to meet me during the conference to discuss possible internship opportunities with him. It was great talking with him about his research.

  • Prof Mubark Shah: Dr. Shah is a very famous professor of computer vision. Since he has been working in this field for a very long time, knows conventional CV, and has seen neural networks rising, one of the major topics of discussion with him was his views on this change. We also discussed Pakistan since he is a fellow Pakistani.

  • Porf. Nasir Rajpoot: He is leading Tissue Image Analytics Labs at the University of Warwick. Since he has also witnessed great changes in the field, a major topic of discussion was conventional machine learning vs deep learning for medicine.

  • Dr. Salman Khan: Dr. Salman is a very active computer vision researcher, a lecturer at the Australian National University (ANU), and a Senior Scientist at Inception Institute of AI. I asked him to spare some time, and he was kind enough to invite me for lunch. There, we discussed it for 2-3 hours. This was probably one of the best discussions I ever had on artificial intelligence, computer vision, and machine learning. I was impressed with his knowledge of the practical side of deep learning. He gave me excellent tips to sharpen up my doing-experiments-quickly skills. We discussed many topics ranging from research to Pakistan, the AI community, and many more. Amazingly, I was able to learn so many things from him, even in this short discussion. So, special thanks to him for sparing time for young researchers.

  • Dr. Arslan Basharat: Dr. Basharat is an amazing person to have a discussion with. One of the major topics of discussion with him was hardships in the life of a graduate student and how to overcome them. He also told us about his work on deep fake detection.

  • Dr. Rao Anwer: Dr. Anwer is a research scientist at the Inception Institute of AI. He is originally from Pakistan, so most of the discussion was around Pakistan.

  • Dr. John K. Tsotsos on his views on deep learning and its limitations.

  • Dr. Iasonas Kokkinos discussion on multi-tasking neural networks.

  • Dr. Ross Girshick on many general topics like how research is conducted, how suddenly things came out straight from a mess, the stress of the research process, and his work on randomly wired networks.

  • Dr. Bolei Zhou on his talk about GANs and his general work on the interpretability of deep learning.

  • Dr. Seong Joon: Discussed his research on the interpretation of deep learning with him.

Meetup of Pakistanis@ICCV

One of the exciting things at the conference was seeing so many Pakistanis from around the globe attending it. While I met a few, it was only possible to meet some. Therefore, I proposed arranging a meetup with some other young researchers. This meetup was an amazing event attended by around 30 Pakistanis ranging from early researchers to famous professors and researchers like Prof Mubarak Shah, Prof. Porf. Nasir Rajpoot, Dr. Arslan Basharat, Dr. Salman Khan, Dr. Basharat, and many more. It was an amazing feeling to meet so many accomplished Pakistanis. We ended up doing two meetups to accommodate people who could not attend the first one.

Other Moments

References

[1] Gao, Ruiqi, et al. “Learning Grid Cells as Vector Representation of Self-Position Coupled with Matrix Representation of Self-Motion.” (2018).

[2] Kossaifi, Jean, et al. “T-Net: Parametrizing Fully Convolutional Nets with a Single High-Order Tensor.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

[3] Mahendran, Aravindh, and Andrea Vedaldi. “Understanding deep image representations by inverting them.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[4] Ulyanov, Dmitry, Andrea Vedaldi, and Victor Lempitsky. “Deep image prior.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[5] Nguyen, Anh, et al. “Plug & play generative networks: Conditional iterative generation of images in latent space.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

[6] Zeiler, Matthew D., and Rob Fergus. “Visualizing and understanding convolutional networks.” European conference on computer vision. Springer, Cham, 2014.

[7] Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. “Deep inside convolutional networks: Visualising image classification models and saliency maps.” arXiv preprint arXiv:1312.6034 (2013).

[8] Springenberg, Jost Tobias, et al. “Striving for simplicity: The all convolutional net.” arXiv preprint arXiv:1412.6806 (2014).

[9] Sundararajan, Mukund, Ankur Taly, and Qiqi Yan. “Axiomatic Attribution for deep networks.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[10] Zhou, Bolei, et al. “Learning deep features for discriminative localization.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[11] Selvaraju, Ramprasaath R., et al. “Grad-cam: Visual explanations from deep networks via gradient-based localization.” Proceedings of the IEEE International Conference on Computer Vision. 2017.

[12] Fong, Ruth, Mandela Patrick, and Andrea Vedaldi. “Understanding Deep Networks via Extremal Perturbations and Smooth Masks.” Proceedings of the IEEE International Conference on Computer Vision. 2019.

[13] Bau, David, et al. “Gan dissection: Visualizing and understanding generative adversarial networks.” arXiv preprint arXiv:1811.10597 (2018).

[14] Bau, David, et al. “Seeing What a GAN Cannot Generate.” Proceedings of the IEEE International Conference on Computer Vision. 2019.

[15] Montavon, Grégoire, et al. “Layer-wise relevance propagation: an overview.” Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Springer, Cham, 2019. 193-209.

[16] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

[17] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).

[18] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[19] Huang, Gao, et al. “Densely connected convolutional networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[20] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[21] Zoph, Barret, et al. “Learning transferable architectures for scalable image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[22] Real, Esteban, et al. “Large-scale evolution of image classifiers.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR.org, 2017.

[23] Wu, Bichen, et al. “Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

[24] Pham, Hieu, et al. “Efficient neural architecture search via parameter sharing.” arXiv preprint arXiv:1802.03268 (2018).

[25] Cubuk, Ekin D., et al. “Autoaugment: Learning augmentation policies from data.” arXiv preprint arXiv:1805.09501 (2018).

[26] Cubuk, Ekin D., et al. “RandAugment: Practical data augmentation with no separate search.” arXiv preprint arXiv:1909.13719 (2019).

[27] Kang, Zhuoliang, Kristen Grauman, and Fei Sha. “Learning with Whom to Share in Multi-task Feature Learning.” ICML. Vol. 2. No. 3. 2011.

[28] Romera-Paredes, Bernardino, et al. “Exploiting unrelated tasks in multi-task learning.” International conference on artificial intelligence and statistics. 2012.

[29] Murdock, Calvin, et al. “Blockout: Dynamic model selection for hierarchical deep networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[30] Ahmed, Karim, and Lorenzo Torresani. “Maskconnect: Connectivity learning by gradient descent.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.

[31] Veniat, Tom, and Ludovic Denoyer. “Learning time/memory-efficient deep architectures with budgeted super networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

[32] Fernando, Chrisantha, et al. “Pathnet: Evolution channels gradient descent in super neural networks.” arXiv preprint arXiv:1701.08734 (2017).

[33] Su, Jong-Chyi, Subhransu Maji, and Bharath Hariharan. “When Does Self-supervision Improve Few-shot Learning?.” arXiv preprint arXiv:1910.03560 (2019).

[34] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[35] Rebuffi, Sylvestre-Alvise, Hakan Bilen, and Andrea Vedaldi. “Learning multiple visual domains with residual adapters.” Advances in Neural Information Processing Systems. 2017.

[36] https://medium.com/syncedreview/aaai-2020-reports-record-high-paper-submissions-67f88080c93c

[37] https://medium.com/syncedreview/cvpr-2019-accepts-record-1300-papers-91b9e3b315f5