SIBGRAPI 2016: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES

Scene Understanding for Activity Monitoring

Since the population of the older persons grows highly, the improvement of the quality of life of older persons at home is of a great importance. This can be achieved through the development of technologies for monitoring their activities at home. In this context, we propose activity monitoring approaches which aim at analysing older person behaviors by combining heterogeneous sensor data to recognize critical activities at home. In particular, this approach combines data provided by video cameras with data provided by environmental sensors attached to house furnishings.

There are 3 categories of critical human activities:

Activities which can be well described or modeled by users
Activities which can be specified by users and that can be illustrated by positive/negative samples representative of the targeted activities
Rare activities which are unknown to the users and which can be defined only with respect to frequent activities requiring large datasets

In this talk, we will then present several techniques for the detection of people and for the recognition of human activities using in particular 2D or 3D video cameras. More specifically, there are 3 categories of algorithms to recognize human activities:

Recognition engine using hand-crafted ontologies based on a priori knowledge (e.g. rules) predefined by users. This activity recognition engine is easily extendable and allows later integration of additional sensor information when needed (König et al., 2015).
Supervised learning methods based on positive/negative samples representative of the targeted activities which have to be specified by users. These methods are usually based on Bag-of-Words computing a large variety of spatio-temporal descriptors (Bilinski and Bremond, 2015).
Unsupervised (fully automated) learned methods based on clustering of frequent activity patterns on large datasets which can generate/discover new activity models (Negin et al., 2015).

We will illustrate the proposed activity monitoring approaches through several home care application datasets:

Dataset 1 Dataset 2 Dataset 3 Dataset 4

References

A. König, C. Crispim, A. Covella, F. Bremond, A. Derreumaux, G. Bensadoum, R. David, F. Verhey, P. Aalten and P. H. Robert. Ecological Assessment of Autonomy in Instrumental Activities of Daily Living in Dementia Patients by the means of an Automatic Video Monitoring System. Frontiers in Aging Neuroscience, June 2015. DOI: 10.3389/fnagi.2015.00098 Open Access
P. Bilinski and F. Bremond. Video Covariance Matrix Logarithm for Human Action Recognition in Videos. In: International Joint Conference on Artificial Intelligence (IJCAI’15), Buenos Aires, Argentina, July 25th - July 31st, 2015.
F. Negin, S. Cosar, M. Koperski, and F. Bremond. Generating Unsupervised Models for Online Long-Term Daily Living Activity Recognition. In: IAPR Asian Conference on Pattern Recognition (ACPR’15), Kuala Lumpur, Malaysia, 4-6 November, 2015. DOI: 10.1109/ACPR.2015.7486491

Biography

François Brémond is leading the STARS team at INRIA Sophia Antipolis. He designs and develops generic systems for dynamic scene interpretation. The targeted class of applications is the automatic interpretation of indoor and outdoor scenes observed with various sensors and in particular with static cameras. These systems detect and track mobile objects, which can be either humans or vehicles, and recognize their behaviours. He is particularly interested in filling the gap between sensor information (pixel level) and recognized activities (semantic level). In 1997 he obtained his PhD degree at INRIA in video understanding and he pursued his research work as a post doctorate at USC on the interpretation of videos taken from UAV (Unmanned Airborne Vehicle). He has also participated to many European and industrial research projects in activity monitoring. François Brémond is author or co-author of more than 140 scientific papers published in international journals or conferences in video understanding. He is a co-fonder of Keeneo and Neosensys, two companies in intelligent video surveillance.

Presentation

Computer Vision: The Next Decade

The field of computer vision has advanced remarkably during the past 10-15 years. This is due to a variety of factors including the availability of the large annotated data sets needed to train deep learning models, software like AMT that enables the collection of these data sets at reasonable costs, important engineering improvements to the training methodologies of deep networks, dramatic decreases in price/performance ratios of computing systems (especially GPU’s) and memory systems, widespread availability of source code that researchers make available to one another worldwide, and inexpensive sensors and robotic platforms like Kinect, Go-pro’s and UAV’s. So, while the fundamental vision problems of detection and recognition of objects and human movements are not solved, they have improved to the point where it is important to ask: What’s next? A workshop was held in the US late last year to address exactly that question (chaired by me, Fei Fei Li and Devi Parikh) and this talk will discuss the conclusions of that workshop, and illustrate research in some of those future directions with work from the University of Maryland, in particular research on object detection and recognition and visual search.

Biography

Prof. Larry S. Davis is a Professor in the Institute for Advanced Computer Studies and the Department of Computer Science at the University of Maryland, College Park MD. He received his B.A. from Colgate University and both his M.Sc. and Ph.D. from the University of Maryland. He was the founding Director of the Institute for Advanced Computer Studies and served as the chair of the Department of Computer Science from 1998-2012. Prof. Davis is well known for his contributions to computer vision, especially to video surveillance and video data analysis. He has served as both program chair and general chair for many major conferences, including Computer Vision and Pattern Recognition and the International Conference on Computer Vision. He has served on DARPA’s ISAT advisory panel. Prof. Davis has published over 250 papers in leading conferences and journals on computer vision and has advised more than 40 Ph.D. students. He is a fellow of IAPR, ACM and the IEEE.

Presentation

Designing and Modeling Cities of Tomorrow

Designing, simulating, and visualizing urban regions is a task of critical importance today. In the year 1900, approximately 14 percent of the world’s population of 1.6 billion people lived in cities. Today, more than half of the world population live in cities and the population has grown to over 7 billion people. Moreover, over the next 30 years the growth of the population and amount of urbanization will only increase. Our research efforts have focused on multi-disciplinary efforts to create interactive what-if visual design tools for urban modeling and planning. In particular, we have focused on creating digital models of large-scale urban structures and on an inverse modeling framework. Rather than design a 3D urban model and simulate the resulting behavior, we create tools that suggest how to alter the existing geometric structure of a city or propose a new city that yields a desired behavior. This inverse modeling mentality requires the use of optimization, machine learning, and stochastic techniques. We have used such inverse models in the areas of urban socio-economic planning, controlling vehicular traffic and pollution emissions, and predicting (and altering) urban weather (e.g., temperature, precipitation, wind). While our work focuses on the geometric modeling and interactive visualization aspects, we have implemented a variety of scientific simulation frameworks in collaboration with researchers in earth and atmospheric sciences, urban planning, urban design and architecture, civil engineering, and more. We will present a summary of several projects that have appeared recently at major conferences.

Biography

Dr. Daniel G. Aliaga’s research is primarily in the area of 3D computer graphics but overlaps with computer vision and with visualization. He focuses on (i) 3D urban modeling (creating novel 3D urban acquisition algorithms, forward and inverse procedural modeling, and integration with urban design and planning), (ii) projector-camera systems (focusing on algorithms for spatially-augmented reality and for appearance editing of arbitrarily shaped and colored objects), and (iii) 3D digital fabrication (creating novel methods for digital manufacturing that embed into a physical object information for genuinity detection, tamper detection, and multiple appearance generation). Prof. Aliaga has also performed research in 3D reconstruction, image-based rendering, rendering acceleration, and camera design and calibration. Dr. Aliaga’s first computer graphics publication was in 1990 and since has published over 100 peer reviewed publications, been a member of more than 50 program committees, as well as conference chair, papers chair, invited speaker, and invited panelist. In addition, Dr. Aliaga has served on several NSF panels, is on the editorial board of Computer Graphics Forum and of Graphical Models, and is a member of ACM SIGGRAPH and is an ACM SIGGRAPH Pioneer. His research has been whole or partially funded by NSF, MTC, Microsoft Research, Google, and Adobe Inc.

Interactive Machine Learning for Automatic Image Annotation: What can machines and specialists learn from each other?

In Machine Learning, the behavior of the computer agent is expected to improve over time in order to increase its usefulness to the end users. Traditional supervised techniques have made considerable progress by inducing a generalized function from examples, that are annotated by specialists (or by end users) before the learning process. While this approach can be succeeded in some applications, the absence of interaction between machines and specialists during the machine learning process leaves many important questions unanswered, compromising the usefulness of the solutions to many applications: how to minimize human effort with maximum efficacy in machine learning? can the machines learn from their errors? can the specialists understand the behavior of the machines, explain their actions, and trust on their decisions? what can machines and specialists learn from their interaction? This lecture is concerned with techniques to address such questions in the context of image annotation.

Image annotation consists of assigning one or multiple labels per image in order to make a decision or to support a human decision about a problem (e.g., the medical diagnosis). The pipeline for image annotation involves extraction, characterization, and classification of the content of interest, named samples. Samples may be pixels, regions of connected pixels with similar color and texture patterns (superpixels), connected components with known shapes (objects), or regions around objects (subimages). In any case, sample extraction is a fundamental problem that often requires object (semantic) segmentation. Nevertheless, interactive segmentation methods are rarely designed to improve from errors. Sample characterization aims at learning image features, usually based on the knowledge of the specialists (the handcrafted features) or based on a reasonable amount of previously extracted and annotated samples. The second strategy is not feasible when specialists are required to manually extract and annotate samples, raising two important questions: can feature learning techniques succeed from small labeled training sets? Can the specialist interact in feature learning to cope with the absence of labeled data, to improve the process, and to better understand the correlation between features and problem? Once the feature space is defined, the choice of key samples for label supervision is paramount in the design of the classifier. However, active learning techniques usually simulate user interaction during the process, disregarding the need for efficiency and interactive response times.

Sample extraction has been investigated as a separated task from characterization and classification, and the last two have also been investigated as a single operation. Indeed, their separation is important, with the specialist being part of the learning loop in all three steps, and the integration of their results in a same system is paramount for effective and efficient interactive machine learning.

The lecture proposes a methodology to address the problem, presents previous and underdevelopment work, and concludes with our still modest experience in what specialists and machines can learn from each other.

Biography

Alexandre Xavier Falcão is full professor at the Institute of Computing, University of Campinas, Campinas, SP, Brazil. He received a B.Sc. in Electrical Engineering from the Federal University of Pernambuco, Recife, PE, Brazil, in 1988. He has worked in biomedical image processing, visualization and analysis since 1991. In 1993, he received a M.Sc. in Electrical Engineering from the University of Campinas, Campinas, SP, Brazil. During 1994-1996, he worked with the Medical Image Processing Group at the Department of Radiology, University of Pennsylvania, PA, USA, on interactive image segmentation for his doctorate. He got his doctorate in Electrical Engineering from the University of Campinas in 1996. In 1997, he worked in a project for Globo TV at a research center, CPqD-TELEBRAS in Campinas, developing methods for video quality assessment. His experience as professor of Computer Science and Engineering started in 1998 at the University of Campinas. His main research interests include image/video processing, visualization, and analysis; graph algorithms and dynamic programming; image annotation, organization, and retrieval; machine learning and pattern recognition; and image analysis applications in Biology, Medicine, Biometrics, Geology, and Agriculture.

Presentation

Walk and Learn: Person Attribute Representation Learning from Egocentric Videos and Contextual Data

Deep convolutional neural networks have recently achieved breakthrough results in the field of computer vision. However, existing approaches require a large number (usually hundreds of thousands or millions) of annotated training examples in order to learn a high-performance network model. In this talk, I will present a different approach to learn rich feature representations without costly manual annotation, with focus on the problem of estimating facial attributes (gender, age, hair style, ...) and clothing attributes (color, pattern, sleeve length, ...). Rather than relying on manually annotated images from the web, we learn a discriminative attribute representation from egocentric videos captured by a person walking across different neighborhoods of a city, while leveraging geo-location and weather information readily available in wearable devices as a free source of supervision. By tracking the faces of casual walkers on more than 40 hours of egocentric video, we are able to cover tens of thousands of different identities, and automatically extract nearly 5 million pairs of images connected by or from different face tracks with weather and location context, under pose and lighting variation. These image pairs are then fed into a deep network that preserves similarity of images connected by the same track, in order to capture identity-related attribute features, and simultaneously optimizes for geo-location and weather prediction to capture additional person attribute features. Finally, the network is fine-tuned with a few manually annotated samples. Our method outperforms other state-of-the-art approaches in standard public benchmarks. I will conclude the talk by covering other strategies for learning deep feature representations without costly manual annotation, with applications in fashion retrieval (Chen et al., 2015) and smart video surveillance (Huang et al., 2015).

References

Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong and S. Yan. Deep Domain Adaptation for Describing People Based on Fine-Grained Clothing Attributes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, USA, June 7-12, 2015. DOI: 10.1109/CVPR.2015.7299169
J. Huang, R. Feris, Q. Chen and S. Yan. Cross-domain Image Retrieval with a Dual Attribute-aware Ranking Network. In: IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, December 7-13, 2015. DOI: 10.1109/ICCV.2015.127

Biography

Rogério Schmidt Feris is currently a research scientist at IBM T. J. Watson Research Center, New York, and an Affiliate Associate Professor at University of Washington. He joined IBM in 2006 after receiving a Ph.D. in computer science from the University of California, Santa Barbara. He has also worked as an Adjunct Associate Professor at Columbia University, teaching courses on Visual Recognition and Search and Automatic Video Surveillance. His publications have appeared in major computer vision/graphics conferences and journals, including ICCV, CVPR, SIGGRAPH, and PAMI. Among other professional activities, he has co-chaired the first and second workshop on parts and attributes. He received several honors and awards, including a recent IBM Master inventor honor and a prestigious IBM Outstanding Innovation Achievement Award in 2011. In addition to working on core research, he had a one-year assignment at IBM Global Technology Services as a senior software engineer to help the productization of the IBM Smart Surveillance System.

Invited Speakers

Alexandre Falcão

Daniel Aliaga

François Brémond

Larry Davis

Rogério Feris

Sponsors

Support