Alexandre Falcão, PhD, is a Professor in Image Processing at University of Campinas, in Brazil
Daniel Aliaga, PhD, is a Professor of Computer Science at the University of Purdue, in the USA
François Brémond, PhD, is a research director at INRIA Sophia Antipolis, in France
Larry Davis, PhD, is a Professor of Computer Science at the University of Maryland, in the USA
Rogério Schmidt Feris, PhD, is a Research Scientist at IBM T. J. Watson Research Center, New York, USA
Since the population of the older persons grows highly, the improvement of the quality of life of older persons at home is of a great importance. This can be achieved through the development of technologies for monitoring their activities at home. In this context, we propose activity monitoring approaches which aim at analysing older person behaviors by combining heterogeneous sensor data to recognize critical activities at home. In particular, this approach combines data provided by video cameras with data provided by environmental sensors attached to house furnishings.
There are 3 categories of critical human activities:
In this talk, we will then present several techniques for the detection of people and for the recognition of human activities using in particular 2D or 3D video cameras. More specifically, there are 3 categories of algorithms to recognize human activities:
We will illustrate the proposed activity monitoring approaches through several home care application datasets:
Dataset 1 Dataset 2 Dataset 3 Dataset 4The field of computer vision has advanced remarkably during the past 10-15 years. This is due to a variety of factors including the availability of the large annotated data sets needed to train deep learning models, software like AMT that enables the collection of these data sets at reasonable costs, important engineering improvements to the training methodologies of deep networks, dramatic decreases in price/performance ratios of computing systems (especially GPU’s) and memory systems, widespread availability of source code that researchers make available to one another worldwide, and inexpensive sensors and robotic platforms like Kinect, Go-pro’s and UAV’s. So, while the fundamental vision problems of detection and recognition of objects and human movements are not solved, they have improved to the point where it is important to ask: What’s next? A workshop was held in the US late last year to address exactly that question (chaired by me, Fei Fei Li and Devi Parikh) and this talk will discuss the conclusions of that workshop, and illustrate research in some of those future directions with work from the University of Maryland, in particular research on object detection and recognition and visual search.
Designing, simulating, and visualizing urban regions is a task of critical importance today. In the year 1900, approximately 14 percent of the world’s population of 1.6 billion people lived in cities. Today, more than half of the world population live in cities and the population has grown to over 7 billion people. Moreover, over the next 30 years the growth of the population and amount of urbanization will only increase. Our research efforts have focused on multi-disciplinary efforts to create interactive what-if visual design tools for urban modeling and planning. In particular, we have focused on creating digital models of large-scale urban structures and on an inverse modeling framework. Rather than design a 3D urban model and simulate the resulting behavior, we create tools that suggest how to alter the existing geometric structure of a city or propose a new city that yields a desired behavior. This inverse modeling mentality requires the use of optimization, machine learning, and stochastic techniques. We have used such inverse models in the areas of urban socio-economic planning, controlling vehicular traffic and pollution emissions, and predicting (and altering) urban weather (e.g., temperature, precipitation, wind). While our work focuses on the geometric modeling and interactive visualization aspects, we have implemented a variety of scientific simulation frameworks in collaboration with researchers in earth and atmospheric sciences, urban planning, urban design and architecture, civil engineering, and more. We will present a summary of several projects that have appeared recently at major conferences.
In Machine Learning, the behavior of the computer agent is expected to improve over time in order to increase its usefulness to the end users. Traditional supervised techniques have made considerable progress by inducing a generalized function from examples, that are annotated by specialists (or by end users) before the learning process. While this approach can be succeeded in some applications, the absence of interaction between machines and specialists during the machine learning process leaves many important questions unanswered, compromising the usefulness of the solutions to many applications: how to minimize human effort with maximum efficacy in machine learning? can the machines learn from their errors? can the specialists understand the behavior of the machines, explain their actions, and trust on their decisions? what can machines and specialists learn from their interaction? This lecture is concerned with techniques to address such questions in the context of image annotation.
Image annotation consists of assigning one or multiple labels per image in order to make a decision or to support a human decision about a problem (e.g., the medical diagnosis). The pipeline for image annotation involves extraction, characterization, and classification of the content of interest, named samples. Samples may be pixels, regions of connected pixels with similar color and texture patterns (superpixels), connected components with known shapes (objects), or regions around objects (subimages). In any case, sample extraction is a fundamental problem that often requires object (semantic) segmentation. Nevertheless, interactive segmentation methods are rarely designed to improve from errors. Sample characterization aims at learning image features, usually based on the knowledge of the specialists (the handcrafted features) or based on a reasonable amount of previously extracted and annotated samples. The second strategy is not feasible when specialists are required to manually extract and annotate samples, raising two important questions: can feature learning techniques succeed from small labeled training sets? Can the specialist interact in feature learning to cope with the absence of labeled data, to improve the process, and to better understand the correlation between features and problem? Once the feature space is defined, the choice of key samples for label supervision is paramount in the design of the classifier. However, active learning techniques usually simulate user interaction during the process, disregarding the need for efficiency and interactive response times.
Sample extraction has been investigated as a separated task from characterization and classification, and the last two have also been investigated as a single operation. Indeed, their separation is important, with the specialist being part of the learning loop in all three steps, and the integration of their results in a same system is paramount for effective and efficient interactive machine learning.
The lecture proposes a methodology to address the problem, presents previous and underdevelopment work, and concludes with our still modest experience in what specialists and machines can learn from each other.
Deep convolutional neural networks have recently achieved breakthrough results in the field of computer vision. However, existing approaches require a large number (usually hundreds of thousands or millions) of annotated training examples in order to learn a high-performance network model. In this talk, I will present a different approach to learn rich feature representations without costly manual annotation, with focus on the problem of estimating facial attributes (gender, age, hair style, ...) and clothing attributes (color, pattern, sleeve length, ...). Rather than relying on manually annotated images from the web, we learn a discriminative attribute representation from egocentric videos captured by a person walking across different neighborhoods of a city, while leveraging geo-location and weather information readily available in wearable devices as a free source of supervision. By tracking the faces of casual walkers on more than 40 hours of egocentric video, we are able to cover tens of thousands of different identities, and automatically extract nearly 5 million pairs of images connected by or from different face tracks with weather and location context, under pose and lighting variation. These image pairs are then fed into a deep network that preserves similarity of images connected by the same track, in order to capture identity-related attribute features, and simultaneously optimizes for geo-location and weather prediction to capture additional person attribute features. Finally, the network is fine-tuned with a few manually annotated samples. Our method outperforms other state-of-the-art approaches in standard public benchmarks. I will conclude the talk by covering other strategies for learning deep feature representations without costly manual annotation, with applications in fashion retrieval (Chen et al., 2015) and smart video surveillance (Huang et al., 2015).