1.1 The Study of Groups and Crowds
Understanding activities and human behavior from images and videos is an active research area in computer vision and has a large impact to many real-world applications. These include surveillance, assistive robotics, autonomous driving, data analytics, to cite a few. The research community has put significant focus on analyzing the behavior of individuals and proposed methods that can understand and predict behavior of humans as they are considered in isolation. More recently, however, the attention has shifted to the new issues of analyzing and modeling gatherings of people, commonly referred as groups or crowds, depending on the number of people involved. The research done on these two topics has brought about many diverse ad-hoc methodologies and algorithms, and has led to a growing interest in this topic. This has been supported by multiple factors. Firstly, the advancement of the detection and filtering strategies running on powerful hardware has encouraged the development of algorithms able to deal with hundreds of different individuals, providing results that were unthinkable just few years ago. Concomitantly, there has been a broader availability of new types of sensors, and the possibility of mounting these sensors on cutting edge devices, from glasses to drones. Such sensory devices have made it possible to observe people from radically different points of view, in a genuine ecologic, noninvasive manner, and for long durations, namely, from ego-vision settings to bird-eye views of people. Moreover, the advancement of social signal processing [1,2] has brought in the computer vision and pattern recognition community new models imported from the social sciences, able to read between-the-lines of simple locations and velocities assumed by the individuals, using advanced notions of proxemics and kinesics [1,3]. Finally, the industry, governments, and small companies are asking our community for methods to understand and model groups and crowd, for public order and safety, social robotics, advanced profiling and many other applications.
The study of groups and crowds has generally been considered as having its roots in sociology and psychology. Human behavior, in general, has been extensively studied by sociologists to understand social interactions and crowd dynamics. It has been argued that characteristics that dictate human motion constitute a complex interplay between human physical, environmental, and psychosocial characteristics. It is a common observation that people, whenever free to move about in an environment, tend to respect certain patterns of movement. More often, these patterns of movement are dominated by social mechanisms [3]. The study of groups and crowds from the computer vision perspective has typically been modeled as a three-level approach. At the low level, given a video, humans are detected [4,5], then tracked [6,7], and then tracklets are grouped to form trajectories [8]. At the mid-level, machine learning techniques are used to identify groups by clustering trajectories [9]. At the higher level, a semantic understanding of the group behavior is obtained, like classifying actions such as āwalking in groupsā, āprotestingā, āgroup vandalismā, etc.
The low level algorithms have been widely studied in computer vision [10ā12] with promising results. However, algorithms at the middle and high level have only been explored in recent times.
Algorithms at the higher level can either explicitly model human behavior and their interactions in the group and with the environment, or a model can be created through observations by assuming that the human behavior is encapsulated in the learning process. Khan and Shah [13] observed and learned a group's rigid formation structure to classify the activity and successfully applied it to parades. Ryoo and Aggarwal [14,15] represented and learned various types of complex group activities with a programming language-like representation, and then recognized the displayed activities based on the recognition of activities of individual group members. On the other hand, human behavioral models can be used to predict the human interactions with each other. Helbing and Molnar [16] proposed the social force model, which assumes humans as particles and models the influence of other humans and the environment as forces. Furthermore, Pellegrini et al. [17] and Choi et al. [18], as well as [19,20], proposed models that anticipate and avoid collisions of a human with other humans and the actual scene physical structure. These models assume that the humans partaking in the group follow the existing social norms and hence can be used to model specific categories of people, and even crowds.
Typically, in crowded scenes, people are engaged in multiple activities resulting from inter- and intra-group interactions. This poses a rather challenging problem in analyzing group events due to variations in the number of people involved, and more specifically the different human actions and social interactions exhibited within people and groups [21ā24]. Understanding groups and their activities is not limited to only analyzing movements of individuals in group. The environment in which these groups exist provides important contextual information that can be invaluable in recognizing activities in crowded scenes [25,26]. Perspectives from sociology and psychology embedded into computer vision algorithms show that human activities can be effectively understood by considering implicit cognitive processes as latent variables that drive positioning, proximity to other people, movement, gesturing, etc. [16,27ā30]. For example, exploring the spatial and directional relationships between people can facilitate the detection of social interactions in a group. Thus, activity analysis in low-density crowded scenes can often be considered a multistep process, one that involves individual person activity, individuals forming meaningful groups, interaction between individuals, and interactions between groups [28]. In general, the approaches to group activity analysis can be classified into two categories: bottom-up and top-down. The bottom-up (BU) approaches rely on recognizing activity of each individual in a group. Vice versa, top-down (TD) methods recognize group activity by analyzing at the group level rather than at the individual level. Since BU algorithms address the understanding of activities at the individual level, they are limited in recognizing activities at the group level. Conversely, TD approaches show better contextual understanding of activities of a group as a whole, but they are not robust enough to recognize activities at the individual level.
On the other side, when the density of people becomes too high, also in dependence of the camera perspective view, individuals and even groups cannot be distinguished anymore, and a more holistic analysis should be performed to figure out the behavior of a crowd. Analyzing crowd scenes can be categorized into three main topics, i.e., (i) crowd density estimation and people counting, (ii) tracking in crowd, and (iii) modeling crowd behaviors [31]. Recently, some works on pedestrian path predictions in crowded scenes were also proposed [32]. The goal of these methods is to predict the pedestrian pathway in advance, given the past walking history and the surrounding environment (obstacles, scene geometry, etc.). This is yet another interesting application in crowd scenarios having the aim of, for instance, estimating entry/exit points in a specific area or to find the main people walking pathways or standing areas, so that this information can be possibly used to set up open spaces.
Estimating the number of people in a crowd is a cardinal stage for several real-world applications such as safety control, monitoring public transportation, crowd simulation and rendering for animation and urban planning. Many interesting works are present in the literature addressing this target [33ā35], however, automated crowd density estimation still remains an open problem in computer vision due to extreme occlusion conditions and visual ambiguities of the human appearances in such scenarios [36].
Tracking individuals (or objects) in crowd scenes is another challenging task [37,38] which involves, other than severe occlusions, cluttered background and pattern deformations, which are common complexities in visual tracking. In practice, the efficiency and effectiveness of crowd trackers is largely dependent on crowd density and dynamics, people social interactions as well as the crowd's psychological characteristics [36,39,40].
Typically, the primary goal of modeling crowd behaviors is to allow the identification of abnormal events such as, for instance, riots, panic, and violence acts [41]. Despite recent works in this direction, detecting crowd abnormalities still remains an open and challenging problem mainly because of the ālooseā definition of abnormality which is strongly context dependent [42,43]. For example, riding a bike in a street is a normal action, whereas it may be considered abnormal in another scene with a different context such as a park or a sidewalk. Similarly, people gathering for a social event is typically a normal situation, while a similar gathering to āprotest against somethingā can be an abnormal event, which may deserve attention and needs to be detected. Several methods have been devised to analyze crowd behavior. One of the most influential works still derived from the Social Force Model (SFM) [16], and was proposed by Mehran et al. [44]. It adopted the SFM and particle advection scheme for detecting and localizing abnormal behavior in crowd videos. To this end, it considered the entire crowd as a set of moving particles whose interaction force was computed using SFM. The interaction force mapped onto the video frames identifies the force flow of each particle and is used as the basis for extracting features which, along with a bag-of-words strategy, is used to classify each frame as either normal or a...