Marco Cristani

Papers from this author

SIMCO: SIMilarity-Based Object COunting

Marco Godi, Christian Joppi, Andrea Giachetti, Marco Cristani

Responsive image

Auto-TLDR; SIMCO: An Unsupervised Multi-class Object Counting Approach on InShape

Slides Poster Similar

We present SIMCO, a completely agnostic multi-class object counting approach. SIMCO starts by detecting foreground objects through a novel Mask RCNN-based architecture trained beforehand (just once) on a brand-new synthetic 2D shape dataset, InShape; the idea is to highlight every object resembling a primitive 2D shape (circle, square, rectangle, etc.). Each object detected is described by a low-dimensional embedding, obtained from a novel similarity-based head branch; this latter implements a triplet loss, encouraging similar objects (same 2D shape + color and scale) to map close. Subsequently, SIMCO uses this embedding for clustering, so that different 'classes' of similar objects can emerge and be counted, making SIMCO the very first multi-class unsupervised counter. The only required assumption is that repeated objects are present in the image. Experiments show that SIMCO provides state-of-the-art scores on counting benchmarks and that it can also help in many challenging image understanding tasks.

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Hasan Irtiza, Marco Cristani, Fabio Galasso

Responsive image

Auto-TLDR; TransformerNetworks for Trajectory Prediction of People Interactions

Slides Poster Similar

Most recent successes on forecasting the people mo-tion are based on LSTM models andallmost recent progress hasbeen achieved by modelling the social interaction among peopleand the people interaction with the scene. We question the useof the LSTM models and propose the novel use of TransformerNetworks for trajectory forecasting. This is a fundamental switchfrom the sequential step-by-step processing of LSTMs to theonly-attention-based memory mechanisms of Transformers. Inparticular, we consider both the original Transformer Network(TF) and the larger Bidirectional Transformer (BERT), state-of-the-art on all natural language processing tasks. Our proposedTransformers predict the trajectories of the individual peoplein the scene. These are “simple” models because each personis modelled separately without any complex human-human norscene interaction terms. In particular, the TF modelwithoutbells and whistlesyields the best score on the largest and mostchallenging trajectory forecasting benchmark of TrajNet [1]. Ad-ditionally, its extension which predicts multiple plausible futuretrajectories performs on par with more engineered techniqueson the 5 datasets of ETH [2]+UCY [3]. Finally, we showthat Transformers may deal with missing observations, as itmay be the case with real sensor data. Code is available atgithub.com/FGiuliari/Trajectory-Transformer