Stylized ambient displays of digital media collections

doi:10.1016/j.cag.2010.11.004

Computers & Graphics

Volume 35, Issue 1, February 2011, Pages 54-66

https://doi.org/10.1016/j.cag.2010.11.004 Get rights and content

Abstract

The falling cost of digital cameras and camcorders has encouraged the creation of massive collections of personal digital media. However, once captured, this media is infrequently accessed and often lies dormant on users' PCs. We present a system to breathe life into home digital media collections, drawing upon artistic stylization to create a “Digital Ambient Display” that automatically selects, stylizes and transitions between digital contents in a semantically meaningful sequence. We present a novel algorithm based on multi-label graph cut for segmenting video into temporally coherent region maps. These maps are used to both stylize video into cartoons and paintings, and measure visual similarity between frames for smooth sequence transitions. The system automatically structures the media collection into a hierarchical representation based on visual content and semantics. Graph optimization is applied to adaptively sequence content for display in a coarse-to-fine manner, driven by user attention level (detected in real-time by a webcam). Our system is deployed on embedded hardware in the form of a compact digital photo frame. We demonstrate coherent segmentation and stylization over a variety of home videos and photos. We evaluate our media sequencing algorithm via a small-scale user study, indicating that our adaptive display conveys a more compelling media consumption experience than simple linear “slide-shows”.

Graphical Abstract

Research Highlights

► New approach to structuring user media collections, and sequencing content for display from those collections. We have developed a hierarchical approach to clustering and navigating visual media collections (images, videos). The Digital Ambient Display (DAD) now transitions between content by moving through a tree—each node contains a cluster of media items that share semantics and/or visual similarity. The walk through the tree is stochastic, and partly guided by user interest that is measured using a built-in camera on the DAD. This passive interaction for guiding transitions is also novel with respect to our NPAR paper. Novel material in Sections 3 and 5. ► User study exploring the efficacy of the content sequencing. We measure user engagement with the DAD using attention monitoring technology, and compare the level of engagement with a random slideshow. A paired t-test indicates more a engaging display results from our content sequencing approach. Novel material in Section 6.2.

Introduction

Traditional approaches to linearly browsing personal media collections (e.g. photo albums, slideshows) are becoming impractical due to the explosive growth of media repositories. Furthermore, although digital media is intrinsically more accessible than physical archives, the focus on the PC as the main portal to these collections poses a convenience barrier to realizing their value. The proliferation of video and image data in digital form creates demand for an effective means to browse large volumes of digital media in a structured, accessible and intuitive manner.

This paper proposes a novel approach to the consumption of home digital media collections, centred upon ambient experiences. Ambient experiences are distinguished from compelling or intense experiences in that they are able to co-exist harmoniously with other activities such as conversations, shared meals and so forth. An ambient experience does not demand the full attention of the user but is able to play out in a pleasing, unobtrusive way such that fresh and interesting content is available in the attention spaces of everyday life.

People have been creating ambient displays of media for decades; by placing photographs on the mantelpiece, or leaving the television or radio on in the background. These established behaviors reflect commonly appreciated value in the passive consumption of media. Yet, beyond updates to accommodate new capture technology (e.g. digital photo frames) methods of passive content delivery and consumption have remained largely unchanged. Our work aims to create a platform for ambient delivery of home digital media content. Such content often encodes householders' memories and life experiences. We seek to emulate, for digital media, the serendipitous process of rediscovery often experienced whilst browsing physical media archives (e.g. a box of photos in the attic) that can trigger enjoyable reminiscence over past memories and events.

Although considerable research has been devoted to direct interactive approaches for browsing digital media collections, there is little previous work addressing the problem of displaying digital content in an ambient manner (Section 1.2). Typical approaches to browsing small or medium scale photo sets project thumbnails onto either a planar or a spherical surface, so that images that are visually similar are located in close proximity in the visualization [1], [2], [3], [4]. Large photo sets are often handled by clustering content into subsets, sometimes arranged hierarchically for visualization and manual navigation [5], [6], [7]. With the expected proliferation of large format video displays around the home, recent work explores the specific domain of household digital media interaction [8], [9]. Yet, the ambient dissemination of home visual media, and the associated issues of interaction in an ambient context, remain sparsely researched.

An earlier version of this extended paper [10] presented a video-only digital ambient display with simple media sequencing. The novel contribution of this paper over [10] is a more sophisticated approach to sequencing, involving unsupervised hierarchical clustering of media and content navigation influenced by user attention. In addition, an implementation on embedded hardware has enabled an in situ user evaluation of the digital ambient display, also presented in this paper.

The digital ambient display (DAD) is an always-on display for living spaces that enables users to effortlessly visualize and rediscover their personal digital media collections. DADs address the paradoxical requirement of an autonomous technology to passively disseminate media collections, that also enables minimal interaction to actively navigate routes through content that may trigger interest and user reminiscence. By transitioning between selected media items, the DAD passively presents a global summary visualizing the essential structure of the collection. This results in an evolving temporal composition of media, the sequencing of which considers both media semantics and visual appearance, as well as adaptively responding to user attention (sensed via gaze detection). Rather than simply stitching digital content together, we harness artistic stylization to depict image and video in a more abstract sense. In contrast to photorealism, which often proves distracting in the ambient setting (e.g. a television in the corner of a café), artistic stylization would provide an aesthetically pleasing and unobtrusive means of disseminating content in the ambient setting; creating a flowing, temporal composition that conveys the essence of users' experiences through an artistic representation of their digital media collection (Fig. 1).

Creating a DAD requires that the media be automatically parsed into an structured representation that enables semantically meaningful routes to be navigated through the collection. This process is dependent on meta-data tags user-assigned to each media item. Furthermore, the visual content within individual media items must be also be parsed into a mid-level visual scene representation that enables both:

1.
Artistic rendering of media into aesthetically pleasing forms.
2.
Generation of appropriate transition effects and sequencing decisions, to create an appealing temporal composition.

While artistic rendering of images is much explored, temporally coherent stylization of video is still a challenging task which requires a stable and consistent description of the scene structures. Following DeCarlo and Santella [11] as well as Collomosse et al. [12], we identify a color region segmentation as being an appropriate “mid-level” scene abstraction, and in Section 4 contribute a novel algorithm for segmenting video frames into a deforming set of temporally coherent regions. We demonstrate how these regions may be stylized via either shading or stroke-based rendering, to produce coherent cartoon and painterly video styles (Section 4.5). We describe our hierarchical approach to structuring the media collection in Section 3, and describe how content is sequenced at run-time using that representation in Section 5. Qualitative evaluations of segmentation coherence, and a quantitative user evaluation of the DAD, are presented in Section 6.

Video temporal composition was first proposed by Schodl et al. [13], within the scope of a single video and based upon visual similarity only. Much as motion graphs [14] construct a directed graph that encapsulates connections among motion capture fragments, so the Video Textures of Schodl et al. create a graph of video fragments that may be walked in perpetuity to create a temporal composition. Our proposed approach to composition borrows from the graph representations [14], [13], but using a hierarchical representation, comprising multiple videos, and measuring similarity both visually and semantically.

Hierarchical structuring of media is common to many contemporary approaches for interactively navigating collections. For example, Krishnamachari et al. form tree structures from an image collection, imposing a coarse to fine representation of image content within clusters and enabling the users to navigate up and down the tree levels via representative images from each cluster. This approach was later adopted by Chen et al. [5] and Goldberger et al. [6]. Chen et al. propose a fast search algorithm and a fast-sparse clustering method for building hierarchical tree structures from large image collections. Goldberger et al. combine discrete and continuous image models with information-theoretic based criteria for unsupervised hierarchical clustering. Images are clustered such that the mutual information between the clusters and the image content is maximally preserved. Our approach to structuring collections combines a hierarchical clustering with graph optimization approach [14] to navigate and visualize large media collections. The resulting system differs from existing hierarchical clustering approaches [5], [6], [7] in several ways. Rather than exploiting low level visual features in the clustering, we incorporate both high-level semantic similarity when constructing top levels of the tree, and global image feature descriptors via a Bag of visual words (BoW) framework [15] for constructing lower levels of the tree. Consequently, this tree structure not only enables a global semantic summary of the collection, but also encodes visual similarities at various levels. Furthermore, in our system each node in the hierarchy encodes a directed graph that encapsulates connections among the digital items assigned to thats node, rather than an unstructured subset of media as typified by previous work.

Video stylization was first addressed by Litwinowicz [16], who produces painterly video by pushing brush strokes from frame to frame in the direction of optical flow motion vectors. This approach was later extended by Hayes and Essa [17] who similarly move strokes but within independent motion layers. Complementary work by Hertzmann [18] use differences between consecutive frames of video, painting over areas of the new frame that differ significantly from the previous frame. While these methods can produce impressive painterly video, the errors in the estimated per-pixel motion field can quickly accumulate and propagate to subsequent frames, resulting in increasing temporal incoherence. This can lead to a distraction scintillation or “flicker” when strokes of the stylized output no longer match object motion [19].

More recently, image segmentation techniques have been applied to yield mid-level models of scene structure [20], [21] that can be rendered in artistic styles. By extending the mean-shift based stylization approach on images [11], Collomosse et al. [12] create spatio-temporal volumes from video by associating 2D segmentations over time and fitting stroke surfaces to voxel objects. Although this geometric smoothing improves stability, temporal coherence is not ensured because the region map for each frame is formed independently without knowledge of the adjacent frames. Furthermore, association is confounded by the poor repeatability of 2D segmentation algorithms between similar frames, causing variations in the shape and photometric properties of regions that require manual correction. Wang et al. [21] also transform video into spatio-temporal volumes by clustering space-time pixels using a mean-shift operator. However, this approach becomes computationally infeasible for pixel counts in even moderate size videos, and often under-segments small or fast moving objects that form disconnected volumes. This also requires manual correction and frequent grouping of space-time volumes.

Winnemoeller et al. [22] present a method to abstract video using a bilateral filter, attenuating detail in low-contrast regions using an approximation to anisotropic diffusion while artificially increasing contrast in higher contrast regions with difference-of-Gaussian edges. A variant of the bilateral filter is presented by Kyprianidis et al. [23] which is aligned to the local structure of the image to avoid block artifacts and create smooth boundaries between color regions. Another variant of the bilateral filter is presented by Kang et al. [24] that is guided by a feature flow field, which improves feature preservation, noise reduction, and stylization. A further technique based on a generalization of the Kuwahara filter has been presented by Kyprianidis et al. [25] using the Kuwahara filter. Bhat et al. [26] provide a solution for applying gradient-domain filters to videos and video streams in a coherent manner. Such approaches do not seek to parse a description of scene structure, making them useful for scenes that are difficult to segment, but limited to a characteristic soft-shaded artistic style.

We adopt a scene segmentation approach; convenient both for diverse artistic rendering and for creating the structural correspondences between frames for transition animations. We propose a new video segmentation algorithm, in which the segmentation of each frame is guided by motion flow propagated priors estimated from the region labels of past frames. In doing so we combine the automation of early optical flow stylization algorithms with the robustness and coherence of region segmentation approaches; propagating labels with flow, and resolving ambiguities using a graph-cut optimization to create coherent region maps. Some recent interactive “video cut-out” systems are similar in spirit [27], [28]; tracking key-points on region boundaries over time for matte segmentation. However we differ in several ways. First, we propagate label priors and data forward with motion flow within regions, rather than tracking 2D windows on region boundaries that contain clutter from adjacent regions. Second, we are more general, producing a multi-label (region) map rather than a binary matte. Third, both interactive systems [27], [28] require regular manual correction, typically every $\sim 5$ frames. Our algorithm requires no user interaction, beyond (optional) modification of the initial frame for aesthetics.

Section snippets

System overview

The Digital Ambient Display (DAD) visualizes home media collections comprising photos and videos. Videos are ingested as short, visually interesting clips that form the atomic unit of composition. Obtaining such clips differs from classical shot detection as raw home footage tends to consist of a few lengthy shots. An existing algorithm [29] performs this pre-processing. The ingested media collection is clustered into a hierarchical representation according to semantic content (derived from

Structuring the media collection

We represent the media collection as a hierarchy of pointers to media items. Each node in the tree represents a subset of the media collection sharing a common semantic theme or visual appearance.

Video stylization

We next describe a coherent video segmentation algorithm which performs a multi-label graph cut on successive video frames, using both photometric properties of the current frame and prior information propagated forward from previous frames. This information comprises:

1.
an incrementally built Gaussian Mixture Model (GMM) encoding the color distribution of each region over past frames;
2.
a subset of pixel-to-region labels from the previous frame.

We check for region under-segmentation (e.g. the

Content sequencing

Finally, we explain the algorithms for sequencing stylized content to create the temporal composition of media items from the user's collection, and for creating the animated transitions between displayed items.

Results and user study

We present a qualitative comparison of the proposed video segmentation algorithm with two existing techniques [33], [49] and present a gallery of stills from videos stylized into cartoons and paintings. We also present a small-scale study exploring user engagement with the DAD.

Conclusion

We have presented a digital ambient display (DAD) that harnesses artistic stylization to create an abstraction of user's experiences through their home digital media collections. The DAD automatically selects, stylizes and transitions between media contents enabling users to passively or actively consume their digital media collections and rediscover past memories.

We contributed a novel algorithm for coherent video segmentation based on multi-label graph cut, and applied this algorithm to

Acknowledgement

This work was funded by Hewlett Packard under the IRP studentship programme (Grant #477).

References (50)

T.T.A. Combs et al.
Does zooming improve image browsing?
Heath K, Gelfand N, Ovsjanikov M, Aanjaneya M, Guibas LJ. Image webs: computing and exploiting connectivity in image...
K. Rodden et al.
Does organisation by similarity assist image browsing?
G. Schaefer
A next generation browsing environment for large image repositories
Multimedia Tools and Applications
(2010)
J. Chen et al.
Hierarchical browsing and search of large image databases
IEEE Transactions on Image Processing
(2000)
J. Goldberger et al.
Unsupervised image-set clustering using an information theoretic framework
IEEE Transactions on Image Processing
(2006)
Krishnamachari S, Abdel-Mottaleb M. Image browsing using hierarchical clustering. In: IEEE symposium on computers and...
Arksey N. Exploring the design space for concurrent use of personal and large displays for in-home collaboration....
You W, Feis S, Lea R. Studying vision-based multiple-user interaction with in-home large displays. In: Proceedings of...
Wang T, Collomosse J, Slatter D, Cheatle P, Greig D. Video stylization for digital ambient displays of home movies. In:...

D. DeCarlo et al.

Stylization and abstraction of photographs

J. Collomosse et al.

Stroke surfaces: temporally coherent artistic animations from video

Transactions on Visualization and Computer Graphics

(2005)

Schodl A, Skeliski R, Salesin D, Essa H. Video textures. In: Proceedings of the ACM SIGGRAPH, 2000. p....

Kovar L, Gleicher M, Pighin F. Motion graphs. In: Proceedings of the ACM SIGGRAPH, 2002. p....

Sivic J, Zisserman A. Video google: a text retrieval approach to object matching in videos. In: Proceedings of the...

Litwinowicz P. Processing images and video for an impressionist effect. In: SIGGRAPH, 1997. p....

Hays J, Essa IA. Image and video based painterly animation. In: NPAR, 2004. p....

Hertzmann A, Perlin K. Painterly rendering for video and interaction. In: NPAR, 2000. p....

Meier BJ. Painterly rendering for animation. In: Proceedings of the ACM SIGRGAPH, 1996. p....

Collomosse J. Higher level techniques for the artistic rendering of images and video. PhD thesis, University of Bath;...

Wang J, Xu Y, Shum H, Cohen M. Video tooning. In: SIGGRAPH, vol. 23, 2004. p....

Winnemoller H, Olsen S, Gooch B. Real-time video abstraction. In: ACM SIGGRAPH, 2006. p....

Kyprianidis JE, Döllner J. Image abstraction by structure adaptive filtering. In: Proceedings of the EG UK theory and...

H. Kang et al.

Flow-based image abstraction

IEEE Transactions on Visualization and Computer Graphics

(2009)

Kyprianidis J-E, Kang H, Doellner J. Image and video abstraction by anisotropic Kuwahara filtering. In: Proceedings of...

Cited by (6)

TouchCut: Fast image and video segmentation using single-touch interaction
2014, Computer Vision and Image Understanding
Citation Excerpt :
Interactive video object segmentation systems have been proposed in recent years. Various directions have been investigated such as tracking region boundaries over time [54,55], extending 2D segmentation to 3D video volumes [56–58,8], applying graph cut segmentation on successive frames driven by motion flow [59–62]. Our algorithm propagates the foreground mask forward which initializes the segmentation on the new frame and provides a shape prior taking account of the inherent error in optical flow.
We present TouchCut; a robust and efficient algorithm for segmenting image and video sequences with minimal user interaction. Our algorithm requires only a single finger touch to identify the object of interest in the image or first frame of video. Our approach is based on a level set framework, with an appearance model fusing edge, region texture and geometric information sampled local to the touched point. We first present our image segmentation solution, then extend this framework to progressive (per-frame) video segmentation, encouraging temporal coherence by incorporating motion estimation and a shape prior learned from previous frames. This new approach to visual object cut-out provides a practical solution for image and video segmentation on compact touch screen devices, facilitating spatially localized media manipulation. We describe such a case study, enabling users to selectively stylize video objects to create a hand-painted effect. We demonstrate the advantages of TouchCut by quantitatively comparing against the state of the art both in terms of accuracy, and run-time performance.
Special section on non-photorealistic animation and rendering (NPAR) 2010
2011, Computers and Graphics (Pergamon)
Graph-boosted attentive network for semantic body parsing
2019, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Line-Drawing Video Stylization
2016, Computer Graphics Forum
VideoPuzzle: Descriptive one-shot video composition
2013, IEEE Transactions on Multimedia
A research of touch-screen digital photo frame based on μC/OS-II real-time operating system
2013, Sensors and Transducers

¹: Funded by the Hewlett-Packard Labs IRP Programme.

View full text

Extended papers from NPAR 2010Stylized ambient displays of digital media collections

Abstract

Graphical Abstract

Research Highlights

Introduction

Section snippets

System overview

Structuring the media collection

Video stylization

Content sequencing

Results and user study

Conclusion

Acknowledgement

Does zooming improve image browsing?

Does organisation by similarity assist image browsing?

A next generation browsing environment for large image repositories

Multimedia Tools and Applications

Hierarchical browsing and search of large image databases

IEEE Transactions on Image Processing

Unsupervised image-set clustering using an information theoretic framework