Tuesday, October 6, 2009

User Guided Music Editing



Today, when we edit photos we have a variety of tools, from removing unwanted tourists to copying the kids from another photo. However, when editing music we are far less equipped. Ideally, we would like to interact with our music with the same richness that we can with photos: add/remove/edit tracks. Unfortunately, current audio editing tools (e.g. using waveforms and spectral plots) are prohibitive for everyday people.

Paris Smaragdis, a researcher at Adobe has created a system where users can select audio tracks (like the singer, or snare drums) from songs simply by humming, singing (or even coughing!) them.

The key idea is to use human examples (such as singing or humming) to guide a probabilistic model for factorizing a song into its frequency components and time-varying magnitudes.

First a factorization is performed on the user samples (e.g. the humming of the tune). Then, these frequency components are used as weights in factoring the original song. This produces a factorization that emphasizes the frequencies in the original song that are close to the user's example. For example, singing to Sting's voice will produce a factored song where Sting's voice is separated out.

Once separated out, one can modify it or remove it. Very cool!

This work was presented at UIST 2009.

Unfortunately the paper has not put been on the web (yet). I will update as it becomes available.

Monday, October 5, 2009

Microsoft's New Multi-touch Mice



Microsoft combines the power of multi-touch (think Apple's iPhone, Microsoft Surface, etc.) with the everyday mouse. Multi-touch enables a host of applications ranging from pinch-to-zoom photo manipulation to easy web-page navigation via swipes. Combine that with a mouse and suddenly you have the ability to not only perform conventional mouse-like activities, but also multi-touch apps as well.

Some novel applications that derive directly from this form factor include multi-selection, gestures for cut-and-paste, and a single-handed navigation technique for first-person shooters.

The key idea is really the implementations used to realize such a combination. In this paper, we discuss 5 implementations: 1) a capacitance-based mouse, 2) a mouse based on FTIR (Frustrated Total Internal Reflection), 3) an optical mouse, 4) a table-top mouse (also using IR), and 5) an articulated mouse.

This work was presented at UIST 2009.

The project webpage, which contains more videos and the paper, can be found here.

Sunday, August 30, 2009

Mobile Phone Localization via the Ambient Environment

Location is playing an increasingly important role on mobile devices. In the near future, our phones may be notifying us of nearby sales, interesting tourist attractions, and of course nearby friends.

However, current localization techniques (e.g. techniques for finding the phone's position), such as GPS are error prone. In the outdoors, GPS has about 10 meter accuracy. In the indoors, it fails completely. Now, indoor localization techniques have been proposed (such as SkyHook), but these are still not ubiquitous. Furthermore, such techniques won't help differentiate which side of a wall you're on. This is important when the wall separates two distinctly different stores.

Well, researchers at Duke University have devised a clever way to localize your mobile phone, using sensing information from the surrounding, ambient environent.

The key idea is that the local context (or environment) provides strong cues to your location. For example, audio plays a important role: at Starbucks, you may hear sounds of coffee grinding. Local thematic colors also play a role: at Target, you may see lots of red objects. Also, lighting styles are an important cue: consider the difference in lighting in a bar vs. at Blockbuster. Finally, even the way the phone travels (carried by a person of course!) is cued on its location. At Safeway, you may walk up and down aisles. In a restaurant, you may wait in line, then sit down at a table. All of this sensory information, audio, thematic colors, lighting, and acceleration, provide important cues to where you (or your phone, rather) are.

The system operates by having a mobile device measure sensory information and by using this information to filter candidate locations and to match the remaining ones. First, to gather ground truth data, 4 sets of measurements are acquired at each location: audio, acceleration, photos (for colors and lighting), and nearby wifi hotspots. Then, to identify an unknown location, these 4 measuresments are made. The 4 measurements are used to filter candidate known locations (by measuring similarity in the 4 measurements) and to rank the final candidates.

One interesting point is that color can be very different, depending on what photos you take in the store. So to disambiguate, photos are taken of the floor only (detectable by the phone's accelerometer). It turns out that floor photos are good cues for store disambiguation!

The results are promising. On a test run of over 51 distinct stores. The system can successfully localize on average of 85% of the time. Of course this data can be used to reinforce the localization. I imagine that as more data is observed, this localization rate should increase.

Very clever idea!

This work was be presented in late September at MobiCom 2009.

The paper can be found here.

Tuesday, August 18, 2009

Local Layering



Standard digital image compositing operates on entire layers, borrowed from the film industry. For example, in Adobe Photoshop you can composite two layers together via a variety of operators: blending, additive, burn, etc. However, in reality, objects can locally overlap each other in ways. For example, if you have a pile of clothes laid flat on your bed, the arm-portion of the shirt may cover the pants. The left pant leg could cover the dress, which partially covers the shirt. In short, layers should also be able locally change their overlap ordering, just like reality.

Researchers at Carnegie Mellon University have developed a simple yet elegant system for allowing a user to perform local layering.

The key idea is the list graph formulation. In a standard layering scheme (like in Adobe Photoshop), a single list can describe the layer ordering. However, with local layering, one needs multiple lists, one for each overlap region. This intuition exactly describes the list graph. The list graph is a graph that has a single list at each vertex. A vertex represents an overlap region, so the associated list is the local ordering of the layers! Edges in the graph represent adjacent regions of overlap.

While the graph formulation is simple, one needs to be careful to preserve consistency between adjacent overlap regions. For example, between two adjacent regions, the two layers in both regions can not flip their order. Intuitively, this means the two layers intersect each other (as opposed to being on top or below in a physical sense).

Given this notion of consistency, user-operators are defined, Flip-Up and Flip-Down, which preserve consistency and allow the user to change the local layer ordering.

More results are shown for creating impossible images (like the Necker Cube) and for animation. This work is simple yet very clever!

This work was presented at SIGGRAPH 2009.

More details (including the paper and videos) can be found on the project webpage.

Monday, August 17, 2009

Content-aware Resizing using Multiple Operators



Content-aware resizing is a resizing operation, performed on an image or video, that preserves the embedded content. For example, in a photograph of a skier in the Swiss Alps, one can resize it for display on a mobile device by carving out image seams of the Alps, while retaining the subimage of the skier. Naive methods for resizing would simply crop the Alps or scale the entire image, which would make the skier smaller. Content aware resizing is also important for resizing video for mobile display; imagine watching the Superbowl on your mobile device, while still being able to see the football players!

However, it turns out that in some cases scaling/cropping can produce more visually-pleasing results than a content-preserving operator. How does one choose between the three operators: scaling, cropping, and content-preserving resizing? Well, researchers at the Interdisciplinary Center in Herzliya and Adobe have developed an algorithm for selecting resizing operators to produce visually-pleasing images and videos.

The key idea is to combine several operators together. In other words, to resize an image, one might first scale and then perform a content-aware resizing. In order to find the best combination of operators, a resizing space is defined. Each axis represents a resizing operator, such as scale and crop. A point in this space is a set of operators applied to the original image. Hence, the problem of finding the best set of operators can be reduced to a dynamic programming problem, given a metric for image similarity.

Intuitively, we would like the operators to preserve as much of the original image as possible. To measure the similarity between the resized image and the original, a new image-similarity metric is proposed, called Bi-Directional Warping. This metric measures the similarity between two images as the maximum misalignment error between rows or columns.

Given the image similarity metric and an optimization for finding the best set of resizing operators, resized images are produced that agree well with users intuition of best resized images. This agreement is verified with a user study. Very solid work!

This work was presented at SIGGRAPH 2009.

More details can be found on the project page.

Saturday, August 15, 2009

Manufacturing Custom Reflections

Left shows a user-created reflectance. Middle-left is the microfacet geometry. Center-left shows the milling machine used to carve the surface. Center-right shows a view of the surface. Right shows the reflection of the surface, when illuminated by a light (not shaped as a teapot).

Up until recently, computer graphics has been concerned with the measurement of reflectance. That is, how do materials look visually, under different illumination and viewing angles? Understanding a material's reflectance enables us to photorealistically render it in a synthetic scene. How about the reverse process? Can we create surfaces with custom reflective properties in the real world?

You might wonder why one would ever want to create custom reflectance in the real world. Honestly, I did too. But it turns out that custom reflectance could be useful in architecture (see Pope's revenge), military (as camouflage), and commercial applications (such as security markers on credit cars or your personal bling bling on your rice rocket!).

To generate custom reflections, researchers at the University College London, University of Southern California, Adobe, and Princeton have developed a system enabling a user to control a milling machine to generate a surface with a custom reflection.

The key idea is to use a microfacet model for representing the surface and its reflection. A microfacet model is a surface comprising a distribution of tiny, oriented mirrors.

The user first designs a reflection, like a teapot. This reflection is actually a convolution of the material properties of the surface (e.g. the BRDF) and the microfacet distribution (e.g. normal distribution). Hence, the system deconvolves the user-defined reflection with the BRDF to obtain the normal distribution.

This normal distribution is sampled and its placement and order is computed from an optimization (e.g. simulated annealing) that seeks to find a smooth distribution of normals, tilable, and is amenable for the milling machine.

Finally, the normal distribution is converted to a height field by solving a Poisson equation. This solution is a height field whose derivatives best match the given normal distribution.

This height field is passed to the milling machine for manufacturing. Now, when a light is shown on this surface, you will see the reflection that the user designed. Cool!

It would be interesting to investigate how one could fabricate surfaces for light field capture and display. This could have applications in 3D TV.

This work was presented at SIGGRAPH 2009.

More details can be found on the project page.

Friday, August 14, 2009

3D Teleconferencing



3D teleconferencing has been one of the most popular science-fiction fantasies of computer graphics. Many of us remember the scene in Star Wars: A New Hope where R2D2 projects a 3D rendition of Princess Leia, crying out "Help me Obi-wan Kenobi, you're my only hope!"

Unfortunately, due to the physics of light (e.g. light must scatter in a medium for us to observe it), we probably won't be seeing 3D Princess Leia in that form, anytime soon. However, researchers at the University of Southern California and Fakespace Labs have constructed a 3D display for teleconferencing, which is similar in spirit to R2D2's historic 3D projection. The system displays a 3D face that tracks your gaze and holds eye contact. The overall experience provides a deeper communication connection for teleconferencing.

The key idea is a combination of 1) a fast 3D scanning system, 2) a spinning rotating surface for 3D display and 3) a 2D video feed for face tracking and simulating vertical parallax.

First, the remote participant (e.g. the dude in the box) is scanned at 30 Hz. This means every 1/30th a second, the remote participant's 3D geometry and texture is acquired. This is accomplished by an active-illumination system consisting of a projector and synchronized camera. The projector shines phase-shifted sinusoidal patterns, measured by the camera. The output is a face texture map and a coarse 3D mesh.

Next, this 3D mesh and face texture is displayed in 3D. This is accomplished by illuminating a projector onto a spinning, two-sided tent made of brushed aluminum. The projector is synchronized to the rotation to display the correct light field for the given orientation of the tent surface.

Finally, a camera is mounted in the 3D display to enable the remote participant to observe the audience. From this video feed, face tracking is also performed so that vertical parallax can be simulated. In other words, the visual information is adjusted to the position of the audience's face, so that if you were to crouch down, you would see the lower half of the 3D face, if it wasn't set to follow your gaze.

Cool stuff! In the future, it would be great to scale up this system to display a full human body. Anyone remember the scene in Minority Report where Tom Cruises browses his old videos by watching them in 3D? :)

This work was presented at SIGGRAPH 2009.

More details (and videos!) can be found on the project webpage.