Notes on the human visual system and perceptual image analysis and segmentation
Notes on the human visual system and perceptual image analysis and segmentation This is intended to become a loose collection of links and notes to gather my thoughts, ideas and resources on the topic in one place. Usually I have an unorganized mess of scribbles on various post-its, hyperlinks sitting in emails, browser favourites and text files. Also, this collection is intended to provoke feedback and get your neurons going, maybe we could spark some new ideas. Human perception, especially vision, is still an active research area and many things are yet unknown. On the other hand it is of tremendous interest to decipher nature's tricks. With today's computing power and camera systems lots of highly useful and practical applications could be built - if only we had the right algorithms to deal with the data. One of the most important and relatively low level vision tasks is segmenting an image into it's semantic parts and components. Human's do that tremendously efficient and fast, without even concious thought. Computers on the other hand are still relatively inept at this task and only perform well for certain small subsets of the general problem. So the first thing should be trying to understand how humans do image segmentation. Theories about the human visual system (some proven and true, others not) and ideas of how to emulate them algorithmically: - different light wavelengths provoke reactions of varying intensity on the retina. For example we can distinguish different shades of green far more easily and in greater fidelity than any other color. Evolutionary this makes perfect sense, considering that the great outdoors living habitat presents itself primarily in greens. Conclusion: model the colorspace accordingly, i.e. vary the dynamic range (bitdepth) used for representing signal strength in certain wavelengths. In the simplest case this means having more bits available for representing green than for any other color. - Humans perceive certain colors to be similar to each other while others appear dissimilar. Conclusion: model the colorspace accordingly, i.e. find a color space such that some mathematical metric represents human perception in color differences. One such colorspace is the CieLAB colorspace which has been devised with several hundred test candidates such that the euclidian distance metric corresponds to perceived color similarity. It still has quite a lot shortcomings though (see appendix). Problems: As opposed to this model human color perception is not absolute, but spatially and contextually varying. Our perception of color depends on the colors directly surrounding it and the number and type of colors visible simultaniously. It is easier to see color variation in a local color gradient than to count the number of different shades of a color randomly dotted all over an image. Also, we can only distinguish so many colors at once (a relatively small number). - The human eye is not modelled on a regular grid. Rather the perceptors are irregularly spaced on a rounded surface (the inner surface of the eye). On the other hand all artificial sensors adhere to a strict, regular and flat lattice. Question: does the irregularity of signal sampling in the human eye have any advantages over a fixed grid? Is it more robust dealing with aliasing problems? Is it more robust against over saturation with a certain spatial frequency? What purpose does it serfe? - Physical light intensity does not correspond 1 to 1 to perceived light intensity. Research indicates that perceived intensity approximately follows a logarithmic function of physical, measured, intensity. Computers already try to capture this effect for example with the nonlinear gamma curve for display devices. When modelling a color space or normalizing one this should be taken into account. - Humans seem to be more sensitive to changes in brightness than in color. The human eye has far more intensity than color receptors. This indicates that a color space that models brightness or intensity seperately from color would be desirable. Color could be represented with fewer bits than brightness. When building gaussian or laplacian image pyramids brightness could be sampled with a smaller (higher frequency) filter kernel while color could be smoothed much more with a larger filter producing a "perceptual image pyramid". BTW, why are all the usual color spaces 3 dimensional? There is nothing inherently 3 dimensional about color. Would it make sense to find another representation? (Update: there actually seems to be a theory saying color is at least 3 dimensional) - Humans only focus on a very small part of the whole visible field of view at a time. The spatial and color resolution of the eye falls off towards the edges. Color can only be perceived in the center of the field of view. Both of these facts would imply a non-linear, non-kartesian, coordinate system to represent images. The log-polar coordinate system might be one such choice, giving high spatial resolution at it's center and falling off towards the edges. Thus one could model a sharp and crisp "center of attention" while suppressing noise and uninteresting detail in the surrounding area. This would only deal with the spatial issue though. A similar approach should be used for color, so that an image fades to gray towards the eges and color detail is reduced. - The human eye and associated nervous system is organized higly hierarchical. Processing and understanding of the received data happen at each level of the hierarchy. Some simple metrics and information seem to be gathered very early in the process such as "it's dark/it's bright". Vertical and horizontal breakage lines are detected pretty early (probably because they are needed for navigating our three dimensional surrounding) as are foreground and background objects. Foreground segmentation is a particularly interesting subject since a lot of information seems to be used here. Examples would be: is the object in motion; a tendancy to place objects "on the ground", i.e. the lower part of an image where gravity would have them; a light vs. dark object segmentation; stereoscoping 3D segmentation, i.e. how close is the object. The last part of the system is the actual pattern and heuristics matching happening in the visual cortex. Pattern matching meaning matching the brain's database of actual known objects to the newly encountered ones. Heuristics matching means matching known physical truths to a probably unknown object (This last point is especially interesting since it implies that things that "cannot be true" won't easily be perceived. Supposedly this is anectotedly true for the first sightings of Columbus' ships by the natives - they didn't know such constructs and thus had problems "seeing" them - although they were huge). The hierarchical layout implies that an artificial perception system should probably be modelled the same way. Basic receptors, followed by ever more complex and complicated feature detectors followed by logic combining those features into semantic objects. Maybe some form of neural net with different types of neurons and different layers serving the individual detection tasks could be constructed for this. - Humans have two eyes. Stereo vision is useful for 3D navigation, but is also used for detecting redundancies and suppressing noise in 2D images. Also, the focus area is in the field of view of both eyes giving extra detail and resolution, while the edges of the image are only ever in the field of view of one eye. This again hints at a non-cartesian image representation with detail fall-off towards the edges. Maybe sensors could be built or refitted to produce stereo vision images as well. - Similar to the center of attention mentioned earlier humans seem to have something like locally dominant colors. We can only sense so many colors at once (a relatively small number) so we see certain areas as one homogenous color even if they are not. Similarily our ability to discertain colors directly depends on how many different colors are visible at once. It is easier to spot a smooth color gradient than to discertain shades of the same color in a chaotic patch of different colors. This could probably be modelled by a clustering algorithm (k-means, meanshift, self organizing maps, ...) in color space, finding the locally dominant colors. - While computer image analysis usually only works on one image at a time humans naturally blend several impressions of the same visual stimulus over time. I don't know whether the human eye has a concept of "frame rate/frames per second" or whether that would be constant for all receptors or different (it would be possible for example, that the gray receptors have faster circuit times than the color receptors, thus giving light intensity higher time resolution). All this is important in analysing motion of course, but even for static images it makes sense to view it over time. A human seeing an image for some time will gradually notice more and more detail. Due to the imperfect nature of our sensors (eyes) still frames will have a certain amount of noise and inaccuracies in them. Interpolating over time will eliminate most of these. Also, it allows one to focus on different or more details once the basics have been analyzed. Putting all these factors together it seems humans have a hiararchic system over time, reducing noise and redundancies and analyzing in increasing amount of detail. - Texture. Texture is an exceedingly complicated and overloaded term. There is not even a standard way to define what texture means. Yet at the same time humans have no trouble whatsoever recognizing objects of same texture. Why is that? Texture describes some spatial relationship between colors and intensities, a pattern. It is not necessarily exactly repeating (the texture of a plant leaf is always the same, yet at the same time it is unique for each leaf and from one square centimetre to the next). It is not necessarily of uniform scale. It is not easy to describe in simple geometric terms (clouds). It does not necessarily appear uniform in color or intensity. Nevertheless, there must be at least some characteristics appearing constant and non-varying to the human observer, so he can group parts of an image by their texture. Texture is a hierarchical term again. Viewed from a distance at coarse detail a patch of lawn will appear textured uniformely. Viewed in close detail the individual blades of gras will appear totally chaotic. There are multitudes of approaches to capturing texture in mathematical terms, most of them deriving some form of statistical features from gray value intensities. There are fourier descriptors, wavelet transforms, coocurance matrices, color clusterings, bayesian predictors, neural networks, genetic algorithms, self organizing maps, active contours/snakes, ... Each one of these is a valid approach and fit at capturing one aspect of texture. However none of them is adequat for the general problem. To sum up: obviously the human visual system is a highly sophisticated, highly dynamic and hierarchically organized system. It seems difficult if not impossible to model all of it's capabilities with only one algorithm. It seems more logical to start at the bottom of the hierarchy, with the simple or even trivial transformations and interpretations and work the way up. Ideally keeping the interfaces open such that at each step each algorithm is pluggable and replacable by a different, possibly better, one. Another very important preprocessing step is determing what will be the basic inputs to the whole system. These inputs should be modelled as close to the outputs of the receptors in the human eye as possible. I'm a big fan of emergent behaviour and believe a lot of nature's power comes from systems organized in such a way. Very basic and simple basic blocks in large quantities work together to archieve more than the sum of their parts. Examples for systems like these are all of a human's senses, each of which is built from tiny receptors, the human nervous system and brain and, on a larger scale, ant colonies. These systems exhibit a lot of very desirable properties: they scale well, they are robust to noise, they are robust to defects in individual agents, they do not have a single point of failure or a single mastermind, they are easily parallelizable and they are easy to construct (since each individual agent is trivial). Artificial examples for algorithms with these qualities would include neural nets, self organizing maps, autonomous agent systems (ants) and genetic algorithms. The idea is to build a hierarchy of lots of very simple building blocks and hope that something useful will evolve from that. Ok, now I'll start working on my genetically evolving, self organizing, hierarchical neural network based on fuzzy logic maths - it will call you and tell you when it's done and ready to take over the world ;-) Information about the eye and vision: http://webvision.med.utah.edu/ Irregular grid sampling: http://www.ee.surrey.ac.uk/Basic/ Perceptual image pyramids: http://www.cs.ucla.edu/~siavosh/iccv05.pdf Log-Polar image representation: http://users.isr.ist.utl.pt/~alex/Projects/TemplateTracking/logpolar.htm Locally dominant color adaptive segmentation: http://www.ece.northwestern.edu/~pappas/papers/jqchen_tip05.pdf Experimental Determination of Visual Color and Texture Statistics for Image Segmentation: http://commnet.ece.northwestern.edu/~jqchen/docs/spie05.pdf Human visual system foreground background detection: http://www.apa.org/journals/releases/xge1312194.pdf Stereo vision correspondence comparison: http://cat.middlebury.edu/stereo Comparison and benchmarking datasets for different segmentation strategies: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench Autonomous agents and pheromone maps for digital image segmentation (this one is just soo cool ;-) ): http://alfa.ist.utl.pt/~cvrm/staff/vramos/Vramos-WCLC05b.pdf Problems with the CIE Lab color space: http://www.aim-dtp.net/aim/evaluation/cie_lab/index.htm plus a gazillion more in my bookmarks and littered all over the place. I hope I'll come around to organizing them.