The "Vuphonics" system is an experimental sensory-substitution system for use by blind and deafblind people, and this website describes "work-in-progress". Please note that the Vuphonics system is at the "detailed spec" stage, and that no finished software is currently available.
The Vuphonics system highlights the features of visual images that are normally perceived categorically, by substituting with coded sound effects and their tactile equivalents. It simulates the "instant recognition" of properties and objects that occurs in visual perception, by using the near-instantaneous recognition of phoneme sounds that occurs in speech. By listening to coded phonetic sounds (and feeling corresponding tactile/braille effects), the user can instantly understand the colours, textures, distances and "entities" that are present in an image. The system also conveys shape, location, "fine texture" and "change".
For beginners, the system can "speak" actual words, which directly describe the properties and entities being conveyed. The words can be "moved" in "sound space" to convey the shape of an item, for example a red circle :-"Direct-description" sounds of the red circle (MP3-compressed 28KB).
Volume fluctuations can be added "on top" of the basic sounds, to convey the "fine texture" of an area or entity :-"Direct-description" sounds of the red circle, with "fine texture" effects (MP3 28 KB).
These volume-variations combine the effect of small variations in brightness, colour, and distance, to produce a "texture" effect.
Instead of speaking actual words, the system can also output "coded phonetics" : for example, if the consonant sound "NN" represents the colour purple, and the vowel sound "EE" represents white, then a purple and white zigzag can be represented by the sound "NNEE", repeated if necessary, and moved in "sound-space" :-Coded phonetic sounds of the purple and white zigzag (MP3 28 KB).
Several of these moving audio "tracers" can be combined to convey 2-dimensional "composite audio graphics" :-Sounds of the purple and white parallelogram (alternating direction) (MP3 60 KB).
The coded phonetics used for the examples are taken from the table below, which gives the consonant (column 2) and vowel (column 3) sounds (shown in 2-letter phonetic format) used to convey properties (column 4 shows example words in which the parts in capitals sound like the corresponding consonant and vowel sounds).
The default property type is called "DInCoTex", as it combines the properties of Distance, INteger, COlour and TEXture onto two scales : the system selects one of the properties shown in column 5 (DInCoTex1) and one of the properties shown in column 6 (DInCoTex2), and presents them via their corresponding consonant and vowel sounds (it selects the two properties that best describe the area or entity being presented). Other property types are also available, for example Layout (shown in column 7 and described below).
(Cs & Vs
(Cons./C or Vowel/V. Numerical
categories are for objects.)
|9||YY||AX||YOUng||Text-like, characters and symbols||Polychrome||LDLD|
|10||LL||ER||LEARn||Radial or loops / Few : 0-4||Close : <70cm||LDDL|
|11||MM||YR||MERE||Wavy or wiry / Some : 5-16||Near:0.7-1.5m||DLLD|
|12||ZH||AR||JARdin||Mainly horizontal / Several : 17-64||Medium:1.5-7m||DDDL|
|13||VV||OR||VORtex||Mainly vertical / Many : 65-256||Far : 7m-50m>||DDLD|
|14||ZZ||OO||ZOO||Tessellated / Lots : >256||Remote: >50m||DDLL|
By using sequencing rules, the system can convey several property types at once. For example, the coded phonetics can (optionally) also describe the "layout" of properties within an area or entity, as well as conveying the DInCoTex properties : the consonant sound "DH" can represent the sequence of light levels "dark-light-dark-light", and the vowel sound "OO" can represent "dark-dark-light-light" (see column 7 of the table above). The DInCoTex colours Orange and Yellow are represented by the coded phonetics "HHAY", and these sounds and the Layout sounds can be output in DInCoTex-Layout order ("HHAY-DHOO"), the sounds being formed into the shape of the entity concerned, for example a triangle :-Coded colour and layout sounds of the orange and yellow triangle (MP3 28 KB).
Recognised entities can be conveyed by replacing the C&V (consonant & vowel) pairs that represent "layouts" with more complex consonant groups, which represent the objects. The example opposite shows how a small area of an image (known as a "viewport"), containing layouts and an object, can be conveyed via coded phonetics and (in the tactile modality) via braille.
The audio effects have tactile equivalents which can be produced by using : a "moving powered pointer" to convey the location within an image; a tactile pad (or moving powered pointer) to convey shape; and braille or other touch-based methods to convey the categorical properties.
As there are 16 basic categorical speech consonants and 16 vowels, a "C&V pair" conveys one of 256 possible combinations (16 x 16), so the combination of two DInCoTex properties can be displayed on a single 8-dot (i.e. "8-bit") braille cell, with each half of the cell (i.e. 4 dots) conveying one property. Layouts and objects can also be presented in braille format, as illustrated in the example above.
Using both modalities allows the user to spread the load of information to suit their needs and abilities, and could be used by deafblind people.
The Vuphonics system aims to simulate the way that sighted people perceive visual features, rather than conveying raw optical measurements : "scanning" methods can convey straightforward images effectively, but "information overload" may occur if too much detail is conveyed in a short period of time. The Vuphonics approach uses speech-like sounds, consisting of specific coded phonetics that can be rapidly interpreted in a categorical and linguistic way. By continuously changing the pitch and binaural positioning of the sounds, they can be made to "move", whether following a systematic path or conveying a specific shape.
Visual perception is complicated, and a degree of complication is inevitable if several aspects of vision are to be substituted via audiotactile means. Although the system may initially appear to consist of several unconnected features, the features can generally operate together, with the user controlling the effect of each feature, as well as the resolution, speed of presentation etc.
The rest of this page describes the Vuphonics system in more detail.
"Audiotactile tracers" are apparently-moving audio and tactile effects that can be in the form of "shape-tracers", which trace out the significant shapes of features and identified objects within an image, by continuously changing the pitch and binaural positioning of the sounds (the "sound space" uses a high = high-pitch / low = low-pitch convention, with a frequency range of 200 to 400 Hz and a "musical" scale).
Alternatively the tracers can systematically move round an area while outputting the properties of the parts that they are conveying at any moment (these are known as "area-tracers").
(N.B. "Area-tracer" scanning patterns similar to some of these have been used in systems developed by others - see links below.)
In the tactile modality, tracer location and movement can be conveyed via a "moving powered pointer" (see below). Moving effects are generally easier to "mentally position" than stationary ones.
People can easily recognise speech-like sounds and rapidly assign meanings to them. Speech is a natural and efficient method of conveying information, it is perceived in a classified/coded way, and the information content is not greatly effected by distortion. Most people are able to retain several spoken words in their short-term memory, including "nonsense" words. The use of natural-language words to describe shapes in an image has been investigated before, but the Vuphonics system uses new "words" assembled from the component sounds of English, which convey information in a coded format. These "coded phonetics" allow a lot of information to be conveyed in a short period of time, and can convey additional information by being modified in pitch and binaural positioning, so that they become moving "tracers". The effort needed to learn the "coded phonetics" is low.
Visual properties are presented to the user via combinations of 16 consonant (C) and 16 vowel (V) sounds, which are assembled to produce "CV CV ..." strings, that convey the properties via a convention. The user can recognise the sounds instantaneously, in the same way as people recognise language. Certain visual properties, such as colour, tend to be perceived in a categorical way, but properties which are not "naturally" categorical, for example distance, can be assigned to bands of values.
See the table of "DInCoTex" property assignments above. It shows that, for example, if an area or entity is mainly green and blue, the sounds "FFAH" would be conveyed, while if the system wants to convey the texture "Wavy or wiry" and the distance band "Close" then it outputs the sounds "MMER". Certain special consonants, not shown in the table above, are used to temporarily override the default property type, and to convey more detailed or additional information, for example special colours, more precise distances and numbers, or the presence of recognised entities.
The fine detail of an area or entity is conveyed by small, rapid fluctuations in the volume of the speech-sounds. These are referred to as "Combotex" effects, as they combine the effects of small changes in brightness, colour, and distance, to give a single volume-conveyed "texture" effect. This simulates the effect found in vision whereby the overall properties of an area are perceived categorically, and the minor variations in properties across it are perceived as a general texture. The user need not follow the precise detail conveyed by the Combotex effects, but gets an impression of the general level of fine change occurring in an area."Direct-description" sounds of the red circle, with "Combotex" effects added (MP3 28 KB).
Sections of the full image can be selected via a pointer, so that only those parts are conveyed, and at a higher resolution. These sections are known as "viewports", and the user can instruct a viewport to "zoom in" to any level of detail, as well as "zoom out" to convey a low-resolution representation of the whole image.
Viewports can be rectangular, hexagonal or "rounded" (circular or elliptical) and several viewports can be active at any moment. Viewports can be nested so that a "child" viewport moves within a "parent" viewport. One possible configuration would be to define nested viewports to simulate an eye's macula, fovea and/or areas of focal attention, that can move within a simulated visual field :-
However viewports would usually be rectangular, as these are easier to work with, and more straightforward to implement.
There are several possible ways of positioning and moving viewports :-
The coded phonetic sound tracers (with Combotex effects added) "travel" in binaural stereophonic "sound space", systematically covering a viewport, to sequentially represent the properties of adjacent parts of the viewport.
The methods by which coded tracers convey the properties in a viewport are described as being :-
"Layout" sounds allow the layout of the image to be "calculated", while "Averaged" sounds allow a more intuitive interpretation. Averaged properties and recognised entities can also be conveyed in the form of actual descriptive words ("Direct-description" effects).
Additionally, "Audiotactile Entities" can convey identified objects, unidentified objects, areas with common characteristics, or other features that are to be highlighted within a viewport. Combinations of "shape-tracers" can convey "composite graphics" :-Coded sounds of the half-textured purple and white parallelogram (alternating direction) (MP3 60 KB).
The "System Pulse" is a user-controlled period of time (typically between one and four seconds) that specifies the time allowed for conveying the contents of the viewports (the "scan time"). It can be thought of as analogous to musical "bar" timings. It acts as a "conductor" to maintain the timing of different viewports and keep them synchronised.
The System Pulse must be easy for the user to quickly set and change, so that they can slow down the output when the conveyed image content becomes complex, and speed it up again later. This allows the user to feel in control, and they can set the general output speed to a rate that suits them.
The System Pulse effects the "frame rate" of moving images, the resolution of the conveyed information, and the "stepping" rate of automatically-moved viewports.
"Change" output is user-controlled and optional. The user can set the sound level to be relatively quiet when there is little change occurring. The volume rises when the amount of change in an area increases, so drawing the user's attention to it. The volume then gently declines. Items or effects moving around an otherwise stationary viewport will cause the volume to increase in the effected areas.
A viewport can be defined as being change-controlled : the area of maximum change can be indicated by the position of a "powered pointer"; and when sudden change is detected in one part of the image, the system can move the viewport and centre it on the area of change, with the zoom level set to encompass the change.
As well as conveying general visual features, the system attempts to simulate the way in which features and "objects" are perceived in vision. Conveying basic properties does not do much to identify "entities", separate "figures" from the background, or assist with the other processes that occur naturally when people see things.
The simplest features are conveyed via shape-tracers, but "composite graphics" can also be used. These consist of several shape-tracers which together (either simultaneously or in sequence) convey a single entity (whether recognised or unrecognised).Coded sounds of the purple and white parallelogram (alternating direction) (MP3 60 KB).
There are three main types of Audiotactile Entities :-
The Audiotactile Entity types can be joined together as Audiotactile Structures, which link up related objects and features.
Audiotactile Objects are items in an image that have been identified to the extent that they can be described as specific entities rather than being described in terms of their properties, shapes and features. They are signified by the presence of a complex consonant in audio mode, and by a special "Object dot" in braille mode. Standard audiotactile objects could include common everyday items, standard shapes that are otherwise difficult to convey, items commonly conveyed by signs and symbols etc. At present audiotactile objects will mainly be used within pre-processed images, but in the future their is some scope for automatic recognition of certain objects.
If the shape of an object is available, then an audiotactile shape-tracer can present the coded object description. As an option it may be better to convey the distinctive "classic" shapes of objects, rather than the outline that happens to be formed by the object at its current distance and orientation, allowing "shape constancy" to be simulated. "Shape constancy" and "size constancy" are the perceptual effects whereby the shapes and sizes of objects are often perceived in a constant way once they are recognised, despite objects changing in distance and orientation within a scene : the shape and size of the tracers can be left constant (but their positions changed) when objects move about.
Identified entities in a scene are often related to one another, either in a hierarchical ("parent"-"child") manner (e.g. wheels fitted to a cart) or via a more general linkage (e.g. horse and cart). With prepared material, complex structures may need to be conveyed, and the components of complex entities can be linked together, sometimes with several "child" sub-components linked to a single "parent" component.
When an entity is present in a viewport, the user can select the special "Structure" mode, whereupon the system ceases conveying the viewport, and only conveys the entity, along with some or all of the entities to which it is linked. (If several structures were present in the viewport, then the system "locks on" to the entity that was being conveyed when Structure mode was selected.)
Most of the audio features have tactile equivalents :-
A "force-feedback" joystick makes an effective pointing device with which to indicate areas of the image, as it can also be programmed to tend to position the viewport in one of a number of set adjacent positions, so that a "notchy" effect is experienced as the viewport is moved. A force-feedback joystick can also be moved by the system, pushing and pulling the user's hand and arm both to convey shapes (by tracing them out), and to indicate a position in space. A force-feedback joystick can convey tactile effects equivalent to the Combotex volume "flutter", and can move a viewport to an area of "change", so drawing the user's attention to it. "Conducted" sequences can be developed, where the user is "lead" round a prepared image or movie sequence.
A possible design for a device which combines these tactile facilities is the multi-purpose Tactile Output and Input Device ("TactOID") illustrated below :-
The device is shown in desktop form, but could be attached to a body-worn framework for mobile use. The TactOID has a multi-functional hand-set which contains control buttons, a braille display, and a tactile palm-pad.
The system could convey a prepared programme of material. Pre-processed images allow the best methods for conveying an image to be pre-selected. A sighted designer, with the help of appropriate software, could define features and areas of an image (perhaps by selecting and modifying areas indicated by edge-detection software, which is readily available), and specify the most appropriate methods of conveying them. The designer could assemble "conducted sequences" which specify the order in which the image features are presented.
The entity and conducted sequence information could be embedded in the image pixels, using "steganography", so that the images can also be viewed normally using standard equipment and software. Images and movie sequences prepared in this way could be transmitted through currently available media, for example via compact discs, the Internet or broadcasts, enabling pre-processed sequences to be combined with otherwise standard video material. (For broadcast television, entity and conducted sequence data could be included in the lines often used for "teletext" data.)
If implemented, the Vuphonics system would allow a continuum of features, from basic visual properties, to fully-recognised objects, to be conveyed to blind (and deafblind) users. The system could be implemented as a dedicated portable electronic device, or in the form of hardware and software installed on a personal computer.
There are other people researching the use of sound and touch to convey images. Peter Meijer has already developed "The vOICe", which conveys images through sound. A different approach is used by "KASPA" (developed by Prof. Leslie Kay), which uses ultrasonics to convey the location and texture of objects. Prof. Phil Picton is developing a real-time "Optophone". Dan Jacobson's "Haptic Soundscapes" site describes a project to develop a tool to allow access to spatial information without vision.
I'll be making some minor changes to the conventions described on this page, but the general approach remains the same. Please email me if you want to know more.
I'm designing software for converting images into sound and tactile effects similar to those described on this page, with the resolution and presentation options controlled by the user. I'll make it available for downloading when it's in a usable state.
The Vuphonics System website is maintained by David Dewhurst. Any enquiries or feedback should be sent to firstname.lastname@example.org.