Spoken like a 1980s chip
In the beginning
In the beginning there was light. Of course there was, but a bang must have followed shortly after. It is not unexpected that the communication between organisms, surrounded by a fluid whether it is air or water, is primarily acoustic rather than visual. While vision remains the king of senses and chemical signals are ubiquitous, interaction using a oscillating pressure producer and a corresponding sensor remains the best compromise to create a versatile rich, abstracted, linear data transfer not requiring line of sight. or any other complex infrastructure apart from a medium that envelops the organism. Thus, speech. I recall the early days of my youth my ZXSpectrum had a Currah Microspeech module which, with only a 2K ROM, was able to produce an infinite number of raspy, but mostly recognisable words. Flash-forwards to today, we have far more powerful utilities producing sounds almost indistinguishable from another human. and Perl allows access to these speech engines. These are powerful, non-trivial utilities, with superior abilities to change intonnation, speed and other forms of expressive audio manipulations. For those interested polease explore PerlSpeak, Festival, e-Speak. Perl is certainly not short of vocalising options.
Pure Perl Implementation
As an old simpleton, however, I can not drag myself away from the genius of the guys who with such limited resources managed so much, into the modern world were memory and processor power limitations are of no real issue. The SP0256-AL2 is the centre of this remarkable primitive utility, and I took it myself to explore how this worked, to transform it into a simple module that can be imported into any Perl program, with no dependencies apart from a means to transfer data to a speaker. There indeed many attempts to emulate this little chip, though I have not come across any that specifically use Perl, I do find resources including this one from Greg Kennedy that allow translation to the allophones used here.
Re-inventing the wheel?
Why do things again that far cleverer people have already done, with far more advanced methods? For several reasons. It helps me learn how things are done. Old methods are not going to be as good as new ones, but use far fewer resources, and this would allow quite a lightweight model for speech synthesis that could be embedded in simple applications and games. This would not require any installation of any heavy weight back-ends, polluting the system with more and more infrequently used libraries. So the "plan" is 1) produce a stand-alone speech synthesis module, 2) progressively make the module more accurate and more recognisable 3) eventually have something anyone can use for fun or for real-life needs.
Now I did make a Piano-like monophonic player (Enable the sound in the embedded video sample to hear), based on a memory of a similar utility I saw many years ago. Most web examples use /dev/dsp for audio transfer to the output device, but this virtual interface no longer exists. A pipe to padsp does allow the emulation of /dev/dsp, and this is what I have used for Linux OS; Win32::Sound has a Windows equivalent of a sink for the raw audio data, as yet to be figured out. Thus we have a beginning of a module:-
Emulation
The classic method for emulation of this chip appears (as far as I can tell) to be to record the sounds output from the chip, sampling at much higher frequencies than is output (effectively adding a high pass filter). This results in errors and distortions. Such converted data is available as WAV files, and do not have any of the analog filtering that is performed on the original chip, this is actually much lower in quality that the original. But borrowing this data from various sources enables some semblance of a speech synthesiser, and my experimental modulino represents an early attempt at using unprocessed wave forms. But these wave forms are not only inaccurate and noisy compared to the sounds that came from my 40 year old ZX Spectrum, they take significantly more memory. The unpacked data in a script took 1/2 MB and packed into a Storable file, still take just under 100kb. Can't see my Speccy being too impressed.
Missing the point
BUT this simple way belies the cleverness of the original chip mode of operation. To understand how this is managed in 2kb of ROM one has to explore the disassembly of this ingenious device. Unlike modern speech synthesisers with banks of pre recorded audio allophones, the SP0256 actually appears to have built the output waveform on the fly algorythmically, much the same way as piano.pl generates the raw tone data for each of 96 notes. I do need to pore through this code not that I am familiar with the ancient microprocessor (or assembly for that matter) to see how this actually works. It should be entirely possible to create an ultra-lightweight module capable of generating recognisable speech in Pure Perl without any external speech libraries.
Leave a comment