Lomoji: A New Tool for Polyglots

I often make the analogy that trying to be a polyglot in an English-speaking country is like trying to maintain a hobby for keeping igloos somewhere non-polar. Without year-round cold, you either have to keep adding snow, or watch in resignation as your bespoke collection of icy enclosures melt away.

Language atrophy happens especially quickly for us English speakers because everywhere, it’s hot. It’s far more likely the person you’re talking to can speak English better than you can speak their native language or they underestimate your abilities. So chances are that unless you’re already super fluent in another language, whenever you try to practice it, you’ll very likely get a, “Wow! Your [insert language] is very good!” and the interaction will resume in English. (Ugh.) And even though I live in New York City, it’s still an effort to keep all of my igloos from turning into puddles of mush.

So I keep adding snow every day (i.e. artificially force exposure to all my languages constantly). Like a good polyglot, I have Google Translate on the first page of my iPhone’s home screen. Whenever I have random moments of pause throughout the day when I realize, “Wait. How would you say [word/phrase] in [language A]?” I’ll stick it into Translate for a quick answer.

It's sometimes a pain in the butt using Google Translate as a polyglot as it's often the case that you have to manually translate multiple times.
One translation at a time?! Ugh.

But wait. I’ve also just realized I don’t know how to say that in BC, D, or E! Okay. Hit X. Select B. Cool. Hit X. Select C. Cool. Hit X. Select D… this quickly gets tiresome because each time, you have translate each one manually. The Google Translate app only keeps a history of your five last-translated languages.

Gif of Kim Kardashian going, "Like, I'm so annoyed."
Okay. Maybe not that annoying.

As we polyglots tend to, I “dabble” quite a bit  — enough to notice how frequently I was having to scroll back to that huge list even for my most-used languages. On a random day when I wanted a reminder of how to say “shelf” in Spanish I had to manually select from the master list it because it had gotten bumped off the last-used list by random forays into Danish and Portuguese.

WHERE'S MY SPANISH?
WHERE’S MY SPANISH?

I wanted something where I could select ALL the languages I’m currently “dabbling” in so I could view translations for all of them concurrently. Well, I had never implemented the Google Translate API in anything before, so I took this as a learning opportunity. I went ahead and created myself an account on Google Cloud Console, got myself an API key, and started playing.

The result is Lomoji.

Screenshot of Lomoji, an iOS app that shows a text box and a green button with a back and forth arrow to indicate translate action.
Lomoji main screen

Lomoji is a small React Native app that lets you select multiple target languages, and persist them in AsyncStorage so that whenever you perform a translate action, with just a quick glance, you’re instantly reminded of how to say them in all your languages. The languages only disappear when you uncheck them; not because you happened to translate a few small things in five other languages prior.

First select the language you want to translate from…

 

Selecting languages in Lomoji
Selecting a “from” language

Then select your destination language(s)…

Dropdown menu of languages in Lomoji
Dropdown menu of “to” languages

After hitting “confirm,” we have our nice list of igloos!

Selected from and to languages in settings panel
Completed settings panel

Enter the string to be translated and… voilà!

Panel showing final translations in all destination languages
Panel showing final translations in all destination languages

THAT’S WHUT I’M TALKIN’ ABOUT.

Beyoncé throwing up two sign of the horns.
WHOOP, WHOOOP!

Google Translate makes it easy to get an updated list of all available languages. As noted in their documentation, you just make an API request that Translate responds to with a JSON object to get the most up-to-date list. So I made the component so that it asynchronously checks the list of available languages, and updates the internal list before defaulting to loading the in-built store of languages in case that fails. This makes it future-friendly as support for languages are added (or perhaps removed) by Google.

async componentDidMount() {
    await axios
        // Get most recent list of langs from Google.
        .get(`https://translation.googleapis.com/language/translate/v2/languages?key=${config.apiKey}&target=en`)
        // If that was successful, set our internal state.
        .then((result) => {
            this.setState({
                langList: result.data.languages
            });
	})
        // If that fails, use default built-in list of languages.
	.catch((err) => {
	    this.setState({
		langList: languages.data.languages
             });
            alert(`Unable to retrieve list of languages from Google Translate.\nLocal list used instead.\n\n${err}`);
        });
            this.setState({assetsLoaded: true});
}

But that’s it for now! I think in a future iteration, it would definitely make for a more fluid user experience if the input language were detected automatically. Still a few loose ends to tie up but I was too excited not to not post about it. I’ll update as I move toward publishing it!

Oh, and if you’re wondering why I decided to call it “Lomoji,” it’s because in Korean, “X (eu)ro muoji” (X[으]로 뭐지?), where X is a language, is a phrasal suffix that roughly translates to, “How would you say it in X?” I simplified it and anglicized it to make it catchy to Western ears.

So “Yeongeo ro muoji?” (영어로 뭐지?),  where “yeongeo (영어)” is “English,” would translate to, “How would you say that in English?” Clever?

My Top Essential Plover Commands

Plover is the world’s first free and open-source stenography software. But unlike traditional computer-aided (machine-shorthand) translation systems, which typically resemble sandboxed word processors, Plover lets you use your steno machine as an OS controller/input device. Each has its own user base and use cases and I’m not saying one is inherently better or worse than the other. They fundamentally serve different audiences. However, one of the benefits of Plover is that it unlocks a whole new world of single-stroke commands one can use to control their machine. Here’s my list of the top commands that I use nearly every day. Keep in mind that I use macOS. If you use Windows, for most commands, you can just substitute “ctrl” for when I use “command” and you’ll be good to go. An exhaustive list of all the commands Plover has to offer is available here.

1. Switching workspaces

When you have several workspaces on macOS and you want to switch between them quickly.

SKR-R = {#control(left)}

SKR-G = {#control(right)}

“switch to the workspace to the right” (mnemonic hook: “screen-⬅”)

“switch to the workspace to the left” (mnemonic hook: “screen-➡”)

*Think of the RPBG keys as standard keyboard arrow keys.

2. Go to the end/beginning of the document

Useful for when you’re making edits while captioning and the speaker starts talking again, or you just want to snap to the bottom of the document you’re working on.

SR-RZ =  {#command(down)}

SR-FD = {#command(up)}

3. Git commands

TKPWHRA*UL (git add [all]) = {>}git add .

TKPWEUPLT = {>}git commit

TKPW*EUPLT = {>}git commit -m “{^}{-|}

TKPWHRURP = {>}git pull origin master

TKPWHRURB =  {>}git push origin master

TKPWAUTS = {>}git status

4. One-stroke money formatters

When people say sums of money in ways that make you stop and think where to put the commas and periods, do these instead.

TKHR-RS = {*($c)}

“Turn the previous number into a correctly-formatted sum of money.”

3 + TKHR-RS becomes $3

4.5 + TKHR-RS becomes $4.50

.33 + TKHR-RS becomes $0.33

THO*UDZ = {^}000{*($c)}

“Turn the previous number into thousands of dollars.”

4 + THO*UDZ becomes $4,000

200 + THO*UDZ becomes $200,000

PH*LDZ = {^}{#option(left)}${#option(right)}million

“Turn the previous number into millions of dollars.”

7 + PH*LDZ = $7 million

*Use the same paradigm to do trillions of dollars (TR*LDZ) and billions of dollars (PW*LDZ).

5. Start lower with no space, start upper with no space.

When you want to suppress automatic spaces and want to control the case of the first letter of the output.

KPA* = {^}{-|}

“Start text where my cursor is currently located and capitalize the first letter.”

STHROER = {^}{>}

“Start text where my cursor is currently located and make sure the first letter is lowercased.

6. Special characters that need to be escaped.

Certain characters that are part of Plover’s dictionary syntax need to be escaped correctly to define them so that Plover doesn’t get confused and misinterpret the entry.

KHRURB (curly brace open) = \{{^}

{

KHRURBS (curly brace close) = {^}\}

}

PWHR*RB (backslash) = {^\^}

7. Comment slashes

\

KPHERBS (comment slashes) = {^}//

// (hi)

KPH*ERBS (comment slashes + cap next) = {^}//{-|}

// (Hi)

TAEUBL/TPHR-P = (╯°□{#Shift_L}°)╯︵ ┻━┻

 (╯°□ °)╯︵ ┻━┻

*Note the required {#Shift_L} to prevent special-character encoding weirdness.

SKWHR*UG = ¯\_(ツ{#Shift_L}{^})_/¯

¯\_(ツ)_/¯

*Note the required {#Shift_L} to prevent special-character encoding weirdness.

8. Brightness up/down

PWR*P = {#MonBrightnessUp}

PWR*B = {#MonBrightnessDown}

*Again, think of the P and B as up and down on a standard keyboard arrow set.

9. Sound controls

SRAO*UP (volume up) = {#AudioRaiseVolume}

SRAO*UB (volume down) = {#AudioLowerVolume}

SRAO*UPLT (volume mute) = {#AudioMute}

10. On-the-fly speaker definition

For when you didn’t create a speaker definition for someone beforehand but you want to still mark them by their name in the transcript rather than using chevrons (>>).

SPOEU = {*<}{^:}{#Alt_L(Left)}{^\n\n^}{#command(right)}{^ ^}{-|}

Inserts a new line, puts the last name (or any word that comes prior to the stroke) in all caps, adds a colon, and capitalizes the next character.

“Josh” becomes \nJOSH:(capitalize next)

11. Go to end of line & add a semicolon

TKHRAO*EUPB (go to end of line) = {#command(right)}

SKHROEUPB (add semicolon to end of line — used frequently in writing JavaScript) = {#command(right)}{;}

12. CamelCase

KPW-PBG (cap next word and attach it) = {^}{-|}

set KPW-PBG attribute becomes setAttribute

KPHAO*EL (cap the first word and the next word, attach them) = {*-|}{^}{-|}

word KPHAO*EL press becomes WordPress\

13. Add comma at end of line, make new line

Handy for writing JavaScript objects or working with JSON.

KPHAPBL (comma + new line) = {#right}{,}{^\n^}{^}

14.  Indented parentheticals

 Used for captioning to mark things like laughter or applause.

KHRAFT (laughter) = {^\n^}{^\n^}{^\t^}{^}[ Laughter ]{-|}

Adds a new line, indents one tab and adds [ Laughter ].

15. Dot-slash

Used for pointing to the current working directory in bash.

TKHR*RB (dot slash) = ./{^}

16. “S” is for “sticky (parentheses & brackets)

Useful for writing function declarations and array pointers.

SPR-PB = {^}({^}

SPR-PBD (paren, double) = {^}()

SPWR*BG = {^}[{^}

So these are a few of the most important ones that I use for now. I’ll keep adding to this list as I think of more!

Hacking Music Accessibility Via Realtime Stenography (and Code): Recap

This March, I got to present at the NYC Monthly Music Hackathon about my experiments using stenography and coding to enhance accessibility of auditory information for those with hearing loss who use real-time captioning as a form of accommodation.

Photo of Stanley in front of an audience. On the left side there's a screen with the title slide of his talk and on the right is Stanley standing in front of another screen, to which he's transcribing himself.
If you think that looks like I’m transcribing myself, you would be correct!

Problem

While sign language interpreters can use their body and expressive, animated signing to convey the tone, rhythm, and various other qualities of music, real-time captioners are comparatively limited in their expressive capabilities. Deaf attendees who primarily use spoken language typically must rely on a combination of their residual hearing, vibrations, cues from the performer or audience members, and parenthetical descriptions inserted by the captioner to piece together a mental representation of what’s being heard.

Here’s a short clip of what live captioning looks like in action:

Despite having provided real-time captioning services for almost six years now, this was something I hadn’t considered at all until I was hired by Spotify to caption their February Monthly Music Hackathon. A deaf attendee, Jay Zimmerman, approached me and requested that I describe the audio segments during the demo portion as best I can. As a musician who had lost his hearing later in life, Jay wanted and appreciated a greater level of access than the simple music note, as often seen in TV closed captions, or simply the lyrics on screen. So I did the best that I could:

 On screen, it shows a bunch of dialogue and then parentheticals for [ Big Room House/EDM ] and [ Tiësto, Martin Garrix - "The Only Way Is Up" ].
Stanley’s captioning screen showing captions from Spotify’s Monthly Music Hackathon event.
For me, this wasn’t good enough. Even when the captioner knows a song well enough to nail the title, it still feels lacking. It just isn’t a suitable accommodation for someone who truly wants to experience the acoustic information in a more direct way. There’s just too much information to convey to sum it up in short parentheticals. Additionally, when I am captioning, I’ve found it difficult to come up with accurate and succinct descriptions on the spot when things are often happening very quickly in live settings.

  Brainstorming and Thought Process

A while back, I wrote a post on my project, Aloft. In short, it’s a cloud based caption delivery app which enables participants to access stenographers’ live transcription output on any Internet-enabled device with a modern browser. Its main use case was originally for those participating in events or lectures remotely:

Screenshot of a remote participant watching the keynote address at WordCamp US Nashville depicting a PowerPoint slide and the realtime captions underneath.
Example of Aloft being used by a remote conference participant.

But I actually use Aloft in all of my captioning work, including those that are on site where I hook up to a projector, drag the participant view page over to the monitor, and display it full screened like so:

Photo depicts a man in a wheelchair giving a talk while a full-screen view of live transcription is visible behind him.
Example of Aloft being used on-site at the March NYC Monthly Music Hackathon. Speaker: Patrick Anderson.
Image of a large audience at a tech conference watching the keynote with live captions above the presentation slides.
Example of using Aloft for large-event captioning at WordCamp US 2017.

Well, one of the great benefits of working with software you wrote yourself is that, as you encounter different needs, you are free to modify your tools to meet them. In fact, one of the reasons I created Aloft was because of how anoyingly restrictive and unintuitive commercial legal-transcription viewing platforms typically are with respect to seemingly simple things like being able to change the font size or the color palette.

So that got me thinking…

What if we could simultaneously convey the transcription of spoken content in addition to some visual representation of the audio?

Initial Research

At first, I didn’t know where to start since, at the time, I hadn’t really dealt with computer audio processing at all so I began with researching different JavaScript packages (since it needed to be web compatible and I didn’t really want to deal with having to run code through an interpreter). I came across a package called p5.js, which a lot of the folks at the hackathon seemed to have either heard about already, or used in the past. The Frequency Spectrum add-on seemed pretty promising.

 

Screenshot depicting the demo page for the Frequency Spectrum add-on. A frequency spectrogram (squiggly lines on a light gray background) and some JavaScript code beneath are visible.
P5’s Frequency Spectrum add-on

Implementation

So I created it a special route in Express so it would only live on a page that I could point the browser to only during the occasions I need it. We don’t want all those extra scripts loading when I’m doing just normal captioning!

app.get('/soundbar/:user/:event', function(req, res) {
	Event.findOne({'user': req.params.user, 'event': req.params.event},
	function(err, event) {
		if(err) {
			throw err;
		} else {
			res.render('watch-soundbar', {
				'user': req.params.user,
				'event': req.params.event,
				'marker': req.params.marker
			});
		}
	});
});

So now, I would I would load the libraries only if I went to localhost:4000/soundbar/stanley/monthly-music-hackathon.

Then, it was just a matter of importing the libraries into the view, and initiating p5 via the code provided on the demo page with some minor modifications.

var mic, fft;

function setup() {
	let canvas = createCanvas(windowWidth, windowWidth/5);
	canvas.parent('waveform');

	noFill();

	mic = new p5.AudioIn();
	mic.start();
	mic.amp(10);
	fft = new p5.FFT();
	fft.setInput(mic);
}

function draw() {
	background('#{bgColor}');
	stroke('#{fontColor}');
	strokeWeight(3);

	var spectrum = fft.analyze();

	beginShape();

	for (i = 0; i<spectrum.length; i++) {
		vertex(i, map(spectrum[i], 0, 255, height, 100) );
	}

	endShape();
}

function windowResized() {
	resizeCanvas(windowWidth, 250);
}

A few changes I made were: I modified the canvas width so that it automatically calculates the height and width based on the parent window, so it would always be proportional to the size of the size of the captioning screen. Since we don’t know what the foreground and background colors are, as they’re dynamically set by an Angular service, I passed those in as variables that are pulled from local storage via an Angular service, and sent to the controller. Finally, I passed in the div to which it should display and we have a working audio waveform at the bottom of the screen!

 Screenshot of a web page displaying real-time transcriptions with a squiggly-line audio frequency spectrum displayed at the bottom.
View of frequency spectrum at the bottom of the Aloft viewer page.

But, wait a minute. What if I don’t want to see the “soundbar” all the time? What if I only want it to be displayed when appropriate, as in during a live musical demo or performance? Well, why don’t I create a command that I can invoke from the steno machine that would turn it on and off? So I arbitrarily decided that &SOUNDBAR would be the flag I would use to turn the soundbar on and implemented it like so:

// If the number of &SOUNDBARs is even, toggle between show and hide.

if ((text.match(/&SOUNDBAR/g) || []).length % 2 === 0)  {
	soundbar.style.display = 'none';
	captionArea.style.paddingBottom = '4rem';
} else {
	soundbar.style.display = 'inherit';
	captionArea.style.paddingBottom = '0rem';
}

So essentially, every time the backend receives text, before it modifies the DOM, it uses a regular expression to check to see whether the sequence of characters matches &SOUNDBAR, and if the number of &SOUNDBARs present in the transcript is odd, then display. If it’s even, hide it. Probably not the most optimal solution in that the microphone is activated and also performing the calculations in the background whether or not the soundbar is visible, but it works for now.

Now, the soundbar remains hidden until I choose to show it by writing SPWA*R on my steno machine, which spits out &SOUNDBAR in the transcript. As soon as I don’t need it anymore, I write SPWA*R again, the number of soundbar flags on the page is now even, and it hides the div.

 Screenshot of two screens, one of the captioner's screen where you can see the captioner wrote "&SOUNDBAR" to flag Aloft to open the soundbar. On the second screenshot, you can see the squiggly frequency spectrum "soundbar" displayed.
How the &SOUNDBAR flag is used by the captioner to display the frequency spectrum.

An obvious drawback is that if the consumer wants a copy of the transcript, I would have to go back and delete all the occurrences of &SOUNDBAR but that’s a problem to solve for another day. Another problem is that p5 is always running in the background even when the div is hidden. This is a limitation of package itself, unfortunately, since I’ve encountered trouble loading it via some event handler as opposed to how it was designed to work: on page load.

A limitation that Jay pointed out was that, while the soundbar solution is great for visualizing instantaneous peaks, what’s loud or quiet at the current moment. It’s relatively poor at visualizing temporally, what’s going on. So I played around with another possible solution, this time using a different Web Audio-based package called Awesome Realtime Audio Visualizer in JS, HTML created by Will Brickner. His audio visualization spectrogram conveys the frequency on the vertical axis, time on the horizontal axis, and intensity through pixel brightness. When the visualizer is activated, a yellow line scans across the screen from left to right, drawing a spectrogram in real-time, giving one more of a temporal idea of the auditory content:

Screenshot of a captioning window with a dynamic audio spectrogram as the complete background.
Demo of spectrogram with “Crowd Control” – Fisher playing in the background.

In contrast to the soundbar, this visualizer takes up the whole screen and becomes the entire page’s background. In the view, it lives as an HTML canvas object. As with the soundbar example, this implementation also has its own route (/spectrogram) and it also allows the captioner to turn the audio visualizer on and off depending on need.

Final Thoughts and Next Steps

The biggest challenge is really being able to convey enough depth to make the experience as immersive and inclusive as possible. While these two solutions are a great start, perhaps combining them somehow into one page would let the user benefit from both formats. Or better yet, incorporate a more-complex visualizer to allow for more bandwidth. As Jay puts it:

Both give a sense of amplitude and pitch, with amplitude being easier to perceive so we can see when sound is happening and whether or not it is soft or loud.

The issue with frequency/pitch information is how to make it clear and easy to decipher, which leads to linear vs exponential or logarithmic graphing.

Easiest way to comprehend pitch/frequency quickly is on the vertical, so high pitches are high and low pitches are low.

So what if we could harness a different axis like color to indicate volume, leaving the y-axis free for pitch and the x-axis to denote time? What if instead of a line or a sequence of bars, we had a 3D object that morphs and changes in size in response to different volumes or frequencies? But then that leads to the question: at what level of complexity would the visuals negatively affect the legibility of the captioning?