Listening-based language learning

  • By OmniLingo
  • Last update: Dec 9, 2022
  • Comments: 12

OmniLingo

Matrix #omnilingo:matrix.org GitHub licence

Project in action

What is this?

The goal of the project is to help you practice listening comprehension.

It works by giving you random sentences in the language you're learning and asking you to fill in the gaps. The sentences were submitted by contributors to Mozilla Common Voice platform.

The project aims to not require any knowledge of a meta language in order to start learning. If you are interested in a more traditional course creation project, check out LibreLingo.

The game works by ordering the the questions by difficulty, then you are given batches of five with a random task for each of the questions. When you sucessfully answer a batch of five in less time than the audio takes to play, then you advance a level and get given a new batch of five.

Tasks

  • Fill in the blanks: A cloze-style task
  • Drag and drop: Get a set of tiles and click on them to build a word or sentence
  • Pick the right one: Get two options and choose the right one
  • Spot the word: Get set of six tiles and click on the ones that appear in the audio

Keys

  • Space: Play the recording
  • Enter:
    1. Submit and check if you got it right
    2. If already submitted, move to the next recording

Data

The data comes from the Common Voice dataset releases.

Target audience

This system is designed with two main user groups in mind:

  • People who want to learn a new language
  • People who want to learn how to write their native language

The system endeavours to be audio first, with knowledge of writing built up by hearing.

Contact

Talk to us!

  • IRC: irc.freenode.net #OmniLingo
  • Matrix: #OmniLingo:matrix.org (access via Element)
  • Telegram: OmniLingo

Follow us!

Available languages

All of the languages available in Common Voice 6.1 dataset.

Abkhaz · Arabic · Assamese · Breton · Catalan · Hakha Chin · Czech · Chuvash · Welsh · German · Dhivehi · Greek · English · Esperanto · Spanish · Estonian · Basque · Persian · Finnish · French · Frisian · Irish · Hindi · Upper Sorbian · Hungarian · Interlingua · Indonesian · Italian · Japanese · Georgian · Kabyle · Kyrgyz · Luganda · Lithuanian · Latvian · Mongolian · Maltese · Dutch · Odia · Punjabi · Polish · Portuguese · Romansh Sursilvan · Romansh Vallader · Romanian · Russian · Kinyarwanda · Sakha · Slovenian · Swedish · Tamil · Thai · Turkish · Tatar · Ukrainian · Vietnamese · Votic · Chinese (China) · Chinese (Hong Kong) · Chinese (Taiwan)

If you want to work with a language not yet in Common Voice, we highly recommend that you get set up in Common Voice, but in the meantime, you can check out the format guidelines.

Releases

  • 0.1.0 Functional proof of concept
  • 0.2.0 Partial prototype with level progression

Deployment

To bootstrap the project for Finnish, git clone the repository, then run the following commands:

pip install poetry
pip install -r requirements.txt
make
poetry install
poetry run omnilingo serve

The project should be accessible through http://localhost:5001/index.html

To add more languages, download a dataset from Common Voice and put it in cv-corpus-6.1-2020-12-11/.

Happy hacking! :)

Dependencies

For those who prefer to install their dependencies through their package manager in Debian/Ubuntu, the following dependencies are available there:

python3-mutagen - audio metadata editing library (Python 3)
python3-jieba - Jieba Chinese text segmenter (Python 3)
python3-flask - micro web framework based on Werkzeug and Jinja2 - Python 3.x

Acknowledgements

Logo by Fabi Yamada! Licensed under CC-BY.

Github

https://github.com/omnilingo/omnilingo

Comments(12)

  • 1

    Remove Japanese tokenizer - it's crashing.

    We might need to use a different tokenizer library. Here's the crash report:

    [15:20:19] git:(main*) [email protected]:/home/d33tah/workspace/omnilingo(0) > gdb python                    
    GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
    Copyright (C) 2020 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.
    Type "show copying" and "show warranty" for details.
    This GDB was configured as "x86_64-linux-gnu".
    Type "show configuration" for configuration details.
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>.
    Find the GDB manual and other documentation resources online at:
        <http://www.gnu.org/software/gdb/documentation/>.
    
    For help, type "help".
    Type "apropos word" to search for commands related to "word"...
    Reading symbols from python...
    (No debugging symbols found in python)
    (gdb) r question_loader.py
    Starting program: /home/d33tah/virtualenv/bin/python question_loader.py
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    [Detaching after fork from child process 14658]
    [New Thread 0x7ffff321e700 (LWP 14684)]
    [New Thread 0x7ffff2a1d700 (LWP 14685)]
    [New Thread 0x7ffff021c700 (LWP 14686)]
    [New Thread 0x7fffeba1b700 (LWP 14687)]
    [New Thread 0x7fffe921a700 (LWP 14688)]
    [New Thread 0x7fffe6a19700 (LWP 14689)]
    [New Thread 0x7fffe4218700 (LWP 14690)]
    [dynet] random seed: 1234
    [dynet] allocating memory: 32MB
    
    Thread 1 "python" received signal SIGILL, Illegal instruction.
    0x00007fffdf082476 in dynet::DeviceMempoolSizes::DeviceMempoolSizes(std::string const&) () from /home/d33tah/virtualenv/lib/python3.8/site-packages/dyNET38.libs/libdynet-dbf8d59b.so
    (gdb) bt
    #0  0x00007fffdf082476 in dynet::DeviceMempoolSizes::DeviceMempoolSizes(std::string const&) () from /home/d33tah/virtualenv/lib/python3.8/site-packages/dyNET38.libs/libdynet-dbf8d59b.so
    #1  0x00007fffdf0d32e3 in dynet::initialize(dynet::DynetParams&) () from /home/d33tah/virtualenv/lib/python3.8/site-packages/dyNET38.libs/libdynet-dbf8d59b.so
    #2  0x00007fffdf77f3ad in __pyx_f_6_dynet_11DynetParams_init (__pyx_skip_dispatch=1, __pyx_v_self=<optimized out>) at /tmp/pip-req-build-5x6eyi0z/python/_dynet.cpp:8373
    #3  __pyx_pf_6_dynet_11DynetParams_6init (__pyx_v_self=<optimized out>) at /tmp/pip-req-build-5x6eyi0z/python/_dynet.cpp:8422
    #4  __pyx_pw_6_dynet_11DynetParams_7init (__pyx_v_self=<optimized out>, unused=<optimized out>) at /tmp/pip-req-build-5x6eyi0z/python/_dynet.cpp:8406
    #5  0x000000000050425b in ?? ()
    #6  0x000000000056a136 in _PyEval_EvalFrameDefault ()
    #7  0x000000000056822a in _PyEval_EvalCodeWithName ()
    #8  0x000000000068c1e7 in PyEval_EvalCode ()
    #9  0x00000000005ff1f4 in ?? ()
    #10 0x00000000005c3cb0 in ?? ()
    #11 0x00000000005f257d in PyVectorcall_Call ()
    #12 0x000000000056fcb6 in _PyEval_EvalFrameDefault ()
    #13 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #14 0x00000000005f6033 in _PyFunction_Vectorcall ()
    #15 0x000000000056ef97 in _PyEval_EvalFrameDefault ()
    #16 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #17 0x000000000056a136 in _PyEval_EvalFrameDefault ()
    #18 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #19 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #20 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #21 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #22 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #23 0x00000000005f3a11 in ?? ()
    #24 0x00000000005f3e98 in _PyObject_CallMethodIdObjArgs ()
    #25 0x0000000000551493 in PyImport_ImportModuleLevelObject ()
    #26 0x000000000056c1e3 in _PyEval_EvalFrameDefault ()
    #27 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #28 0x000000000068c1e7 in PyEval_EvalCode ()
    #29 0x00000000005ff1f4 in ?? ()
    #30 0x00000000005c3cb0 in ?? ()
    #31 0x00000000005f257d in PyVectorcall_Call ()
    #32 0x000000000056fcb6 in _PyEval_EvalFrameDefault ()
    #33 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #34 0x00000000005f6033 in _PyFunction_Vectorcall ()
    #35 0x000000000056ef97 in _PyEval_EvalFrameDefault ()
    #36 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #37 0x000000000056a136 in _PyEval_EvalFrameDefault ()
    #38 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #39 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #40 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #41 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #42 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #43 0x00000000005f3a11 in ?? ()
    #44 0x00000000005f3e98 in _PyObject_CallMethodIdObjArgs ()
    #45 0x0000000000551493 in PyImport_ImportModuleLevelObject ()
    #46 0x000000000056c1e3 in _PyEval_EvalFrameDefault ()
    #47 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #48 0x000000000068c1e7 in PyEval_EvalCode ()
    #49 0x00000000005ff1f4 in ?? ()
    #50 0x00000000005c3cb0 in ?? ()
    #51 0x00000000005f257d in PyVectorcall_Call ()
    #52 0x000000000056fcb6 in _PyEval_EvalFrameDefault ()
    #53 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #54 0x00000000005f6033 in _PyFunction_Vectorcall ()
    #55 0x000000000056ef97 in _PyEval_EvalFrameDefault ()
    #56 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    --Type <RET> for more, q to quit, c to continue without paging--
    #57 0x000000000056a136 in _PyEval_EvalFrameDefault ()
    #58 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #59 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #60 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #61 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #62 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #63 0x00000000005f3a11 in ?? ()
    #64 0x00000000005f3e98 in _PyObject_CallMethodIdObjArgs ()
    #65 0x0000000000551493 in PyImport_ImportModuleLevelObject ()
    #66 0x000000000056c1e3 in _PyEval_EvalFrameDefault ()
    #67 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #68 0x000000000068c1e7 in PyEval_EvalCode ()
    #69 0x00000000005ff1f4 in ?? ()
    #70 0x00000000005c3cb0 in ?? ()
    #71 0x00000000005f257d in PyVectorcall_Call ()
    #72 0x000000000056fcb6 in _PyEval_EvalFrameDefault ()
    #73 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #74 0x00000000005f6033 in _PyFunction_Vectorcall ()
    #75 0x000000000056ef97 in _PyEval_EvalFrameDefault ()
    #76 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #77 0x000000000056a136 in _PyEval_EvalFrameDefault ()
    #78 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #79 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #80 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #81 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #82 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #83 0x00000000005f3a11 in ?? ()
    #84 0x00000000005f3e98 in _PyObject_CallMethodIdObjArgs ()
    #85 0x0000000000551493 in PyImport_ImportModuleLevelObject ()
    #86 0x000000000056c1e3 in _PyEval_EvalFrameDefault ()
    #87 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #88 0x000000000068c1e7 in PyEval_EvalCode ()
    #89 0x00000000005ff1f4 in ?? ()
    #90 0x00000000005c3cb0 in ?? ()
    #91 0x00000000005f257d in PyVectorcall_Call ()
    #92 0x000000000056fcb6 in _PyEval_EvalFrameDefault ()
    #93 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #94 0x00000000005f6033 in _PyFunction_Vectorcall ()
    #95 0x000000000056ef97 in _PyEval_EvalFrameDefault ()
    #96 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #97 0x000000000056a136 in _PyEval_EvalFrameDefault ()
    #98 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #99 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #100 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #101 0x0000000000569f5e in _PyEval_EvalFrameDefault ()
    #102 0x00000000005f5e56 in _PyFunction_Vectorcall ()
    #103 0x00000000005f3a11 in ?? ()
    #104 0x00000000005f3e98 in _PyObject_CallMethodIdObjArgs ()
    #105 0x0000000000551493 in PyImport_ImportModuleLevelObject ()
    #106 0x000000000056c1e3 in _PyEval_EvalFrameDefault ()
    #107 0x000000000056822a in _PyEval_EvalCodeWithName ()
    #108 0x000000000068c1e7 in PyEval_EvalCode ()
    #109 0x000000000067d5a1 in ?? ()
    #110 0x000000000067d61f in ?? ()
    #111 0x000000000067d6db in PyRun_FileExFlags ()
    #112 0x000000000067da6e in PyRun_SimpleFileExFlags ()
    #113 0x00000000006b6132 in Py_RunMain ()
    --Type <RET> for more, q to quit, c to continue without paging--
    #114 0x00000000006b64bd in Py_BytesMain ()
    #115 0x00007ffff7dcc0b3 in __libc_start_main (main=0x4eec80 <main>, argc=2, argv=0x7fffffffdab8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdaa8) at ../csu/libc-start.c:308
    #116 0x00000000005f927e in _start ()
    (gdb) quit
    A debugging session is active.
    
            Inferior 1 [process 14630] will be killed.
    
    Quit anyway? (y or n) 
    Please answer y or n.
    A debugging session is active.
    
            Inferior 1 [process 14630] will be killed.
    
    Quit anyway? (y or n) y
    
  • 2

    Tokenisers multi issue

    Here is a list of languages that might require specific tokenisation for a first release:

    • [x] as: specific punctuation characters, e.g. ।
    • [x] br: apostrophes and hyphens
    • [x] ca: apostrophes and hyphens
    • [x] cy: apostrophes
    • [x] dv: non-Latin script
    • [x] en: apostrophes
    • [x] fa: non-Latin script
    • [x] fr: apostrophes
    • [x] fy-NL: apostrophes
    • [x] ga-IE: apostrophes and hyphens
    • [x] hi: non-Latin script
    • ~~it~~: apostrophes #84
    • [x] ja: language without spaces
    • [x] ka: non-Latin script
    • [x] kab: hyphens and apostrophes
    • [x] lg: apostrophes
    • [x] mt: hyphens
    • [x] or: non-Latin script
    • [x] pa-IN: non-Latin script
    • [x] pl: ?
    • [x] pt: hyphens
    • [x] rm-sursilv: apostrophes
    • [x] rm-vallader: apostrophes
    • [x] ta: non-Latin script
    • [x] th: non-Latin script
    • [x] tr: apostrophes
    • [x] uk: apostrophes
    • [x] zh-CN: language without spaces
    • [x] zh-HK: language without spaces
    • [x] zh-TW: language without spaces

    Click when checked and implemented. Each tokeniser should have a test set.

  • 3

    Issue #11 fix?

    The app isn't running properly on my system (the page isn't updating when I reload the server) but surely this fixes the issue? Sorry if this doesn't help at all but surely the onBlur event means that the input is checked every time the input element loses focus, such as when the audio is played? I don't know, I tried lol Also I added the encoding option because I was getting a UnicodeDecodeError otherwise. This might just be an issue for me though Possibly fixes #11

  • 4

    Static Site Commands

    The following are steps towards having an omnilingo command that can perform the steps for build, deploy, etc that are automatically tested in a production like manner. These are subject to change but this issue will be a tracking point for those changes as the pull requests that implement this change continue to land.

    • [x] Enable poetry2nix (poetry and nix with direnv support)
    • [x] Setup pre-commit for developer workflow checks
    • [x] Setup GitHub workflows for pull request automated verifications
    • [x] Setup pytest for automated checks of the python commands
    • [ ] Change the Makefile to a python command for building the static site as an output directory
    • [ ] Setup the following subcommands to omnilingo
      • build
      • serve
    • [ ] Determine how to use build and GitHub workflows to deploy the site automatically
  • 5

    Incorporation of Pictures

    Add an option where an audio file will be played and the user selects pictures of things that are said in the audio stream. This could help users start to actually learn new words from the app

  • 6

    add timer

    as a first pass at gamification it would be good to have a timer that starts when the user presses play and stops when the user completes the task. At the end the interface can report the score for the task, as user time / audio time x.

  • 7

    Create a tokeniser for Italian

    The Italian corpus has a lot of non-alphabetic characters in it:

    $ cat ~/source/common-voice/server/data/it/* | grep -o "\W" | sort -u | tr -d '\n'
    ੝’­​!"#%&'()+,-./:;<=>?[]`{|}~¡«¬®°´·»¿΄״‐‑‒–—―‘’‛“”„†…′″‹›←→↵−≡☆♡♭♯⟨⟩。・,़̨̩̣̥̱̓́̀͡  $£€्્
    

    This probably requires a bit of a different approach, e.g. stripping some of the characters, or filtering some of the sentences.

  • 8

    add tokenisers with different levels

    we have words, but also useful would be, e.g.

    • character, some languages have di/tri-graphs
    • syllable, could be useful
    • morphs (for a word construction task)

    Example: kinkʼowinik kimbʼek

    • chars: k i n kʼ o w i n i k k i m bʼ e k
    • sylls: kin kʼow in ik kim bʼek
    • morphs: k in kʼow in ik k im bʼe k
  • 9

    Create an announcement board

    We'll want to be able to announce progress to our users and developers to keep them engaged. Signing up should be as simple as entering your e-mail and clicking "subscribe". Remember to set up SOME privacy policy.

  • 10

    Modal for settings

    Initially two settings:

    • Location of userdata storage
    • Enabled tasks
      • To chose the enabled tasks a user should just click/unclick on screenshots of what the tasks look like.
  • 11

    slowdown version

    other apps often have an option to slow down the audio, this seems to work by people putting pauses in between the words. it would be cool to support something like this, but chopping the audio is hard. some other ideas:

    • for a given clip find the longest piece of audio and if it is longer than the current one, use that as the "slow down" option
    • play the recording at 0.9x or something like that, but this will distort the audio
    • use josh's forced alignments to provide clipped versions for those clips where the segmentation is reliable
  • 12

    Sometimes spacebar is needed for the answer, but its use reloads the page

    15:55 Alix Maybe a bug, space bar always moves to next, but sometimes you need to use it in the answer
    15:55 Alix And the ones with the buttons for letter selection don't seem to work